Polymarket Earnings Prediction: Beating the Crowd with ML + Alternative Data

Most earnings 'beats' are already priced in before a single trade hits Polymarket. The consensus estimate, last quarter's surprise, the past month of price action... all of that is baked into the CLOB price on your screen. The question behind this project was straightforward: can an ML model trained on thousands of historical earnings events spot where the crowd has it wrong, and size bets around that?

This article goes through the full pipeline cell by cell, with real numbers from every stage. Raw fundamentals to calibrated probabilities, Kelly sizing to a multi-layered risk filter that pulls in GPT-4o, FinBERT, insider trading data, and options sentiment to decide which trades to skip. Everything is open-source, and the notebook lets you reproduce every result.

1. The Dataset

The feature set covers S&P 500 and Russell 2000 names from 2007 through early 2026:

52,512 rows, 50 features. Russell 2000: 40,781 rows across 1,611 symbols (68.8% beat rate). S&P 500: 11,731 rows across 502 symbols (82.8% beat rate). Overall beat rate sits at 71.9%.

The 50 features break down into four groups:

Analyst estimate features: EPS estimates from multiple horizons (7d, 30d, 60d, 90d ago), analyst count, estimate spread, high/low bounds, and net revisions at 7- and 30-day intervals. Revision momentum matters here: if the 7-day number is diverging from the 30-day, something is shifting.
Historical beat features: 8-quarter beat rate, 4-quarter beat streak, and lagged beat indicators going back 5 quarters. This is basically serial correlation in estimate beats.
Fundamental features: Prior quarter margins (gross, net, operating), both levels and changes over 1- and 4-quarter horizons. A company with compressing margins quarter-over-quarter is a different bet than one with stable margins.
Price momentum: 5-, 15-, 30-, 45-, and 60-day pre-earnings price changes. Markets tend to price in expected beats through pre-earnings drift.

Here's what a slice of the data looks like for seven recognizable names (most recent quarter each), with one representative feature per category:

Article 1774902281342

Look at the contrast: TSLA has a beat_rate_8q of just 0.13 (1 beat in 8 quarters), a previous-quarter miss of -10%, the widest estimate spread by far (315%), and the steepest pre-earnings price drop. And it still beat. Cases like this are exactly why the model needs to learn nonlinear interactions instead of leaning on any single feature. AMZN, on the other hand, had a perfect 8-quarter beat rate and strong prior surprise but missed. Even the "easy" names are unpredictable.

The target is binary: did the company beat the consensus EPS estimate?

2. Exploratory Data Analysis: Correlations and Structure

Before touching any model, we ran a full EDA pass in the earnings_pipeline notebook. Three things came out of it that shaped the modeling strategy.

Feature-target correlation (point-biserial)

Article 1774902443240

We computed Pearson correlation between each feature and the binary beat target. The strongest positive predictors were beat_rate_8q and prev_eps_surprise_pct. Companies that beat consistently and beat big last time tend to keep beating. On the negative side, price momentum features stood out: stocks that rallied hard into earnings had already priced in the good news, so a beat carried less signal.

A heatmap of the top 15 features' inter-correlations showed obvious multicollinearity clusters. The EPS estimate horizons (7d, 30d, 60d, 90d) were correlated above 0.95 with each other, same with the operating margin variants. That's what motivated the 0.95 correlation filter in preprocessing. Nine features got dropped before any model saw the data.

Cross-Company Beat Correlation

Article 1774902491397

We pivoted the full dataset into a binary matrix (stocks × quarters) and computed pairwise beat correlations across all 500+ S&P 500 names. The distribution centered near zero, meaning most stocks' beat/miss outcomes are roughly independent of each other. Good news for portfolio diversification.

The highest positive correlations (1.0) included GOOG/GOOGL (same company, obviously) and pairs like ADP/MSFT, ADSK/CRM, CDNS/KO. On the other end, the strongest negative pairs were EW/TSLA (-0.43), PSX/SRE (-0.42), and ETR/MOH (-0.41). These hedge pairs fed into the correlation penalty in the Diversified Kelly position sizer later on.

Sector-Level Beat Correlation

Article 1774902536013

We averaged beat rates by sector per quarter and built a sector × sector correlation matrix. This shows which sectors beat and miss together, which matters for portfolio concentration risk. Sectors with high positive correlation (say, Technology and Communication Services) get hit at the same time in a bad earnings season, while negatively correlated sectors give you natural diversification.

This EDA phase mattered a lot. It told us three things: features are predictive but multicollinear, individual stock beats are mostly independent, and sector-level clustering exists but isn't dominant. All three fed directly into how we built the model and sized the portfolio.

3. Temporal Split and Preprocessing

We split strictly by time, never randomly, to avoid data leakage:

Train (2017–2022): 31,808 rows, 72.7% beat rate.
Validation (2023): 6,818 rows, 70.6% beat rate.
Test (2024–2025): 13,818 rows, 70.9% beat rate.

Pre-2017 data got dropped due to sparsity. We confirmed there's no leakage: train ends at 2022-12-31, validation starts at 2023-01-27, test starts at 2024-01-29.

Preprocessing was kept leakage-safe:

First, we ran a correlation filter on the train set and dropped 9 features with pairwise correlation above 0.95, mostly redundant estimate horizons and operating margin variants. That brought the feature count down to 41.

Then we applied RobustScaler, fit only on train data and transformed onto validation and test. RobustScaler uses median and IQR instead of mean and standard deviation, so it handles the heavy outliers you get in financial data.

4. Model Selection: Simplicity First

We trained four model families on the 41-feature train set:

Dummy classifier (majority class) is just the floor. MCC = 0.0000, AUC-ROC = 0.5000. Predicts "beat" every time.
Logistic regression (L2, balanced class weights) gave MCC = +0.2154, AUC-ROC = 0.6676. Decent baseline. The balanced weights keep it from collapsing into "predict beat always."
Random Forest (500 trees, balanced subsample) came in at MCC = +0.2908, AUC-ROC = 0.7290, OOB score = 0.7422. Best performer on validation right away.
XGBoost (1,000 estimators, early stopping at 846) hit MCC = +0.2928, AUC-ROC = 0.7210. Competitive with RF on discrimination.
LightGBM (early stopped at iteration 11) only managed MCC = +0.1220, AUC-ROC = 0.7157. It stopped way too early because the default learning rate was too aggressive for the tree structure.

Full validation comparison:

Article 1774902833435

We used MCC as the selection metric because it's the most reliable single number for imbalanced binary classification. It penalizes false positives and false negatives equally regardless of class skew, while accuracy and F1 can look misleadingly good when the model just predicts the majority class. That said, for a trading system what you actually care about is well-calibrated probabilities, not just correct labels. Brier Score and Log Loss measure probability quality directly, and on those metrics Random Forest edges out XGBoost at this stage (Brier 0.1865 vs 0.2000). We come back to this in the calibration section, where we apply Platt scaling and isotonic regression and compare the two models head-to-head on calibrated probability quality, which is what actually drives Kelly sizing and P&L.

5. Feature Importance

Permutation importance on the validation set (15 repeats, scored on AUC-ROC) shows which features the model actually relies on:

B9zhbWc7Cgd3QAAAABJRU5ErkJggg==

The top two features, 8-quarter beat rate and previous EPS surprise, dominate by a wide margin. This makes sense financially: companies that consistently beat tend to keep beating (management "guides down" as a pattern), and the size of the previous surprise is predictive of the next one. Analyst count and estimate spread pick up uncertainty. High coverage with tight estimates means the market is confident, and deviations from that carry more weight.

6. Hyperparameter Tuning

LightGBM's premature stopping was the most obvious issue. We tuned both XGBoost and LightGBM with RandomizedSearchCV (40 iterations each), using a PredefinedSplit to preserve temporal order. No shuffling.

XGBoost tuned: learning_rate=0.01, max_depth=5, n_estimators=1500, reg_alpha=1.0, reg_lambda=5.0, best iteration=1,378. AUC-ROC = 0.7356.

LightGBM tuned: learning_rate=0.02, num_leaves=15, min_child_samples=50, best iteration=27. AUC-ROC = 0.7330.

Before and after on validation:

Article 1774903329319

After tuning, XGBoost took the lead on both MCC and AUC-ROC. LightGBM stopped at a more reasonable point but still lagged on discrimination.

Threshold Tuning

The default 0.5 classification threshold is arbitrary for imbalanced targets. We swept thresholds on the validation set to maximize MCC:

Article 1774903691084

Random Forest benefited most from threshold tuning. Its optimal threshold at 0.635 — above the 0.5 default — reflects the high base rate; the model needs to be more confident before predicting "beat."

8. Probability Calibration

A prediction market doesn't care about your accuracy. It cares about whether your probability estimates are better than the price. If Polymarket says 70% beat and your calibrated model says 78%, that's an edge. If your model says 72%, that's noise.

We applied Platt scaling (sigmoid) and isotonic regression on the tuned XGBoost, calibrating with cv='prefit' on the validation set.

Calibration comparison on test set:

Article 1774904511819

Raw XGBoost actually had a negative Brier Skill Score, meaning its probabilities were worse than just predicting the base rate every time. After sigmoid calibration, BSS jumped to +0.1236, so the model's probabilities are now about 12.4% better than the naive baseline. This is the number that matters for trading: your probabilities need to be calibrated, not just discriminative.

Winner: XGBoost (tuned) + sigmoid at threshold 0.631.

We also calibrated Random Forest separately and ran a head-to-head:

Article 1774904808849

The result is close. RF has slightly better discrimination (MCC), XGBoost has better calibration (BSS). Both end up in the Polymarket backtest: XGBoost for edge calculation, RF probability as a confidence filter.

Article 1774904939447

9. The Polymarket Connection

With calibrated probabilities from both models, we matched predictions against around 600 closed Polymarket earnings markets. CLOB prices came from Polymarket's API and were cached locally. Out of those 600, 314 matched to our dataset.

The edge for each market:

edge = model_probability − polymarket_implied_probability

Flat-bet backtest results ($10 per trade):

Article 1774905430323

The naive strategy of buying "Yes" on every market using Polymarket's own price returns almost nothing (+0.4%). The market is efficient in aggregate. But the ML model finds the mispriced subset: at a 15% edge cutoff, accuracy hits 78.5% and ROI reaches +49.7%.

10. Position Sizing: Diversified Kelly Criterion

Flat betting doesn't cut it for real capital deployment. Position sizing needs to account for edge magnitude and risk.

The simulation runs day by day starting with 100 USD. Each day, we first check if any open positions resolved overnight and settle them (win pays shares × USD1, loss pays $USD0). Then we scan for new markets where the model has an edge, allocate from available cash using Kelly sizing, and record a daily snapshot of cash, locked capital, and total equity. Capital is locked from the moment a bet is placed until the market resolves, so the strategy has to manage liquidity rather than treating all capital as always available.

Full Kelly is theoretically optimal but practically suicidal. One bad streak wipes you out. We used quarter-Kelly as the base and added three risk controls on top:

Hard position cap: no single trade exceeds 12% of total equity. Daily deploy limit: max 70% of available cash per day, which keeps dry powder for future opportunities. Correlation penalty: if two stocks report on the same day and are historically correlated in their beat/miss patterns, the second stock's allocation gets reduced. We built a binary correlation matrix from 28 quarters across 1,276 stocks, with adaptive shrinkage (λ = 0.500 based on the N/T ratio of 45.6).

Article 1774906302770

Diversified Kelly Results:

Article 1774906449092

Naive Kelly generated higher raw return (+363% vs +279%), but at the cost of 30.4% max drawdown and a catastrophic single-trade loss of -$116.53 (BRZE). Diversified Kelly cut the max drawdown by 17.8 percentage points while maintaining the same final equity.

Risk-adjusted comparison:

Article 1774906472392

Diversified Kelly is superior on every risk-adjusted metric, nearly 2x the Sortino ratio.

11. Statistical Significance: Is the Edge Real?

With 146 trades, any backtest could just be luck. We ran four statistical tests on the Diversified Kelly results.

Binomial test: is the 81.5% win rate significantly above the 68.0% break-even rate? p-value = 0.000177. Significant.

One-sample t-test: is the mean per-trade P&L ($+1.91) significantly greater than zero? t-stat = 2.905, p-value = 0.002122. Significant.

Bootstrap (10,000 resamples):

Article 1774906988049

All confidence intervals sit comfortably above their null references. ROI is positive with 99.8% probability.

Permutation test (10,000 shuffles): could a break-even process (WR = 68%) produce $+278.79? p-value = 0.5424. Not significant.

The permutation test fails because the null hypothesis generates a P&L distribution with a similar mean ($+286). At a 68% win rate with our average win/loss ratio, even random trading produces profit. This doesn't mean the strategy isn't working; it just reflects the favorable odds in binary earnings markets where the base rate is already high. The other three tests, which check whether our specific win rate and trade-level returns exceed their null benchmarks, all pass convincingly.

Verdict: 3 out of 4 tests passed at α = 0.05. Moderate-to-strong evidence of a real edge.

12. Loss Autopsy: Understanding What Went Wrong

Before adding filters, we dissected the 27 losing trades. The biggest losers:

Article 1774907538944

These all had positive model edges and reasonably high Polymarket implied probabilities. The model and the market both expected a beat, yet the companies missed. The key question: were there warning signs in data the model couldn't see?

13. Risk Filter Layer 1: GPT-4o Transcript Analysis

For each of the 221 tradeable tickers, we pulled the previous quarter's earnings call transcript from Alpha Vantage. GPT-4o analyzed the forward guidance section and returned a structured JSON with guidance_tone (POSITIVE/NEUTRAL/CAUTIOUS/NEGATIVE) and a miss_risk_score from 0.0 to 1.0.

Guidance distribution across 221 tickers: POSITIVE: 154 (69.7%) CAUTIOUS: 57 (25.8%) UNKNOWN: 8 (3.6%) NEUTRAL: 2 (0.9%)

At the aggressive filter level (NEGATIVE tone, or CAUTIOUS with miss_risk ≥ 0.6), skipping 41 tickers catches 8 of the 27 original losses while only sacrificing 15 winners.

Some of the caught losers: MKC, ORLY, AAL, AMZN, TSLA, MCD. All had CAUTIOUS guidance with miss_risk in the 0.6–0.7 range. The ones that slipped through: BRZE, YUM, WAY, TXN. GPT labeled these POSITIVE. The weakness here is that GPT reads overall tone, but negative signals can hide inside an otherwise optimistic transcript.

Backtest impact:

Article 1774907765233

The GPT AGGRESSIVE filter improved PnL from $+279 to$ +389 (+39% improvement), win rate from 81.5% to 84.7%, and profit factor from 2.08 to 2.64 — all while slightly reducing max drawdown. This is the single most impactful risk filter.

Article 1774907947695

14. Risk Filter Layer 2: FinBERT Sentence-Level Sentiment

GPT gives a document-level label, but negative signals can hide inside an otherwise positive transcript. FinBERT (ProsusAI/finbert) scores every individual sentence. For each transcript, we extracted:

neg_ratio: percentage of sentences classified as negative
**sentiment_mean: **average score across all sentences (-1 to +1)
**worst_5_mean: **average score of the 5 most negative sentences
qa_neg_ratio: negative sentence ratio in the Q&A section only

Distribution across 221 transcripts:

Article 1774908076903

Winners vs Losers comparison:

Article 1774908133020

Interestingly, losers actually had slightly better (less negative) FinBERT scores than winners. This suggests the losses came from companies with genuinely positive transcripts where the miss was driven by something outside the call: macro shifts, supply chain surprises, or one-off items.

FinBERT on its own was a poor filter. The combined GPT+FB filter (+345,84.5+345, 84.5% WR) didn't beat the GPT-only filter

( +345,84.5+389, 84.7% WR). FinBERT adds value as a complementary signal, not a standalone one.

15. Risk Filter Layer 3: Insider Trading

Via FMP API, we pulled SEC insider transactions for each ticker in the 90 days before earnings. Key features: net_sell_ratio, number of distinct sellers, total sell value, and whether C-suite executives sold.

Distribution (154 tickers with data):

net_sell_ratio: mean=0.793, median=1.000 (most insiders are net sellers, this is normal)
**exec_sold: **97% of tickers had at least one executive selling
sell_value: mean= $38.9M, median=$ 3.6M

Winners vs Losers:

Article 1774908347222

Again, this looks backwards at first: winners actually had higher insider selling. But it makes sense. Insiders at well-performing companies sell more stock because their shares are worth more. Raw insider selling isn't a bearish signal on its own; you'd need context-adjusted selling (unusual volume relative to history, for example) for that to mean anything.

That said, the combination of GPT + Insider produced the best risk-adjusted result of any filter:

Article 1774908432298

GPT+INS combined brought the max drawdown down to 8.3% and pushed win rate to 87.4% with a profit factor of 2.90, the highest of any configuration. It gave up some absolute PnL ( $+277 vs$ +389) but the risk-adjusted return improved significantly.

Notable loser detail:

Article 1774908627966

WAY stands out: 11 insiders dumping $885M worth of stock, yet GPT called the transcript POSITIVE. The insider filter caught this one; GPT missed it.

Article 1774908719035

16. Risk Filter Layer 4: Options Sentiment

Using Alpha Vantage historical options data, we computed pre-earnings options features from the chain 1 business day before the report: put/call volume ratio, put/call open interest ratio, ATM implied volatility, and IV skew (put IV minus call IV).

Winners vs Losers:

Article 1774909006600

Surprisingly, winners had higher put/call ratios than losers. This contradicts the textbook signal but makes sense in context: options market makers hedge their earnings exposure, and heavy put buying often reflects hedging of existing long positions rather than directional bearish bets. A high P/C ratio before a company that ends up beating might mean institutional holders are buying protection while staying long.

Options filter results:

Article 1774909153542

The full stack (GPT + Insider + Options) catches 22 of 27 losses but also kills 85 winners, reducing trade count to just 52 and PnL to $+101. There's a clear point of diminishing returns, more filters means fewer opportunities.

Article 1774909248959

17. Filter Comparison: The Pareto Frontier

Across all filter configurations, here's the trade-off surface:

Article 1774909344169

Two configurations stand out:

If you want maximum absolute return: GPT aggressive filter. 131 trades, +389% ROI, 84.7% WR, profit factor 2.64. This is your scale strategy.

If you want the best risk-adjusted return: GPT+INS combined. 87 trades, 87.4% WR, profit factor 2.90, 8.3% max drawdown, Calmar 33.3. This is your sleep-at-night strategy.

Adding more filters beyond this point starts to hurt. You run out of trades before you run out of risk.

18. What Worked and What Didn't

What worked:

Probability calibration was the single most important technical step. Raw model probabilities had a negative Brier SS; after calibration they actually became tradeable.
The GPT transcript filter alone improved PnL by +39% while cutting drawdown. It catches the cases where management was hedging, which pure fundamentals miss.
Diversified Kelly with risk controls brought max drawdown from 30.4% down to 12.6% compared to naive Kelly, without giving up terminal equity.
Using RF and XGBoost together as a dual-model system, XGBoost for edge calculation and RF probability as a confidence gate, worked better than either one alone.

What didn't work:

Full mean-variance portfolio optimization. The scipy optimizer killed returns without meaningfully improving risk metrics. Kelly sizing with surgical risk controls worked much better.
FinBERT as a standalone filter. Sentence-level sentiment was counter-intuitively worse for losers than winners. Its value is complementary, not primary.
Raw insider trading signals. In this universe, everyone sells, it's not informative without additional context.
Aggressive multi-filter stacking. The GPT+INS+OPT "all filters" configuration caught 22/27 losses but also removed too many winners, reducing absolute PnL to a third of the optimal.

19. Reproducing This Work

Two notebooks are provided:

earnings_pipeline_publish.ipynb + earnings_pipeline_publish.py: The data layer. Collects raw data from Alpha Vantage and FMP, engineers all 50 features, and runs the full EDA (correlation heatmaps, sector analysis, cross-company beat patterns). If you want to rebuild the dataset from scratch or add new features, start here. Requires FMP and Alpha Vantage API keys (both have free tiers).

modeling_publish.ipynb: The modeling and trading layer. Takes the pre-built CSVs and runs the full ML pipeline. Self-contained:

Cells 1–12: Pure ML pipeline. Requires only the CSV data files (data/ml_ready/). No API keys needed. You'll get trained models, calibrated probabilities, evaluation dashboards, and feature importance plots.

Cell 0B: Quick-load shortcut. After running cells 1–12 once, the checkpoint system saves everything to disk. Next time, run Cell 1 → Cell 0B → skip to Cell 13.

Cells 13–20: Polymarket backtest. Requires polymarket_earnings_raw.csv and polymarket_clob_prices.csv (included). Produces day-by-day trading simulations, Kelly sizing, and equity curves.

Cells 21–28: Risk filter layer. Requires API keys for Alpha Vantage , FMP , OpenAI, and optionally Perplexity and NewsData. All results are cached to JSON files in the polymarket/ directory, so you only pay for API calls once.

If you don't have the API keys, cells 1–20 still give you a complete working system with ML models, Polymarket edge calculation, diversified Kelly sizing, and statistical significance tests.