Back to blog
Quantitative Finance16 min read

Building a Statistical Arbitrage Engine: Cointegration, Ornstein-Uhlenbeck, and Why Most Pairs Trading Backtests Are Lies

Pairs trading is the most popular 'quant' strategy for retail traders — and the most commonly implemented incorrectly. The gap between backtest Sharpe and live Sharpe is enormous, systematic, and caused by mistakes that are entirely avoidable if you understand the math.

Statistical arbitrage — specifically pairs trading — is the gateway drug of quantitative finance. Every aspiring quant builds one. The pitch is irresistible: find two correlated stocks, go long the cheap one, short the expensive one, wait for convergence, collect risk-free profit. The backtests look beautiful. Sharpe ratios of 2.5, 3.0, sometimes 4.0. Smooth equity curves with barely a drawdown. Then you go live and the strategy bleeds money for six months straight.

This is not a coincidence. The gap between backtested and live performance in stat arb is systematic, predictable, and caused by a chain of methodological errors that almost every retail implementation makes. This post is a technical walkthrough of how to build a statistical arbitrage engine correctly — the math, the pitfalls, and the framework that separates strategies that work on paper from strategies that work with real capital.

Correlation vs. Cointegration: The Fundamental Confusion

The first and most catastrophic mistake is using correlation as the selection criterion for pairs. Correlation measures whether two time series move in the same direction at the same time. Two stocks can have a correlation of 0.95 and diverge permanently — one doubles while the other goes to zero, as long as they move in the same direction on most days. Correlation says nothing about whether the spread between two series is stable.

Cointegration is the correct concept. Two time series X(t) and Y(t) are cointegrated if there exists a linear combination Y(t) - beta * X(t) = S(t) where S(t) is stationary — meaning S(t) has a constant mean and variance over time and tends to revert to that mean. Cointegration implies that even though X and Y individually wander like random walks, their spread S has a stable distribution. This is the mathematical property that makes pairs trading possible: you are betting that the spread will revert, not that the individual stocks will move in a particular direction.

The Engle-Granger Two-Step Method

The standard test for cointegration between two series is the Engle-Granger method. Step one: regress Y(t) on X(t) using ordinary least squares to estimate the hedge ratio beta. This gives you the spread S(t) = Y(t) - beta_hat * X(t). Step two: test S(t) for stationarity using the Augmented Dickey-Fuller (ADF) test. The ADF test regresses delta_S(t) = alpha + gamma * S(t-1) + sum(phi_i * delta_S(t-i)) + epsilon and tests whether gamma is significantly negative. If gamma < 0 and statistically significant (p-value < 0.05), you reject the null hypothesis of a unit root, and the spread is stationary — the pair is cointegrated.

The Johansen test extends this to multivariate settings — testing whether a basket of N > 2 assets contains one or more cointegrating relationships. It uses a vector autoregressive framework and reports the number of independent cointegrating vectors via trace and maximum eigenvalue statistics. For pairs trading, Engle-Granger is sufficient. For basket stat arb with 10-50 assets, Johansen is necessary.

Correlation is a statement about returns. Cointegration is a statement about prices. A pairs trading strategy is a bet on price convergence, not return similarity. Using correlation to select pairs is like using a thermometer to measure wind speed — the instrument measures the wrong thing entirely.

The critical flaw in cointegration testing: it is estimated on historical data. A pair that was cointegrated for the past 3 years can lose cointegration tomorrow due to a structural break — a merger, a regulatory change, a shift in business model. Cointegration is not a permanent property of a pair. It is a regime-dependent statistical relationship that must be continuously re-estimated and validated out-of-sample. This single fact invalidates the majority of pairs trading backtests published online.

The Ornstein-Uhlenbeck Process

Once you have a cointegrated pair and a stationary spread, you need a mathematical model of how the spread behaves. The standard model is the Ornstein-Uhlenbeck (OU) process, a continuous-time stochastic process defined by the stochastic differential equation: dS = theta * (mu - S) * dt + sigma * dW. Here S is the spread level, theta is the mean reversion speed, mu is the long-term mean, sigma is the volatility of the spread, and dW is a Wiener process (Brownian motion increment).

The key parameter is theta, the mean reversion speed. When the spread deviates from mu, the drift term theta * (mu - S) pulls it back. Higher theta means faster reversion. The half-life of the process — the expected time for the spread to revert halfway from its current level to mu — is given by half_life = ln(2) / theta. If theta = 0.05 per day, the half-life is ln(2) / 0.05 = 13.9 days. If theta = 0.01, the half-life is 69.3 days. This is the single most important number in a stat arb strategy: if the half-life exceeds your holding period tolerance or your risk budget, the trade does not work regardless of how cointegrated the pair appears.

Calibrating OU Parameters via OLS

You calibrate the OU process from discrete observations using the discretized version: S(t+1) - S(t) = a + b * S(t) + epsilon(t). Run an OLS regression of delta_S on S. The coefficient b estimates -(1 - exp(-theta * dt)), which for daily data (dt = 1) simplifies to b approximately equals -theta when theta is small. So theta = -b, mu = -a/b, and sigma is estimated from the standard deviation of the residuals epsilon, scaled appropriately: sigma_OU = sigma_epsilon * sqrt(2 * theta / (1 - exp(-2 * theta * dt))).

The Hurst Exponent: A Second Filter

The Hurst exponent H provides a complementary measure of mean-reversion strength. For a time series, H < 0.5 indicates mean-reverting behavior, H = 0.5 indicates a random walk, and H > 0.5 indicates trending (momentum) behavior. The further below 0.5, the stronger the mean reversion. A spread with H = 0.35 is aggressively mean-reverting. A spread with H = 0.48 is barely distinguishable from a random walk.

Two common estimation methods exist. Rescaled range (R/S) analysis computes the ratio of the range of cumulative deviations to the standard deviation over varying time windows, then fits a power law. Detrended Fluctuation Analysis (DFA) removes polynomial trends from the series and measures the scaling of residual fluctuations. DFA is generally preferred for financial time series because it is more robust to short-range autocorrelation and non-stationarities.

The critical point: you need H < 0.5 AND cointegration AND a reasonable half-life. Any single metric alone is insufficient. You can find pairs with H = 0.4 that are not cointegrated (the spread mean-reverts locally but drifts over longer horizons). You can find pairs that pass the ADF test but have a half-life of 90 days (statistically cointegrated but economically untradeable). The conjunction of all three filters is what separates viable pairs from statistical noise.

Why Most Backtests Are Lies

Now we arrive at the core problem. You have read a Medium article, run cointegration tests on 200 pairs, found 15 that pass, backtested a z-score entry/exit strategy, and your backtest shows a Sharpe ratio of 3.2 with 82 percent win rate and maximum drawdown of 4 percent. You are about to wire money to your brokerage. Let me explain why your backtest is lying to you.

Survivorship Bias

You tested pairs of stocks that exist today. But stocks that were delisted, went bankrupt, or were acquired in the past 5 years are not in your dataset. These are precisely the stocks that would have caused catastrophic losses in a pairs strategy — the spread diverges to infinity when one leg goes to zero. Your backtest systematically excludes the worst outcomes. Use a survivorship-bias-free dataset (CRSP, or point-in-time constituents of an index) or your results are meaningless.

Look-Ahead Bias in Parameter Estimation

This is the most common and most damaging error. You estimated the cointegration parameters (hedge ratio beta, mean mu, OU parameters) on the full sample — say, 5 years of data — and then backtested the strategy on the same 5 years. This is circular. Of course the spread reverts to the mean you estimated from that same data. The correct approach is walk-forward estimation: estimate parameters on data up to time t, generate signals at time t, then advance t. Never use future data to compute parameters applied to past decisions.

Transaction Costs and Slippage

Stat arb strategies trade frequently. A typical pairs strategy rebalances every 1-5 days with holding periods of 5-20 days. Each rebalance involves buying and selling two positions. At 5 basis points of slippage per side per trade (conservative for liquid large-caps, aggressive for small-caps), a strategy that trades 200 times per year incurs 200 * 4 legs * 0.0005 = 40 basis points in round-trip costs per trade, or roughly 400 basis points annually. If your backtest gross Sharpe is 2.0, the net Sharpe after costs might be 0.6. Small-cap pairs are worse — bid-ask spreads of 20-50 bps destroy profitability entirely.

The Multiple Testing Problem

If you scan 5000 pairs and test each for cointegration at the 5 percent significance level, you expect 250 false positives — pairs that appear cointegrated by pure chance. This is the multiple comparisons problem. If you then backtest these 250 and pick the 15 with the best Sharpe ratios, you have selected on noise twice. The Bonferroni correction (divide your p-value threshold by the number of tests) is the blunt fix. A more sophisticated approach uses the Benjamini-Hochberg procedure to control the false discovery rate at 5 percent instead of the family-wise error rate.

A backtest Sharpe ratio of 3.0 in pairs trading should be treated as a bug report, not a discovery. The real question is not 'how high is my Sharpe' but 'what did I do wrong to get a Sharpe this high.' In live trading, stat arb Sharpe ratios above 1.5 sustained over multiple years are exceptional. Above 2.0 is world-class. Above 3.0 is almost certainly overfitting.

Capacity Constraints

A strategy that generates 200 basis points of annual alpha on a $50K account may generate zero alpha on a $5M account. Market impact — the price movement caused by your own order — scales roughly with the square root of order size relative to average daily volume. A pair of mid-cap stocks trading $20M per day can absorb a $100K position without meaningful impact. A $2M position moves the market 3-5 basis points per side. A $10M position is effectively untradeable without multi-day execution algorithms. Your backtest assumes instantaneous execution at mid-price. Reality does not cooperate.

Building It Right: A Walk-Forward Framework

Now that we have thoroughly demolished the naive approach, here is how to build a stat arb engine that has a chance of working in production. The core principle is walk-forward validation with rolling parameter re-estimation, combined with economically motivated pair selection.

Rolling Window Cointegration

Re-estimate cointegration parameters monthly using a rolling 12-month window. At the start of each month, for each candidate pair: run the Engle-Granger test on the trailing 252 trading days, estimate the OU parameters on the same window, compute the Hurst exponent, and compute the half-life. Only pairs that pass all three filters (ADF p-value < 0.05, H < 0.45, half-life between 3 and 25 days) enter the tradeable universe for that month. Pairs that were tradeable last month but fail this month are closed immediately.

Walk-Forward Validation Protocol

Entry, Exit, and Position Sizing

Compute the z-score of the current spread: z(t) = (S(t) - mu) / sigma_S, where mu and sigma_S are estimated from the training window. Entry signal: open a mean-reversion position when |z| > 2.0 (long the spread when z < -2.0, short the spread when z > 2.0). Exit signal: close the position when |z| < 0.5 (spread has reverted close to the mean). Stop-loss: close the position if |z| > 4.0 (spread has diverged further instead of reverting) or if the position has been open for more than 2 * half_life days (time stop).

For position sizing, the Kelly criterion adapted for mean-reversion gives optimal fraction f = (p * W - q * L) / W, where p is the probability of mean reversion (estimated from the training window hit rate), W is the average winning trade P&L, L is the average losing trade P&L, and q = 1 - p. In practice, use half-Kelly (f/2) or quarter-Kelly (f/4) because parameter estimation uncertainty inflates the true Kelly fraction. Size each pair position so that a 4-sigma spread move results in a maximum loss of 2 percent of portfolio equity.

Universe Selection: Economics Over Statistics

Do not scan 5000 random pairs. Select candidate pairs with an economic rationale: same sector, similar market capitalization, same supply chain, same regulatory environment. KO and PEP (Coca-Cola and PepsiCo) are cointegrated because they sell substitute products to the same customers in the same geographies — there is an economic force that binds their valuations. Two random stocks that happen to pass a cointegration test on historical data have no such binding force, and the statistical relationship will break at the first regime change.

The Regime Problem

Cointegration relationships are not constants of nature. They are regime-dependent. During normal market conditions, KO and PEP are tightly cointegrated. During a sector rotation out of consumer staples, or a hostile takeover bid for one of them, or a global pandemic that shifts consumption patterns — the spread can diverge permanently. The regime transition happens precisely when you are maximally positioned: you entered at z = 2.0, the spread moved to z = 3.0 so you added to the position (or held), and now z = 6.0 and the pair is no longer cointegrated. This is how stat arb funds blow up.

Hidden Markov Models (HMMs) offer a framework for regime detection. Fit a 2-state or 3-state HMM to the spread dynamics: State 1 (mean-reverting, low volatility), State 2 (mean-reverting, high volatility), and State 3 (trending, cointegration broken). When the Viterbi-decoded state transitions from State 1 or 2 to State 3, exit all positions immediately and halt trading on that pair until cointegration is re-established. The HMM does not predict regime changes — it detects them after a few observations, which is still far faster than a stop-loss that triggers only on spread level.

Long-Term Capital Management had the most sophisticated convergence trading operation in history, run by two Nobel laureates in economics. Their models were correct — the spreads did eventually converge. But the market can stay divergent longer than you can stay solvent. Position sizing and regime awareness are not optional risk controls. They are the strategy.

Modern Stat Arb vs. Classical Pairs Trading

Classical pairs trading — two stocks, one spread, z-score entry and exit — is what you find in textbooks and blog posts. Modern statistical arbitrage at institutional scale looks nothing like this. The evolution from classical to modern involves three major shifts: from pairs to baskets, from regression to factor models, and from static parameters to adaptive machine learning.

From Pairs to Baskets and Factor Residuals

Instead of trading a single pair, modern stat arb constructs a portfolio of 20-50 stocks where the portfolio is market-neutral and factor-neutral. The construction starts with PCA (Principal Component Analysis) on the cross-section of stock returns. The first 5-10 principal components capture systematic risk factors (market, sector, value, momentum, size). The residual — what is left after removing factor exposure — is the idiosyncratic component. Stat arb trades the mean reversion of these residuals. Because the portfolio is neutralized against the dominant risk factors, the residual returns are more stationary and the strategy is less exposed to regime changes in factor premia.

Machine Learning Approaches

The frontier of stat arb uses gradient-boosted trees and recurrent neural networks to predict spread dynamics. Instead of assuming a fixed OU process, the model learns the conditional mean and variance of spread changes as a function of features: spread level, spread momentum, implied volatility ratio, sector ETF flows, earnings calendar proximity, short interest differential, and order book imbalance. XGBoost with 50-100 features trained on rolling 3-year windows can capture nonlinear mean-reversion patterns that the OU model misses entirely — at the cost of requiring an order of magnitude more data and rigorous cross-validation to avoid overfitting.

Practical Advice for Implementation

If you are building a stat arb engine from scratch, start with ETF pairs rather than individual stocks. ETFs have lower idiosyncratic risk, higher liquidity, tighter spreads, and more stable cointegration relationships. GLD/GDX (gold vs. gold miners), EWA/EWC (Australia vs. Canada — both commodity economies), XLF/KRE (financials vs. regional banks) — these pairs have economic linkages that provide structural cointegration, and the ETF structure smooths out the single-stock risks that destroy individual equity pairs.

The honest truth about stat arb in 2026: it works, but the easy alpha is gone. The strategies that generate meaningful risk-adjusted returns require either significant infrastructure (data pipelines, execution systems, risk management frameworks) or a genuine informational or structural edge (access to alternative data, co-location for execution speed, or domain expertise in niche markets like commodity pairs or cross-listed equities). Building the engine correctly is table stakes. The alpha comes from what you feed into it.

Build Quantitative Systems with Accelar

Accelar brings deep quantitative engineering to every system we build — from statistical modeling and time-series analysis to production-grade data pipelines and real-time decision engines. If you are building systems where mathematical rigor meets engineering discipline, whether in finance, operations research, or data-intensive applications, let's talk.

Weekly letter

Deep tech, AI agents, and the science of peak performance

One letter per week for builders who think different. No spam, unsubscribe anytime.