Systematic investing is in oversupply. Equity statistical arbitrage managers can run thousands of signals across hundreds of stocks. Macro quant teams can test dozens of model variants across rates, FX, commodities, and futures curves. The result is a flood of hypothetical performance—often presented with the confidence of a live record.

Backtests still matter. They are the first test of whether an idea has any chance of surviving. But they are also the easiest place to manufacture a “great” strategy by trying enough variants until something looks exceptional.

The most important upgrade to allocator thinking is this: the backtest problem is not just optimism. It is multiple testing. Campbell Harvey and Yan Liu formalize this in “Backtesting” and show how “routine Sharpe haircuts” should be replaced by an explicit framework that accounts for the number of strategies tried and their correlation.

Define the Evidence: Backtest vs Simulated vs Live

Before debating returns, identify what you are actually looking at.

  • Backtest: Strategy rules applied to historical data with assumed execution, costs, and constraints. Useful for understanding behavior. Not a track record.

  • Simulated / pro forma / model track record: A time series presented in “performance format,” often spanning periods when the strategy did not run operationally. Treat as marketing unless the full production context is documented.

  • Paper trading: Live signal generation without capital. Valuable because it tests data availability, timing, and production discipline. Still not proof of execution quality.

  • Live track record (net): Real capital, real slippage, real operational constraints, ideally audited. This is the only “track record” that counts.

Allocator rule: separate evidence tiers and weight them accordingly. A clean year of live results under the stated process tells you more than a decade of simulated returns built in a research notebook.

The Failure Modes: Why Equity Curves Lie

Most allocator disappointment comes from predictable failure modes.

Research bias: false edges
  • Overfitting / data mining: Testing many signals, parameters, universes, and start dates until one works.

  • Look-ahead and leakage: Using information that was not available at the time (especially common with corporate actions, fundamentals, and alternative data).

  • Survivorship and selection bias: Using a “clean” universe that quietly excludes failures.

  • Regime cherry-picking: Building a model that “works” because the chosen sample embeds a single dominant regime.

Key point from Harvey–Liu: the common “just haircut the Sharpe by 50%” approach is not a robust fix. When the research process involves many trials, the inflation in the reported Sharpe is structural. Their framework makes this explicit: the appropriate haircut depends on how many strategies were tested and how correlated they were.

Implementation bias: edges that don’t survive markets
  • Costs and impact ignored: Bid/ask spreads, commissions, slippage, market impact, and borrow costs erase gross returns.

  • Liquidity and constraints ignored: Trading “small-cap-like” liquidity with institutional AUM assumptions.

  • Latency and timing mismatch: Using end-of-day data as if it were tradable at the close; assuming instant rebalancing across instruments.

Portfolio construction bias: hidden exposures
  • “Market neutral” that loads on value/momentum/carry at the wrong times.

  • Macro models that are effectively long duration (or short volatility) in disguise.

  • Performance driven by one period of structural beta rather than repeatable alpha.

Allocator stance: treat any smooth curve as a red flag until the manager proves it’s not an artifact of testing and assumptions.

What “Good” Looks Like: Robustness and Falsification Tests

A credible backtest does not try to impress. It tries to survive attempts to kill it.

Statistical robustness
  • Walk-forward / out-of-sample testing: Results must hold when the model is trained on one window and tested on the next.

  • Sensitivity analysis: Small changes to parameters should not flip the strategy from “great” to “dead.”

  • Multiple-testing discipline: This is where the Harvey–Liu paper is directly actionable.

What to ask for (explicitly):

  1. How many strategies were tried before selecting the reported one?

  2. How correlated were those trials (same universe, similar signals, same holding period)?

  3. Provide a multiple-testing-adjusted Sharpe or an equivalent “haircut Sharpe” using a documented method (Harvey–Liu propose one; the goal is transparency). 

Why this matters in practice:

  • Equity stat arb often has a massive research search space (features, lags, universes, neutralization schemes). That calls for harsher adjustments and higher profit hurdles.

  • Macro trend strategies can be simpler, with fewer degrees of freedom (trend horizon, scaling, portfolio construction). That reduces—but does not eliminate—multiple-testing risk.

Economic robustness
  • A coherent rationale: why the signal should exist and persist after discovery.

  • Attribution: show how much performance is explained by known premia (value, momentum, carry, term, credit, vol selling) versus residual alpha.

Regime robustness
  • Performance by regime: inflation shocks, rate shocks, volatility spikes, crises.

  • Drawdown mechanics consistent with stated portfolio role.

From Backtest to Investable: The Friction Bridge

Even an honest backtest can be unallocatable.

Gross-to-net is the real battleground

Require a waterfall from:
Backtest gross → realistic costs → impact → constraints → fees → expected net
and insist that assumptions are conservative, not “best execution” fantasy.

Capacity and crowding
  • Equity stat arb is capacity-limited by liquidity, turnover, and the competitive nature of the edge. If the strategy needs high turnover, assume returns decay with AUM.

  • Macro quant often scales better in liquid futures but still faces crowding around common signals and stop-outs in stress.

Governance and production discipline

Investable systematic strategies have:

  • Model versioning and change control

  • Monitoring, incident response, and kill-switches

  • Data lineage and vendor contingency

  • Clear rules on when discretion is allowed (and when it is prohibited)

Harvey–Liu’s practical implication: ask for the profit hurdle—the minimum performance required to claim statistical significance given the sample length, volatility, and the implicit number of tests. If the manager cannot articulate their hurdle, they cannot articulate their uncertainty.

The Allocator Framework: Decision Rules

Use a rule-based underwriting process that forces clarity.

Decision rules

  1. Define the job: diversifier, crisis alpha, return stream, overlay—each requires different evidence.

  2. Tier the evidence: simulated < paper < live. Don’t let a long simulation substitute for operational reality.

  3. Adjust for multiple testing: demand disclosure of research breadth and a disciplined adjustment (start with Harvey–Liu’s framework). 

  4. Bridge to investability: costs, impact, capacity, constraints, governance.

  5. Pre-commit stop rules: define what failure looks like (drawdown, deviation from expected exposures, breakdown of live execution metrics).

Conclusion

Backtests are necessary. Simulated track records are not proof. The allocator edge is not spotting the best equity curve—it is underwriting how the curve was found, then translating it into net, scalable, governed reality.

The Harvey–Liu framework forces the right question: how many shots did it take to produce this Sharpe? If the answer is “a lot,” the required haircut and profit hurdle rise sharply. That is not cynicism. It is statistics. Allocators who enforce this discipline avoid paying fees for noise.

Resonanz insights in your inbox...

Get the research behind strategies most professional allocators trust, but almost no-one explains.