Backtesting Honestly: How to Avoid the Curve-Fitting Trap
Backtests are easy to make look great by accident, and very easy to make look great on purpose. Honest backtesting is the difference between learning and self-deception.
A backtest tells you how a strategy would have performed on historical data. Done well, it's the cheapest way to estimate whether a hypothesis has edge. Done badly, which is the default, it's a way to confidently lose money on strategies that look great on paper and fail in real markets. The traps are subtle. Avoiding them requires deliberate discipline.
What a backtest actually does
A backtest takes a precisely-defined strategy and applies it mechanically to historical price data. For each historical candle, it asks:
- Would this candle have triggered an entry condition?
- If so, what's the entry price?
- Where's the stop?
- Where does the trade exit (target, trailing stop, time stop)?
- What's the resulting P&L (or R-multiple)?
Aggregated across hundreds of historical bars, you get an expectancy estimate, win rate, max drawdown, and other performance metrics. Useful, if the test is honest.
The traps that ruin backtests
Several specific failure modes turn backtests from learning tools into self-deception:
1. Look-ahead bias. Using information in the test that wouldn't have been available at the moment of decision. Common version: using a candle's close to make an entry decision for that same candle, when in real trading you wouldn't know the close until the candle ended. Backtests that "buy the close" or "sell the high of the day" fall into this.
The fix: every decision in the backtest must use only information available before the decision moment. If you need the candle's close to confirm the setup, the entry happens on the next candle's open, not the current candle's close.
2. Curve fitting / overfitting. Tweaking parameters until the historical results look great. "50 SMA gives me 20% return; 47 SMA gives me 24%; 53 SMA gives me 18%, clearly 47 SMA is optimal." You've now tuned to the noise of past data, not to a robust signal. The 47 SMA's edge will not persist out-of-sample.
The fix: use widely-watched standard parameters (50/200 SMA, 14 RSI, 12/26/9 MACD) that other traders also watch. Resist the urge to optimize specific numbers. If a strategy only works at one specific parameter setting, it's curve-fit.
3. Survivorship bias. Testing on assets that survived to today. If you backtest "buy and hold the top 100 cryptocurrencies of 2024," you're implicitly excluding all the tokens that died (delisted, zero'd out, pulled the rug). The historical sample looks profitable because the failures are gone.
The fix: include delisted assets in the test universe. For crypto, this is hard data to source, most platforms don't include dead tokens. Be aware of the bias even when you can't fully eliminate it.
4. Selection bias. Testing only the time periods where the strategy looks good. "Buying dips in 2020-2021 was profitable", yes, because that was a roaring bull market. The same strategy in 2018 or 2022 was disastrous.
The fix: test across multiple market regimes. A strategy that only works in bull markets isn't really an edge; it's a beta exposure dressed up as a strategy.
5. Data-snooping. Looking at past data, finding a pattern, then "testing" the pattern on the same data you saw it in. The pattern will of course replicate, you found it there. The test is circular.
The fix: form your hypothesis on one set of data (e.g., 2018-2022) and test it on a different set (e.g., 2023-2026). Walk-forward validation (next chapter) formalizes this.
6. Ignoring transaction costs. A backtest that doesn't subtract fees and slippage produces inflated results. A 0.1R/trade strategy can become flat or negative once realistic fees (5-10 bps per trade round trip) and slippage (variable, larger in illiquid pairs) are included.
The fix: always subtract realistic costs. A useful default for crypto: 5 bps per round trip on majors with limit orders, 15 bps on majors with market orders, much more on alts.
7. Position-sizing assumptions. Backtests often assume you can take the trade at any size. In reality, your risk-based sizing might cap you at small positions on volatile alts where the stop is wide. The backtest might show 0.5R per trade but in reality you couldn't size to capture much absolute return.
The fix: include a realistic sizing layer in the backtest compute position size from stop distance and account risk, exactly as you'd do live.
The structure of an honest backtest
A discipline that catches most traps:
1. Specify the strategy completely (per the hypothesis- testing chapter): conditions, entry, exits, sizing.
2. Use only past-available information. Each decision uses only data that would have been visible at the decision moment.
3. Include realistic friction. Fees, slippage, and realistic position sizing.
4. Test on out-of-sample data. Form the hypothesis on one period; test on a different period.
5. Test across multiple regimes. Bull markets, bear markets, chop. Run separate sub-tests for each regime.
6. Use standard parameters. Don't optimize. If the strategy needs custom-tuned parameters to work, it's probably overfit.
7. Run on multiple assets. A strategy that works on BTC but nothing else is suspect, likely curve-fit to BTC's specific history.
8. Examine drawdowns and streaks. Don't just look at total return. Look at max drawdown, longest losing streak, and worst single-day performance. The strategy's risk profile is in those numbers.
A backtest passing all eight of these is genuinely informative. A backtest that does none of them is roughly useless.
Reading the results honestly
Beyond passing the discipline checks, several patterns in the results signal a more or less trustworthy strategy:
Trustworthy signals:
- Consistent expectancy across regimes (works in bull, bear, and chop, even if magnitudes differ)
- Reasonable max drawdown relative to expected return
- Win rate and average R-multiple in plausible ranges (not 90% win rate with 0.1R winners, that pattern usually signals a hidden bias)
- Performance varies with parameters but doesn't cliff (47 SMA gives 22%, 50 gives 20%, 53 gives 18%, smooth decay suggests robustness; spiky variance suggests curve-fit)
Suspicious signals:
- Spectacular returns with low drawdown (real edges have drawdowns)
- Performance only in specific regimes
- Performance highly sensitive to parameter changes
- Backtested on a small number of assets or short windows
- Backtested by the strategy's author ("look at these results from my system!")
A common mistake: trusting the backtest
Even an honest backtest is a projection, not a guarantee. Markets evolve. Regimes change. Strategies that worked for years can stop working. Treating the backtest as the truth about future returns is the next-most-dangerous mistake after a dishonest backtest.
The fix: backtest results are upper-bound estimates, typically optimistic by 30-50% even when honest. Plan as if the live strategy will perform meaningfully worse than the backtest. If you can't tolerate that, you've sized too aggressively.
A common mistake: never re-backtesting
A trader runs a backtest in 2024. Validates it. Trades it through 2026. Never re-runs the test with the new data.
By 2026, the regime has shifted twice. The original backtest's conclusions may no longer hold. The trader is now operating on stale evidence.
The fix: re-backtest periodically (quarterly or after any major regime shift). Compare actual live performance to the projected backtest performance. If they diverge materially, investigate whether the strategy still has edge.
A common mistake: backtesting on what you wish was true
A trader thinks they're a great pullback trader. They run a backtest of "buy the 38.2% Fib retracement in uptrends." The result: -2% per trade. They're disappointed. Instead of accepting the result, they tweak: add an RSI filter, add a volume filter, exclude certain assets. The result becomes +0.3% per trade.
The trader concludes their strategy is "with the right filters, profitable." But what they actually did was overfit until the historical data agreed with their wish to be a pullback trader. The forward-test will not survive.
The fix: when a clean backtest fails, the honest move is usually to accept that the original idea doesn't work, at least not in this form on this data. Tweaking until it works is curve-fitting. The original disappointment is the correct answer.
Mental model, backtest as a microscope, not a crystal ball
A microscope shows you what's there in the past data. You can use it to verify properties of the historical sample, test hypotheses against past observations, and rule out strategies that don't even work on data you already have.
What it can't do is predict the future. The future has properties (regime, sentiment, structure) that aren't in the past data. A strategy that worked in the past might or might not continue to work; the backtest can't tell you which.
Use the microscope to eliminate bad strategies. The microscope is highly reliable for falsification (if it doesn't work in the past, it probably won't work in the future). It's much less reliable for confirmation (working in the past is necessary but not sufficient for working in the future).
Why this matters for trading
A backtest is the cheapest way to evaluate whether a hypothesis has even a chance of being edge before you commit real (or paper) capital. Done with discipline, it's high-value. Done sloppily, it produces overconfidence in strategies that will fail in live trading. The next chapter on walk-forward validation extends backtesting into a more robust framework that addresses many of the overfitting risks specifically.
Takeaway
Backtesting tells you how a strategy would have performed on historical data, if the test is honest. The traps that kill honest backtesting: look-ahead bias, curve fitting, survivorship bias, selection bias, data snooping, ignored costs, unrealistic sizing. The disciplined backtest uses only past-available information, includes realistic friction, tests on out-of-sample data, runs across regimes and assets, and uses standard parameters. Trust the microscope to falsify, not to predict. Treat backtest returns as optimistic upper bounds; live performance will be meaningfully worse, by design.
Related chapters
- Strategy7 min read
Hypothesis Testing in Trading: How to Turn a Vague Idea Into a Testable Strategy
Most strategy ideas are too vague to be tested. Reframing them as falsifiable hypotheses is what separates strategy development from strategy daydreaming.
Read chapter - Strategy8 min read
Walk-Forward Validation: The Backtesting Method That Actually Tests Edge
Walk-forward validation tests whether a strategy's edge survives the most realistic challenge, being deployed on data it hasn't seen yet. It's how pros validate strategies.
Read chapter - Strategy7 min read
Paper Trading vs Live Trading: What Changes When Real Money Is on the Line
Paper trading proves the strategy works mechanically. Live trading proves you can actually execute it. The gap between the two is where most traders quietly fail.
Read chapter