How to Know If Your Backtest Is Real

⚠️ Personal research and trading journal — not investment advice. The author does not provide licensed advisory services.

Most retail backtests are not real.

Not because the trader is dishonest — but because the standard way of backtesting a trading system contains several failure modes that almost nobody checks for. You can run a system across twenty years of data, see strong returns, and still have nothing. The returns can be entirely explained by lucky years, data artifacts, or a regime that no longer exists.

I know this because I ran those backtests. Then I built better tests, and watched most of my "discoveries" disappear.

Here are the four tests that matter. A result that can't pass all four isn't ready to trade with real money.

Test 1: Split by year. Does the median year win?

The most common failure mode in backtesting is what I call the tail-year trap.

You run a system across 20 years. The mean return is impressive. But when you split by year and sort the results, you find that two or three exceptional bull years account for most of the mean — and in the other seventeen years, the system barely breaks even or loses.

This is not a system. This is a lottery ticket that wins big in exceptional conditions and breaks even otherwise.

The fix: after any backtest, split the results by year and look at the median year — not the average. If the median year is near zero or negative while the mean looks good, the edge is concentrated in tail years. You're holding a coin flip with a fat right tail, not a real system.

My assumption audit found exactly this. Several "improvements" to the contracting-base setup produced dramatically higher means. When I split by year, the extra return came almost entirely from 2020 (a year when one group had 6 trades and produced extreme R because of the COVID recovery bounce). The median year was actually worse than the baseline.

The mean told one story. The median year told the truth.

Test 2: Bootstrap CI. Does the confidence interval exclude zero?

Even if the median year looks good, you need to know whether the result could plausibly be explained by chance.

A bootstrap confidence interval answers this. You take your trade sample, resample it 1,500–10,000 times with replacement, compute the metric (median return, net R, whatever you're measuring) on each resample, and look at the distribution of results. The 95% CI is the range that contains 95% of those resampled outcomes.

If the CI spans zero — if the lower bound is negative and the upper bound is positive — the result is statistically consistent with a coin flip. You can't rule out that your "edge" is random noise.

If the CI excludes zero on the positive side — if even the pessimistic end of the distribution is positive — you have evidence of a real edge.

One important trap: bootstrap CI is only valid if your underlying data is clean. I found this the hard way. My Information Coefficient on US RS rankings was 0.26 — very high. After I added a data-freshness filter (requiring stock data to be within 1 day of current), the same signal went to −0.09. The CI had been confident in a number that didn't exist. Garbage in, confidence interval out.

Test 3: Date-matched baseline. Does it beat random?

This is the test most research skips, and the one that kills the most "discoveries."

The intuition: if you're claiming that a specific setup or signal has edge, you need to compare it to the alternative — what would have happened if you'd picked a random stock from the same qualifying universe on the same date?

Why this matters: bull markets lift all stocks. IBD analysts saying a stock is "in the buy zone" happens more in bull markets. So a naive backtest of "stocks mentioned by IBD in the buy zone" will show strong returns — but so will any RS≥80 stock mentioned on the same date, in the same regime. The phrase doesn't add anything; the market regime does.

The right baseline: for every signal or mention, draw 5 random RS≥80 stocks active on the same date. Compute the same return metric. Compare.

If your signal doesn't beat the date-matched baseline on median return + drop-top-3 robustness + hit-rate threshold, the signal is not adding anything beyond what a random qualified stock on that date would have given you.

I ran this test on 28 phrases from my IBD transcript corpus. Twenty-six failed. The two that looked like they survived fell apart under bootstrap CI or the by-year split.

Test 4: Walk-forward. Does it hold out-of-sample?

The first three tests can still be gamed — unintentionally — if you've looked at the data while building the system. The researcher effect: you've run enough tests that your brain has pattern-matched to this particular dataset. The rules you wrote encode the past rather than a real mechanism.

The only defense is out-of-sample testing: hold back a portion of the data — the most recent years, or every other year in a rolling window — and run the system on that held-out data with no further adjustments.

If the system holds up in the periods it never saw during construction, you have real evidence of generalization. If it falls apart, you've found overfitting.

For my core systems, I use a multi-window walk-forward: split the full history into training and test windows, roll forward, and look at whether the system performs consistently across held-out periods. The results need to be consistent — not just one or two good windows — to count as validated.

A system that looks great in-sample but degrades significantly out-of-sample isn't ready. It needs more data, a simpler rule, or a better mechanism.

Putting it together

Four tests. A result needs to pass all four:

1. Median year positive — edge is consistent, not tail-concentrated 2. Bootstrap CI excludes zero — on clean data 3. Beats date-matched baseline — on median + drop-top-3 + hit-rate 4. Walk-forward holds out-of-sample — across multiple held-out periods

Most retail discoveries fail at step one or two. Most published "strategies" haven't been tested at step three at all. Step four is where the serious work lives.

My core system — contracting-base breakout with RS≥80 + confirmed uptrend — passes all four in Thailand. The edge is smaller than the raw backtest suggests, and it only exists inside certain market conditions. But it passes. That's the level of evidence I need to trade it with real money.

That's the bar. It's a high bar. It should be.

Track. Study. Wait. Strike.

Personal research and trading journal — not investment advice. The author does not provide licensed advisory services. — MOEasymmetry