How I Objectively Choose Between Competing Trading Systems

I run three trading systems simultaneously. In November 2026, one will be selected to trade real capital. The question I needed to answer: how do you pick a winner objectively, without your gut lying to you?

The usual answer is backtest performance. The problem with that answer: backtest performance is exactly where over-optimism hides. A system that looks best in-sample often looks average out-of-sample. The more configurations you tested, the more likely your "best" result is luck dressed up as skill.

This is the multiple-testing problem. And there's a principled way to handle it.

The two tools I use

Deflated Sharpe Ratio (DSR) adjusts the Sharpe Ratio downward based on how many strategies you tested before selecting the winner. The more you searched, the larger the adjustment. A system that looks good after testing 16 variants deserves a bigger haircut than one you built once from first principles and never touched.

Probability of Backtest Overfitting (PBO) asks: given your full set of strategy variants tested, how likely is it that the one that looked best in-sample would still rank well out-of-sample? A PBO near 0 means your selection process is robust. A PBO near 0.5 means it was basically random.

Neither tool is perfect. But together, they force you to account for the search you did — not just the result you found.

What I tested

Across 20 years of clean walk-forward data (2006–2025), I ran 16 variants of the locked pattern branch configuration. The universe: every RS≥80 breakout trade with the regime filter applied, across 212 months of data, 738 total trades.

The two locked components:

Component	Expectancy	Rank / 16 variants	Bootstrap 95% CI	p(overfit)
first_pullback × Webster Power Trend exit	+0.820R	4th	[+0.45, +1.22]	0.002
failed_reentry × Minervini partial-TP	+0.297R	11th	[+0.16, +0.44]	0.331

PBO = 0.475 — borderline. With only 16 variants, PBO resolution is limited. Read as "not damning, not pristine."

What the numbers mean for January 2027

first_pullback × Webster is the validated workhorse. It ranks 4th out of 16, its expectancy clears the bootstrap null (the bar a strategy must clear to be "real" rather than lucky), and it has a p(overfit) of 0.002 — meaning there's only a 0.2% chance that a random search process would have produced a result this good by luck alone.

The fat right-tail skew (+5.96) tells you where the edge lives: not in many small wins, but in a few large winners — trend runners that the Webster Power Trend exit holds until the index itself confirms the trend is over. That's an edge that makes sense mechanically.

failed_reentry × Minervini partial-TP is real but modest. The CI excludes zero (genuine positive expectancy), but it ranks 11th out of 16 — meaning 10 other variants looked better in-sample. It doesn't clear the bootstrap null. The p(overfit) of 0.331 means there's a realistic chance the selection was partly luck.

It still earns its place in the ensemble — but as a decorrelated diversifier, not a standalone star. The two systems have correlation −0.139, which means they're genuinely independent. Adding the weaker system improves the portfolio Sharpe even though it underperforms individually.

The lesson about tool choice

Sharpe ratio is the wrong primary tool for trend-following systems. The edge lives in the fat right tail — a handful of trades that run +5R, +8R, +12R. Sharpe ratio penalizes this (it treats all volatility, including upside, as risk). The Deflated Sharpe Ratio inherits this flaw.

The distribution-free bootstrap — compute the expectancy distribution directly from the data, compare against a null of pure chance — is the principled primary test for these systems. DSR is kept for contrast, not as the main verdict.

This matters for how you read any backtest. A system that generates occasional large wins will look mediocre on Sharpe and excellent on expectancy. Trust the metric that matches the edge.

The honest summary

After 20 years of clean walk-forward data and a rigorous multiple-testing correction:

first_pullback × Webster: strongly validated. Real edge, survives the search adjustment decisively. High confidence for Jan 2027 deployment.
failed_reentry × Minervini: real edge, borderline on search-robustness. Valuable as a diversifier. Launch at conservative size; let the live record accumulate.

Both systems will trade together, each at 0.25% risk per trade, with an ensemble kill switch at −10% portfolio drawdown.

The selection method isn't perfect. No method is. But it's honest — it accounts for the search, it separates search luck from genuine edge, and it gives a confidence gradient rather than a single overconfident verdict.

That's more than most traders do before risking real capital.

Analysis: 738 trades across 16 strategy variants, 2006–2025 (212 months). Walk-forward source: 20-window clean freshness-gated family. Tools: Deflated Sharpe (López de Prado), PBO via CSCV (Bailey et al.), distribution-free bootstrap expectancy CI (10,000 draws). Completed 2026-06-10.