← WOLFX Pricing Track record Methodology Whitepapers Scoreboard Changelog Swarm

How we validate alpha

Every strategy that ships at WOLFX clears the same harness. Every rejection is published next to every pass on the scoreboard. This page is the rule book.

Walk-forward 70/15/15

We split the 2016-2026 historical sample into three slices: train (70 %), validation (15 %), and test (15 %). Strategy parameters are fit on train only. Validation tunes hyperparameters. Test is touched once, at the end. Anything that "works" on train but fails on test is rejected.

The point of this discipline is to catch overfitting. A strategy that beats noise on the in-sample period but collapses on data it has never seen is not edge — it's pattern-matching on training set artifacts.

Nine hard gates (v7)

Test slice Sharpe ≥ 0.40 — the headline edge metric
Test slice MaxDD ≤ 15 % — the worst single drawdown can't bury the account
Test slice profit factor ≥ 1.2 — total dollars-up / dollars-down has margin over breakeven
Test slice trade count ≥ 50–100 (varies by strategy class) — sample size large enough that the Sharpe estimate is statistically firm
Full-window Sharpe > 0 — the strategy must work across the entire 10-year span, not just one regime. (This gate added after Round 11 — Treasury Curve Carry passed test slice but lost 18 % over the full window. We don't ship regime-conditional bets.)
Data source named + verified — the API endpoint and units must be confirmed live before the harness runs. (Added after Round 16 — RRP Carry's units bug.)
Backtest data parity with live data — if backtest uses one source and live uses another, the gap is documented and signal quality assessed. (Added after Round 18.)
Train → val → test trajectory hypothesis — the proposer must hypothesize what the trajectory will look like (monotonic improvement / stable / etc.) and the actual run must match. (Added after Round 19's regime-flip + Round 20's monotonic decay.)
Conservative canary sizing alongside spec — every passing strategy ships at sizing tighter than backtest spec until 100+ live trades validate the regime persists.

What we test for besides the gates

Regime stability: train + val + test should all show same-sign Sharpe, ideally monotonically improving. Flip patterns (e.g. -0.4 / +0.5 / -0.3) are "lucky middle slice" signatures and trigger rejection even when test alone passes.
Sample-size confidence: PF estimates from <20 trades have wide enough confidence intervals that a "PF 99.9" looks indistinguishable from coin-flip noise. Production samples below 50 trades get the harness regardless of how good the live numbers look.
Cost realism: 1-15 bps round-trip slippage built into every backtest depending on order type. We don't ship strategies that work only at zero-cost.
Out-of-sample-only validation: train slice never touches the live execution decision. Val + test do.

We gauntlet our own live strategies too

The harness doesn't only run on candidate new strategies — it runs on the strategies already deployed in production. Sniper mean-reversion (Round 21), news alpha (Round 22), and quantum convergence (Round 23) have all gone through the same walk-forward applied to candidate strategies. Scoreboard rows for each document the result.

This is uncomfortable but disciplined: production results on small sample sizes (20-50 trades) are easily noise. Walk-forward on 1,000+ trades over 10 years is harder to fool. When the two disagree, we trust the harness and tighten the live strategy.

What gets shipped

A strategy passing the gauntlet ships as a shadow scaffold — flag-off, no live capital. The autonomous shadow scheduler (V170/V174/V185) records signal decisions daily for ~30 sessions. Once rolling Sharpe ≥ 0.5, no -3 % 2-week drawdown, and ≥ 20 trades accumulate, the auto-promote cron (V179b) flips the live flag and the order-submission path (V180a/b/c) starts placing real trades.

Promotion is automatic. Demotion is automatic — the V184 circuit breaker auto-disables any live strategy whose 30-day rolling PF drops below 1.0. No human-in-the-loop required.

What's currently in the pipeline

4 strategies in shadow canary, awaiting promotion: Cross-Asset Trend (R7), VIX Contango Carry (R8), Overnight Drift Reversal (R14), Intraday VWAP Breakout (R19).
Currently 0 gauntlet-validated live strategies — the previous live workhorses (sniper_mr, news_alpha, wolf_quantum) all failed validation. Realised P&L on the books reflects small-sample variance, not validated edge. Discipline says: tighter parameters with kill-switches, wait for shadow data to validate, do not trade unvalidated strategies at scale.

Questions about methodology: hello@wolfx.trade