WOLFX Research · reports/gauntlet-scoreboard.md

The WOLFX Backtest Gauntlet — Live Scoreboard

WOLFX Research · Updated 2026-04-24

Every strategy WOLFX trades has cleared a walk-forward 70/15/15 backtest with hard gates. Every strategy WOLFX considers but rejects is published here — the rejections are how you know the filter is real.

Running score: 3 PASS / 15 rounds (20 %). One documented near-miss (Round 13).

The passes

Round	Strategy	Test Sharpe	Shipped
7	Cross-Asset Futures Trend (ES/NQ/RTY/YM/6E/6J, 12-1 skip-month)	0.895	V167 · 2026-04-24 · whitepaper
8	VIX Contango Carry (short VXX / long VXZ, regime-gated)	1.41	V169 · 2026-04-24 · whitepaper
14	Overnight Drift Reversal (SPY MOC→MOO, 5d-intraday filter)	1.229	V174 · 2026-04-25 · whitepaper

(Round 15 intentionally skipped — FOMC Pre-Announcement Drift only fires 8 times/year, can't clear the v5 ≥50-trade gate. Proposed as a sizing multiplier on Round 14, not standalone.)

Both strategies are in 30-day paper-shadow canary via V170 scheduler. Flag flip to live execution happens only after rolling Sharpe ≥ 0.5 with no monthly drawdown > 3 %.

The rejections

Round	Strategy	Test Sharpe	Verdict	Finding
1	Overnight Gap Continuation	-8.11	NO-GO	Signal evaporated when realistic fill costs applied. Infrastructure gap on premarket data.
2	Crypto Funding Arbitrage v1	3.15 IS / -5.68 OOS	NO-GO	Textbook overfit. Great in-sample, dead out-of-sample.
3	Crypto Funding v2 (long-only, z < -3)	7.22 (spurious)	NO-GO	Forensic finding: the claimed 71.4 % WR was implicitly bundled with a "price at 20-day low" filter. The filter, not the funding signal, was doing the work.
4	Cointegration Pairs Trading	-1.51	NO-GO	Mega-cap dispersion in 2025-2026 broke the cointegration assumptions that made this work in 2015.
5	Momentum + VIX Long/Short	-1.23	NO-GO	Benign-VIX regimes produce short-squeeze spikes that shred the short leg. MaxDD 25 %.
6	Momentum long-only (decomposed)	+0.88	NO-GO-drawdown	Sharpe ok, but MaxDD 12.85 % eats the risk budget.
—	PEAD — Post-Earnings Drift	-2.77	HARD NO-GO	THE SIGNAL HAS INVERTED. A positive earnings surprise now predicts -0.61 % forward return. Classical decades-old anomaly is now anti-signal.
9	G10 FX Trend+Carry	0.164	NO-GO	Strategy lost money over 9 years. AUD/USD + USD/JPY profitable legs couldn't offset EUR/USD, GBP/USD, CHF, etc.
10	Commodity Basis Carry	-1.58	NO-GO	Proxy rejected, not the underlying premium. Inverted-momentum-as-basis shorted the 2024-2026 gold/palladium rally. Real term-structure data required.
11	Treasury Curve Carry 2s10s (Yahoo futures ratio)	0.54 test / -1.08 train	NO-GO	Test slice looked fine (3 of 4 gates pass) but strategy lost 18 % over the full 10-year window. Only worked post-QE. A regime bet, not a carry premium.
12	Treasury Curve Carry 2s10s (FRED daily yields, IEF/SHY)	-3.34	NO-GO	Retested Round 11 with real yield data to falsify "maybe the proxy was the problem." Result: real yields were worse than the proxy. Full-window Sharpe -0.99, final NAV down 36 %. One trade alone (Oct 2023 short SHY at 3.78× weight) lost $44.8K when 2Y yields fell into the rate-cut cycle. The underlying signal — not the proxy — is wrong for the 2022-2026 hiking/cutting regime.
13	DXY Regime Switch (long-only, Variant B)	1.20 test / 0.61 full	NEAR-MISS / NO-GO	Sharpe 1.20, PF 2.95, MaxDD -1.47 %, full-window Sharpe positive — every substantive gate clears with margin. Fails only on trade count (2 vs gate 20). The 20-trade gate is miscalibrated for a regime classifier that fires ~3 times per test slice by design. Honest verdict under strict rules: NO-GO. Under the same gate-calibration argument that Round 7 (Trend) accepted, this would be a PASS — that's a calibration decision, not a statistical one. Flagged for re-evaluation if the trade-count gate is recalibrated per signal class.
16	RRP-Driven Treasury Carry Reversal (FRED RRPONTSYD → SHY)	-0.89 test / 0.24 full	NO-GO	Test slice fails 3 of 5 gates (Sharpe -0.89, PF 0.88, only 29 trades). Walk-forward: train -0.03, val +1.81, test -0.89 — textbook in-sample-fit / out-of-sample-collapse. The validation slice caught the late-2023 RRP drain wave during the Fed pivot; the test slice is in a post-RRP-trough regime where the facility sits near zero and meaningful drains stop happening. Two side findings: (1) the v5 proposal had a units bug — RRPONTSYD is in billions, not millions; harness corrected; (2) the premium, if it existed, has likely been arbitraged away in the four years since Copeland-Duffie-Yang published.

Methodology

Every round runs through the same walk-forward harness:
70 / 15 / 15 split — training, validation, test. Parameters fit on training only.
Walk-forward monotonicity check — train ≤ validation ≤ test Sharpe. Monotonic improvement OOS is the single strongest credibility signal.
Full-window sanity — if a strategy needs a specific regime to print positive numbers, the test slice alone doesn't save it. Round 11 was rejected on this exact point.
Gate calibration — Sharpe 0.30 – 0.50 depending on asset class. Trend in equities needs higher Sharpe gate than VIX carry.
Cost model — slippage, commission, and execution realism baked in. Round 1 died specifically because we made the fill assumptions honest.

Meta-insights the swarm has learned

The 2015-2021 equity-factor playbook is upside-down in 2026. Six out of the first eight rejections were classical equity factors. PEAD inverting was the clincher. Mega-cap concentration, passive flows, and retail options gamma have rewired the single-name tape.
Both passes are macro / commodity / vol with multi-decade academic lineage. The alpha research pivot to this territory produced two consecutive passes after six consecutive equity-factor rejections. That is not coincidence — that is structure.
Data proxies die in out-of-sample testing. Round 10 and Round 11 both had to use Yahoo-price proxies because the true signal requires curve / yield data. Both failed. The lesson: before we backtest another carry signal, procure the data.

Round 14 — first v5 round, first PASS under v5 constraints

Variant B (mean-reversion-filtered overnight long): test Sharpe 1.229, MaxDD -8.36 %, PF 1.25, 163 trades, full-window Sharpe 0.447. All five v5 gates cleared. Train/Val/Test Sharpe 0.51 / 1.35 / 1.23 — clean OOS pattern, no overfit signature. Variant A (always-on overnight long, no filter) NO-GO at PF 1.09 — confirms the filter is doing real work.

The strategy fills the high-frequency gap that v4 was structurally unable to test. Diversifies cleanly from Trend (monthly futures momentum) and VIX Carry (monthly vol selling).

What's next

Alpha v4 final tally: 0 PASS / 4 tested — commodity basis (R10), treasury curve with and without proxy (R11, R12), DXY regime (R13 near-miss). The v4 research batch leaned too hard on macro / carry signals that either need decades of data (to separate premium from regime bet) or would never fire often enough to accumulate a test-slice sample in ten years.
Alpha Researcher v5 is now the next loop. Three hard constraints for v5, learned the hard way from v4:
1. Data source must be named and verified to expose the true signal before a harness is built. R10 and R11 both died because the available data only supported a proxy of the real signal.
2. Prefer signals that naturally fire ≥ 50 times per year — intraday, event-driven, or short-lookback technical. Monthly carry signals are fine in principle but the 20-trade-per-test-slice sample gate starves them.
3. Full-window Sharpe is a hard gate, set after Round 11. A strategy that only works in one half of the 2016-2026 window is a regime bet; no more "only works post-QE" passes.

---

WOLFX publishes every signal and every realized fill. Past performance, including walk-forward backtest performance, is not predictive of future results. Every strategy here is informational — nothing is investment advice.

Edge-served from Cloudflare R2.