Skip to content
AnalysisIndicators6 min read

When the backtest disagreed with us: a signal-relabeling postmortem

We built 12 trading-robot rules, labelled four of them bearish, and shipped a backtest page that scored every rule honestly. The data said three of the four bearish labels were wrong — so we relabeled them. Here's what happened and why the editorial process matters.

Published May 29, 2026
TL;DR

Our Trading Robot ships 12 rule-based signals. Four were originally tagged 'bearish.' When the year-long backtest scored them honestly, three of the four had 7-day hit rates ABOVE the random-day baseline — they preceded *positive* returns more often than chance. The labels were wrong. We renamed them and re-classified them as neutral. The lesson isn't that the rules failed; it's that the editorial discipline of running the backtest, publishing the numbers, and changing what didn't survive contact with data is the actual product.

The original taxonomy

When we shipped the Trading Robot, we sorted 12 rules into three buckets by intuition: bullish (5), bearish (4), neutral (3). The bearish set was momentum-down (24h < -8% with the 7d trend negative), reversal-down (24h < -5% after a 7d rally), breaking-down (-5% today on top of -20% over 30 days), and underperformer (lagging the cohort median by 5pp+). On the surface these all describe selling pressure. The label felt obvious.

Then we ran the backtest

We backtested every rule against 1 year of daily CoinGecko data on the top 30 coins — 9,660 coin-days, 4,207 signals fired. For each fired signal we computed forward 1-, 7-, and 30-day returns and compared the hit rate (% positive) to a random-day baseline. The baseline for that window was 47.7% — meaning a randomly-chosen day in the dataset was followed by a positive 7-day return 47.7% of the time. Anything materially above is edge; below is anti-edge.

  • near-breakout (bullish): 50.5% hit, +4.82% mean 7d return → real edge confirmed
  • deep-discount (neutral): 52.4% hit → mild positive edge
  • momentum-up (bullish): 30.8% hit, -4.63% mean → terrible. Strong upward chases were followed by declines almost 7 times out of 10
  • momentum-down (bearish): 37.0% hit → did underperform the baseline. Genuine downward edge
  • reversal-down (bearish): 46.0% hit, +6.78% mean → above baseline. The 'selling after a rally' rule preceded *higher* prices on average
  • breaking-down (bearish): 57.7% hit, +0.87% mean → far above baseline. The 'multi-week downtrend accelerating' rule was actually catching capitulation lows, not continuation
  • underperformer (bearish): 52.5% hit, +3.27% mean → above baseline. Lagging the cohort more often *bounced* than kept lagging

Why three of the four 'bearish' rules failed as bearish

The intuition behind each was correct as a description of price action — they really do fire on selling pressure. What they don't predict is more selling pressure. Crypto is mean-reverting at short horizons in this dataset (we're looking at the top 30 coins, which are highly liquid and heavily traded). Bouts of capitulation are followed by bounces more often than chance because the marginal seller is being absorbed by mean-reversion buyers. Identifying capitulation is useful — but framing it as 'bearish continuation' is the opposite of what the data supports.

The relabeling

We renamed three rules to match what they actually do:

  • reversal-down → 'Distribution watch' (neutral). Fires on intraweek pullbacks after weekly gains; often profit-taking inside continuing uptrends.
  • breaking-down → 'Capitulation watch' (neutral). Fires on accelerating multi-week declines; historically more often a low than a continuation.
  • underperformer → 'Mean-reversion candidate' (neutral). Fires on coins lagging the cohort; more often bounces than keeps lagging.

Why momentum-down stayed bearish

The exception. Strong downward momentum with the 7-day trend also negative is the only one of the four with a 7-day hit rate (37.0%) materially below the baseline (47.7%). Those moves DO precede further declines. It stayed bearish. One in four was the right call out of the gate — the other three deserved relabelling.

What this means for using the robot

If you see 'Capitulation watch' or 'Distribution watch' fire on a coin you're following, the historical pattern says wait — don't short the move. Either step aside or treat it as a bounce-watch trigger. 'Mean-reversion candidate' is similar: the coin is underperforming, but on a 7-day horizon it's more likely to catch up than lag further. None of this is a buy recommendation — it's a directional context for the rule's actual historical behaviour, not a guarantee of future returns.

Why we wrote this article

Most retail trading-signal products would have buried the inconvenient backtest results, kept the marketing-friendly 'bearish' label, and hoped no one ran the numbers. We did the opposite: published the backtest at /trading-robot/backtest the day the engine shipped, then changed the rules within hours when the data made the original taxonomy untenable. The credibility cost of changing labels in response to data is approximately zero; the credibility cost of refusing to is the entire business.

Worked example

Walk through the breaking-down → Capitulation watch decision

The original rule fired 111 times in the dataset. Bearish framing predicted continued downside. Then the numbers came in. Here's the actual decision-tree we walked through.

  1. 17d hit rate57.7% vs 47.7% baseline → +10pp above random
  2. 2Mean 7d return+0.87% (positive — not what 'breaking down' should produce)
  3. 3Sample sizen=111 — large enough to trust the signal
  4. 4Interpretation A (keep label)Defensible if we ignore the data; indefensible if anyone runs the math
  5. 5Interpretation B (relabel)Match the label to what the data says it actually does
  6. 6DecisionRelabel to 'Capitulation watch', flip bias to neutral, update reasoning text
Takeaway

The decision wasn't hard once we had the data. The hard part is committing to that decision-tree BEFORE running the backtest — agreeing in advance that if the data disagrees, the labels change. That commitment is what separates editorial product from marketing product.

Common mistakes

What to avoid

  • !Treating signal labels as marketing copy instead of empirical claims. If a rule is labelled bearish, the burden of proof is on the label, not on the user to discover it's wrong.
  • !Burying inconvenient backtest results. The signal you can't defend transparently isn't a signal — it's a vibe.
  • !Assuming intuition about selling pressure scales to forward returns. In short-horizon crypto, mean reversion often dominates continuation.
  • !Backtesting once and never re-running. Markets regime-shift; rules that worked in one window can fail in the next. The robot's backtest is dated and meant to be re-run quarterly.
  • !Confusing 'fires often' with 'works often.' A rule firing 2,600 times tells you nothing about whether to act on it. Only the forward-return distribution does.
Self-check

Test yourself

Q1Why did three of the four bearish labels survive only days before being changed?+

Because the backtest contradicted them. Their 7-day hit rates ran above the random-day baseline, meaning the rules historically preceded positive forward returns — the opposite of what a 'bearish' label predicts. Keeping the label would have required ignoring the data.

Q2Why did momentum-down stay bearish?+

It was the only one of the four with a 7-day hit rate (37%) materially below the random-day baseline (47.7%). Strong downward momentum with a confirming weekly trend did historically precede further declines. The bearish label survived contact with data.

Q3What does 'Capitulation watch' (formerly 'Breaking down') actually predict, based on history?+

More often a bottom than a continuation. It fires on a -5% day combined with a -20% month — historically those compound declines were followed by positive 7-day returns 57.7% of the time, well above the 47.7% baseline.

Q4Why is publishing the backtest more important than the signals themselves?+

Because anyone can write a list of trading rules. The credibility comes from showing the rules that didn't work alongside the ones that did, and changing the taxonomy when the data demands it. The discipline of publishing wrong-then-corrected is the actual product.

Keep reading

Related