Machine learning in trading is mostly overfitting dressed up as alpha. We say this as a team that built, trained, and deployed an XGBoost signal filter, and then watched it fail to generalize beyond the training period. The strategy exists in our codebase (strategy/ml_signal_filter.py), it works mechanically, and it taught us more about ML pitfalls than any textbook. Here is the full walkthrough, including the parts that did not work.
The Premise
The idea behind ml_signal_filter is straightforward: instead of using fixed rules (RSI below 30, price below Bollinger Band), use a gradient-boosted classifier to predict whether the next N candles will be profitable. The model takes in a feature vector derived from technical indicators and outputs a probability of upward movement. High probability means "take the trade." Low probability means "skip."
This is a signal filter, not a signal generator. It does not decide when to look for trades. It decides whether a trade that our rules-based strategies already flagged is worth taking.
Feature Engineering: 10 Shared Features
Our ML feature module (ml/features.py) computes 10 features from raw OHLCV data. These features are shared across all ML strategies so that feature computation is consistent and testable.
The features are: RSI (14-period), MACD histogram, Bollinger Band width (as percentage of price), ATR (14-period, normalized by price), volume ratio (current volume divided by 20-period average), price position within the Bollinger Band range (0 at lower band, 1 at upper band), close-to-close return over 1 period, close-to-close return over 5 periods, high-low range as percentage of close, and hour-of-day (encoded cyclically using sine/cosine to handle the midnight wrap).
features = {
"rsi_14": ta.rsi(close, 14),
"macd_hist": ta.macd(close)["MACDh_12_26_9"],
"bb_width": (upper - lower) / close,
"atr_norm": ta.atr(high, low, close, 14) / close,
"vol_ratio": volume / volume.rolling(20).mean(),
}
We deliberately kept the feature count low. With 10 features, overfitting is possible but manageable. With 50 features (which we tried early on), the model finds spurious patterns in training data that evaporate in production.
Every feature is normalized relative to price or its own history. Raw price levels are never used as features because a model trained when SOL was at $20 would be useless when SOL is at $150. Normalized features (ATR as percentage of price, volume relative to its average) are scale-invariant.
Label Generation
The label is the target variable the model learns to predict. We define a positive label (1) when the close price N candles in the future is higher than the current close by at least the transaction cost (fees plus estimated slippage). A negative label (0) means the future price is flat or lower.
future_return = close.shift(-lookahead) / close - 1
labels = (future_return > cost_threshold).astype(int)
The shift(-lookahead) is a forward look, which is perfectly valid during training (we know future prices in historical data) but must never appear in the strategy's live analyze() method. The training script (scripts/train_ml_model.py) computes labels offline. The live strategy only uses the trained model's predict_proba() with current features.
The cost threshold matters enormously. Without it, the model learns to predict tiny upward movements that get eaten by fees. We set the threshold at 0.15% (our estimated round-trip cost of 0.075% maker fee times two plus slippage). This filters the training set so the model only learns to identify moves large enough to be profitable after costs.
The lookahead period is 8 candles for 15-minute timeframes (2 hours forward) and 3 candles for 4-hour timeframes (12 hours forward). Shorter lookaheads produce more labels but noisier ones. Longer lookaheads are smoother but the model cannot reliably predict 24+ hours ahead in crypto.
Train/Test Split
We use a strict temporal split, never a random split. The training set is the first 70% of the data chronologically, and the test set is the remaining 30%. Random splitting would leak future information into the training set (a candle from 2025 in training, while a nearby candle from 2025 is in test), producing artificially inflated accuracy.
split_idx = int(len(features) * 0.7)
X_train, X_test = features[:split_idx], features[split_idx:]
y_train, y_test = labels[:split_idx], labels[split_idx:]
For our primary symbols (BTC, ETH, SOL), training uses data from 2021-2024 and testing uses 2024-2026. This means the model must generalize across the 2024-2025 consolidation regime, which was structurally different from the 2021-2022 bull/crash and 2022-2023 bear/recovery periods it trained on.
XGBoost Configuration
We use XGBClassifier from the xgboost library with conservative hyperparameters designed to limit overfitting.
The key parameters: max_depth=4 (shallow trees prevent memorizing specific market patterns), n_estimators=200 (moderate ensemble size), learning_rate=0.05 (slow learning to avoid jumping to conclusions), min_child_weight=10 (require at least 10 samples per leaf to prevent overfitting to rare events), and subsample=0.8 (use 80% of data per tree for bagging-style regularization).
We also set scale_pos_weight to balance the classes. In most market conditions, the negative class (price did not rise enough) outnumbers the positive class 60/40. Without class balancing, the model learns to always predict "skip," which is accurate but useless.
Model Persistence with joblib
Trained models are saved to disk using joblib, which handles NumPy arrays and scikit-learn compatible objects efficiently. Each model is saved per symbol because different assets have different statistical properties.
The models live in data/models/ with filenames like ml_signal_filter_SOL_USDT.joblib. The training script handles the save. The live strategy loads the model during initialization and calls predict_proba() on each tick.
If no model file exists for a symbol, the strategy returns None (no signal), effectively disabling itself. This is a deliberate safety mechanism: you cannot accidentally run the ML strategy without having trained a model first.
The Training Script
Training is an offline process, not part of the live system. The script (scripts/train_ml_model.py) accepts a list of symbols, loads historical candle data from the SQLite database, computes features, generates labels, trains the model, evaluates on the test set, and saves the model file.
uv run python scripts/train_ml_model.py \
--symbols BTC/USDT ETH/USDT SOL/USDT
Training takes 30-60 seconds per symbol on a Mac M-series chip. The bottleneck is feature computation over hundreds of thousands of candles, not the XGBoost training itself.
The script prints test set metrics: accuracy, precision, recall, F1, and a confusion matrix. For SOL/USDT with our current parameters, test accuracy is around 54% with precision of 56%. These numbers sound terrible compared to typical ML benchmarks, but in trading, 54% accuracy with appropriate position sizing is profitable if the average win is larger than the average loss.
Honest Assessment: Where It Stands
The ml_signal_filter strategy exists in our codebase and mechanically works. It loads a model, computes features, and produces signals. We included it in our tournament and sweep pipeline.
The results were mixed. In sweep testing, the strategy achieved Sharpe ratios up to 2.1 on SOL/USDT with optimized parameters. That sounds reasonable until you run validation across multiple market regimes.
In our 5-period validation (2021-2022 Bull/Crash, 2022-2023 Bear/Recovery, 2023-2024 Recovery/Highs, 2024-2025 Consolidation, 2025-2026 Recent), the strategy was profitable in 2-3 of the 5 periods depending on the symbol. It did not achieve ROBUST status (4+ of 5 periods profitable) on any symbol.
The core problem is regime sensitivity. An XGBoost model trained on 2021-2024 data has seen one full market cycle. It learns patterns from that cycle. When the next cycle unfolds differently (as cycles always do), the model's pattern recognition degrades. This is not a bug in our implementation. It is a fundamental limitation of supervised learning on non-stationary financial data.
We classify ml_signal_filter as "needs offline training" in our strategy roster, which is an honest way of saying "it works mechanically but is not deploy-ready." The strategy remains in the codebase as a functional example of ML signal filtering and as a baseline for future ML experiments.
Lessons for ML in Trading
Five concrete takeaways from building this system.
Never use random train/test splits on time series data. Always split temporally. This is the single most common mistake in ML trading and it makes every backtest look better than reality.
Keep feature counts low. With 10 features and 200 trees of depth 4, overfitting is possible but detectable. With 100 features, overfitting is invisible until you deploy.
The cost threshold in label generation is critical. Without it, the model optimizes for predicting direction, which is not the same as predicting profitability.
Validate across regimes, not just on a hold-out period. A model that works in a bull market and a bear market but fails in consolidation will lose money when consolidation arrives.
ML in crypto trading is harder than in equities because the market is younger (less data), more volatile (noisier labels), and has fewer stable statistical relationships (feature distributions shift faster). The bar for ML to outperform simple rules-based strategies is higher than most practitioners assume.
Our rules-based mean_reversion_bb strategy achieves Sharpe 9-19 across 13 symbols with fixed parameters. The ML strategy achieved Sharpe 2.1 at best with symbol-specific training. The simpler approach won. We keep the ML strategy because the infrastructure (feature engineering, model persistence, training pipeline) has value for future iterations, but we are candid about the current performance gap.