Clean data is the foundation of every trading decision we make. Our platform holds 20.8 million candles across 25 symbols and 11 timeframes, and every row needs to be correct. A single corrupted candle, a duplicate row, a timestamp gap, or an impossible OHLC relationship can produce a false signal that triggers a real trade. We built a dedicated data integrity module (data/integrity.py) that validates every candle before it reaches a strategy.
Here is the full pipeline from raw exchange data to strategy-ready DataFrames.
Timestamp Normalization
Exchange APIs return timestamps in various formats. Binance uses Unix milliseconds. Some exchanges return ISO 8601 strings. Others return Unix seconds. The first step in our pipeline normalizes everything to UTC datetime objects with no timezone offset.
df["timestamp"] = pd.to_datetime(df["timestamp"], unit="ms", utc=True)
df["timestamp"] = df["timestamp"].dt.tz_localize(None)
The tz_localize(None) call strips the timezone info, leaving a naive UTC datetime. We store and compute everything in UTC. No timezone conversions, no daylight savings issues, no ambiguity. Every timestamp in our database is UTC, period.
Normalization also includes sorting by timestamp ascending and resetting the index. Strategies expect candles in chronological order. An unsorted DataFrame would produce incorrect rolling window calculations, wrong crossover signals, and broken equity curves.
OHLC Validation
A valid candle must satisfy four constraints: high is greater than or equal to both open and close, low is less than or equal to both open and close, high is greater than or equal to low, and all four price values are positive.
invalid = (
(df["high"] < df["open"]) |
(df["high"] < df["close"]) |
(df["low"] > df["open"]) |
(df["low"] > df["close"]) |
(df["high"] < df["low"])
)
These constraints sound obvious, but violations happen more often than you might expect. We have seen candles from exchange APIs where high was less than close (likely a data feed error), candles with negative volume (an API parsing issue), and candles where open equaled close equaled high equaled low equaled zero (exchange maintenance periods).
Our validation marks invalid candles rather than deleting them. A strategy can choose to skip marked candles or interpolate. Deletion would create timestamp gaps, which causes its own problems in rolling window calculations.
In practice, across 20.8 million candles, we have found approximately 0.003% invalid rows. That is about 600 candles total. Most were from low-liquidity periods on newer tokens (WIF and PEPE in their first weeks of trading) where single large trades created momentary OHLC anomalies.
Gap Detection
A gap is a missing candle in a time series. For 15-minute candles, consecutive timestamps should be exactly 900 seconds apart. A gap means the exchange either did not produce a candle (no trades occurred in that interval) or our fetch missed it.
The detection is straightforward: compute the difference between consecutive timestamps and flag any that exceed the expected interval.
expected_interval = pd.Timedelta(minutes=15)
gaps = df["timestamp"].diff() > expected_interval * 1.5
The 1.5x tolerance accounts for minor timestamp jitter. Some exchanges report candle timestamps that are a few seconds off from the exact interval boundary. Without tolerance, you get false positive gap alerts on every data fetch.
When gaps are found, the integrity module reports them with their location and duration. Short gaps (one missing candle) are common and harmless for most strategies since a rolling(20) calculation simply has 19 values instead of 20 for the affected window. Long gaps (hours or days of missing data) are flagged as critical because they can produce misleading indicator values.
We fill short gaps with forward-fill for strategies that require continuous data (like our Kalman filter strategy, which expects evenly spaced observations). For most strategies, we leave gaps as-is because forward-filling prices implies zero volatility during the gap, which is a stronger assumption than simply having fewer data points.
Quality Scoring
Each symbol-timeframe combination gets a quality score from 0 to 100 based on four metrics: completeness (percentage of expected candles present), validity (percentage passing OHLC checks), consistency (no duplicate timestamps), and recency (how fresh the latest candle is).
A score of 95+ means the data is production-ready. Between 80 and 95, it is usable for backtesting but should be inspected. Below 80, it is unreliable and should not be used for trading decisions.
Our 25 symbols average quality scores above 98. The lowest is MATIC/USDT at 91, because Polygon migrated from MATIC to POL on Binance in September 2024, so the data series terminates abruptly. We still carry the historical data for backtesting but do not run live bots on it.
Quality scores are not just a development tool. Our Data Coverage tab in the dashboard (accessible at /market?tab=coverage) displays quality scores for every symbol and timeframe combination. Before deploying a new bot, we check that the target symbol has a quality score above 95 on the target timeframe.
Deduplication with INSERT OR IGNORE
Duplicate candles are the most insidious data quality issue. They do not fail OHLC validation. They do not show up as gaps. But they double-count volume, produce incorrect rolling averages, and can trigger duplicate signals in a backtest.
Duplicates arise from overlapping fetch windows. When we incrementally update candle data, we fetch the last 500 candles, which overlap with existing stored data. The naive approach of appending all fetched candles to the database creates duplicates at the overlap boundary.
Our solution uses SQLite's INSERT OR IGNORE with a unique constraint on (exchange_id, symbol, timeframe, timestamp). If a row with the same key already exists, the insert is silently skipped.
CREATE UNIQUE INDEX ix_candles_unique
ON candles (exchange_id, symbol, timeframe, timestamp);
This makes incremental updates idempotent. We can re-fetch the same time range ten times and the database will still have exactly one candle per timestamp. The storage module reports how many rows were actually new versus skipped, which is useful for monitoring whether incremental syncs are working correctly.
The same pattern applies to funding rates, open interest, long/short ratios, and premium index data. All have unique constraints on their natural keys, and all use INSERT OR IGNORE for deduplication.
The Loading Pipeline
When a strategy needs data, the pipeline runs in this order.
First, the data adapter queries SQLite for the requested symbol and timeframe within the needed time range. The query returns a DataFrame sorted by timestamp ascending.
Second, the integrity module runs validation: OHLC checks, gap detection, and quality scoring. Invalid rows are flagged. Gaps are logged.
Third, if the strategy requires multiple timeframes (our 4h momentum strategy needs both 4h and 1d candles), the adapter fetches each timeframe separately and returns them in a dictionary keyed by timeframe string.
Fourth, if the strategy uses derivatives data, the adapter also loads funding rates, open interest, long/short ratios, and basis data from their respective tables. These are added to the dictionary under keys like "funding", "oi", "lsr", and "basis".
The strategy receives a clean dictionary of DataFrames, each validated and sorted. It never touches the database directly. It never parses raw API responses. The strategy's only job is to analyze clean data and produce a signal.
Backtest Data Loading
Backtesting adds one more concern: no lookahead. When the backtest engine replays historical data, it must only show the strategy candles that would have been available at each point in time.
The engine loads the full time range upfront for performance, but it slices the DataFrame progressively. At simulated time T, the strategy sees candles from the start of the dataset up to T. It never sees future candles.
This slicing applies to all data types. Funding rates are sliced to the current simulated time. Open interest is sliced. Macro indicators are sliced. The backtest engine handles this uniformly so that strategies do not need to implement their own anti-lookahead logic.
For the tournament, this means processing 175,000 candles per symbol for the full 5-year 15m dataset. The progressive slicing and indicator recalculation at each step is the dominant cost in backtesting. Our distributed worker setup (two remote machines plus the local Mac) parallelizes across strategy-symbol combinations to keep total runtime manageable.
Practical Advice
If you are building a trading data pipeline, here are the lessons that cost us time.
Always validate before computing. Running a rolling average on data with duplicates or OHLC violations produces wrong results silently. Validate first, compute second.
Use unique constraints in your database. Deduplication in application code is fragile. Database constraints are enforced regardless of which code path inserts the data.
Store timestamps in UTC. Every timezone conversion bug we have seen in trading systems traces back to mixing UTC and local time. Pick UTC and never convert.
Forward-fill with caution. Filling a one-candle gap is reasonable. Filling a 24-hour gap with the last known price implies the market was frozen for a day. For most indicators, it is better to have NaN values and let the rolling window handle shorter inputs.
Log everything. Our integrity module logs every gap, every invalid candle, and every quality score drop. When a strategy produces unexpected results, the first thing we check is the data quality log.