Data Quality Is Your First Risk: Gap Detection and Validation

Most trading systems fail not because the strategy is wrong but because the data is wrong. A missing candle causes a moving average to shift. A duplicate row inflates volume calculations. A candle where the high is lower than the low (physically impossible) corrupts every indicator that touches it. Data quality is the invisible foundation. When it is solid, you never think about it. When it cracks, everything built on top of it fails in ways that are difficult to diagnose.

We maintain 20.8 million candles across 25 symbols and 11 timeframes, plus funding rates, open interest, long/short ratios, premium index data, and 24 Coinglass-sourced tables covering derivatives, macro, and on-chain metrics. Every record passes through validation before any strategy is allowed to consume it.

The Anatomy of Bad Data

Exchange APIs are not databases. They are distributed systems with eventual consistency guarantees. Binance's candle API occasionally returns candles with timestamps that do not align with the expected interval. A 15-minute candle might have a timestamp of 14:57 instead of 15:00 due to a server-side clock skew. Less commonly, candles are simply missing: the API returns the 14:45 candle and the 15:15 candle but not the 15:00 candle.

Duplicate candles are another frequent issue. When fetching candles incrementally (requesting candles since the last known timestamp), the API may return the last known candle again as the first result. Without deduplication, this creates a duplicate row in the database that distorts any calculation that sums or averages across the dataset.

OHLC violations are rare but damaging. A candle where the high is 50,000 and the low is 51,000 is physically impossible. A candle where the open is higher than the high or lower than the low violates the basic definition of OHLC data. These typically result from exchange reporting errors during extreme volatility or from partial fills during the candle period.

Our Validation Pipeline

Every candle passes through three validation stages before storage.

Stage one: deduplication. We use INSERT OR IGNORE with a unique constraint on (exchange_id, symbol, timeframe, timestamp). This means if a candle with the same identity already exists in the database, the new record is silently dropped. The deduplication is at the database level, not the application level, which makes it impossible for a duplicate to slip through regardless of how many times the same data is fetched.

This same pattern applies to all supplementary data. Funding rates use a unique constraint on (exchange_id, symbol, timestamp). Open interest, long/short ratios, and premium index data all follow the same pattern. The Coinglass tables use INSERT OR IGNORE across their respective unique keys. No table in the system allows duplicate records for the same data point.

Stage two: OHLC validation. The data integrity module checks four invariants for every candle. High must be greater than or equal to low. Open must be between high and low (inclusive). Close must be between high and low (inclusive). Volume must be non-negative. Candles that violate any of these invariants are flagged and excluded from strategy consumption.

In practice, OHLC violations are extremely rare on Binance, occurring in fewer than 0.001 percent of candles. But that rarity is exactly why they are dangerous. A strategy developer who has never seen a violated candle will not build defenses against one. The validation layer catches them universally so that no strategy needs to implement its own checks.

Stage three: gap detection. After storing new candles, the integrity module scans for temporal gaps. For a 15-minute timeframe, it checks that consecutive candles are exactly 900 seconds apart. For a 1-hour timeframe, exactly 3,600 seconds. Gaps wider than the expected interval are logged with the symbol, timeframe, gap start, and gap end.

Gaps are not necessarily errors. Some exchanges close trading briefly during maintenance windows. Some symbols have periods of zero volume where no candles are generated. The gap detection module distinguishes between expected gaps (maintenance windows, low-liquidity periods) and unexpected gaps (missing data that should exist). Unexpected gaps trigger a backfill request that attempts to fetch the missing candles from the exchange.

Quality Scoring

Beyond binary validation (pass or fail), we calculate a quality score for each symbol-timeframe combination. The score is a percentage representing data completeness: the number of candles that exist divided by the number of candles that should exist for the date range.

A score of 100 percent means perfect coverage. A score of 99.5 percent means a few candles are missing, likely during exchange maintenance windows. A score below 98 percent triggers an investigation because it suggests a systematic data collection problem.

Our current coverage across the 25-symbol, 11-timeframe universe averages 99.7 percent. The lowest scores are on smaller altcoins (WIF, PEPE) where trading launched more recently, so historical data before the listing date does not exist. The highest scores are on BTC/USDT and ETH/USDT, which have complete coverage back to 2021.

Data Quality for Derivatives Data

Funding rates, open interest, long/short ratios, and premium index data have additional validation requirements beyond OHLC candles.

Funding rates settle every 8 hours on Binance. The validation checks that rates are within plausible bounds (typically between negative 0.01 and positive 0.01 per period). Extreme values outside this range during liquidation cascades are valid but flagged for manual review.

Open interest values must be non-negative (you cannot have negative open interest). The validation also checks for implausible jumps: if open interest doubles or halves within a single period, the data point is flagged. These jumps do occur during extreme events but are rare enough to warrant investigation.

Long/short ratios must sum their components correctly: the long_account and short_account fractions should be complementary. If the long ratio is 0.65, the short ratio should be approximately 0.35. Ratios that deviate significantly from this complementary relationship indicate data corruption.

Premium index data (the basis between spot and futures) uses the same OHLC validation as regular candles. The close value represents the premium at period end and is typically small (under 0.5 percent of spot price). Values that deviate significantly are flagged.

The Backtest Data Pipeline

Data quality is even more critical for backtesting than for live trading. A live bot processes one candle at a time, and a single bad candle causes one bad signal at most. A backtest replays thousands of candles, and bad data early in the sequence can propagate errors through every subsequent calculation.

Our backtest engine loads candle data through the same validation pipeline that live bots use. Before a backtest begins, the engine checks the quality score for each symbol-timeframe combination in the test period. If coverage is below 98 percent, the backtest aborts with an error message identifying the gap locations. This prevents strategies from being evaluated on incomplete data, which would produce unreliable performance metrics.

The backtest data loader also ensures no lookahead bias. Data is loaded chronologically, and each candle is only available to the strategy at the time it would have been available in live trading. Funding rates are only available after their settlement timestamp. Open interest values are only available after their reporting period. This temporal alignment is a data quality requirement, not just a backtesting methodology requirement.

Practical Lessons

Three years of managing trading data across multiple symbols and timeframes has taught us several lessons.

First, deduplication must be at the database level. Application-level deduplication works until someone runs two data fetch processes simultaneously, or a process crashes and restarts mid-fetch. The unique constraint in the database is the only reliable guarantee.

Second, validation must happen on write, not on read. Checking data quality when a strategy requests candles is too late. By that point, the bad data exists in the database and might have been used by other processes. Validating on write prevents bad data from entering the system at all.

Third, automated gap detection pays for itself quickly. The first time a gap causes a strategy to produce an incorrect signal during paper trading, the investigation time to track down the root cause exceeds the time to build the gap detection system. Finding gaps proactively and backfilling them before any strategy is affected is the only approach that scales.

Data Quality Is Your First Risk: Gap Detection and Validation

Operations & Reliability

The Anatomy of Bad Data

Our Validation Pipeline

Quality Scoring

Data Quality for Derivatives Data

The Backtest Data Pipeline

Practical Lessons

Related Posts

Monitoring 45 Live Bots: Dashboards, Alerts, and What to Watch

Crash Recovery: The State Machine Approach

Dead Man's Switch: The Safety Net Every Trader Needs