Exchange API Failures: Rate Limits, Timeouts, Recovery

Exchange APIs are the foundation of every automated trading system and the most common point of failure. Binance's REST API handles millions of requests per day, but it has rate limits, experiences outages, and occasionally returns unexpected responses. When you run a single bot, API failures are an inconvenience. When you run 45 bots sharing a single API key, API failures are an engineering problem that requires systematic solutions.

The Rate Limit Landscape

Binance enforces multiple rate limits simultaneously. The primary limit is 1,200 weighted requests per minute per IP address. Different endpoints have different weights: a simple market data query costs 1 weight, a candle fetch costs 1 to 5 depending on the limit parameter, and order placement costs 1. There are also order rate limits (10 orders per second, 200,000 per day) and raw request limits.

With 45 bots, each needing to fetch candle data on every tick, the raw request volume adds up quickly. A 15-minute bot ticks every 900 seconds and needs candles for its primary timeframe plus any secondary timeframes (some strategies use 1-hour or 4-hour confirmation candles alongside 15-minute primary candles). Thirteen mean reversion bots each fetching two timeframes means 26 candle requests per tick cycle, plus another 20 or so from the momentum and macro bots. In a single tick cycle, we might issue 50 to 70 candle requests.

The ccxt library that we use for exchange connectivity handles rate limit headers automatically. When a response includes a rate limit warning header, ccxt backs off before the next request. But this passive approach is insufficient for 45 bots because the backoff happens per-request, not per-batch. Without coordination, multiple bots can simultaneously hit the rate limit and all back off, creating a thundering herd on the next attempt.

Request Deduplication

The most effective rate limit mitigation is not making duplicate requests. Multiple bots trading the same symbol on the same timeframe need the same candle data. SOL/USDT 15-minute candles are needed by both the mean reversion bot and the momentum bot on SOL. Fetching them twice wastes one request against the rate limit.

Our DataFetcher implements request deduplication at the symbol-timeframe level. When a bot requests candles for SOL/USDT on the 15-minute timeframe, the fetcher checks if those candles have already been fetched within the current tick cycle. If they have, the cached result is returned immediately without an exchange request. If they have not, the request goes to the exchange and the result is cached for the remainder of the tick cycle.

This deduplication reduces our effective request count by 30 to 40 percent. Symbols like SOL, AVAX, and DOGE that are traded by multiple strategies benefit the most. The cache is cleared at the start of each tick cycle to ensure fresh data.

Retry Logic for Transient Failures

Exchange APIs fail transiently. A 500 Internal Server Error from Binance typically resolves within seconds. A network timeout might be caused by a brief routing hiccup. These failures should not stop a bot or trigger an alert. They should be retried.

Our retry logic uses exponential backoff with jitter. The first retry waits 1 second. The second waits 2 seconds plus a random jitter of up to 500 milliseconds. The third waits 4 seconds plus jitter. After three failed retries, the request is abandoned and the bot's tick completes without that data. The bot remains in its current state and will attempt the data fetch again on the next tick.

The jitter is important for multi-bot systems. Without jitter, all 45 bots that failed on the same tick would retry at exactly the same time, recreating the thundering herd problem at the retry level. Random jitter spreads the retries across a window, preventing synchronized spikes.

Timeout Configuration

The default timeout for HTTP requests to Binance is 30 seconds. This is too long for a trading bot. If the exchange is genuinely down, waiting 30 seconds per request per bot creates a cascading delay where later bots in the tick cycle are still waiting for earlier bots' timeouts to expire.

We set timeouts to 10 seconds for market data requests and 15 seconds for order-related requests. Market data can be retried cheaply. Order requests need slightly more patience because a partial execution on the exchange side followed by a client timeout creates an ambiguous state that requires reconciliation.

SQLite Under Concurrent Access

All 45 bots write to the same SQLite database. Orders, fills, positions, trades, equity snapshots, and risk events all flow into the same file. SQLite handles concurrent reads well but serializes writes. Without configuration, concurrent writes from multiple bot coroutines can produce database locked errors.

WAL (Write-Ahead Logging) mode is the primary mitigation. With WAL enabled, readers do not block writers and writers do not block readers. Multiple bots can read data simultaneously while one bot writes. Writes are still serialized, but WAL mode makes the serialization fast: the writer appends to the WAL file rather than modifying the main database file, which is significantly faster.

The busy_timeout pragma is set to 5,000 milliseconds. When a write attempt encounters a lock held by another writer, SQLite waits up to 5 seconds for the lock to release before returning an error. With 45 bots, write contention is rare because individual writes are fast (single row inserts). The 5-second timeout provides a generous buffer for the occasional case where two bots write simultaneously.

Graceful Degradation

The system is designed to degrade gracefully when external services are unavailable. If the Binance API is completely down, bots cannot fetch candle data and their tick loops complete early without generating signals. Existing positions remain open. No new positions are entered. When the API recovers, bots resume on their next scheduled tick as if nothing happened.

If the Claude API is unavailable (used for signal enrichment), raw signals pass through unchanged. The AI enrichment layer adjusts confidence scores by plus or minus 0.2, but it is advisory, not gatekeeping. Without AI enrichment, signals are slightly less refined but still valid. The system logs the AI unavailability and continues.

If Telegram is unavailable, alerts queue in memory and are sent when the connection restores. Trading continues. Alerts are important for operator awareness but not for trade execution.

This layered degradation means the critical path (exchange data fetch, strategy signal, risk check, order execution) depends only on the exchange API. All other services (AI, Telegram, dashboard SSE) can fail independently without affecting trade execution.

Reconciliation on Recovery

The most dangerous failure mode is a network split during order execution. The bot sends an order to Binance. The network connection drops. The bot does not receive the order confirmation. From the bot's perspective, the order may or may not have been filled. From Binance's perspective, the order was received and is being processed.

Our reconciliation module handles this on every bot startup and can be triggered manually. It queries the exchange for all open orders and recent fills associated with our API key. It compares the exchange state with the local database state. Orders that exist on the exchange but not locally are recorded. Fills that occurred on the exchange but are not reflected locally are applied. Orders that exist locally but not on the exchange (canceled by the exchange due to timeout or other reasons) are marked accordingly.

This reconciliation runs automatically on server restart as part of the state recovery process. It also runs periodically (every hour) as a background check. The goal is to ensure that the local state never diverges from the exchange state for more than one hour, even in the worst case of a crash during order execution.

Exchange API Failures: Rate Limits, Timeouts, Recovery

Operations & Reliability

The Rate Limit Landscape

Request Deduplication

Retry Logic for Transient Failures

Timeout Configuration

SQLite Under Concurrent Access

Graceful Degradation

Reconciliation on Recovery

Related Posts

Crash Recovery: The State Machine Approach

Dead Man's Switch: The Safety Net Every Trader Needs

Monitoring 45 Live Bots: Dashboards, Alerts, and What to Watch