Every tutorial for building a crypto trading bot starts the same way. Import ccxt. Fetch some candles. Calculate RSI. If RSI is below 30, buy. If above 70, sell. Print the result. Fifty lines of Python and you have a trading bot.
You do not have a trading bot. You have a script that will lose money. The gap between a tutorial script and a production trading bot is enormous, and it is filled with the components that tutorials skip: exchange connectivity with retry logic, position state management, risk checks, order tracking, crash recovery, and monitoring.
We built a production platform that runs 45 bots simultaneously. Here is what the real architecture looks like.
The Core Loop
A production bot runs a tick loop on a schedule. Every N seconds (900 for 15-minute strategies, 3600 for hourly, 14400 for 4-hour), the bot executes a cycle: fetch candles, update unrealized PnL on open positions, check for existing positions, run strategy analysis, apply risk checks, and if everything passes, place an order.
This loop must be async because you are fetching data from an external API (the exchange) and potentially calling an AI service for signal enrichment. Blocking the event loop while waiting for an HTTP response would prevent other bots from executing their ticks. We use Python's asyncio with aiohttp for non-blocking exchange communication.
APScheduler manages the timing. Each bot registers as an interval trigger job. The scheduler handles time drift (ensuring ticks fire at consistent intervals even if the previous tick ran long), missed ticks (coalescing multiple missed ticks into one execution), and graceful shutdown (waiting for in-progress ticks to complete before stopping).
Exchange Connectivity
The ccxt library provides a unified API for 100+ exchanges. Our DataFetcher wraps ccxt with retry logic, rate limit handling, and request deduplication. When 45 bots request SOL/USDT 15-minute candles simultaneously, the DataFetcher makes one API call and shares the result across all requesting bots.
Binance allows 1,200 weighted requests per minute. With 13 unique symbols across 3 timeframes, each tick cycle requires approximately 39 API calls. We stagger bot startup by 2 seconds to spread these requests across the tick interval rather than firing them simultaneously.
Error handling is critical. Exchange APIs return errors for rate limits (HTTP 429), maintenance windows (HTTP 503), invalid parameters, and network timeouts. Our fetcher retries transient errors with exponential backoff and jitter, logs persistent errors, and returns the last known data rather than crashing. A single failed API call should never bring down the entire bot fleet.
Position State Machine
Every position progresses through a state machine: FLAT (no position), OPENING (entry order placed), OPEN (entry order filled), CLOSING (exit order placed), CLOSED (exit order filled). The state machine enforces that transitions are valid — you cannot close a position that is not open, and you cannot open a new position when one is already opening.
This state tracking is essential for crash recovery. If the process crashes while a position is in OPENING state (entry order placed but fill not yet confirmed), the recovery system checks the order status on the exchange, syncs the fill data if it completed, and updates the local state accordingly. Without the state machine, a crash during order execution could result in orphaned exchange orders or duplicate entries.
The Risk Hierarchy
The tutorial bot has no risk management. Our production bot evaluates five sequential per-bot checks before every order: stop-loss enforcement (no entry without a predefined exit), maximum position size (25 percent of bot capital), drawdown circuit breaker (20 percent from peak equity), daily loss limit (5 percent of allocated capital), and consecutive loss cooldown (position halving after 3 losses, floor at 10 percent).
Beyond individual bot risk, three portfolio-level constraints prevent correlated blowups across all 45 bots: total exposure cap (50 percent of aggregate capital), single asset concentration (25 percent), and portfolio drawdown halt (15 percent aggregate decline).
These checks execute synchronously in the tick loop. Every order passes through every check. The first failure blocks the trade. There is no override mechanism.
State Persistence
A tutorial bot loses all state when you close the terminal. A production bot persists everything to SQLite with WAL mode. Positions, orders, trades, equity snapshots, risk events, and bot configuration are all stored in the database.
SQLite with WAL mode allows concurrent reads (the API serving dashboard queries) and writes (bots updating positions) without blocking. The busy_timeout of 5000 milliseconds means write operations wait up to 5 seconds for a lock rather than failing immediately. This handles the rare case where multiple bots attempt writes simultaneously.
Alembic manages schema migrations. Every database change goes through a migration script that can be applied forward or rolled back. This is critical for a system managing capital — you cannot afford schema inconsistencies from ad-hoc database changes.
Execution Layer
Our execution layer is abstracted behind a BaseExecutor interface. The PaperTrader implements this interface with simulated fills (2-10 basis points slippage, 0.01-0.02 percent fees, 1-5 millisecond latency simulation). The LiveTrader implements the same interface with real exchange orders via ccxt.
This abstraction means the bot code is identical for paper and live trading. The strategy generates a signal, risk checks validate it, and the executor handles the order. Switching from paper to live is a configuration change, not a code change.
Monitoring and Alerting
A production bot needs observability. Our system publishes events on an async event bus for every significant action: signal generated, order placed, order filled, position opened, position closed, circuit breaker fired, risk decision rejected. Subscribers include the SSE endpoint (real-time dashboard updates), Telegram notification service, and the risk event persister (database audit trail).
Structlog provides structured JSON logging for every tick, every risk check, and every order. When something goes wrong at 3 AM, the structured logs let you reconstruct exactly what happened: which bot, which signal, which risk check, which order, and which exchange response.
The Dead Man's Switch
The final component that no tutorial mentions: automated shutdown when the operator is unavailable. Our dead man's switch requires a Telegram check-in every 24 hours. Without check-in, the system escalates through warning (83 percent of timeout), critical (96 percent), and triggered (100 percent, all bots stop). The latch mechanism requires an explicit reset to resume, preventing automated scripts from masking genuine operator absence.
What 50 Lines Actually Gets You
A 50-line Python script that fetches candles and trades on RSI is a proof of concept. It demonstrates that you can connect to an exchange and place orders. It does not handle errors, manage state, enforce risk, recover from crashes, monitor performance, or protect against your own absence.
The real architecture for a production trading bot is approximately 15,000 lines of Python across data fetching, strategy logic, risk management, execution, monitoring, database persistence, crash recovery, and safety mechanisms. The strategy logic (the part tutorials focus on) is less than 10 percent of the total codebase. The other 90 percent is the infrastructure that makes the strategy logic safe to run with real capital.