Crash Recovery: The State Machine Approach

Trading bots crash. Servers restart. Network connections drop mid-request. The question is not whether these failures will happen but what state the system is in when they do. A bot that crashes while submitting an order creates an ambiguous state: did the exchange receive the order? Was it filled? Is there an open position that the bot does not know about? Without a systematic approach to crash recovery, these ambiguities accumulate until the local state diverges so far from reality that manual intervention becomes the only option.

Our approach uses two mechanisms: a position state machine that constrains the possible states a position can be in, and a reconciliation protocol that resolves ambiguity by querying the exchange as the source of truth.

The Position State Machine

Every position in the system exists in exactly one of five states: FLAT, OPENING, OPEN, CLOSING, or CLOSED. Transitions between states are strictly defined and enforced.

FLAT is the initial state. No position exists. A strategy signal triggers an order, which transitions the state to OPENING. The OPENING state means an order has been submitted to the exchange but no fill has been confirmed. Once the exchange confirms a fill, the state transitions to OPEN. A position in OPEN state has a confirmed entry and is being marked to market every tick.

When a strategy generates a close signal, a stop-loss triggers, or a take-profit triggers, an exit order is submitted and the state transitions to CLOSING. Once the exit order fills, the state transitions to CLOSED. The PnL is calculated, and the position record is finalized.

The state machine prevents invalid transitions. You cannot go from FLAT to CLOSING (there is nothing to close). You cannot go from OPEN to OPENING (you cannot open a new position while one is already open). You cannot go from CLOSING to OPENING (you must fully close before reopening). These constraints are enforced at the code level, not by convention.

Where Crashes Create Ambiguity

The dangerous crash moments are during state transitions, specifically during FLAT to OPENING and OPEN to CLOSING.

If the server crashes after submitting an entry order but before receiving the fill confirmation, the local state shows OPENING. The order may or may not have been filled on the exchange. If it was filled, there is an open position that the bot does not know about. If it was not filled, there is a pending order on the exchange that might fill later without the bot's knowledge.

If the server crashes after submitting an exit order but before receiving the fill confirmation, the local state shows CLOSING. The exit may or may not have been executed. If it was executed, the position is closed but the bot thinks it is still closing. If it was not executed, the position is still open and the exit order might fill later at a different price than expected.

Crashes during FLAT, OPEN, or CLOSED states are benign. FLAT means no exposure. OPEN means a confirmed position exists and the bot just needs to resume monitoring. CLOSED means the position is finalized. The recovery protocol for these states is simply to reload the state from the database and continue.

The Reconciliation Protocol

On every server startup, the state recovery module executes a reconciliation protocol for each bot that was in a non-terminal state (OPENING, OPEN, or CLOSING) when the server stopped.

Step one: query running bots. The module loads all bots from the database whose status was RUNNING at the time of the crash. These are the bots that need state recovery.

Step two: query the exchange. For each affected bot, the module queries the exchange API for two things: all open orders associated with the bot's symbol, and all recent fills (trades executed in the last 24 hours) for the bot's symbol.

Step three: reconcile OPENING positions. If a position is in OPENING state, the module checks whether the entry order was filled. If a matching fill exists in the exchange's trade history, the position is transitioned to OPEN with the fill price and quantity. If no fill exists and the order is still open on the exchange, the position remains in OPENING state and the bot resumes monitoring the pending order. If no fill exists and the order is not on the exchange (it was rejected or expired), the position is transitioned back to FLAT.

Step four: reconcile CLOSING positions. If a position is in CLOSING state, the module checks whether the exit order was filled. If a matching fill exists, the position is transitioned to CLOSED with the fill details and PnL is calculated. If the exit order is still pending, the position remains in CLOSING state and the bot resumes monitoring. If the exit order is gone (rejected or expired), the position is transitioned back to OPEN and the bot will re-evaluate whether to close on the next tick.

Step five: detect orphaned orders. The module checks for any open orders on the exchange that do not correspond to a known position in the local database. These orphans can result from orders placed just before a crash where the local record was not persisted. Orphaned orders are canceled automatically to prevent unexpected fills with no local state tracking.

Step six: publish recovery event. After reconciliation completes, the module publishes a recovery event on the event bus with counts: how many positions were reconciled, how many orders were canceled, how many bots were successfully restarted, and how many bots could not be recovered (requiring manual investigation). This event streams to the dashboard and to Telegram, giving the operator immediate visibility into the recovery outcome.

Why State Machines Matter

The state machine approach is more rigorous than the common alternative of simply checking whether a position is open or closed. Binary state (open versus closed) cannot represent the in-between states that exist during order execution. Without OPENING and CLOSING as explicit states, the system cannot distinguish between a position that is actively being opened (order submitted, fill pending) and a position that is confirmed open (fill received).

This distinction is critical for crash recovery. An OPENING position requires checking whether the order was filled. An OPEN position does not. A CLOSING position requires checking whether the exit order was filled. A CLOSED position does not. Without the intermediate states, the recovery protocol would need to check every position against the exchange, which is slower and more error-prone.

The state machine also prevents the bot from taking conflicting actions. If a position is in OPENING state, the bot does not generate new entry signals for that symbol. If a position is in CLOSING state, the bot does not generate new close signals. These guardrails prevent the accumulation of duplicate orders that occurs when a bot does not track in-flight order state.

Idempotent Operations

Every state transition in the system is designed to be idempotent: applying the same transition twice produces the same result as applying it once. If the fill notification arrives twice (which can happen with WebSocket reconnections), the second notification is detected as a duplicate and ignored. If the recovery protocol runs twice (for example, if the server restarts during recovery), the second run detects that positions have already been reconciled and takes no action.

Idempotency is enforced through database constraints. Each fill record has a unique identifier from the exchange. Inserting a duplicate fill violates the unique constraint and is silently ignored. Each state transition is recorded with a timestamp. Applying a transition that has already been applied (same state, same timestamp) is a no-op.

Practical Considerations

The reconciliation protocol depends on the exchange API being available during recovery. If the server crashes and the exchange is also down, recovery must wait until the exchange API returns. Our startup sequence retries the exchange connection with exponential backoff (1 second, 2 seconds, 4 seconds, up to 60 seconds) before giving up and starting the server without recovered bots. Those bots remain in their crash state and are flagged for manual review.

The 24-hour lookback window for recent fills is a trade-off. A longer window catches fills from older pending orders but costs more API requests and processing time. A shorter window is faster but might miss fills from orders that took a long time to execute. Twenty-four hours covers virtually all realistic scenarios: an order placed before a crash that takes more than 24 hours to fill was likely a limit order in a very illiquid market, which is not a scenario we encounter with our current symbol universe.

The recovery module logs every action it takes at the INFO level with structured logging. After a crash, the operator can read the recovery log to understand exactly what happened: which positions were reconciled, which orders were canceled, and which bots were restarted. This audit trail is essential for maintaining confidence that the system's local state accurately reflects reality.

Crash Recovery: The State Machine Approach

Operations & Reliability

The Position State Machine

Where Crashes Create Ambiguity

The Reconciliation Protocol

Why State Machines Matter

Idempotent Operations

Practical Considerations

Related Posts

Exchange API Failures: Rate Limits, Timeouts, Recovery

Dead Man's Switch: The Safety Net Every Trader Needs

Monitoring 45 Live Bots: Dashboards, Alerts, and What to Watch