Back to Blog
Product Updates

Monitoring 45 Live Bots: Dashboards, Alerts, and What to Watch

QFQuantForge Team·April 3, 2026·9 min read

Building a trading bot is the easy part. Keeping 45 of them running reliably across six strategy types, 25 symbols, and three timeframes is the hard part. Most tutorials stop at the backtest results. They show a beautiful equity curve and imply that deployment is a solved problem. In practice, deployment is where the real engineering begins: monitoring, alerting, dashboarding, and building the operational discipline to catch problems before they become losses.

The Dashboard Architecture

Our React dashboard provides three levels of monitoring: the bot manager page for individual bot status, the risk dashboard for portfolio-level health, and the market data page for data quality verification.

The bot manager page shows all 45 bots in a table with real-time status: running, stopped, paused, or errored. Each bot row expands into a detail panel showing recent trades, open positions, cumulative PnL, and the Monte Carlo risk profile for that bot's strategy. The most important column in the table is the status indicator. A green dot means the bot is running and has ticked successfully within its expected timeframe. A red dot means the bot has thrown an error or has not ticked when it should have. A yellow dot means the bot is paused, either manually or by the risk framework.

The data flows from the FastAPI backend to the React frontend via Server-Sent Events (SSE). When a bot places an order, fills a trade, or hits a risk limit, the event publishes on the event bus and streams to the dashboard within seconds. There is no polling delay. The SSE connection is persistent, so the dashboard updates are effectively real-time as long as the browser tab is open.

The Risk Dashboard

The risk dashboard is the primary monitoring interface. It has three tabs that serve three different purposes.

Live Risk tab displays real-time portfolio health. At the top, a health banner shows the overall status: green for healthy, yellow for caution, red for critical. Below the banner, three gauge charts show portfolio drawdown (current drawdown versus the 15 percent halt threshold), total exposure (current notional versus the 50 percent cap), and daily loss (current daily loss versus the daily limit). These gauges update every 10 seconds via polling against the risk status API endpoint.

Below the gauges, a per-bot risk table shows each bot's current drawdown, position exposure, consecutive loss count, and risk status. Bots that are approaching risk limits are highlighted in amber. Bots that have breached limits and been paused are highlighted in red. A 30-day drawdown chart at the bottom shows the equity curve with drawdown shading, powered by hourly equity snapshots stored in the database.

Strategy Risk tab shows risk profiles derived from backtest data. Strategy scorecards display the Sharpe ratio, Sortino ratio, Kelly fraction, and a composite risk score for each strategy type. An asset correlation heatmap shows the 30-day rolling correlation matrix across all symbols. A Monte Carlo panel shows the VaR (Value at Risk) and CVaR (Conditional Value at Risk) for any selected strategy-symbol combination, calculated from historical trade PnL distributions.

Risk Log tab is an audit trail of every risk event. When a circuit breaker fires, the event is logged with the bot ID, the trigger reason, the severity level, and the full details. When a trade is rejected by the risk framework, that rejection is logged with the signal that was blocked and the reason it was blocked. The log is paginated, filterable by event type and severity, and each row expands to show the full details JSON.

Telegram Alerts

The dashboard is useful when you are actively watching it. Telegram alerts are useful when you are not. Our Telegram bot sends alerts for three categories of events.

Circuit breaker fires are the highest priority. When a bot hits its maximum drawdown limit, exceeds the daily loss cap, or triggers the portfolio exposure limit, a Telegram message fires immediately with the bot name, the symbol, the trigger reason, and the current values versus the limits. These alerts demand immediate attention because they indicate a bot has been forcibly stopped.

Bot errors are medium priority. When a bot's tick loop throws an exception, whether from a network error, an exchange API timeout, or an unexpected data format, the error is caught, the bot is stopped, and a Telegram alert fires with the error message. Most bot errors are transient (exchange API returned a 500) and resolve on the next tick after a restart. Persistent errors (like an API key expiration) require manual intervention.

Trade notifications are low priority but useful for maintaining awareness. Each order placed and each fill received generates a brief Telegram message with the bot, symbol, direction, size, and price. At 45 bots, these messages can be frequent, so we filter to only send notifications for trades above a minimum notional threshold.

Hourly Equity Snapshots

The drawdown chart on the risk dashboard depends on equity snapshots. Every hour, a scheduled job iterates through all running bots and records an equity snapshot for each. The equity formula is: current_capital (cash) plus entry_notional plus unrealized_pnl.

This formula matters because our bots trade on margin. When a bot opens a position, the entry notional is deducted from current_capital (cash). If we only tracked current_capital, the equity would appear to drop every time a position opened, creating false drawdown readings. By adding back the entry_notional and the unrealized PnL, the equity snapshot reflects the true portfolio value at that moment.

A portfolio-level snapshot (with bot_id set to null) aggregates all bots into a single equity figure. This feeds the portfolio drawdown gauge on the Live Risk tab. The per-bot snapshots feed the individual drawdown calculations visible in the bot detail panels.

What to Check Daily

Operating 45 bots requires a daily routine. Here is what we check every morning.

First, the bot manager page. Are all 45 bots showing green status? Any red or yellow indicators get investigated immediately. A stopped bot might have encountered a transient error overnight and needs a restart. A paused bot has hit a risk limit and needs evaluation: is the pause justified, or has the risk limit triggered due to normal volatility?

Second, the risk dashboard Live Risk tab. What is the current portfolio drawdown? What is the total exposure? Are any bots near their individual risk limits? The gauges provide an instant health check. If portfolio drawdown is above 5 percent, we dig into which bots are contributing and whether the drawdown is correlated (market-wide selloff affecting all bots) or idiosyncratic (one strategy failing).

Third, the risk log. Any circuit breaker fires overnight? Any risk rejections? A single rejection is normal — the risk framework is doing its job. A cluster of rejections on the same bot or same symbol suggests a market condition that is hostile to that strategy, and we may need to manually pause or adjust parameters.

Fourth, data quality. The market data page has a coverage tab that shows data completeness across all symbols and timeframes. Missing candles or gaps in funding rate data can cause strategies to generate incorrect signals. We verify that the data pipeline has been running cleanly and that no symbol has fallen behind in its data sync.

Automated Recovery

Not everything requires manual intervention. The bot engine includes automated recovery for common failure modes. When a bot's tick loop fails due to a network error or exchange timeout, the error is logged but the scheduler continues to call the bot's tick method on its normal schedule. Most transient errors self-resolve on the next tick when the network request succeeds.

On server restart, the state recovery module queries all bots that were in running state before the shutdown. For each, it checks the exchange API for any pending orders or recently filled trades that occurred during the downtime. It reconciles the local state with the exchange state, cancels any orphaned orders, and resumes the bots. This recovery process publishes a recovery event with counts of reconciled fills, canceled orders, and successfully restarted bots.

The server itself is managed by a launchd service on macOS that automatically restarts the process if it crashes. Between automatic process restart, state recovery on startup, and transient error tolerance in the tick loop, the system handles most failure modes without operator intervention. The monitoring infrastructure exists for the failures that automation cannot resolve: persistent API issues, risk limit breaches, and strategy performance degradation.