Scheduling 45 Bots with APScheduler: Lessons Learned

Running one trading bot on a timer is trivial: set an interval, call a function, repeat. Running 45 bots on different timeframes, on a single machine, within a single async event loop, without any bot's execution interfering with another's, requires more thought. APScheduler (Advanced Python Scheduler) is the library that handles this for us, and getting the configuration right took several iterations.

Here is what our scheduling layer looks like in production, the bugs we hit, and the configuration that works.

APScheduler Basics

APScheduler provides several scheduler backends. We use AsyncIOScheduler, which integrates with Python's asyncio event loop. This is essential because our bots are async coroutines. A threading-based scheduler would create threads that fight with the event loop for control, producing race conditions and deadlocks.

The scheduler is initialized during the FastAPI lifespan and shut down when the server stops.

from apscheduler.schedulers.asyncio import AsyncIOScheduler

scheduler = AsyncIOScheduler()
scheduler.start()
# ... app runs ...
scheduler.shutdown(wait=True)

The wait=True on shutdown lets currently executing jobs finish before the scheduler terminates. Without it, a bot mid-tick could be killed during order execution, potentially leaving a position in an inconsistent state.

Interval Triggers per Timeframe

Each bot gets its own APScheduler job with an interval trigger matched to its strategy's timeframe.

For 15-minute strategies (mean_reversion_bb, momentum_rsi_macd), the interval is 900 seconds. For 1-hour strategies (leverage_composite), it is 3,600 seconds. For 4-hour strategies (momentum_rsi_macd_4h, correlation_regime, nupl_cycle_filter, stablecoin_supply_momentum), it is 14,400 seconds.

scheduler.add_job(
    bot.tick,
    trigger="interval",
    seconds=timeframe_to_seconds(bot.timeframe),
    id=f"bot-{bot.id}",
    replace_existing=True,
)

The id parameter is important. Without it, APScheduler generates a random ID, and if you restart the server, the old jobs remain (APScheduler can persist jobs to a database) while new duplicates are added. With a deterministic ID based on the bot's database ID, replace_existing=True ensures the old job is replaced, not duplicated.

In our current deployment: 13 mean_reversion_bb bots tick every 900 seconds, 5 momentum_rsi_macd bots tick every 900 seconds, 3 leverage_composite bots tick every 3,600 seconds, 6 momentum_rsi_macd_4h bots tick every 14,400 seconds, 6 correlation_regime bots tick every 14,400 seconds, 7 nupl_cycle_filter bots tick every 14,400 seconds, and 5 stablecoin_supply_momentum bots tick every 14,400 seconds. That is 45 independent scheduler jobs, each with its own cadence.

Staggered Startup

When the API server starts (or restarts), all 45 bots are initialized and their scheduler jobs are registered. If we registered all 45 jobs at the same instant, their first tick would fire simultaneously. That means 45 concurrent exchange API calls, 45 concurrent strategy computations, and 45 concurrent database writes, all in the same event loop iteration.

We stagger bot startup by 2 seconds each. The first bot starts immediately, the second at +2 seconds, the third at +4 seconds, and so on. The last of the 45 bots starts 88 seconds after the first.

for i, bot in enumerate(bots):
    delay = timedelta(seconds=i * 2)
    scheduler.add_job(
        bot.tick,
        trigger="interval",
        seconds=interval,
        next_run_time=datetime.now() + delay,
    )

The next_run_time parameter tells APScheduler when to fire the first tick. After that, the interval takes over. The stagger only affects the initial alignment. Over time, the bots' ticks drift naturally because each tick takes a slightly different amount of time depending on exchange response latency and strategy computation cost.

This staggering reduces the peak load on the exchange API. Binance rate limits are per-IP, not per-connection. Even though all 45 bots share one ccxt exchange instance (with its own rate limiter), the stagger ensures we never attempt 45 simultaneous REST calls.

Missed Tick Handling

What happens when a bot's tick takes longer than its interval? If a 15-minute bot's tick takes 20 minutes (unlikely but possible during extreme exchange latency), the next scheduled tick fires while the previous one is still running.

APScheduler's default behavior is to accumulate missed ticks and fire them all when the previous job completes. This is called "coalescing." For trading bots, this is wrong. If we missed the 9:15 tick and it is now 9:32, we should skip the 9:15 tick entirely and wait for the 9:30 tick. Running on stale 9:15 data at 9:32 could generate a signal that was valid 17 minutes ago but is no longer relevant.

We configure misfire_grace_time and coalesce to handle this.

scheduler.add_job(
    bot.tick,
    trigger="interval",
    seconds=interval,
    misfire_grace_time=60,
    coalesce=True,
    max_instances=1,
)

misfire_grace_time=60 means a tick that is more than 60 seconds late is considered missed and skipped. coalesce=True means if multiple ticks were missed, only one catch-up tick fires (the most recent). max_instances=1 prevents parallel execution of the same bot's tick, so the scheduler never fires a new tick while the previous one is still running.

For 15-minute bots, the 60-second grace time means the tick can be up to 1 minute late and still execute. After that, it is skipped. For 4-hour bots, we could afford a longer grace time, but keeping it consistent at 60 seconds across all bots simplifies reasoning about execution timing.

Integration with FastAPI's Event Loop

The AsyncIOScheduler runs inside FastAPI's event loop. This means bot ticks and API request handlers share the same loop. A long-running tick could delay API responses, and a flood of dashboard requests could delay bot ticks.

In practice, this has not been a problem for two reasons.

First, bot ticks are fast. The typical tick sequence (fetch candles, compute indicators, check signals, apply risk limits) takes 50-200 milliseconds. The async I/O portions (fetching candles from the exchange) yield to the event loop, so API requests are served during the fetch wait time. The CPU-bound portions (indicator computation) take 5-50 milliseconds, which is too short to cause perceptible API latency.

Second, dashboard polling is light. The React UI polls bot status every 10 seconds and risk metrics on the same interval. Each poll is a simple database query that returns in single-digit milliseconds. There is no scenario where dashboard traffic competes meaningfully with bot tick execution.

The one exception is during backtesting. We explicitly run backtest workers in a ProcessPoolExecutor (separate processes) rather than in the event loop, because a backtest that processes 175,000 candles would monopolize the event loop for seconds. All async-incompatible heavy computation is farmed out to processes. The scheduler and bot ticks stay in the event loop.

Job Lifecycle Management

When a bot is started via the API (PATCH /bots/{id}/start), the route handler calls BotManager.start_bot() which initializes the BotInstance, then calls SchedulerService.add_bot_job() which registers the APScheduler job. When a bot is stopped (PATCH /bots/{id}/stop), the reverse happens: the scheduler job is removed and the BotInstance is shut down.

The scheduler maintains the job store in memory (we do not use persistent job stores). This means if the server restarts, all jobs are lost and must be recreated. Our startup logic queries the database for all bots marked as "running" and re-registers their scheduler jobs. This is the correct behavior: we want job state to be derived from database state, not from a separate persistent store that could diverge.

# On server startup: resume all previously running bots
running_bots = db.query(Bot).filter(Bot.status == "running")
for bot in running_bots:
    await bot_manager.start_bot(bot.id)

Monitoring and Debugging

APScheduler provides limited built-in monitoring. We augmented it with structured logging that records when each job fires, how long the tick took, and whether the tick produced a signal.

The key log fields are: bot_id, symbol, strategy, tick_duration_ms, signal_generated (boolean), and error (if any). These are emitted as structured JSON via structlog, which makes them searchable and filterable.

When debugging timing issues, the most useful metric is tick_duration_ms. If a bot's average tick duration is 150ms but one tick took 5,000ms, something unusual happened, usually exchange API latency during a high-volatility event. These outliers are normal and handled by the misfire_grace_time configuration. But if tick duration is consistently increasing over time (creeping from 100ms to 500ms over weeks), it indicates a resource leak or growing data volume that needs investigation.

What We Would Change

If we were starting over, we would evaluate APScheduler 4.x (currently in alpha), which has a redesigned async-native architecture. The 3.x version we use works well but has some rough edges around async job stores and event listeners.

We would also consider a simpler approach for bots with the same timeframe. Currently, each bot has its own scheduler job. An alternative is one job per timeframe that ticks all bots of that timeframe in sequence. This reduces scheduler overhead (one 15m job instead of eighteen) and makes staggering unnecessary. The downside is that all 15m bots become coupled: if one bot's tick errors, the tick-all function must handle it without affecting subsequent bots. Our current per-bot isolation is cleaner even if it creates more scheduler jobs.

But these are refinements, not problems. APScheduler has been running 45 bots for months without a missed trade, a duplicate execution, or a scheduling error. For a library that we configured in about 50 lines of code, that is an excellent return on investment.

Scheduling 45 Bots with APScheduler: Lessons Learned

Python for Traders

APScheduler Basics

Interval Triggers per Timeframe

Staggered Startup

Missed Tick Handling

Integration with FastAPI's Event Loop

Job Lifecycle Management

Monitoring and Debugging

What We Would Change

Related Posts

Async Python for Real-Time Market Data

ccxt in Practice: Fetching Candles in 20 Lines

SQLite for Trade Data: Why We Chose It Over Postgres