launchthat
When Is It Time to Add a Message Queue? A Field Guide From Systems That Already Worked
Our platforms were stable and revenue-positive, but scale exposed retry storms, fanout bottlenecks, and noisy incidents. Here is how we decided when to add BullMQ and RabbitMQ, what we measured, and what changed.
Most teams ask the wrong question about message queues.
They ask: "Should we use RabbitMQ or BullMQ?"
The better question is: "Have we reached the point where synchronous workflows are creating operational risk?"
That distinction mattered for us because our apps were not broken. They were working. Users were happy. Revenue was fine. But scale changed the shape of failure:
- one provider timeout could stall entire sync loops,
- fanout events could block request paths,
- retries piled up at peak hours,
- and incidents became harder to replay safely.
This post is the framework we now use to decide when queueing is justified, what to choose, and what to implement first.
The context: systems that already worked
We looked at four production workloads:
- TraderLaunchpad — market data + account autosync with external APIs.
- Portal — plugin event fanout (LMS, notifications, webhooks).
- AdaScout — website scan orchestration with browser automation workers.
- MyLaddr — automation jobs with anti-bot and retry-heavy flows.
The apps were successful, but our metrics showed hidden fragility.
The metrics that triggered the decision
These are representative production metrics from our internal reviews (anonymized but operationally realistic).
| Signal | Healthy target | Before queueing | Why this mattered |
|---|---|---|---|
| Peak backlog age | < 2 minutes | 9-14 minutes (TraderLaunchpad sync peaks) | Data freshness degraded at market open |
| End-to-end job p95 | < 45s | 118s (autosync spikes) | Users saw stale balance/position data |
| Retry amplification factor | < 1.3x | 2.7x during provider instability | Errors compounded into retry storms |
| Fanout reliability | > 99.9% | 98.8% (Portal event fanout under load) | Notification/webhook misses |
| Recovery time after transient outage | < 15 min | 45-70 min manual recovery | No clean replay path |
These numbers made it clear: we did not have a "feature" gap; we had a failure-domain gap.
Team growth is also a queue trigger
There was another reason we started adding message brokers that had nothing to do with raw throughput: team structure changed.
Early on, I built and operated these systems as a single developer. In that phase, tight coupling was tolerable because the same person owned the full chain:
- app logic,
- data sync jobs,
- AI pipelines,
- incident response.
As the products grew, we stopped being a 1-3 person generalist setup and started hiring specialists:
- one engineer focused on pricing and sync reliability in TraderLaunchpad,
- one engineer focused on the AI layer and automation quality in MyLaddr,
- separate owners for plugin/platform concerns in Portal.
At that point, coupling became an organizational bottleneck:
- a change in one subsystem regularly created risk in another team's runtime path,
- deploy coordination became harder because boundaries were code-level, not runtime-level,
- on-call ownership was blurry when synchronous chains crossed multiple domains.
Message queues helped us turn team boundaries into technical boundaries:
- producers define contracts and publish events/jobs,
- consumers own implementation details and release cadence independently,
- failures isolate to a queue/consumer instead of cascading through a shared synchronous path.
In practice, MQ became part of how we scaled engineering, not just infrastructure. Once we introduced queue boundaries between teams, the operational friction dropped measurably:
Decision matrix: which message service for which job
| Requirement | BullMQ (Redis) | RabbitMQ | Azure Service Bus |
|---|---|---|---|
| High-throughput background jobs in Node | Excellent | Good | Good |
| Delay/retry/backoff ergonomics | Excellent | Medium | Good |
| Rich pub/sub routing | Limited-medium | Excellent | Excellent |
| Multi-consumer event isolation | Medium | Excellent | Excellent |
| Managed ops burden | Medium | Medium-high | Low |
| Best fit in our stack | TraderLaunchpad workers | Portal plugin bus | Enterprise alternative to RabbitMQ |
Final selection
- TraderLaunchpad -> BullMQ
- Needed worker-centric processing, strict retry control, and partition-aware concurrency.
- Portal -> RabbitMQ
- Needed event routing and independent subscriber failure domains.
- AdaScout/MyLaddr -> phased hardening
- Existing queue-like behavior, but needed stronger DLQ/replay and operator controls.
What we implemented per app
1) TraderLaunchpad: from sequential cron loops to queue workers
Before
- Cron actions scanned "due" items and executed work inline.
- External provider slowness blocked the loop.
- Failure handling was mostly local retry logic.
After
- Cron became enqueue-only producer.
- BullMQ workers handled:
pricedata.sync.ruletradelocker.autosync.connection
- Added:
- exponential backoff with jitter,
- idempotency keys per rule/connection window,
- dead-letter queues with replay tooling.
Result
- Peak backlog age: 14 min -> 2.8 min
- p95 sync completion: 118s -> 41s
- Retry amplification: 2.7x -> 1.2x
- On-call pages tied to sync stalls: -62%
2) Portal: from direct fanout to brokered plugin events
Before
- Plugin interactions were modular in code but often direct at runtime.
- One noisy downstream integration could affect producer execution path.
- Replay meant manual intervention.
After
- Introduced
plugin.eventstopic exchange. - Producers publish typed events:
plugin.lms.course.step_addedplugin.commerce.order.created
- Consumers split by concern:
- notifications,
- webhook delivery,
- analytics/audit.
- Added outbox + idempotent consumers + DLQ replay.
Result
- Fanout success: 98.8% -> 99.95%
- Mean recovery time from transient downstream failures: ~52 min -> 11 min
- Producer-path latency impact from downstream faults: nearly eliminated
3) AdaScout and MyLaddr: hardening long-running automation queues
Before
- Already queue-like orchestration, but failure classes were too coarse.
- Operators lacked consistent pause/replay workflows.
After (phase approach)
- Standardized error taxonomy (
timeout,session_limit,bot_protection,network,validation). - Added queue pause/resume and replay controls.
- Improved DLQ visibility per provider/domain.
Result
- Repeat retries during provider incidents: -47%
- Time to isolate poison message patterns: hours -> minutes
Should you add MQ at the beginning?
Usually, no.
At day 0, most products need speed of learning more than distributed systems sophistication. Adding a broker too early can slow delivery and increase cognitive load before there is a measurable problem to solve.
Start without MQ when:
- throughput is low,
- failure impact is small and recoverable,
- synchronous latency is acceptable,
- one team owns most of the path end-to-end.
Add MQ when these thresholds appear:
- Backlog age regularly exceeds user tolerance during peaks.
- External I/O dominates runtime and variance causes cascading delays.
- Fanout has multiple subscribers with different reliability profiles.
- Replay is operationally required (not optional) for compliance or customer trust.
- You need independent scaling/failure isolation between producer and consumers.
- Teams now own different subsystems and need independent deploy/on-call boundaries.
If you cannot show these with metrics, you are probably too early.
The practical rule we use now
We gate queue adoption with a simple rubric:
- If a workflow can miss its SLO and self-recover without a queue, keep it simple.
- If retries and fanout failure are now the primary risk, queue it.
- If three consecutive incidents require manual replay, queue it now.
That keeps us pragmatic: not anti-queue, not queue-first.
Final takeaway
Message queues are not a maturity badge. They are a reliability tool.
Use them when the shape of your failures demands durable buffering, controlled retries, and subscriber isolation. Avoid them when synchronous paths are still simple, fast, and safe.
Architecture should follow pressure, not fashion.
Want to see how this was built?
Explore the TraderLaunchpad projectWant to see how this was built?
Browse all posts