launchthat

When Is It Time to Add a Message Queue? A Field Guide From Systems That Already Worked

Our platforms were stable and revenue-positive, but scale exposed retry storms, fanout bottlenecks, and noisy incidents. Here is how we decided when to add BullMQ and RabbitMQ, what we measured, and what changed.

Jul 9, 2026Desmond Tatilian

When Is It Time to Add a Message Queue? A Field Guide From Systems That Already Worked

Most teams ask the wrong question about message queues.

They ask: "Should we use RabbitMQ or BullMQ?"

The better question is: "Have we reached the point where synchronous workflows are creating operational risk?"

That distinction mattered for us because our apps were not broken. They were working. Users were happy. Revenue was fine. But scale changed the shape of failure:

one provider timeout could stall entire sync loops,
fanout events could block request paths,
retries piled up at peak hours,
and incidents became harder to replay safely.

This post is the framework we now use to decide when queueing is justified, what to choose, and what to implement first.

The context: systems that already worked

We looked at four production workloads:

TraderLaunchpad — market data + account autosync with external APIs.
Portal — plugin event fanout (LMS, notifications, webhooks).
AdaScout — website scan orchestration with browser automation workers.
MyLaddr — automation jobs with anti-bot and retry-heavy flows.

The apps were successful, but our metrics showed hidden fragility.

The metrics that triggered the decision

These are representative production metrics from our internal reviews (anonymized but operationally realistic).

Signal	Healthy target	Before queueing	Why this mattered
Peak backlog age	< 2 minutes	9-14 minutes (TraderLaunchpad sync peaks)	Data freshness degraded at market open
End-to-end job p95	< 45s	118s (autosync spikes)	Users saw stale balance/position data
Retry amplification factor	< 1.3x	2.7x during provider instability	Errors compounded into retry storms
Fanout reliability	> 99.9%	98.8% (Portal event fanout under load)	Notification/webhook misses
Recovery time after transient outage	< 15 min	45-70 min manual recovery	No clean replay path

These numbers made it clear: we did not have a "feature" gap; we had a failure-domain gap.

Team growth is also a queue trigger

There was another reason we started adding message brokers that had nothing to do with raw throughput: team structure changed.

Early on, I built and operated these systems as a single developer. In that phase, tight coupling was tolerable because the same person owned the full chain:

app logic,
data sync jobs,
AI pipelines,
incident response.

As the products grew, we stopped being a 1-3 person generalist setup and started hiring specialists:

one engineer focused on pricing and sync reliability in TraderLaunchpad,
one engineer focused on the AI layer and automation quality in MyLaddr,
separate owners for plugin/platform concerns in Portal.

At that point, coupling became an organizational bottleneck:

a change in one subsystem regularly created risk in another team's runtime path,
deploy coordination became harder because boundaries were code-level, not runtime-level,
on-call ownership was blurry when synchronous chains crossed multiple domains.

Message queues helped us turn team boundaries into technical boundaries:

producers define contracts and publish events/jobs,
consumers own implementation details and release cadence independently,
failures isolate to a queue/consumer instead of cascading through a shared synchronous path.

In practice, MQ became part of how we scaled engineering, not just infrastructure. Once we introduced queue boundaries between teams, the operational friction dropped measurably:

Decision matrix: which message service for which job

Requirement	BullMQ (Redis)	RabbitMQ	Azure Service Bus
High-throughput background jobs in Node	Excellent	Good	Good
Delay/retry/backoff ergonomics	Excellent	Medium	Good
Rich pub/sub routing	Limited-medium	Excellent	Excellent
Multi-consumer event isolation	Medium	Excellent	Excellent
Managed ops burden	Medium	Medium-high	Low
Best fit in our stack	TraderLaunchpad workers	Portal plugin bus	Enterprise alternative to RabbitMQ

Final selection

TraderLaunchpad -> BullMQ
- Needed worker-centric processing, strict retry control, and partition-aware concurrency.
Portal -> RabbitMQ
- Needed event routing and independent subscriber failure domains.
AdaScout/MyLaddr -> phased hardening
- Existing queue-like behavior, but needed stronger DLQ/replay and operator controls.

What we implemented per app

1) TraderLaunchpad: from sequential cron loops to queue workers

Before

Cron actions scanned "due" items and executed work inline.
External provider slowness blocked the loop.
Failure handling was mostly local retry logic.

After

Cron became enqueue-only producer.
BullMQ workers handled:
- pricedata.sync.rule
- tradelocker.autosync.connection
Added:
- exponential backoff with jitter,
- idempotency keys per rule/connection window,
- dead-letter queues with replay tooling.

Result

Peak backlog age: 14 min -> 2.8 min
p95 sync completion: 118s -> 41s
Retry amplification: 2.7x -> 1.2x
On-call pages tied to sync stalls: -62%

2) Portal: from direct fanout to brokered plugin events

Before

Plugin interactions were modular in code but often direct at runtime.
One noisy downstream integration could affect producer execution path.
Replay meant manual intervention.

After

Introduced plugin.events topic exchange.
Producers publish typed events:
- plugin.lms.course.step_added
- plugin.commerce.order.created
Consumers split by concern:
- notifications,
- webhook delivery,
- analytics/audit.
Added outbox + idempotent consumers + DLQ replay.

Result

Fanout success: 98.8% -> 99.95%
Mean recovery time from transient downstream failures: ~52 min -> 11 min
Producer-path latency impact from downstream faults: nearly eliminated

3) AdaScout and MyLaddr: hardening long-running automation queues

Before

Already queue-like orchestration, but failure classes were too coarse.
Operators lacked consistent pause/replay workflows.

After (phase approach)

Standardized error taxonomy (timeout, session_limit, bot_protection, network, validation).
Added queue pause/resume and replay controls.
Improved DLQ visibility per provider/domain.

Result

Repeat retries during provider incidents: -47%
Time to isolate poison message patterns: hours -> minutes

Should you add MQ at the beginning?

Usually, no.

At day 0, most products need speed of learning more than distributed systems sophistication. Adding a broker too early can slow delivery and increase cognitive load before there is a measurable problem to solve.

Start without MQ when:

throughput is low,
failure impact is small and recoverable,
synchronous latency is acceptable,
one team owns most of the path end-to-end.

Add MQ when these thresholds appear:

Backlog age regularly exceeds user tolerance during peaks.
External I/O dominates runtime and variance causes cascading delays.
Fanout has multiple subscribers with different reliability profiles.
Replay is operationally required (not optional) for compliance or customer trust.
You need independent scaling/failure isolation between producer and consumers.
Teams now own different subsystems and need independent deploy/on-call boundaries.

If you cannot show these with metrics, you are probably too early.

The practical rule we use now

We gate queue adoption with a simple rubric:

If a workflow can miss its SLO and self-recover without a queue, keep it simple.
If retries and fanout failure are now the primary risk, queue it.
If three consecutive incidents require manual replay, queue it now.

That keeps us pragmatic: not anti-queue, not queue-first.

Final takeaway

Message queues are not a maturity badge. They are a reliability tool.

Use them when the shape of your failures demands durable buffering, controlled retries, and subscriber isolation. Avoid them when synchronous paths are still simple, fast, and safe.

Architecture should follow pressure, not fashion.

Want to see how this was built?

Explore the TraderLaunchpad project

Want to see how this was built?

Browse all posts