launchthat
Observability Across Five Production Systems
Monitoring, tracing, and instrumentation look different in every system we run. Here is what we actually measure, how we measure it, and what those measurements have caught — across Kubernetes infrastructure, integration pipelines, high-performance APIs, browser automation, and AI agents.
The LaunchThat ecosystem runs five distinct production systems — a multi-tenant Kubernetes platform, an integration pipeline, a high-performance API service, a browser automation engine, and an AI agent deployment platform. Each one has different operational characteristics, failure modes, and observability requirements.
This post is not about which monitoring tool is best. It is about what we actually measure, why we chose those measurements, and what they have caught in production.
The three pillars, in practice
Observability literature talks about three pillars: metrics, logs, and traces. In practice, each system leans on a different pillar depending on its primary failure mode.
Metrics dominate when the question is "is the system healthy right now?" — container resource usage, queue depth, success rates. Portal V4 and BrowserLaunch lean on metrics.
Traces dominate when the question is "why did this specific request fail?" — following a single event through multiple services and database calls. RelayFlow and PulseAPI lean on traces.
Structured logs dominate when the question is "what did the AI say and why?" — capturing inputs, outputs, and decision context for post-hoc analysis. The AI production layer leans on logs.
No system uses only one pillar. But knowing which pillar is primary for each system determines where we invest instrumentation effort first.
System 1: Portal V4 — infrastructure metrics with Prometheus and Grafana
Portal V4 runs on a bare-metal k3s cluster. Each tenant gets their own frontend containers and their own Convex backend instance. The infrastructure layer has failure modes that application-level monitoring cannot catch.
What we measure
Container-level metrics via Prometheus node exporter and cAdvisor:
- CPU and memory usage per container, per tenant
- Container restart counts (a leading indicator of memory leaks or crash loops)
- Disk usage on the host — especially important because container logs accumulate
Caddy reverse proxy metrics:
- Request latency by upstream (which tenant is experiencing slow responses)
- 4xx/5xx rates by tenant subdomain
- TLS certificate expiration countdown
Per-tenant Convex instance health:
- Each tenant's Convex backend runs independently. We monitor whether each instance is reachable and responsive
- Sync lag — how far behind a client's state is from the server's latest mutation
k3s cluster health:
- Node readiness and resource pressure conditions
- Pod scheduling latency (how long new deployments take to become ready)
- Persistent volume usage and IOPS
What it caught
A single tenant's container was consuming 2.3GB of memory — in a cluster where each container was allocated 512MB. The OOM killer was restarting it every 40 minutes. Without per-container memory tracking, this looked like "the server is unstable." With it, we traced the issue to a specific tenant's page builder component that loaded every page into memory during preview rendering. The fix was lazy loading and memory limits — but the diagnosis required per-tenant resource metrics.
Alerting rules
We alert on symptoms, not causes:
- Error rate > 5% for any tenant for 5 consecutive minutes
- Container restart count > 3 in a 15-minute window
- Disk usage > 80% on any host volume
- Convex instance unreachable for 60 seconds
We deliberately do not alert on CPU usage. High CPU is fine if latency is fine. Alerting on causes creates noise; alerting on symptoms creates action.
System 2: RelayFlow — distributed tracing with OpenTelemetry
RelayFlow processes webhooks from external services (Stripe, Monday.com, Discord, email providers) and routes them through a multi-stage pipeline. A single webhook touches signature verification, normalization, routing, task execution, and potentially dead-letter recovery. When something fails at step 4, you need to trace back to step 1.
What we measure
End-to-end traces with correlation IDs:
Every inbound webhook gets a correlation ID at the ingestion point. That ID propagates through every function call, database write, and external API request in the pipeline. When a task fails, the operator console shows the full trace: what came in, how it was normalized, where it was routed, what the handler did, and where it failed.
const correlationId = `${provider}:${eventType}:${crypto.randomUUID()}`;
Span annotations on critical operations:
- Signature verification duration and result
- Normalization transform time
- Database write latency (raw payload storage, canonical event storage)
- External API call duration and status code
- DLQ enqueue reason and retry count
Pipeline health metrics:
- Events processed per minute, per provider
- Average pipeline latency (ingestion to completion)
- DLQ depth and age of oldest unprocessed event
- Retry rate by provider and event type
What it caught
A Monday.com webhook handler was succeeding on the first attempt 95% of the time — acceptable by most standards. But the trace data showed that the 5% failures were concentrated on item.update events for boards with more than 50 columns. The Monday.com API was returning partial column data under load, and our normalizer was throwing on missing fields.
Without per-event traces, this would have looked like "Monday.com integration has a 5% error rate." With traces, we could see the exact payload shape that triggered the failure, correlate it to board size, and add defensive handling for partial column responses.
Trace propagation through Convex
Convex scheduled functions and actions do not natively carry trace context. We propagate correlation IDs explicitly:
await ctx.scheduler.runAfter(0, internal.pipeline.executeTask, {
correlationId,
eventId: event._id,
taskType: "process_webhook",
});
The correlation ID is a first-class argument. Every function in the chain includes it in structured log output. This is not as elegant as automatic context propagation (like OpenTelemetry's W3C trace context), but it works reliably in a serverless environment where there is no thread-local storage.
System 3: PulseAPI — performance instrumentation with SLOs
PulseAPI is a read-heavy API service where latency is the primary quality signal. The instrumentation is built around answering one question: "are we meeting our latency SLOs, and if not, which endpoints are the bottleneck?"
What we measure
Tracing spans on every database call:
[Request] GET /api/search?q=widgets
└─ [Span] authenticate: 12ms
└─ [Span] parse_query: 2ms
└─ [Span] db_search: 89ms (cache: MISS)
└─ [Span] sql_execute: 85ms
└─ [Span] result_map: 4ms
└─ [Span] serialize: 8ms
Total: 111ms
Each span records its duration, whether it hit cache, and the query fingerprint (parameterized SQL without values). This lets us identify which queries dominate latency without exposing user data in traces.
Cache hit/miss ratios per endpoint class:
- Search endpoints: target 60% cache hit rate
- Detail endpoints: target 80% cache hit rate
- Aggregate endpoints: target 90% cache hit rate (because aggregates are expensive and change infrequently)
Cache miss on a search endpoint is expected — queries vary widely. Cache miss on an aggregate endpoint means the background precomputation job is falling behind.
k6 load test baselines:
Every optimization starts with a benchmark. We run k6 load tests that simulate realistic traffic patterns (mixed read/write, concurrent users, varied query complexity) and capture:
- p50, p95, p99 latency per endpoint
- Requests per second at saturation
- Error rate under load
The baseline is versioned with the code. When a PR claims to improve search performance, the CI pipeline runs the k6 suite and compares results against the stored baseline. "It feels faster" is not evidence. "p95 dropped from 180ms to 95ms under 200 concurrent users" is.
SLO-aware alerting:
We define error budgets per endpoint class:
- Search: p95 < 200ms, error rate < 1%
- Detail: p95 < 100ms, error rate < 0.5%
- Aggregate: p95 < 500ms, error rate < 0.1%
Alerts fire when the error budget burn rate exceeds the threshold for a sustained period. A brief latency spike from a cold cache does not page anyone. A sustained degradation from a missing index does.
What it caught
After adding a new filter to the search endpoint, p99 latency jumped from 120ms to 2,400ms. The p50 was unchanged at 45ms. Without percentile tracking, average latency would have shown a modest increase. The p99 spike revealed that the new filter was not covered by an index — most queries did not use the filter (p50 was fine), but the ones that did triggered a sequential scan.
The fix was a composite index. The diagnosis took 5 minutes because the tracing span showed exactly which SQL query was slow and the query plan showed the missing index.
System 4: BrowserLaunch — operational metrics via Convex
BrowserLaunch's monitoring is built on Convex subscriptions — the same real-time system that powers the user-facing application. This is unusual. Most monitoring systems are separate infrastructure. Ours is not.
What we measure
Success rate per domain:
Browser automation against external websites fails for reasons outside our control: layout changes, rate limiting, CAPTCHAs, downtime. Success rate per domain tells us which targets need attention and which are working fine.
indeed.com: 98.2% success (last 24h)
linkedin.com: 94.1% success (last 24h) ← below threshold
glassdoor.com: 97.5% success (last 24h)
When a domain drops below 95%, we investigate. Usually it is a selector change — the site updated its layout and our extraction rules need updating.
Latency percentiles per URL:
- p50: median processing time (typically 3-8 seconds per page)
- p95: the slow tail (typically 15-25 seconds — pages with heavy JavaScript)
- p99: outliers that might indicate hangs or infinite redirects
Memory tracking per browser instance:
The browser pool recycles instances after 50 pages to bound memory leaks. Memory tracking validates that this threshold is correct. If we see memory climbing linearly through the 50-page cycle, the threshold is working. If we see it spike early, we lower the threshold.
Queue depth and processing rate:
How many URLs are waiting, how many are in-flight, and how many completed per minute. This is a Convex query:
export const getQueueStats = query({
args: {},
returns: v.object({
pending: v.number(),
running: v.number(),
completedLastHour: v.number(),
}),
handler: async (ctx) => {
const pending = await ctx.db
.query("automationTasks")
.withIndex("by_status_created", (q) => q.eq("status", "pending"))
.collect();
// ... similar for running and completed
return {
pending: pending.length,
running: running.length,
completedLastHour: completed.length,
};
},
});
The dashboard subscribes to this query with useQuery. Updates are instant — when a task moves from pending to running, the dashboard reflects it within milliseconds without polling.
What it caught
Queue depth was climbing steadily over a 3-hour period while processing rate stayed constant. The dashboard showed 340 pending tasks with only 4 in-flight. The problem: the browser pool had two instances stuck on pages with infinite JavaScript redirects. The pool considered them "in use" and refused to allocate new instances, creating a bottleneck.
The fix was adding a per-page timeout (30 seconds) and a pool-level health check that force-recycles stale instances. The real-time dashboard caught it within minutes instead of hours.
System 5: AI in production — structured logging for accountability
AI systems have a unique observability requirement: you need to explain why the system said what it said. Metrics tell you the error rate. Traces tell you the latency. Logs tell you the content — what went in, what came out, and whether the output was reasonable.
What we measure
Every prompt and response:
const logEntry = {
correlationId,
model: "gpt-4o",
promptTokens: usage.prompt_tokens,
completionTokens: usage.completion_tokens,
cost: calculateCost(usage),
latency: endTime - startTime,
guardrailResult: "pass",
confidenceScore: extractConfidence(response),
};
Every AI call is logged with its full context: which model, how many tokens, what it cost, how long it took, and whether guardrails flagged anything. This is not optional — when a user reports that the AI said something wrong, we need to reconstruct exactly what happened.
Cost tracking per model per request:
AI API costs are unpredictable without measurement. A single conversational thread with a large context window can cost $0.50 in tokens. Multiply by thousands of users and you have a cost problem you did not see coming.
We track cost per request, per model, per user, and per feature. When the support chat feature's daily cost spikes, we can trace it to a specific user sending unusually long messages or a specific conversation that accumulated a large context window.
Guardrail trigger rates:
The AI layer runs every response through validation before showing it to users. We track how often guardrails fire and why:
- Content policy violations
- Confidence score below threshold
- Response format validation failures (structured output did not match the expected schema)
- Hallucination indicators (references to nonexistent features or policies)
A spike in guardrail triggers is a leading indicator. It means the model is producing more questionable output than usual — possibly because prompt templates changed, context windows grew too large, or the model was updated by the provider.
Hallucination detection:
We validate AI outputs against known facts when possible. If the AI references a refund policy, we check whether that policy exists in the knowledge base. If it references a product feature, we check whether that feature is enabled for the tenant. False references are logged as hallucination events with the full prompt/response chain for investigation.
What it caught
The AI support agent was generating responses with an average confidence score of 0.82 — well above our 0.6 threshold. But the structured logs showed that one specific conversation pattern — users asking about pricing followed by requesting a comparison — consistently produced responses below 0.5 confidence. The model was attempting to compare pricing tiers but hallucinating features that did not exist in lower tiers.
Without per-response confidence logging, this would have been invisible in aggregate metrics. With it, we identified the pattern, updated the prompt template to include the actual feature matrix, and the confidence scores for comparison queries jumped to 0.85.
Instrumentation patterns
Structured logging over printf
Every log entry is a structured object, not a formatted string:
// Not this
console.log(`User ${userId} created project ${projectId} in ${duration}ms`);
// This
logger.info({
event: "project.created",
userId,
projectId,
workspaceId,
durationMs: duration,
correlationId,
});
Structured logs are queryable. "Show me all project.created events where durationMs > 5000" is a filter operation, not a regex exercise. When you are debugging an incident at 2am, the difference between structured and unstructured logs is the difference between finding the root cause in 5 minutes and reading log files for an hour.
Metric types matter
Not all numbers are the same kind of metric:
- Counters (monotonically increasing): request count, error count, bytes processed. You measure the rate of change, not the absolute value.
- Gauges (point-in-time values): queue depth, active connections, memory usage. You measure the current value.
- Histograms (distributions): request latency, response size, processing time. You measure percentiles (p50, p95, p99), not averages.
Using the wrong metric type leads to wrong conclusions. Average latency hides tail latency. A counter used as a gauge resets on restart. A gauge used as a counter loses events between scrapes.
Alert on symptoms, not causes
Our alerting philosophy:
- Alert: error rate > 5% for 5 minutes (symptom — users are affected)
- Do not alert: CPU > 80% (cause — might be fine if latency is fine)
- Alert: p99 latency > SLO threshold for 15 minutes (symptom — tail users are affected)
- Do not alert: memory > 70% (cause — might be a healthy cache)
The distinction matters because cause-based alerts create noise. CPU spikes during deployment are normal. Memory increases with cache warming are healthy. Alerting on these wastes attention. Symptom-based alerts fire when users are actually experiencing degradation — which is the only time you need to act.
Correlation IDs across async boundaries
In a synchronous request/response system, trace context propagates automatically. In an async system with scheduled functions, background jobs, and event-driven pipelines, you have to propagate it manually:
// At ingestion: generate a correlation ID
const correlationId = generateCorrelationId();
// Pass it through every scheduled function
await ctx.scheduler.runAfter(0, internal.pipeline.process, {
correlationId,
payload,
});
// Include it in every log entry
logger.info({ correlationId, step: "normalize", result: "success" });
// Include it in every external API call header
headers["X-Correlation-ID"] = correlationId;
This is tedious but essential. Without correlation IDs, debugging a failed webhook in RelayFlow means searching logs by timestamp and hoping you find the right entries. With correlation IDs, you filter by one string and see the entire event lifecycle.
What we learned
Instrument before you need it. Every system described here had instrumentation built in from the start, not added after an incident. The cost of adding metrics, traces, and structured logs during development is small. The cost of adding them during an outage is enormous.
Percentiles reveal what averages hide. PulseAPI's p50 latency was fine. The p99 was 20x worse. Average latency would have shown a modest number that masked the tail. If your SLO says "95% of requests under 200ms," you need to measure the 95th percentile, not the average.
Real-time dashboards change behavior. BrowserLaunch's Convex-powered dashboard was not just monitoring — it changed how we operated. Seeing queue depth climb in real time prompted immediate investigation instead of post-mortem discovery. The observability tool became an operational tool.
Cost tracking is observability. For AI systems, cost per request is as important as latency per request. A system that is fast and correct but costs 10x what you budgeted is not healthy. We treat cost as a first-class metric alongside latency and error rate.
Log the content, not just the metadata. For AI specifically, knowing that a request took 800ms and returned 200 OK is not enough. You need to know what the model said, what it was asked, and whether the answer was right. This is unique to AI systems and requires a different instrumentation approach than traditional request/response services.
The goal of observability is not to have dashboards. It is to reduce the time between "something is wrong" and "here is what is wrong and here is the fix." Every metric, trace, and log entry should serve that goal. If it does not, it is noise.
Want to see how this was built?
See the infrastructure behind thisWant to see how this was built?
Browse all posts