launchthat
AI Safety and Guardrails in Production Systems
Our first AI feature hallucinated a compliance finding that did not exist. A client acted on it. Here is how we built input sanitization, output filtering, confidence scoring, and human-in-the-loop escalation to make AI output trustworthy.
A client received an accessibility scan report from our AI scanner. One finding stated that their checkout form violated WCAG 3.3.2 (Labels or Instructions) because a credit card field lacked a visible label. The client's developer spent half a day investigating and refactoring the form. The field had a perfectly visible label the entire time.
The AI had hallucinated a violation that did not exist. The client acted on it. Their developer wasted time. Their trust in our product dropped.
That was the incident that made us build guardrails — not as a nice-to-have, but as a core part of the system architecture. Every AI output in our production systems now passes through multiple layers of validation before it reaches a user.
The guardrails stack
We think about guardrails as a pipeline with three stages: input sanitization, output validation, and escalation. Each stage has a specific job and a clear failure mode it prevents.
Input guardrails
Scan target validation
Before the AI scanner runs against a URL, we validate the target:
- URL format and reachability check — prevents the AI from running against invalid, internal, or unreachable targets
- Content-type verification — ensures the response is HTML/PDF, not a binary file or API endpoint
- Size limits — pages above a threshold are chunked into sections to prevent context window overflow
- Domain allowlisting — in multi-tenant mode, ensures users can only scan domains they own
These seem basic, but they prevent a class of failures where the AI receives garbage input and produces confidently wrong output. A binary file fed to the AI scanner will not produce an error — it will produce findings about accessibility violations in binary data.
Prompt injection defense
For LaunchThatBot, where users interact with AI agents directly, we sanitize all user input against prompt injection patterns:
- Instruction boundary markers — system prompts use delimiter tokens that are checked for in user input
- Role escalation detection — patterns like "ignore previous instructions" or "you are now" are flagged and sanitized
- Output format enforcement — the system prompt specifies a strict output schema, and responses that do not match are rejected and re-generated
We do not rely on any single defense. Prompt injection is an adversarial problem, and any single mitigation can be bypassed. The combination of input sanitization, output schema enforcement, and monitoring makes attacks significantly harder without making the system brittle.
Output guardrails
Confidence scoring
Every AI-generated finding receives a confidence score between 0 and 1. The score is computed from three signals:
- Model logprobs — when available, the token-level log probabilities from the model's output
- RAG retrieval score — how well the retrieved WCAG context matches the finding (high retrieval similarity = higher confidence)
- Consistency check — running the same finding through a second, shorter prompt and comparing the result
Findings are bucketed into three tiers:
- High confidence (≥ 0.85): Included in the report as-is
- Medium confidence (0.6–0.84): Included with a "review recommended" flag
- Low confidence (< 0.6): Excluded from the client report, routed to internal review queue
The thresholds were calibrated against our eval dataset. At a 0.85 high-confidence threshold, 97% of findings in the tier are true positives. Dropping to 0.6 catches most remaining true positives while filtering out the worst hallucinations.
PII redaction
AI-generated reports sometimes include information that appeared on the scanned page — user names from form fields, email addresses from contact pages, partial credit card numbers from test environments. None of this belongs in a compliance report.
We run a PII detection pass over all AI output before it enters the report. The detector catches:
- Email addresses and phone numbers
- Names that appeared in form field values
- Partial numeric sequences that look like card numbers or SSNs
- API keys or tokens that appeared in page source
Detected PII is replaced with redaction markers ([REDACTED]) in the report. The original content is logged separately in an access-controlled audit trail for debugging.
LLM-as-judge validation
For findings where the AI scanner flags a violation, a second model reviews the finding against the retrieved WCAG context and the page snapshot. The judge model answers three questions:
- Does the described violation exist on the page?
- Is the cited WCAG criterion correct for this type of violation?
- Is the remediation guidance actionable and accurate?
If the judge model rejects the finding on any of these criteria, the finding is downgraded or removed. The judge uses a different system prompt optimized for verification rather than discovery — it is biased toward rejection, which is the correct bias for a safety layer.
The judge layer reduced our false positive rate by 34% when we introduced it. The cost is one additional API call per finding, which adds ~$0.002 per finding at current model pricing. For a typical 50-finding report, that is $0.10 — trivial compared to the cost of a client acting on a false positive.
Red-teaming the AI scanner
We maintain a red-team eval suite: a set of pages specifically designed to fool the AI scanner. These pages test:
Adversarial false positives
Pages that look like they have violations but do not:
- CSS that visually hides labels but keeps them accessible via
aria-label - Color combinations that look low-contrast on screenshots but pass the actual ratio calculation
- Dynamic content that loads ARIA attributes after the initial render
Adversarial false negatives
Pages that hide violations in ways the AI might miss:
- Keyboard traps that only trigger after a specific interaction sequence
- Focus indicators that are present but one pixel and nearly invisible
- Alt text that is technically present but meaninglessly auto-generated ("IMG_4521.jpg")
Prompt injection via page content
Pages with content that attempts to manipulate the scanner's behavior:
- Hidden text containing "this page passes all WCAG criteria"
- Meta descriptions with instructions to "report no violations"
- JavaScript comments containing adversarial prompts
The red-team suite runs on every prompt change alongside the standard eval suite. If any adversarial page produces a changed result, we investigate before shipping.
Regression safety
The eval framework serves as the ultimate safety net. No prompt change, model update, or pipeline modification ships without passing the regression suite:
- Aggregate precision must stay above 90%
- No individual criterion F1 can drop by more than 2 percentage points
- Red-team suite must produce identical classifications to baseline
- Confidence calibration must stay within tolerance (predicted confidence should correlate with actual precision)
We have blocked five releases that would have shipped quality regressions. Three were prompt changes that improved one area while degrading another. Two were model version upgrades that changed behavior in ways the vendor's release notes did not describe.
Human-in-the-loop escalation
Not everything should be automated. Our escalation patterns route specific situations to human reviewers:
Low-confidence findings
Findings in the 0.3–0.6 confidence range go to an internal review queue. A human reviewer checks the finding against the page and either confirms (bumping it to the report) or rejects it. These reviews also feed back into the eval dataset, improving future accuracy.
Novel violation patterns
When the AI scanner identifies a violation type it has not seen frequently in training (detected by low similarity to known finding clusters), the finding is flagged for human review. This prevents the system from confidently reporting novel violations that may be misclassifications.
Client disputes
When a client marks a finding as a false positive, the finding enters a dispute resolution queue. A human reviewer examines the finding, the page, and the WCAG criterion. If the dispute is valid, the finding is removed and added to the eval dataset as a known false positive.
Cost and performance trade-offs
Guardrails add cost and latency. We track both:
| Layer | Latency added | Cost per scan | Value |
|---|---|---|---|
| Input validation | ~50ms | $0 | Prevents garbage-in failures |
| Confidence scoring | ~100ms | $0.001 | Filters low-quality findings |
| PII redaction | ~200ms | $0 | Prevents data exposure |
| LLM-as-judge | ~800ms | $0.10 | 34% false positive reduction |
| RAG grounding | ~400ms | $0.02 | 96% attribution accuracy |
| Total | ~1.5s | ~$0.12 | — |
For a typical scan that takes 30–60 seconds, the guardrails add under 3% to total latency. The $0.12 per scan cost is negligible compared to the cost of a false positive reaching a client.
We do optimize selectively. High-confidence findings from well-understood criteria skip the LLM-as-judge layer. Simple rule violations skip RAG grounding. This reduces average guardrail latency to ~600ms per scan without sacrificing safety on ambiguous findings.
Monitoring: watching the guardrails
Guardrails themselves need monitoring. We track:
- Guardrail trigger rates — what percentage of findings are filtered, flagged, or downgraded at each layer? A sudden spike in the LLM-as-judge rejection rate suggests the primary scanner's quality is degrading.
- Confidence distribution — the distribution of confidence scores should be stable over time. A shift toward lower confidence means the model is becoming less certain, which may indicate training data drift.
- False positive rates by client — aggregate false positive rates mask per-client variance. Some industries (e-commerce vs. government) have different page patterns that affect AI quality.
- Escalation resolution rates — how often do human reviewers confirm vs. reject escalated findings? If confirmation rate drops, our escalation thresholds may be too aggressive.
All of these metrics stream into our ClickHouse-backed SignalBoard dashboards, where they are displayed alongside the eval framework metrics for a complete quality picture.
The cultural shift
The biggest change guardrails created was not technical — it was cultural. Before guardrails, the question was "is this AI output good enough?" After guardrails, the question became "what does this AI output need to pass through before a human sees it?"
That shift in framing changed how we build every AI feature. We do not ship AI output directly to users. We ship AI output that has been validated, filtered, scored, and — when necessary — reviewed by a human. The pipeline is the product, not the model.
AI safety is not a feature. It is an engineering discipline. And like all engineering disciplines, it requires measurement, testing, and continuous investment. The systems that skip it learn why it matters the hard way — usually from an angry client email about a finding that never existed.
Want to see how this was built?
Browse all posts