launchthat
Building an Eval Framework for AI Compliance Scanning
AI scanners find accessibility violations that deterministic tools miss — but how do you know the AI is not hallucinating? We built an evaluation harness that measures precision and recall against axe-core ground truth, gates every prompt change with regression tests, and tracks quality drift in ClickHouse.
Our AI accessibility scanner found a color-contrast violation that axe-core missed. The client's designer had used #767676 on #ffffff — a 4.48:1 ratio that technically passes WCAG AA's 4.5:1 threshold... except the font size was 13px, not 14px. The AI caught the nuance. axe-core did not.
The next week, the same scanner flagged a perfectly accessible nav menu as a keyboard-trap violation. The client spent two hours investigating a non-issue before we realized the AI had hallucinated it.
That was the week we stopped trusting AI output and started measuring it.
The measurement problem
When you ship a rule-based scanner like axe-core, you know exactly what it checks and what it misses. The ruleset is deterministic. Run it twice, get the same results.
AI scanners are different. They can catch violations that no rule set covers — layout issues, reading-order problems, context-dependent alt text quality — but they can also invent violations that do not exist. Worse, their behavior changes with every prompt tweak, model update, or temperature adjustment.
Without measurement, prompt engineering is guesswork. You change the system prompt, run it against a few test pages, eyeball the results, and ship. Maybe it got better. Maybe it got worse in ways you will not notice until a client reports a false positive.
We needed a framework that answered two questions with numbers, not intuition:
- Precision: When the AI says something is a violation, how often is it actually a violation?
- Recall: Of all the real violations that exist, how many does the AI find?
Ground truth: axe-core as the baseline
The first decision was choosing a ground truth. We needed a source of "known correct" findings to measure our AI against.
axe-core was the obvious choice. It is deterministic, well-maintained, and has well-understood coverage. It does not catch everything — that is why we built the AI scanner — but for the WCAG criteria it does cover, its findings are reliable.
We built an eval dataset of 500+ pages spanning different industries, frameworks, and accessibility profiles. For each page, we ran axe-core and stored the findings as ground truth. The dataset includes:
- Clean pages with zero violations (to measure false positive rate)
- Known-violation pages with specific, confirmed issues (to measure recall)
- Edge cases where the correct answer is ambiguous (to measure where AI adds value vs. where it hallucinates)
The eval harness
The harness runs the AI scanner against every page in the eval dataset and compares results to the axe-core ground truth. It computes metrics at three levels:
Per-rule metrics
For each WCAG criterion (e.g., 1.4.3 Contrast, 2.4.7 Focus Visible), we track:
- Precision: True positives / (True positives + False positives)
- Recall: True positives / (True positives + False negatives)
- F1 score: Harmonic mean of precision and recall
This tells us exactly where the AI is strong (e.g., 98% precision on alt-text violations) and where it is weak (e.g., 71% precision on ARIA role assignments).
Confusion matrix by WCAG criterion
A confusion matrix for each criterion shows exactly how the AI is failing. Is it confusing aria-label issues with aria-labelledby issues? Is it flagging role="navigation" as a landmark violation when it is actually correct?
This level of granularity turned vague "the AI seems worse" feedback into actionable "the AI is misclassifying ARIA landmark roles 29% of the time" data.
Aggregate scores
Roll-up precision, recall, and F1 across the entire eval dataset give us a single quality number to track over time. When we started measuring, aggregate precision was 78%. After three rounds of measurement-driven prompt iteration, we reached 94%.
LLM-as-judge for the ambiguous middle
Not every finding maps cleanly to a true/false classification. Some accessibility issues are genuinely ambiguous:
- An image has alt text, but is it good alt text?
- A heading structure is technically valid, but does it make semantic sense?
- A focus indicator exists, but is it visible enough?
For these cases, we added an LLM-as-judge validation layer. A second model reviews the primary scanner's findings and classifies each as confirmed, rejected, or needs-human-review. The judge model uses a different prompt optimized for verification rather than discovery, reducing the risk of correlated errors.
The judge layer reduced our false positive rate by 34% without meaningfully affecting recall. It is especially effective at catching the AI scanner's tendency to over-report ARIA violations on modern component libraries that handle accessibility correctly.
Regression testing: gating prompt changes
Every prompt change, model update, or system configuration change runs through the eval harness before it ships. The CI pipeline:
- Runs the AI scanner against the full eval dataset
- Computes per-rule and aggregate metrics
- Compares against the last known-good baseline
- Blocks the merge if any per-rule F1 drops by more than 2 percentage points or aggregate precision drops below 90%
This is the most important piece. Without regression gating, prompt improvements in one area silently cause regressions in another. We have caught three regressions that would have shipped to production — including one where a "minor" system prompt clarification dropped ARIA landmark precision from 91% to 67%.
Online monitoring: tracking drift in production
Eval datasets are a snapshot. Production traffic is not. Pages change, frameworks update, and AI model behavior drifts over time. We track production quality using two signals:
Confirmation rate
When clients review AI findings in their reports, they can mark each finding as confirmed, false-positive, or needs-investigation. We aggregate confirmation rates per WCAG criterion and compare against our offline eval metrics.
If the production confirmation rate for a criterion drops more than 5 points below the offline precision, we flag it for investigation. This has caught two cases where real-world page patterns diverged from our eval dataset.
ClickHouse time-series dashboards
All eval run results — offline and online — stream into ClickHouse for time-windowed aggregation. The dashboard shows:
- Precision/recall trends per WCAG criterion over time
- Model version comparison (side-by-side quality metrics when evaluating model upgrades)
- Latency and cost per scan, correlated with quality metrics
- False positive rate by page category (e-commerce, government, SPA, etc.)
ClickHouse handles the high-cardinality queries well — slicing by criterion, model version, page category, and time window simultaneously without the query performance issues we hit with PostgreSQL.
What changed when we started measuring
Before the eval framework, we had opinions about AI quality. After, we had data.
The numbers surprised us. We assumed the AI scanner was better than axe-core across the board. It was not. For simple, well-defined rules like missing alt text or duplicate IDs, axe-core is faster, more reliable, and does not cost API credits. The AI adds value specifically on context-dependent checks — reading order, semantic structure, focus management — where rules cannot capture the nuance.
Prompt engineering became scientific. Instead of "does this prompt feel better?", we asked "does this prompt improve F1 on ARIA landmarks without regressing color contrast precision?" The answer was usually more nuanced than we expected.
Model upgrades became low-risk. When OpenAI shipped a new model version, we ran it against the eval dataset before switching. One upgrade improved overall precision by 3 points but regressed keyboard navigation recall by 8 points. We would not have caught that without the harness.
Client trust improved. We could tell clients our scanner has 94% precision on the criteria it covers. That is a different conversation than "our AI is really good." Numbers close deals.
The meta-lesson
The eval framework took three weeks to build. It has saved us from shipping quality regressions at least five times. The ROI is not close.
If you are building AI features that make claims — compliance findings, content recommendations, diagnostic results — you need to measure those claims against ground truth before you optimize. Otherwise, you are navigating by intuition in a system that changes behavior with every prompt edit.
Build the measurement framework first. Then optimize. The order matters.
Want to see how this was built?
Browse all posts