launchthat

RAG in Practice: Grounding AI Claims in Authoritative Sources

Our AI scanner said a page violated WCAG 2.4.7 — but it cited the wrong success criterion. RAG fixed the hallucination problem by grounding every AI claim in the actual specification, with retrieval metrics that prove the system works.

Jun 30, 2026Desmond Tatilian

The AI scanner told a client their page violated WCAG 2.4.7 (Focus Visible). The finding was correct — the page did have a focus visibility issue. But the remediation guidance cited success criterion 2.4.7 while describing the requirements of 2.4.11 (Focus Not Obscured). The client's developer "fixed" the wrong thing based on the AI's recommendation, and the actual violation persisted.

This was not a hallucination in the traditional sense. The AI correctly identified the problem. It just could not accurately attribute its recommendation to the right part of the specification. Without grounding, every AI claim is unverifiable — and unverifiable claims in compliance reports are dangerous.

If you want the production tradeoffs behind embedding size decisions, read Vector Dimensions in Production RAG.

Why grounding matters for compliance

In most AI applications, a slightly inaccurate response is an inconvenience. In compliance scanning, it is a liability. When we tell a client "your page violates WCAG 2.4.7," they need to:

Look up what 2.4.7 requires
Understand how their page fails to meet it
Apply the correct remediation technique

If any of those steps point to the wrong criterion, the fix is wrong. The client wastes development time, the violation persists, and their legal exposure remains. We needed every AI-generated remediation to cite the exact WCAG criterion, the specific success technique, and a clear explanation of why the page fails — all grounded in the actual specification, not the model's training data.

Source corpus: what we index

The RAG pipeline indexes three document sets:

WCAG 2.2 specification

The full W3C WCAG 2.2 specification, including all Level A and AA success criteria. Each criterion is a distinct chunk with its criterion ID, title, intent, and normative requirements.

Understanding WCAG 2.2

The W3C "Understanding" documents that explain each criterion in depth — intent, benefits, examples of failures, and relationships to other criteria. These provide the context that makes remediation guidance actionable.

WCAG Techniques

The sufficient and advisory techniques for each criterion (e.g., G18: Ensuring a 4.5:1 contrast ratio, H44: Using label elements). These map directly to "here is how to fix it" guidance.

Chunking strategy

Chunking strategy makes or breaks RAG quality. We tested three approaches:

Fixed-size chunking (abandoned)

Splitting documents into 512-token chunks with 64-token overlap. Simple to implement, terrible for WCAG content. Success criteria would get split mid-sentence, and the model would receive half of a requirement without context.

Document-level chunking (too coarse)

One chunk per success criterion. Retrieval precision was high — when the right criterion was retrieved, the model had everything it needed. But the chunks were too large for the embedding model to represent well, and retrieval recall suffered for queries that touched multiple criteria.

Section-level chunking with criterion metadata (what we use)

Each chunk is a logical section within a success criterion: the normative requirement, the intent section, each example, each technique. Every chunk carries metadata: the parent criterion ID, the section type, and the criterion title.

This gives us fine-grained retrieval (the model gets the specific section it needs) with reliable attribution (every chunk traces back to a specific criterion through its metadata). The metadata also enables hybrid search — vector similarity for semantic matching, plus keyword filtering on criterion IDs when the AI scanner's finding already references a specific criterion.

Embedding model selection

We evaluated two OpenAI embedding models on our WCAG corpus:

text-embedding-ada-002

The older, cheaper model. It performed adequately on general accessibility queries but struggled with the technical precision needed for WCAG content. When a query asked about "focus visible in 2.4.7," ada-002 would sometimes return chunks from 2.4.11 (Focus Not Obscured) or 2.4.3 (Focus Order) — close, but wrong for compliance purposes.

MRR@5: 0.72 | Recall@10: 0.81

text-embedding-3-small

Significantly better at distinguishing between semantically similar but technically distinct WCAG criteria. The dimensional control (we use 512 dimensions) keeps storage manageable while maintaining the precision we need.

MRR@5: 0.89 | Recall@10: 0.94

The 17-point improvement in MRR@5 translated directly to fewer misattributed remediation suggestions. We switched and did not look back.

Hybrid search: vectors plus keywords

Pure vector search works well for semantic queries ("how do I make my focus indicator more visible?") but poorly for structured lookups ("what does WCAG 2.4.7 require?"). Since many AI scanner findings already reference specific criterion IDs, we use hybrid search:

Vector similarity via pgvector <=> operator for semantic matching
Keyword filtering on criterion ID metadata when the query contains a specific criterion reference
Score fusion using Reciprocal Rank Fusion (RRF) to combine both signals

The hybrid approach improved retrieval accuracy on criterion-specific queries from 81% to 96%. For open-ended queries, the keyword component has minimal impact, so there is no downside.

Reranking: cross-encoder for top-k refinement

Initial retrieval returns the top 20 candidates. A cross-encoder reranker (we use a small model fine-tuned on technical documentation) re-scores these candidates with full query-document attention and selects the top 5 for context injection.

Reranking adds ~200ms of latency per query. We justified this because:

It improved precision@5 from 0.82 to 0.91
False attribution (citing the wrong WCAG criterion) dropped by 42%
The latency is negligible compared to the AI scanner's total scan time

Retrieval metrics: measuring what matters

We track retrieval quality separately from end-to-end AI quality. This matters because when the AI gives a wrong answer, you need to know whether the problem is retrieval (wrong context) or generation (right context, wrong interpretation).

MRR (Mean Reciprocal Rank)

How high does the correct chunk rank in the retrieval results? MRR@5 of 0.89 means the correct chunk is usually in the top 2 results.

Recall@k

What fraction of relevant chunks appear in the top k results? We track recall@5 and recall@10. For remediation guidance, we need recall@5 above 0.85 — the model should see the relevant criterion in its first 5 context chunks.

Attribution accuracy

End-to-end metric: when the AI cites a specific WCAG criterion in its remediation, is it the correct one? This is measured against human-labeled ground truth on our eval dataset. Current attribution accuracy: 96%.

Attribution in output

The RAG pipeline does not just improve AI quality — it enables citation. Every remediation suggestion includes:

The specific WCAG criterion (e.g., "2.4.7 Focus Visible")
The relevant success technique (e.g., "G195: Using an author-supplied, visible focus indicator")
A direct quote from the specification supporting the finding

This transforms AI output from "you should make your focus indicator more visible" to "this page fails WCAG 2.4.7 (Focus Visible) because the focus indicator on interactive elements in the navigation does not meet the minimum area requirement defined in Understanding Success Criterion 2.4.7. Apply technique G195 to provide a visible focus indicator with at least a 2px solid outline."

Clients trust cited output. Their developers can verify the citation. Compliance officers can reference the specific criterion in their reports.

When RAG helps vs. when it adds latency

RAG is not universally beneficial. We found clear patterns in when it adds value:

RAG adds clear value

Remediation guidance: Grounding recommendations in specific techniques and criteria
Criterion disambiguation: Distinguishing between similar criteria (2.4.7 vs. 2.4.11 vs. 2.4.3)
Technique selection: Recommending the right fix pattern from the WCAG Techniques library

RAG adds latency without improving quality

Simple rule violations: Missing alt text, duplicate IDs, empty buttons — the model knows these well enough from training data
Binary pass/fail checks: When the answer is just "yes this violates X" with no remediation needed
High-confidence findings: When the AI scanner's confidence score is above 0.95, RAG rarely changes the output

We conditionally apply RAG based on finding type and confidence score. Simple, high-confidence findings skip the retrieval step entirely, saving ~400ms per finding. Complex or low-confidence findings always go through RAG.

The lesson

RAG is not a magic accuracy booster. It is an attribution system. The value is not that the AI gives better answers — though it does — but that every answer can be traced back to an authoritative source.

For compliance use cases, that traceability is the product. Without it, AI findings are opinions. With it, they are verifiable claims backed by specific, citable standards.

Want to see how this was built?

Browse all posts