AdaScout
Sole Developer & Architect
Platform for scanning websites for WCAG 2.2 AA compliance using multiple analysis engines. A dedicated scanner worker runs Playwright + axe-core against Browserless Chromium via CDP. A separate Convex action path uses Browserbase Stagehand with Gemini/MiniMax for AI-powered accessibility analysis. PDF documents are analyzed with pdfjs-dist (metadata, tagging, text layer, reading order, tables, images). An offline evaluation harness measures AI finding precision and recall against the deterministic axe-core baseline, with per-rule F1 scores and confusion matrices by WCAG criterion. A RAG pipeline over the WCAG 2.2 AA specification provides grounded, citation-backed remediation suggestions. Confidence scoring gates AI findings — low-confidence results are escalated for human review, and an LLM-as-judge validation layer filters findings before report inclusion. Reports exported as PDF (browser print) and Excel (exceljs). Mounts the BrowserLaunch Convex component for task orchestration.
axe-core, Stagehand AI, custom policy, PDF analysis
Metadata, tagging, text layer, OCR confidence, reading order, tables, images
OpenAI gpt-4o (default), Google Gemini 2.5 Flash, MiniMax M2-Stable
axe, stagehand, policy, pdf — normalized into unified model
Measured against axe-core deterministic baseline across 500+ page eval dataset
Per-criterion P/R/F1 tracked with regression gating on prompt changes
The Problem
Businesses face ADA lawsuits when their websites aren't accessible. Manual audits miss issues, existing tools only check one dimension, and PDF accessibility is often ignored entirely. A comprehensive solution needs to scan HTML, analyze PDFs, and provide AI-powered remediation guidance — but AI scanners can hallucinate violations, so the system also needs a way to measure and gate AI quality before findings reach client reports.
The Solution
Built a multi-engine scanning platform: (1) axe-core via Playwright on Browserless Chromium for rule-based HTML checks, (2) Browserbase Stagehand with Gemini for AI-powered WCAG 2.2 AA analysis, (3) pdfjs-dist pipeline for PDF accessibility (metadata, tagging, OCR, reading order). Custom policy checks (image-missing-alt, image-empty-alt) supplement axe. Results normalized from multiple sources (axe, stagehand, policy, pdf) into a unified findings model. An evaluation harness uses axe-core as ground truth to measure AI precision/recall, with LLM-as-judge validation for ambiguous edge cases. RAG over WCAG 2.2 AA guidelines grounds remediation suggestions with specific criterion citations. Confidence scoring prevents low-quality findings from reaching reports.
Technical Decisions
Key architecture decisions and their outcomes
Multi-engine over single-tool scanning
No single tool catches all accessibility issues. axe-core is rule-based and misses context. AI catches nuance but can hallucinate.
Combined axe-core for deterministic rules, Stagehand + Gemini for AI interpretation, custom policy checks for gaps, and pdfjs-dist for document accessibility.
Comprehensive coverage. Each engine's weaknesses are covered by another's strengths.
Separate scanner worker vs. Convex actions
Playwright + axe-core needs long-running browser sessions. Convex actions have execution time limits.
Built a dedicated scanner worker (Node.js process) that connects to Browserless via CDP. Convex actions handle the Stagehand/Browserbase path (managed browser sessions).
Heavy scanning runs without timeout constraints. Lighter AI analysis uses managed Browserbase sessions.
Eval-first development: measurement before optimization
AI scanners can hallucinate violations. Without measurement, prompt tuning is guesswork.
Built an offline eval harness using axe-core findings as ground truth before optimizing AI prompts. Per-criterion precision/recall metrics gate every prompt change.
AI finding quality improved from 78% to 94% precision through measurement-driven iteration. Regressions are caught before they reach production.
Engineering Details
- Scanner worker: connects to BROWSERLESS_CDP_URL (ws://), runs AxeBuilder.analyze(), maps violations to findings
- Stagehand path: Convex action → Browserbase session → stagehand.extract() with WCAG 2.2 AA instruction
- PDF pipeline: pdfjs-dist extraction → rule engine (pdf.metadata.*, pdf.tagging.*, pdf.text-layer.*, pdf.images.*)
- Finding normalization: all sources (axe, stagehand, policy, pdf) mapped to unified schema with source discriminator
- Eval harness: axe-core findings as ground truth, per-rule precision/recall/F1, confusion matrix by WCAG criterion
- RAG pipeline: WCAG 2.2 AA spec chunked by success criterion with pgvector hybrid search for remediation grounding
- Confidence scoring: AI findings assigned confidence [0–1], sub-threshold results flagged for human review
- LLM-as-judge: second model validates ambiguous findings before report inclusion
- BrowserLaunch integration: enqueueTask on queue 'adascout_scans' with externalRef linking to scan run pages
Key Highlights
- Multi-engine scanning: axe-core + Stagehand AI + custom policy checks + PDF analysis
- Dedicated scanner worker: Playwright + @axe-core/playwright on Browserless Chromium (CDP)
- AI accessibility analysis: Browserbase Stagehand with Google Gemini 2.5 Flash
- PDF pipeline: pdfjs-dist with 20+ rule checks (metadata, tagging, text layer, OCR, reading order, tables)
- Normalized findings model: unified output from axe, stagehand, policy, and pdf sources
- Offline eval harness measuring AI finding precision/recall against deterministic axe-core baseline
- RAG pipeline over WCAG 2.2 AA guidelines for grounded, citation-backed remediation suggestions
- Confidence scoring with human-review escalation for low-confidence findings
- LLM-as-judge validation layer filtering AI findings before report inclusion
- Report exports: PDF (browser print), Excel (exceljs), CSV
- BrowserLaunch component integration for task orchestration and replay
Tech Stack
Skills & Technologies
Related Articles
AI in Production: Lessons From Shipping to Real Users
Our first AI feature hallucinated a refund policy that did not exist. A customer followed it. Here is what we learned about putting language models in front of real people.
Real-Time Everything: Why We Stopped Polling and Never Went Back
Our trading dashboard polled every 5 seconds and users complained about stale data. We rebuilt on Convex with real-time subscriptions and the difference was not incremental — it was a different product.
Building an Eval Framework for AI Compliance Scanning
AI scanners find accessibility violations that deterministic tools miss — but how do you know the AI is not hallucinating? We built an evaluation harness that measures precision and recall against axe-core ground truth, gates every prompt change with regression tests, and tracks quality drift in ClickHouse.
RAG in Practice: Grounding AI Claims in Authoritative Sources
Our AI scanner said a page violated WCAG 2.4.7 — but it cited the wrong success criterion. RAG fixed the hallucination problem by grounding every AI claim in the actual specification, with retrieval metrics that prove the system works.
AI Safety and Guardrails in Production Systems
Our first AI feature hallucinated a compliance finding that did not exist. A client acted on it. Here is how we built input sanitization, output filtering, confidence scoring, and human-in-the-loop escalation to make AI output trustworthy.