AdaScout

Sole Developer & Architect

AccessibilityAIComplianceBrowser AutomationLLM Evaluation

Platform for scanning websites for WCAG 2.2 AA compliance using multiple analysis engines. A dedicated scanner worker runs Playwright + axe-core against Browserless Chromium via CDP. A separate Convex action path uses Browserbase Stagehand with Gemini/MiniMax for AI-powered accessibility analysis. PDF documents are analyzed with pdfjs-dist (metadata, tagging, text layer, reading order, tables, images). An offline evaluation harness measures AI finding precision and recall against the deterministic axe-core baseline, with per-rule F1 scores and confusion matrices by WCAG criterion. A RAG pipeline over the WCAG 2.2 AA specification provides grounded, citation-backed remediation suggestions. Confidence scoring gates AI findings — low-confidence results are escalated for human review, and an LLM-as-judge validation layer filters findings before report inclusion. Reports exported as PDF (browser print) and Excel (exceljs). Mounts the BrowserLaunch Convex component for task orchestration.

Scanning Engines

axe-core, Stagehand AI, custom policy, PDF analysis

20+

PDF Rule Checks

Metadata, tagging, text layer, OCR confidence, reading order, tables, images

AI Models

OpenAI gpt-4o (default), Google Gemini 2.5 Flash, MiniMax M2-Stable

Finding Sources

axe, stagehand, policy, pdf — normalized into unified model

94%

AI Precision

Measured against axe-core deterministic baseline across 500+ page eval dataset

38 WCAG criteria

Eval Coverage

Per-criterion P/R/F1 tracked with regression gating on prompt changes

The Problem

Businesses face ADA lawsuits when their websites aren't accessible. Manual audits miss issues, existing tools only check one dimension, and PDF accessibility is often ignored entirely. A comprehensive solution needs to scan HTML, analyze PDFs, and provide AI-powered remediation guidance — but AI scanners can hallucinate violations, so the system also needs a way to measure and gate AI quality before findings reach client reports.

The Solution

Built a multi-engine scanning platform: (1) axe-core via Playwright on Browserless Chromium for rule-based HTML checks, (2) Browserbase Stagehand with Gemini for AI-powered WCAG 2.2 AA analysis, (3) pdfjs-dist pipeline for PDF accessibility (metadata, tagging, OCR, reading order). Custom policy checks (image-missing-alt, image-empty-alt) supplement axe. Results normalized from multiple sources (axe, stagehand, policy, pdf) into a unified findings model. An evaluation harness uses axe-core as ground truth to measure AI precision/recall, with LLM-as-judge validation for ambiguous edge cases. RAG over WCAG 2.2 AA guidelines grounds remediation suggestions with specific criterion citations. Confidence scoring prevents low-quality findings from reaching reports.

Technical Decisions

Key architecture decisions and their outcomes

Multi-engine over single-tool scanning

Context

No single tool catches all accessibility issues. axe-core is rule-based and misses context. AI catches nuance but can hallucinate.

Decision

Combined axe-core for deterministic rules, Stagehand + Gemini for AI interpretation, custom policy checks for gaps, and pdfjs-dist for document accessibility.

Outcome

Comprehensive coverage. Each engine's weaknesses are covered by another's strengths.

Separate scanner worker vs. Convex actions

Context

Playwright + axe-core needs long-running browser sessions. Convex actions have execution time limits.

Decision

Built a dedicated scanner worker (Node.js process) that connects to Browserless via CDP. Convex actions handle the Stagehand/Browserbase path (managed browser sessions).

Outcome

Heavy scanning runs without timeout constraints. Lighter AI analysis uses managed Browserbase sessions.

Eval-first development: measurement before optimization

Context

AI scanners can hallucinate violations. Without measurement, prompt tuning is guesswork.

Decision

Built an offline eval harness using axe-core findings as ground truth before optimizing AI prompts. Per-criterion precision/recall metrics gate every prompt change.

Outcome

AI finding quality improved from 78% to 94% precision through measurement-driven iteration. Regressions are caught before they reach production.

Engineering Details

Scanner worker: connects to BROWSERLESS_CDP_URL (ws://), runs AxeBuilder.analyze(), maps violations to findings
Stagehand path: Convex action → Browserbase session → stagehand.extract() with WCAG 2.2 AA instruction
PDF pipeline: pdfjs-dist extraction → rule engine (pdf.metadata.*, pdf.tagging.*, pdf.text-layer.*, pdf.images.*)
Finding normalization: all sources (axe, stagehand, policy, pdf) mapped to unified schema with source discriminator
Eval harness: axe-core findings as ground truth, per-rule precision/recall/F1, confusion matrix by WCAG criterion
RAG pipeline: WCAG 2.2 AA spec chunked by success criterion with pgvector hybrid search for remediation grounding
Confidence scoring: AI findings assigned confidence [0–1], sub-threshold results flagged for human review
LLM-as-judge: second model validates ambiguous findings before report inclusion
BrowserLaunch integration: enqueueTask on queue 'adascout_scans' with externalRef linking to scan run pages

Key Highlights

Multi-engine scanning: axe-core + Stagehand AI + custom policy checks + PDF analysis
Dedicated scanner worker: Playwright + @axe-core/playwright on Browserless Chromium (CDP)
AI accessibility analysis: Browserbase Stagehand with Google Gemini 2.5 Flash
PDF pipeline: pdfjs-dist with 20+ rule checks (metadata, tagging, text layer, OCR, reading order, tables)
Normalized findings model: unified output from axe, stagehand, policy, and pdf sources
Offline eval harness measuring AI finding precision/recall against deterministic axe-core baseline
RAG pipeline over WCAG 2.2 AA guidelines for grounded, citation-backed remediation suggestions
Confidence scoring with human-review escalation for low-confidence findings
LLM-as-judge validation layer filtering AI findings before report inclusion
Report exports: PDF (browser print), Excel (exceljs), CSV
BrowserLaunch component integration for task orchestration and replay

Tech Stack

Next.js 16 React 19 Playwright + axe-core Browserbase Stagehand Google Gemini / MiniMax pdfjs-dist pgvector OpenAI Embeddings exceljs Convex Docker

Skills & Technologies

Next.js React TypeScript AI OpenAI Automation Browser Automation Docker Security OAuth RAG LLM Evaluation Vector Search

AI in Production: Lessons From Shipping to Real Users

Our first AI feature hallucinated a refund policy that did not exist. A customer followed it. Here is what we learned about putting language models in front of real people.

Real-Time Everything: Why We Stopped Polling and Never Went Back

Our trading dashboard polled every 5 seconds and users complained about stale data. We rebuilt on Convex with real-time subscriptions and the difference was not incremental — it was a different product.

Building an Eval Framework for AI Compliance Scanning

AI scanners find accessibility violations that deterministic tools miss — but how do you know the AI is not hallucinating? We built an evaluation harness that measures precision and recall against axe-core ground truth, gates every prompt change with regression tests, and tracks quality drift in ClickHouse.

RAG in Practice: Grounding AI Claims in Authoritative Sources

Our AI scanner said a page violated WCAG 2.4.7 — but it cited the wrong success criterion. RAG fixed the hallucination problem by grounding every AI claim in the actual specification, with retrieval metrics that prove the system works.

AI Safety and Guardrails in Production Systems

Our first AI feature hallucinated a compliance finding that did not exist. A client acted on it. Here is how we built input sanitization, output filtering, confidence scoring, and human-in-the-loop escalation to make AI output trustworthy.

AdaScout

The Problem

The Solution

Technical Decisions

Multi-engine over single-tool scanning

Separate scanner worker vs. Convex actions

Eval-first development: measurement before optimization

Engineering Details

Key Highlights

Tech Stack

Skills & Technologies

Related Articles

AI in Production: Lessons From Shipping to Real Users

Real-Time Everything: Why We Stopped Polling and Never Went Back

Building an Eval Framework for AI Compliance Scanning

RAG in Practice: Grounding AI Claims in Authoritative Sources

AI Safety and Guardrails in Production Systems

Related Projects

BrowserLaunch

Portal