launchthat

From Custom ClickHouse Pipelines to Langfuse: Migrating AI Eval Infrastructure

I built an AI evaluation framework from scratch to understand every piece — ground truth datasets, precision/recall harnesses, LLM-as-judge layers, ClickHouse dashboards. Then I migrated it to Langfuse. Here is why, and what I gained.

Jul 6, 2026Desmond Tatilian

I spent three weeks building a custom evaluation framework for our AI accessibility scanner. ClickHouse for time-series storage. A bespoke harness for precision/recall per WCAG criterion. An LLM-as-judge layer. CI regression gating. Custom dashboards.

It worked. It caught real regressions. It made prompt engineering scientific instead of intuitive.

Then I migrated the entire thing to Langfuse and deleted most of that code.

This is not a story about wasted effort. Building it from scratch was the right first step. Migrating to Langfuse was the right second step. The reasoning behind both decisions is worth explaining.

Why I built it from scratch first

There is a category of infrastructure where you should not adopt a platform until you understand what the platform is doing for you. Evaluation frameworks for AI systems fall squarely in that category.

When I started measuring our AI scanner's output against axe-core ground truth, I did not know what metrics mattered. I did not know whether per-rule F1 was more useful than aggregate precision. I did not know that confusion matrices per WCAG criterion would reveal misclassification patterns that aggregate scores completely hide. I did not know that the LLM-as-judge layer needed a fundamentally different prompt than the scanner itself — optimized for verification, not discovery.

I learned all of this by building each piece. The ground truth dataset taught me which page categories expose different failure modes. The harness taught me that precision and recall trade off differently per WCAG criterion — the AI is precise on color contrast but imprecise on ARIA landmarks. The ClickHouse dashboards taught me which time-series queries actually matter for production drift detection versus which ones look impressive but never get checked.

If I had started with Langfuse, I would have configured it wrong. I would have set up generic scoring instead of per-criterion scoring. I would have missed the confusion matrix insight entirely. I would have built dashboards that tracked the wrong things.

Understanding the primitives made me a better consumer of the platform.

What the custom stack looked like

The evaluation system had four layers:

Ground truth storage. A dataset of 500+ pages with axe-core findings stored as the baseline. Clean pages, known-violation pages, and edge cases. Stored as structured JSON in ClickHouse tables with page metadata — industry, framework, accessibility profile.

The eval harness. A TypeScript pipeline that ran the AI scanner against every page in the dataset, matched findings to ground truth by element selector and WCAG criterion, and computed true positives, false positives, and false negatives. Output: per-rule precision, recall, F1, and a confusion matrix showing which violation types the AI confuses.

LLM-as-judge. A second model reviewed ambiguous findings — cases where the correct classification was genuinely unclear. Alt text quality, heading semantic structure, focus indicator visibility. The judge classified each as confirmed, rejected, or needs-human-review.

ClickHouse dashboards. Time-series aggregation of all eval results — offline and online. Precision/recall trends per WCAG criterion over time, model version comparisons, cost/latency correlation, false positive rates by page category.

Each layer required its own schema design, ingestion pipeline, and query logic. The ClickHouse tables alone had five different materialized views for different aggregation windows.

It was solid. It was also a lot of code to maintain.

What made me look at Langfuse

Two things happened in the same month.

First, I was reviewing job descriptions for AI engineering roles and noticed Langfuse appearing repeatedly. Not as a nice-to-have — as a named requirement. Companies building AI products were standardizing on it for LLM observability and evaluation. The same way Prometheus became the default for infrastructure metrics or Sentry for error tracking, Langfuse was becoming the default for LLM eval.

Second, ClickHouse acquired Langfuse. That got my attention for a specific reason: I was already running ClickHouse for AdaScout's eval harness and TraderLaunchpad's candle data pipeline. Langfuse running on ClickHouse internally meant the analytical engine I trusted for high-cardinality time-series queries was now powering the eval platform too. It was not a new dependency — it was a convergence.

What Langfuse gives me that I was building by hand

Tracing as a first-class primitive

Every AI call in the scanner — every Stagehand extract, every judge model invocation — now automatically captures input, output, latency, token usage, and cost. With the custom stack, I was manually instrumenting each call site and writing structured logs to ClickHouse. With Langfuse, it is an SDK integration:

const trace = langfuse.trace({
  name: "wcag-scan",
  metadata: { pageUrl, wcagProfile, scanId },
});

const generation = trace.generation({
  name: "stagehand-extract",
  model: "gpt-4o",
  input: auditPrompt,
});

// ... scanner runs ...

generation.end({ output: findings, usage: tokenUsage });

The trace captures the full lifecycle without me building the ingestion pipeline, storage schema, or retention logic. That alone eliminated roughly 400 lines of custom code.

Scoring infrastructure I was reimplementing

Langfuse has built-in scoring — numeric, categorical, and boolean — that attaches to any trace or generation. My custom harness computed precision and recall by joining AI findings against axe-core ground truth in ClickHouse. Now I attach scores directly:

trace.score({ name: "precision", value: 0.94 });
trace.score({ name: "recall", value: 0.87 });
trace.score({ name: "wcag-criterion", value: "1.4.3", comment: "Contrast" });

The scores are queryable, filterable, and visualized in dashboards I did not build. The per-criterion breakdown I needed required a custom ClickHouse materialized view. In Langfuse, it is a filter.

LLM-as-judge without orchestration code

My custom judge layer required orchestrating a second model call, parsing the classification, storing the result, and correlating it with the original finding. Langfuse has eval templates where a judge model reviews traces automatically. I configure the evaluation criteria in the UI, point it at the right traces, and the judge runs on its own schedule.

The judge prompt still needs to be accessibility-domain-specific — Langfuse does not know what a WCAG violation is. But the orchestration, storage, and correlation are handled. I write the prompt, not the plumbing.

Prompt versioning I was not doing at all

The Stagehand WCAG audit prompt in scanRunner.ts was hardcoded. When I changed it, I changed it in code, deployed, and hoped the eval harness would catch regressions before they hit production. There was no side-by-side comparison of prompt versions.

Langfuse manages prompt versions natively. I can deploy version A to 50% of scans and version B to the other 50%, then compare precision and recall between them in the dashboard. A/B testing prompts went from "something I should build eventually" to "something that works today."

Dashboards I do not maintain

The ClickHouse dashboards were the most time-consuming part of the custom stack. Five materialized views, custom Grafana panels, retention policies, backup configuration. They worked well but required ongoing maintenance — schema migrations when I added new metrics, view rebuilds when aggregation windows changed.

Langfuse dashboards show precision/recall trends, model version comparisons, cost tracking, and latency distributions out of the box. The data is already in ClickHouse internally — Langfuse just runs the queries I was writing by hand.

What I still own

Langfuse is infrastructure, not domain expertise. The pieces that make this eval framework specific to accessibility compliance are still mine:

The ground truth dataset. 500+ pages with axe-core baselines. Langfuse does not know what a correct accessibility finding looks like. The dataset, the page categorization, and the axe-core comparison logic are custom code that feeds scores into Langfuse.

Per-WCAG-criterion metrics. The logic that maps a finding to a specific WCAG success criterion and computes precision/recall per criterion is a thin TypeScript layer. It reads from Langfuse's scoring API and produces the per-criterion breakdown. Maybe 200 lines of code — down from 1,200 in the custom stack.

CI regression gating. Langfuse does not block merges. The CI step that queries Langfuse's API for aggregate and per-criterion scores, compares against the baseline, and fails the build if thresholds are violated — that is a GitHub Action I maintain. But it queries an API instead of running ClickHouse queries directly.

Finding normalization. Matching AI findings to axe-core findings by element selector, WCAG criterion, and violation type is domain logic. It runs before scores are written to Langfuse.

The split is clean: Langfuse handles storage, visualization, tracing, and judge orchestration. I handle the accessibility-specific evaluation logic.

The migration

The actual data migration was straightforward because both systems use ClickHouse underneath. The historical eval results — precision/recall scores, model version metadata, per-criterion breakdowns — mapped to Langfuse's scoring model with minor schema adjustments.

The harder part was rearchitecting the ingestion flow. In the custom stack, the eval harness wrote directly to ClickHouse tables. In the new architecture, the scanner instruments its calls with the Langfuse SDK, the comparison logic runs as a post-processing step, and scores are attached to traces via the API. The data flows through Langfuse instead of around it.

The LLM-as-judge migration required rewriting the judge prompt as a Langfuse eval template. The prompt itself was identical — the accessibility verification criteria did not change. The orchestration code that managed the second model call, parsed the result, and stored the classification was replaced entirely by Langfuse's built-in eval runner.

Total migration time: about a week. Most of that was testing that the new flow produced identical scores to the old flow on the same eval dataset.

Self-hosting and the Kubernetes path

Langfuse runs self-hosted via Docker Compose for development and Kubernetes with Helm for production. The Docker Compose stack bundles everything — Langfuse Web, Worker, Postgres, Redis, ClickHouse, and MinIO for blob storage. One docker compose up and the full platform is running.

For production, I am running it on the same Hetzner infrastructure that hosts the AdaScout scanner worker. The Ansible roles that manage the scanner deployment now also manage the Langfuse stack. When the eval volume justifies it, the path to high availability is a Helm chart on Kubernetes with ClickHouse broken out as a managed service — the same progression every stateful service follows.

Self-hosting matters here for two reasons. First, the eval traces contain client page content, AI findings, and scan metadata. Keeping that data on infrastructure I control is a compliance requirement, not a preference. Second, the latency between the scanner worker and the eval platform is negligible when they share a network — traces are written in milliseconds instead of crossing the public internet.

The meta-lesson, revised

In the original eval framework post, I wrote: "Build the measurement framework first. Then optimize." I still believe that. But I would add a corollary.

Build it yourself first to understand the pieces. Then migrate to a platform that maintains those pieces for you.

Building the custom stack taught me what matters in AI evaluation — per-criterion granularity, confusion matrices, regression gating, the judge layer. Those insights are portable. They informed how I configured Langfuse, which scores I track, and which dashboards I actually check.

If I had started with Langfuse, I would have set up generic "quality" scores and missed the WCAG-specific insights that make the framework valuable. If I had stayed with the custom stack, I would be maintaining ClickHouse schemas and Grafana panels instead of improving the eval criteria.

The industry is converging on Langfuse for a reason. ClickHouse acquiring them validates the technical foundation. Companies listing it in job descriptions validates the adoption trajectory. And the self-hosted model means I am not trading control for convenience.

I deleted about 2,000 lines of infrastructure code and kept the 200 lines of domain logic that actually differentiates the evaluation. That is the right ratio.

Want to see how this was built?

Read how the eval framework was built from scratch

Want to see how this was built?

Browse all posts