← Back to blog

launchthat

How I Built a Custom AI for My Portfolio (Part 2): Monitoring and Optimization

Part 2 of my portfolio AI series: concrete eval results, ClickHouse monitoring signals, optimization decisions, recursion guardrails, and the measured before/after impact.

Apr 15, 2026Desmond Tatilian

This is Part 2 of a two-part series on my portfolio AI system.

  • Part 1: architecture, tooling, RAG setup, and implementation details
  • Part 2 (this post): monitoring, optimization, and measured outcomes

If you want the build and architecture walkthrough first, start with Part 1.

The first version of my portfolio AI looked good in a demo and failed in real usage.

It could answer "what projects has Desmond built?" but then miss direct asks like "tell me about TraderLaunchpad." It sounded confident even when retrieval was weak, and when I changed embedding settings, quality drifted in ways I could not explain from logs alone.

That is when I stopped treating the portfolio assistant like a chatbot demo and started treating it like a production system with hard quality gates.

This article is the full system view that connects RAG in Practice, AI Eval Framework, and AI Safety and Guardrails into one real implementation.

What I was optimizing for

I set five non-negotiables:

  1. Grounded output: responses must come from real portfolio data.
  2. Entity accuracy: named asks like "TraderLaunchpad" should resolve reliably.
  3. Measurable quality: no model/prompt/index change ships without eval data.
  4. Operational visibility: retrieval quality, latency, cost, and recursion must be queryable.
  5. Safe failure modes: if evidence is weak, the system should say so instead of hallucinating.

Corpus and indexing strategy

The assistant indexes four source types:

  • Projects (slug, title, summary, stack, tags, featured content)
  • Skills (category, tools, proficiency context)
  • Experience (role scope, outcomes, architecture choices)
  • Blog (long-form technical content)

In my latest indexing run, that was approximately:

  • 22 projects
  • 47 skills
  • 19 experience sections
  • 24 published blog posts

After chunking, I ended up with roughly 1,300 searchable chunks.

Each chunk carries metadata:

  • sourceType (project, skill, experience, blog)
  • sourceSlug
  • topicTags
  • importance
  • indexedAt

This metadata is why hybrid retrieval works in practice: vector similarity does semantic match, while metadata and lexical fallbacks recover exact entities when needed.

What failed first (and what the metrics showed)

Before improvements, I tracked 150 real portfolio prompts over two weeks.

The failure pattern was clear:

  • Project-specific asks often got generic answers.
  • Blog-heavy queries over-indexed on long-form chunks and missed project facts.
  • Ambiguous follow-ups sometimes triggered redundant retrieval loops.

Baseline metrics looked like this:

MetricBaseline
Known entity hit rate (project slug/title prompts)72%
Groundedness score (human review)0.79
Wrong-attribution rate11.4%
Fallback retrieval trigger rate34%
p95 end-to-end latency4.8s
Recursion depth > 3 turns9.7% sessions

The first big insight: retrieval was not "broken," it was inconsistent under specific prompt classes.

Embedding model and dimension experiments

I tested three configurations on an offline eval set (180 prompts) before promoting changes online:

  • 60 direct entity prompts (project name/slug asks)
  • 50 architecture prompts
  • 40 comparison prompts
  • 30 adversarial or ambiguous prompts

Candidate configurations

ConfigRecall@5MRR@5Known entity hitp95 retrieval latencyRelative index size
OpenAI 15360.860.7988%94ms1.00x
Google 1536 (constrained)0.890.8291%108ms1.31x
Google 3072 (native)0.920.8595%143ms1.94x

Why I chose the final setup

I ended up with Google native 3072 for the portfolio corpus because it gave the best retrieval quality on nuanced architecture prompts and the highest entity hit rate.

The latency increase was real, but acceptable once I fixed recursion and reduced duplicate retrieval attempts.

Important operational detail: dimension changes are a hard retrieval contract. I had to re-index when changing dimension strategy, otherwise results were noisy and misleading.

If you want more on this decision surface, read Vector Dimensions in Production RAG.

Specific changes that moved quality

These were the most impactful implementation changes:

  1. Entity-aware fallback retrieval

    • Added lexical project lookup fallback for explicit names/slugs.
    • Impact: known-entity hit rate improved from 72% to 96%.
  2. Importance normalization

    • Normalized RAG importance values to a safe 0..1 range before indexing.
    • Impact: eliminated null/invalid score edge cases and stabilized ranking.
  3. Source balancing

    • Prevented long blog chunks from drowning project chunks on project asks.
    • Impact: wrong-attribution rate dropped from 11.4% to 4.1%.
  4. Prompt constraints for low-evidence responses

    • Updated prompt instructions to explicitly return "insufficient evidence" instead of extrapolating.
    • Impact: hallucination-style responses dropped by more than half in adversarial prompts.

Why this got more complex than "just use RAG"

One thing this project made clear: complexity did not come from indexing itself, it came from retrieval reliability requirements.

Projects were easier because many requests map to stable canonical entities (slug/title), so deterministic fallbacks work well. Blog queries were harder because:

  • long-form articles are semantically dense and overlap in vocabulary
  • short follow-ups like "what about part 1" are under-specified
  • part-based series titles require disambiguation, not just nearest-neighbor similarity

So the production solution became hybrid by design:

  • semantic RAG for broad relevance
  • lexical/title matching for specific article targeting
  • deterministic retrieval paths for exact entity intent
  • conversation-aware handling for short follow-ups

This is a common production pattern in RAG systems: retrieval quality improves when you combine probabilistic semantic search with deterministic entity resolution.

Eval framework and release gates

I treat prompt/model/index changes like code releases. Every meaningful change runs through a fixed eval harness.

What I score

  • groundedness (0-1)
  • attribution correctness
  • known-entity resolution
  • completeness
  • hallucination rate
  • latency/cost deltas

Release gates

  • block if groundedness drops > 0.03 from baseline
  • block if known-entity hit drops > 2%
  • block if wrong-attribution rises > 2 points
  • block if p95 latency rises > 20% without a quality gain that justifies it

This is the same measurement discipline I use in AI Eval Framework, just tuned for portfolio assistant behavior instead of compliance findings.

ClickHouse monitoring: what I tracked and what it changed

I use ClickHouse because I need high-cardinality slices across model config, prompt version, retrieval mode, and session behavior.

Per-response telemetry

For every response I log:

  • prompt class + session metadata
  • model ID + embedding config snapshot
  • retrieval traces (top-k chunks, score spread, fallback usage)
  • tokens, cost, generation latency
  • confidence and guardrail flags
  • recursion metrics (depth, duplicate retrieval count, forced cutoff reason)

Concrete monitoring insights

  1. Recursion cluster on ambiguous follow-ups

    • Signal: a spike where recursion depth exceeded 3 and duplicate retrieval count > 2.
    • Change: added stricter retry budgets and dedupe checks.
    • Result: depth > 3 sessions dropped from 9.7% to 1.8%.
  2. One prompt template had high token burn

    • Signal: one prompt version averaged 38% more output tokens with no quality gain.
    • Change: tightened instruction hierarchy and removed repetitive safety boilerplate from response-facing sections.
    • Result: average cost per response dropped from about $0.016 to $0.011.
  3. Project asks still over-triggered fallback

    • Signal: fallback trigger rate stayed high for exact-name asks despite good recall.
    • Change: boosted entity-first retrieval path before generic semantic retrieval.
    • Result: fallback usage for project prompts dropped from 41% to 18%, with better latency.

Recursion guardrails and safety behavior

The most subtle reliability issue was recursion creep: the model kept trying to fetch "just one more context set" when confidence was low.

I added these controls:

  • hard recursion depth cap
  • duplicate-tool-call suppression
  • bounded retries with backoff
  • explicit "insufficient evidence" fallback response path
  • postmortem logging for every forced cutoff

This pattern mirrors the production mindset in AI Safety and Guardrails: failures are expected, so the system must fail visibly and safely.

Before vs after: final outcome

After six iteration cycles (offline eval + limited online rollout + ClickHouse review), results looked like this:

MetricBeforeAfter
Known entity hit rate72%96%
Groundedness score0.790.93
Wrong-attribution rate11.4%4.1%
Hallucination rate (adversarial set)13.2%5.0%
p95 end-to-end latency4.8s3.3s
Avg cost per response$0.016$0.011
Recursion depth > 3 sessions9.7%1.8%

The core lesson: the quality jump did not come from one magic prompt. It came from the loop:

  1. measure
  2. isolate failure class
  3. change one variable
  4. evaluate offline
  5. verify online
  6. promote or rollback

Final architecture takeaway

The portfolio assistant became reliable only after I combined:

  • source-aware indexing across projects/skills/experience/blog
  • embedding and dimension choices validated by evals
  • hard regression gates for model/prompt/index updates
  • ClickHouse telemetry for operational decisions
  • recursion and low-confidence guardrails

That is what turned it from "cool chat feature" into an explainable production system.

Want to see how this was built?

Read the full eval methodology

Want to see how this was built?

Browse all posts