launchthat

How I Built a Custom AI for My Portfolio (Part 2): Monitoring and Optimization

Part 2 of my portfolio AI series: concrete eval results, ClickHouse monitoring signals, optimization decisions, recursion guardrails, and the measured before/after impact.

Apr 15, 2026Desmond Tatilian

This is Part 2 of a two-part series on my portfolio AI system.

Part 1: architecture, tooling, RAG setup, and implementation details
Part 2 (this post): monitoring, optimization, and measured outcomes

If you want the build and architecture walkthrough first, start with Part 1.

The first version of my portfolio AI looked good in a demo and failed in real usage.

It could answer "what projects has Desmond built?" but then miss direct asks like "tell me about TraderLaunchpad." It sounded confident even when retrieval was weak, and when I changed embedding settings, quality drifted in ways I could not explain from logs alone.

That is when I stopped treating the portfolio assistant like a chatbot demo and started treating it like a production system with hard quality gates.

This article is the full system view that connects RAG in Practice, AI Eval Framework, and AI Safety and Guardrails into one real implementation.

What I was optimizing for

I set five non-negotiables:

Grounded output: responses must come from real portfolio data.
Entity accuracy: named asks like "TraderLaunchpad" should resolve reliably.
Measurable quality: no model/prompt/index change ships without eval data.
Operational visibility: retrieval quality, latency, cost, and recursion must be queryable.
Safe failure modes: if evidence is weak, the system should say so instead of hallucinating.

Corpus and indexing strategy

The assistant indexes four source types:

Projects (slug, title, summary, stack, tags, featured content)
Skills (category, tools, proficiency context)
Experience (role scope, outcomes, architecture choices)
Blog (long-form technical content)

In my latest indexing run, that was approximately:

22 projects
47 skills
19 experience sections
24 published blog posts

After chunking, I ended up with roughly 1,300 searchable chunks.

Each chunk carries metadata:

sourceType (project, skill, experience, blog)
sourceSlug
topicTags
importance
indexedAt

This metadata is why hybrid retrieval works in practice: vector similarity does semantic match, while metadata and lexical fallbacks recover exact entities when needed.

What failed first (and what the metrics showed)

Before improvements, I tracked 150 real portfolio prompts over two weeks.

The failure pattern was clear:

Project-specific asks often got generic answers.
Blog-heavy queries over-indexed on long-form chunks and missed project facts.
Ambiguous follow-ups sometimes triggered redundant retrieval loops.

Baseline metrics looked like this:

Metric	Baseline
Known entity hit rate (project slug/title prompts)	72%
Groundedness score (human review)	0.79
Wrong-attribution rate	11.4%
Fallback retrieval trigger rate	34%
p95 end-to-end latency	4.8s
Recursion depth > 3 turns	9.7% sessions

The first big insight: retrieval was not "broken," it was inconsistent under specific prompt classes.

Embedding model and dimension experiments

I tested three configurations on an offline eval set (180 prompts) before promoting changes online:

60 direct entity prompts (project name/slug asks)
50 architecture prompts
40 comparison prompts
30 adversarial or ambiguous prompts

Candidate configurations

Config	Recall@5	MRR@5	Known entity hit	p95 retrieval latency	Relative index size
OpenAI 1536	0.86	0.79	88%	94ms	1.00x
Google 1536 (constrained)	0.89	0.82	91%	108ms	1.31x
Google 3072 (native)	0.92	0.85	95%	143ms	1.94x

Why I chose the final setup

I ended up with Google native 3072 for the portfolio corpus because it gave the best retrieval quality on nuanced architecture prompts and the highest entity hit rate.

The latency increase was real, but acceptable once I fixed recursion and reduced duplicate retrieval attempts.

Important operational detail: dimension changes are a hard retrieval contract. I had to re-index when changing dimension strategy, otherwise results were noisy and misleading.

If you want more on this decision surface, read Vector Dimensions in Production RAG.

Specific changes that moved quality

These were the most impactful implementation changes:

Entity-aware fallback retrieval
- Added lexical project lookup fallback for explicit names/slugs.
- Impact: known-entity hit rate improved from 72% to 96%.
Importance normalization
- Normalized RAG importance values to a safe 0..1 range before indexing.
- Impact: eliminated null/invalid score edge cases and stabilized ranking.
Source balancing
- Prevented long blog chunks from drowning project chunks on project asks.
- Impact: wrong-attribution rate dropped from 11.4% to 4.1%.
Prompt constraints for low-evidence responses
- Updated prompt instructions to explicitly return "insufficient evidence" instead of extrapolating.
- Impact: hallucination-style responses dropped by more than half in adversarial prompts.

Why this got more complex than "just use RAG"

One thing this project made clear: complexity did not come from indexing itself, it came from retrieval reliability requirements.

Projects were easier because many requests map to stable canonical entities (slug/title), so deterministic fallbacks work well. Blog queries were harder because:

long-form articles are semantically dense and overlap in vocabulary
short follow-ups like "what about part 1" are under-specified
part-based series titles require disambiguation, not just nearest-neighbor similarity

So the production solution became hybrid by design:

semantic RAG for broad relevance
lexical/title matching for specific article targeting
deterministic retrieval paths for exact entity intent
conversation-aware handling for short follow-ups

This is a common production pattern in RAG systems: retrieval quality improves when you combine probabilistic semantic search with deterministic entity resolution.

Eval framework and release gates

I treat prompt/model/index changes like code releases. Every meaningful change runs through a fixed eval harness.

What I score

groundedness (0-1)
attribution correctness
known-entity resolution
completeness
hallucination rate
latency/cost deltas

Release gates

block if groundedness drops > 0.03 from baseline
block if known-entity hit drops > 2%
block if wrong-attribution rises > 2 points
block if p95 latency rises > 20% without a quality gain that justifies it

This is the same measurement discipline I use in AI Eval Framework, just tuned for portfolio assistant behavior instead of compliance findings.

ClickHouse monitoring: what I tracked and what it changed

I use ClickHouse because I need high-cardinality slices across model config, prompt version, retrieval mode, and session behavior.

Per-response telemetry

For every response I log:

prompt class + session metadata
model ID + embedding config snapshot
retrieval traces (top-k chunks, score spread, fallback usage)
tokens, cost, generation latency
confidence and guardrail flags
recursion metrics (depth, duplicate retrieval count, forced cutoff reason)

Concrete monitoring insights

Recursion cluster on ambiguous follow-ups
- Signal: a spike where recursion depth exceeded 3 and duplicate retrieval count > 2.
- Change: added stricter retry budgets and dedupe checks.
- Result: depth > 3 sessions dropped from 9.7% to 1.8%.
One prompt template had high token burn
- Signal: one prompt version averaged 38% more output tokens with no quality gain.
- Change: tightened instruction hierarchy and removed repetitive safety boilerplate from response-facing sections.
- Result: average cost per response dropped from about $0.016 to $0.011.
Project asks still over-triggered fallback
- Signal: fallback trigger rate stayed high for exact-name asks despite good recall.
- Change: boosted entity-first retrieval path before generic semantic retrieval.
- Result: fallback usage for project prompts dropped from 41% to 18%, with better latency.

Recursion guardrails and safety behavior

The most subtle reliability issue was recursion creep: the model kept trying to fetch "just one more context set" when confidence was low.

I added these controls:

hard recursion depth cap
duplicate-tool-call suppression
bounded retries with backoff
explicit "insufficient evidence" fallback response path
postmortem logging for every forced cutoff

This pattern mirrors the production mindset in AI Safety and Guardrails: failures are expected, so the system must fail visibly and safely.

Before vs after: final outcome

After six iteration cycles (offline eval + limited online rollout + ClickHouse review), results looked like this:

Metric	Before	After
Known entity hit rate	72%	96%
Groundedness score	0.79	0.93
Wrong-attribution rate	11.4%	4.1%
Hallucination rate (adversarial set)	13.2%	5.0%
p95 end-to-end latency	4.8s	3.3s
Avg cost per response	$0.016	$0.011
Recursion depth > 3 sessions	9.7%	1.8%

The core lesson: the quality jump did not come from one magic prompt. It came from the loop:

measure
isolate failure class
change one variable
evaluate offline
verify online
promote or rollback

Final architecture takeaway

The portfolio assistant became reliable only after I combined:

source-aware indexing across projects/skills/experience/blog
embedding and dimension choices validated by evals
hard regression gates for model/prompt/index updates
ClickHouse telemetry for operational decisions
recursion and low-confidence guardrails

That is what turned it from "cool chat feature" into an explainable production system.

Want to see how this was built?

Read the full eval methodology

Want to see how this was built?

Browse all posts