launchthat

AI in Production: Lessons From Shipping to Real Users

Our first AI feature hallucinated a refund policy that did not exist. A customer followed it. Here is what we learned about putting language models in front of real people.

Jun 16, 2026Desmond Tatilian

Our first AI-powered feature was a customer support agent. It answered questions about our SaaS platform using documentation as context. On paper, the architecture was solid: RAG pipeline, grounded responses, citation links.

In practice, the agent told a customer they could get a full refund within 90 days. Our refund policy was 30 days. The customer asked for a refund on day 45, quoted the AI, and was understandably upset when we could not honor what our own system had told them.

That incident cost us a customer and taught us more about production AI than six months of prompt engineering ever did.

The gap between demo and production

Every AI feature works in demos. You cherry-pick the inputs, you show the impressive outputs, and you skip the edge cases. The gap between "works in a demo" and "works in production" is where most AI projects fail.

Here is what the gap actually looks like:

Hallucination under pressure. When the model cannot find a direct answer in context, it improvises. In a demo, you pick questions with clear answers. In production, users ask ambiguous questions and expect precise answers.
Context window limits. Your documentation fits in the context window today. In six months, it has grown 3x and you are silently dropping critical sections.
Latency expectations. Users tolerate a 2-second response in a chat interface. They do not tolerate it in a search bar. Same model, different UX contract.
Failure modes nobody tested. What happens when the API is slow? When the model returns an empty response? When the user sends the same message 50 times? Every edge case becomes a production incident.

What we built differently the second time

After the refund incident, we rebuilt our AI integration layer with a different philosophy: the model is a tool, not an authority.

Structured output, not free text

Instead of letting the model generate arbitrary responses, we define output schemas:

const supportResponseSchema = z.object({
  answer: z.string(),
  confidence: z.enum(["high", "medium", "low"]),
  sources: z.array(z.object({
    title: z.string(),
    url: z.string(),
    relevance: z.number(),
  })),
  requiresHumanReview: z.boolean(),
  suggestedActions: z.array(z.string()),
});

The model fills in structured fields. A confidence score below "high" triggers human review before the response is sent. Every answer includes source citations that the user can verify.

Guardrails at the application layer

We do not trust the model to stay in bounds. We verify:

async function processAIResponse(response: AIResponse) {
  if (response.confidence === "low") {
    return escalateToHuman(response);
  }

  const policyCheck = await validateAgainstPolicies(response.answer);
  if (policyCheck.violations.length > 0) {
    return escalateToHuman(response, policyCheck.violations);
  }

  return response;
}

The policy validation layer checks the response against known business rules. If the AI mentions refund timelines, pricing, or legal terms, the system verifies those claims against a source-of-truth database before sending. Any mismatch routes to a human.

Streaming with checkpoints

For long responses, we stream tokens to the user but run validation on complete sentences:

for await (const chunk of stream) {
  buffer += chunk;
  if (isSentenceBoundary(buffer)) {
    const validated = await quickValidate(buffer);
    if (validated.safe) {
      yield buffer;
      buffer = "";
    } else {
      yield "[Checking this response...]";
      return escalateToHuman(buffer);
    }
  }
}

Users see the response building in real-time, but the system catches problematic claims before they are fully rendered.

Multi-model strategy

We do not use one model for everything. Different tasks have different requirements:

GPT-4o for complex reasoning tasks where accuracy matters more than speed
Claude for long-context analysis where the full conversation history matters
Smaller models for classification, routing, and simple extraction where latency matters most

The routing layer decides which model handles each request based on the task type, not the price:

function selectModel(task: AITask): ModelConfig {
  if (task.requiresReasoning) return models.gpt4o;
  if (task.contextLength > 50000) return models.claude;
  if (task.latencySensitive) return models.fast;
  return models.default;
}

Cost management

AI costs scale with usage, and usage is unpredictable. We learned this when a single user triggered 400 API calls in one session by rapidly asking follow-up questions.

Our cost controls:

Per-user rate limiting — not just API-level, but per-feature
Response caching — identical questions within a time window return cached answers
Token budgets — each workspace has a monthly token allocation that prevents runaway costs
Fallback chains — if the primary model is slow or expensive, downgrade gracefully

What we tell every team shipping AI

Ship the guardrails before the feature. Policy validation, human escalation, and structured output are not post-launch improvements. They are prerequisites.
Log everything. Every prompt, every response, every confidence score. You cannot improve what you cannot measure, and you cannot debug production incidents without logs.
Set user expectations. "AI-assisted" is better framing than "AI-powered." Users who understand the model is a tool — not an oracle — handle edge cases more gracefully.
Plan for the model to be wrong. Not occasionally wrong. Regularly wrong in ways you did not predict. Your architecture should assume failure and handle it gracefully.

The refund incident was expensive. But it taught us that production AI is not about having the best model. It is about building the system around the model that catches its mistakes before they reach users.

Want to see how this was built?

See LaunchThatBot

Want to see how this was built?

Browse all posts