AIEngineeringLLM

From Prompt to Production: Shipping LLM Features Reliably

Most LLM demos die in staging. The gap between 'it works in the notebook' and 'it handles real users' is bigger than people expect. Here is how I close that gap.

Miraz15 June 20262 min read

Getting an LLM to produce a convincing demo takes an afternoon. Getting it to work reliably in production for real users — with edge cases, rate limits, malformed outputs, and unexpected inputs — takes weeks. Here is what I have learned.

The demo-to-production gap

In a demo, you control the inputs. In production, users will ask the model to do things you never imagined, in languages you did not plan for, with context that breaks your carefully crafted prompt.

The four failure modes I see most:

Output format breaks — the model returns JSON with extra prose, or wraps it in markdown code blocks
Context length exceeded — your prompt plus the user input plus the history blows past the limit
Latency spikes — a single slow API call kills perceived performance
Hallucinations at scale — edge cases that never appeared in testing appear constantly at volume

Structured output first

The single most impactful change: use structured output. Whether that is OpenAI's response_format, Anthropic's tool use, or a Zod schema with retry logic — enforce the shape of the response before you do anything with it.

import { z } from 'zod';

const ReplySchema = z.object({
  summary: z.string().max(200),
  tags: z.array(z.string()).max(5),
  confidence: z.number().min(0).max(1),
});

// If the model returns something that does not match, retry once with the error
async function getStructuredReply(prompt: string) {
  const raw = await callLLM(prompt);
  const parsed = ReplySchema.safeParse(JSON.parse(raw));
  if (!parsed.success) throw new Error('Schema mismatch: ' + parsed.error.message);
  return parsed.data;
}

Streaming for perceived performance

If your response takes more than 2 seconds, stream it. Users tolerate slow progress bars far better than blank screens. Next.js makes this straightforward with the Vercel AI SDK or native ReadableStream.

Cost and rate limit budgets

Set a per-user, per-session, and per-day token budget before you go live. Not as a nicety — as an architectural requirement. Build the counter before you need it.

Eval before deploy

A production LLM feature needs an eval suite the same way a service needs integration tests. Even 20 hand-crafted cases — input, expected output, pass/fail criteria — catches regressions before users do.

The LLM is not the hard part. The infrastructure around it — observability, schema validation, fallbacks, cost controls, evals — is what separates demos from products.

import { z } from 'zod'; const ReplySchema = z.object({ summary: z.string().max(200), tags: z.array(z.string()).max(5), confidence: z.number().min(0).max(1), }); // If the model returns something that does not match, retry once with the error async function getStructuredReply(prompt: string) { const raw = await callLLM(prompt); const parsed = ReplySchema.safeParse(JSON.parse(raw)); if (!parsed.success) throw new Error('Schema mismatch: ' + parsed.error.message); return parsed.data; }