Skip to content
Air Automations
All posts
RAGMay 21, 2026May 2026

Your RAG Hits 0.9 Recall in Eval and Dies in Production. Here's Why.

Benchmark recall lies. Here's how we diagnose retrieval quality in production RAG systems before users start filing tickets at 2am.

Eval sets lie because real users don't write like your eval set

Your benchmark queries were probably generated from your documentation. You chunked a support guide, embedded it, then wrote queries that mirror the structure and language of those chunks. The model nails it. Recall at 5 is 0.92. You ship.

Then on day three of production, a user types 'i bought this last tuesday and it broke can i get my money back' and your system retrieves three unrelated chunks about warranty coverage before it finds the refund policy section. The LLM hallucinates a restocking fee that doesn't exist. The user gets mad. You get a Slack message at 2am.

The problem isn't your chunking strategy or your embedding model. It's distribution shift. Synthetic eval queries are clean, specific, and lexically close to the source documents. Real users write fragments, pronouns, misspellings, and questions that tangle multiple intents. A query like 'refund policy' is nothing like 'i bought this last tuesday and it broke can i get my money back,' but both should retrieve the same chunk. Your 0.9 recall on the eval set was measuring performance on a distribution that doesn't exist in production.

The worst part: this gap is nearly invisible until you look at your logs. A 0.9 eval score can mask a 0.35 real-world recall rate, and you won't know it until customers tell you.

The four failure modes we see in every production RAG audit

When we audit a failing RAG system, we find the same patterns again and again. Knowing them helps you spot the failures before they cascade.

  • Embedding mismatch. Your query embeddings come from a model trained on short, structured queries (like MTEB benchmarks). Your document embeddings come from a model trained on longer, messier text. The distance metric between them is noisy. A production query in a slightly different dialect or style lands in a different region of vector space than you'd expect.
  • Chunk boundary failures. The answer spans two chunks. Neither chunk alone scores high enough to be retrieved. The LLM sees partial context, guesses the rest, and sounds confident about it.
  • Stale index. Docs were updated in your CMS last week. Embeddings weren't recomputed. Users get stale context, the LLM contradicts it with its training knowledge, and you get conflicting information hallucinated as fact.
  • Lost-in-the-middle. You retrieve 10 chunks by semantic similarity, but the answer is in chunk 7. Transformer-based LLMs struggle with context that's not in the first or last positions. The model sees the right chunk but ignores it in favor of noisier chunks at the edges.
  • Metadata filter bugs. Your tenant_id filter is silently mismatched between the indexing pipeline and the query path. A user gets zero results, the system returns an empty context list, and the LLM hallucinates a helpful answer instead of saying 'I don't have that information.'

Each of these is a separate failure mode. Each produces silent degradation until users notice. Most production RAG systems have at least two of these running at once.

Instrument retrieval before you instrument the LLM

You can't diagnose what you don't measure. Start by logging the retrieval step independently from generation, so you can see which part is failing.

Log per-request: the original query, retrieved chunk IDs, similarity scores, and rerank scores if you're using a reranker. Tag every generation with whether the retrieved context actually contained the answer — use an LLM-as-judge on a sample of requests to spot when retrieval is correct but generation is hallucinating. Track score distributions over time. A dropping mean similarity score is your early warning signal for index staleness or query distribution drift.

Separate latency budgets: embedding queries should be ~50ms, vector search ~100ms, reranking ~200ms, LLM generation 1–3 seconds. If your retrieval latency is creeping into the generation budget, you're burning user wait time on search when the answer is weak anyway.

Watch cost carefully. A cross-encoder reranker doubles your retrieval cost. Gate it: only rerank the top 20 results, or only on queries with low embedding confidence. The cost of operations you don't monitor tends to appear later as the cost of people who leave — measure retrieval costs explicitly so you can trade accuracy for efficiency with intent, not by accident.

Build an eval set from your production logs, not your imagination

Close the eval/production gap by grounding your evals in real traffic. Weekly, sample 200–500 queries from your logs, stratified by intent (support questions, product lookups, policy questions, etc.). Have a human label the canonical chunks that should be retrieved for each. Run that set nightly against your current index and alert on regression.

The first 100 queries should be labeled by hand. This is non-negotiable. It teaches you the failure modes specific to your domain and grounds your judgment. After that, you can use an LLM-as-judge with spot checks, but those first hundred are your ground truth.

Track hard negatives: queries that look retrievable but shouldn't be, or queries where the system should refuse. 'What's the CEO's personal phone number?' should retrieve zero chunks and the LLM should decline to answer. If your eval set doesn't have these, you're measuring a different failure mode than the one users will hit.

Re-run nightly. When your eval recall drops from 0.88 to 0.84 overnight, something changed — a bug in the indexing pipeline, stale embeddings, or a metadata filter that broke. You'll catch it before users do.

When retrieval isn't the problem — it's the architecture

Sometimes you optimize retrieval and it's still fragile because the problem isn't semantic search. Queries that need joins, aggregations, or real-time freshness don't belong in a vector store. A user asking 'how much will shipping cost for this order to my address?' needs a SQL tool call or an API lookup, not a semantic search over past docs.

Hybrid retrieval — combining BM25 lexical search, dense embeddings, and metadata filters, then fusing results with reciprocal rank fusion — works better than pure semantic search for many production questions. Structured questions land better with a tool call than with a vector index.

When the answer lives across 50 documents and requires aggregation or reasoning over multiple sources, an agent loop with structured tool calls often retrieves and reasons better than a RAG system with static top-k. An agent can iteratively search, validate, and refine. A RAG system commits to the top-k and hopes the LLM can stitch it together.

The harder diagnosis: knowing when to stop tuning retrieval and change the system shape. If your metrics plateau despite weeks of reranking tweaks and chunk optimization, ask whether you've picked the wrong architecture. That's worth a week of architecture exploration before it's worth another week of parameter tuning.

One concrete diagnostic you can run tonight

Pull 100 queries from last week's logs. For each one, manually label the chunk or chunks that should have been retrieved. Run those queries against your current index. How many of the 100 retrieved the correct chunk in the top 5? That number is your real production recall. If it's less than 0.85, your retrieval is a bottleneck whether or not your eval score says so.

The gap between that number and your eval score is your real production risk. Close it before the next user files a ticket at 2am.