← Back to notes

Building Production RAG Systems

Published on January 15, 2026

Building Production RAG Systems

Retrieval-augmented generation (RAG) has become the go-to pattern for grounding LLMs in domain-specific knowledge. But moving from a prototype to a production system requires careful attention to embedding quality, retrieval relevance, and evaluation practices.

The Core RAG Loop

At its simplest, RAG is:

  1. Indexing: Chunk documents, embed them, store in a vector database
  2. Retrieval: User query → embed → find similar chunks
  3. Generation: Combine retrieved chunks + query → LLM → response

The quality of your final response is bounded by the quality of what you retrieve.

Embedding Models Matter

Your choice of embedding model directly impacts retrieval quality. Some considerations:

Evaluating Embeddings

Don't just assume your embedding model is good. Measure:

Set baselines and iterate. A 10% improvement in retrieval quality often translates to 20%+ improvement in final answer quality.

The Chunking Problem

How you split documents drastically affects what you can retrieve:

Most teams end up with overlap-aware chunking — chunks of 512-1024 tokens with 20% overlap.

Evaluation at Scale

Production RAG needs continuous evaluation:

Without this, you ship broken retrievals silently.

Common Pitfalls

  1. Cache staleness: Updated docs aren't reflected in vector DB
  2. Embedding drift: Models change, old embeddings become misaligned
  3. Context window overflow: Retrieved chunks exceed LLM context limits
  4. No relevance filtering: Bad retrievals make LLM hallucinate more

Next Steps

Start with a small golden dataset (50-100 examples), measure your MRR/Hit Rate, then iterate on embedding model and chunking strategy. Don't scale to millions of documents until retrieval quality is locked in.