Building RAG Pipelines That Actually Work in Production

Everyone is building RAG pipelines. Very few are building ones that work reliably. After shipping RAG-powered features across two companies, here's what I've learned about bridging the gap between a demo and a system you can trust.

Why RAG Over Fine-Tuning?

Fine-tuning bakes knowledge into model weights. It's expensive, slow to update, and prone to hallucination about specifics. RAG keeps your knowledge external in a retrievable store: cheaper to update, auditable, and you can point to exactly which documents informed an answer. For most enterprise use cases, RAG wins.

The Chunking Problem

Your retrieval is only as good as your chunks. I've tried fixed-size (512 tokens), sentence-level, and recursive character splitting. The winner? Semantic chunking, splitting on topic boundaries using embedding similarity. It preserves context and dramatically improves retrieval precision.

python

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS

# Semantic-aware chunking
splitter = RecursiveCharacterTextSplitter(
    chunk_size=800,
    chunk_overlap=200,
    separators=["\n\n", "\n", ". ", " "]
)
chunks = splitter.split_documents(documents)

# Embed and index
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = FAISS.from_documents(chunks, embeddings)

# Retrieve with re-ranking
retriever = vectorstore.as_retriever(
    search_type="mmr",  # Maximal Marginal Relevance
    search_kwargs={"k": 8, "fetch_k": 20}
)

Re-Ranking Changes Everything

Vector similarity is a blunt instrument. Adding a cross-encoder re-ranker (like Cohere Rerank or a local cross-encoder model) on top of initial retrieval improved answer quality by roughly 30% in my evaluations. The pattern: retrieve broadly (top-20), re-rank precisely (top-5), then feed to the LLM.

Evaluation: The Part Everyone Skips

You can't improve what you don't measure. My evaluation stack:

Retrieval precision: are the right documents being pulled? Measured with labeled query-document pairs
Answer faithfulness: does the response stay grounded in retrieved context? I use LLM-as-judge with citation checking
Latency budgets: P95 retrieval under 200ms, full pipeline under 3 seconds
Regression testing: a golden set of 50 queries where I know the expected answers, run on every pipeline change

Key Takeaway

RAG is not a solved problem. It's an engineering discipline. The difference between a demo and production is chunking strategy, re-ranking, evaluation, and relentless iteration on edge cases. Start simple, measure everything, and optimize the retrieval layer before touching the generation prompt.