RAG — Retrieval-Augmented Generation
- →Understand why RAG exists and when to use it vs fine-tuning
- →Build a complete RAG pipeline from scratch
- →Choose the right chunking strategy for different document types
- →Identify and fix the most common RAG failure modes
The Problem RAG Solves
LLMs have a knowledge problem with three distinct dimensions, and RAG addresses all three.
Problem 1: Knowledge cutoff. Every LLM's training data ends at a specific date. Claude's training data has a cutoff; GPT-4's has a cutoff. Anything that happened after that date — a new product release, a regulatory change, a competitor announcement — simply doesn't exist to the model. Asking Claude about yesterday's earnings report will get you either a hallucinated answer or an honest "I don't know."
Problem 2: Private data. Your company's internal documentation, your codebase, your customer contracts, your product database — none of this is in any model's training data. The model has never seen it, and there's no way to make it aware through prompting alone.
Problem 3: Hallucination. When a model doesn't know something, it doesn't always say so. It generates plausible-sounding text regardless. Grounding a model's response in retrieved documents dramatically reduces hallucination because the model is synthesizing provided context rather than generating from memory.
Why not just fine-tune? Fine-tuning trains a new version of the model on your data. It's expensive (GPU time), slow (days to weeks), and static — once trained, the model doesn't know about documents added after the fine-tune. It's the right answer for teaching the model how to behave, not what to know. For knowledge-intensive applications, RAG is almost always the right architecture.
The RAG approach: at inference time, retrieve the most relevant documents from your data store and include them in the context window as part of the prompt. The model then synthesizes an answer grounded in those specific documents. The documents are always fresh because retrieval happens at query time, not training time.
The RAG Pipeline
RAG has two distinct phases. Understanding the separation between them is critical for debugging.
INDEXING PHASE (run once, or on document update) ══════════════════════════════════════════════════ Documents (PDF, HTML, TXT, code) │ ▼ [Load & Parse] ← extract raw text │ ▼ [Chunk] ← split into overlapping segments │ ▼ [Embed] ← convert each chunk to a vector │ ▼ [Vector Store] ← persist embeddings + metadata QUERY PHASE (run on every user question) ══════════════════════════════════════════════════ User Question │ ▼ [Embed] ← embed the question using same model │ ▼ [Similarity Search] ← find top-k closest chunks │ ▼ [Assemble Prompt] ← question + retrieved chunks │ ▼ [LLM] ← generate answer grounded in chunks │ ▼ Answer
Every RAG failure traces back to one of these steps. The most common failures are in the chunking and retrieval steps — and they're invisible to users who just see "I don't know."
Document Loading and Chunking
Before you can embed anything, you need to split your documents into chunks that are:
- Small enough to be semantically focused (not mixing unrelated topics)
- Large enough to contain a complete thought or fact
- Overlapping enough that answers don't fall between chunk boundaries
Three Chunking Strategies
Fixed-size chunking: Split every N characters, regardless of content structure. Simple and predictable. Works poorly when sentences or paragraphs span chunk boundaries.
Recursive character splitting: Try to split on natural boundaries in order: \n\n (paragraph), then \n (line), then . (sentence), then (word). Falls back to character splitting only when needed. This is the recommended default.
Semantic chunking: Embed each sentence, group sentences by embedding similarity. More expensive but produces the most coherent chunks. Worth it for heterogeneous document collections.
Chunk size determines how much context surrounds each retrieved fact. Too small: chunks lack context, answers are incomplete. Too large: chunks contain multiple topics, retrieval becomes imprecise. Start with 500–1000 characters and tune empirically. For code, use 200–500 characters. For long-form prose, try 1000–1500.
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=800, # Target chunk size in characters
chunk_overlap=160, # 20% overlap — prevents answer split at boundaries
length_function=len,
separators=["\n\n", "\n", ". ", " ", ""] # Try these in order
)
with open("employee_handbook.txt", "r") as f:
raw_text = f.read()
chunks = splitter.create_documents([raw_text])
print(f"Split into {len(chunks)} chunks")
print(f"Average chunk size: {sum(len(c.page_content) for c in chunks) / len(chunks):.0f} chars")
# Inspect a boundary to verify overlap works
print("\nEnd of chunk 3:")
print(repr(chunks[2].page_content[-100:]))
print("\nStart of chunk 4:")
print(repr(chunks[3].page_content[:100:]))
# You should see these share contentEmbeddings and Vector Databases
An embedding is a vector (list of numbers) that represents the semantic meaning of a piece of text. Two chunks about "vacation policy" will have similar embeddings; a chunk about "vacation policy" and one about "server deployment" will have very different embeddings.
At query time, you embed the user's question using the same embedding model, then find the chunks with the most similar embeddings. This is semantic search — it finds relevant content even when the query uses different words than the document.
Embedding model options:
| Model | Provider | Dimensions | Notes |
|-------|----------|-----------|-------|
| text-embedding-3-small | OpenAI | 1536 | Fast, cheap, good quality |
| text-embedding-3-large | OpenAI | 3072 | Higher quality, more expensive |
| all-MiniLM-L6-v2 | HuggingFace | 384 | Free, runs locally, good baseline |
| nomic-embed-text | Nomic | 768 | Free, runs locally, strong performance |
Vector database options:
| Database | Best For | Notes | |----------|---------|-------| | Chroma | Local development | In-process, no setup required | | Pinecone | Production cloud | Managed, scales automatically | | Weaviate | Production cloud | Supports hybrid search natively | | pgvector | Postgres shops | RAG in your existing DB | | FAISS | High-performance local | Meta's library, for large datasets |
Building a Complete RAG Pipeline
Here is a complete, working RAG pipeline using LangChain + Chroma + HuggingFace embeddings + Claude:
import anthropic
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.embeddings import HuggingFaceEmbeddings
# ── INDEXING PHASE ──────────────────────────────────────────────
def build_index(filepath: str, collection_name: str = "documents") -> Chroma:
"""Load a text file, chunk it, embed it, and store in Chroma."""
# 1. Load
with open(filepath, "r", encoding="utf-8") as f:
raw_text = f.read()
# 2. Chunk
splitter = RecursiveCharacterTextSplitter(
chunk_size=800,
chunk_overlap=160,
)
chunks = splitter.create_documents(
[raw_text],
metadatas=[{"source": filepath}]
)
print(f"Created {len(chunks)} chunks from {filepath}")
# 3. Embed and store
embedding_model = HuggingFaceEmbeddings(
model_name="sentence-transformers/all-MiniLM-L6-v2"
)
vectorstore = Chroma.from_documents(
documents=chunks,
embedding=embedding_model,
collection_name=collection_name,
persist_directory="./chroma_db"
)
print(f"Stored {vectorstore._collection.count()} chunks in vector DB")
return vectorstore
# ── QUERY PHASE ──────────────────────────────────────────────────
def query_rag(
question: str,
vectorstore: Chroma,
k: int = 4, # Number of chunks to retrieve
score_threshold: float = 0.3 # Minimum similarity score
) -> str:
"""Retrieve relevant chunks and generate a grounded answer."""
# 1. Retrieve top-k similar chunks
results = vectorstore.similarity_search_with_score(question, k=k)
# Filter by score threshold (lower score = more similar in Chroma)
relevant = [(doc, score) for doc, score in results if score < (1 - score_threshold)]
if not relevant:
return "I don't have enough relevant information in my documents to answer that question."
# 2. Build context from retrieved chunks
context_parts = []
for i, (doc, score) in enumerate(relevant, 1):
source = doc.metadata.get("source", "unknown")
context_parts.append(f"[Source {i} (similarity: {1-score:.2f})] {doc.page_content}")
context = "\n\n---\n\n".join(context_parts)
# 3. Call LLM with retrieved context
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-haiku-4-5",
max_tokens=1024,
system="""You are a helpful assistant that answers questions based solely on the provided context.
Rules:
- Only use information from the provided context
- If the context doesn't contain the answer, say "I don't have that information"
- Cite which source(s) you used (e.g., "According to Source 2...")
- Do not add information from your training data""",
messages=[
{
"role": "user",
"content": f"""Context:
{context}
Question: {question}"""
}
]
)
return response.content[0].text
# ── MAIN ─────────────────────────────────────────────────────────
if __name__ == "__main__":
# Index a document
vectorstore = build_index("your_document.txt")
# Interactive Q&A
while True:
question = input("\nQuestion (or 'quit'): ").strip()
if question.lower() == "quit":
break
answer = query_rag(question, vectorstore)
print(f"\nAnswer: {answer}")Common RAG Failure Modes
Understanding why RAG fails is as important as knowing how to build it. Here are the four most common failure modes:
| Symptom | Likely Cause | Fix |
|---------|-------------|-----|
| "I don't have that information" but answer is in docs | Retrieval miss — chunk not ranked in top-k | Increase k, adjust chunk size, try hybrid search |
| Correct retrieval but wrong answer | Context stuffing — too many irrelevant chunks dilute the relevant ones | Decrease k, add score filtering, use reranking |
| Answer contradicts the document | Answer drift — model "knew" something from training that conflicts with context | Strengthen system prompt: "Use ONLY the provided context" |
| Answer cuts off mid-thought | Chunk boundary issue — the answer spans two chunks but only one was retrieved | Increase chunk overlap, increase k |
Diagnosing retrieval misses: Before blaming the LLM, always verify what was retrieved:
def debug_retrieval(question: str, vectorstore: Chroma, k: int = 6):
"""Print retrieved chunks and their similarity scores for debugging."""
results = vectorstore.similarity_search_with_score(question, k=k)
print(f"\n=== Retrieved chunks for: '{question}' ===\n")
for i, (doc, score) in enumerate(results, 1):
print(f"Chunk {i} | Similarity: {1-score:.3f}")
print(f"Content: {doc.page_content[:200]}...")
print()
print("If the answer isn't in any of these chunks, it's a retrieval problem.")
print("If it is here but the LLM got it wrong, it's a generation problem.")Hybrid Search and Reranking
Pure semantic search has a blind spot: it misses exact keyword matches and proper nouns. Search for "GDPR Article 17" semantically and you might not find the document that uses those exact terms if the embedding model doesn't capture their specificity.
Hybrid search combines two signals:
- BM25 (keyword/TF-IDF): exact term matching, great for product names, codes, proper nouns
- Semantic (embedding): meaning-based matching, great for paraphrasing and conceptual queries
A common pattern: retrieve 20 candidates from both methods, merge deduplicate, then rerank using a cross-encoder model (which reads both the query and each document together, producing a more accurate relevance score than embedding similarity alone).
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
def hybrid_retrieve_and_rerank(
question: str,
vectorstore: Chroma,
initial_k: int = 20,
final_k: int = 4
) -> list:
"""Retrieve more candidates than needed, then rerank to get top final_k."""
# Step 1: Semantic retrieval (get more candidates than we'll use)
candidates = vectorstore.similarity_search(question, k=initial_k)
# Step 2: Score each candidate with the cross-encoder
# Cross-encoder sees (query, document) together — more accurate than dot product
pairs = [(question, doc.page_content) for doc in candidates]
scores = reranker.predict(pairs)
# Step 3: Sort by reranker score and keep top final_k
ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
return [doc for doc, _ in ranked[:final_k]]Adding hybrid search and a reranker typically improves RAG answer quality by 15–25% over pure semantic search, with only a small latency increase. For production systems serving real users, this tradeoff is almost always worth it. Start with pure semantic search during development, add reranking when you're optimizing for quality.
Goal: Build a RAG system over a document, deliberately cause a retrieval miss, then fix it.
Setup:
pip install langchain chromadb sentence-transformers anthropicPart 1: Build the basic RAG pipeline
- Download any text document (a Wikipedia article, a product manual, this module's content exported to .txt)
- Implement the indexing pipeline from the code above
- Write 5 questions that are answerable from the document
- Write 2 questions that are deliberately NOT answerable
Expected behavior: 5/5 answerable questions should get correct answers; 2/2 unanswerable should get "I don't have that information."
Part 2: Cause and fix a retrieval miss
- Find a specific fact deep in your document
- Ask a question that paraphrases that fact using very different words
- Run
debug_retrieval()— verify the relevant chunk is NOT in the top-4 - Fix it by: (a) increasing
k, and/or (b) adjusting chunk size
Part 3: Measure the fix
Run the same question 3 ways and compare:
k=2, chunk_size=200k=4, chunk_size=800k=8, chunk_size=1200
Print the retrieved chunks and the final answer for each. Which configuration gives the best answers?
Deliverable: A Python script that:
- Takes a file path and a question as CLI arguments
- Runs the RAG pipeline
- Prints retrieved chunks (with similarity scores) before the answer
- Prints the final answer
Q1Why does RAG exist?
Q2What is 'chunk overlap' in document splitting?
Q3Your RAG system returns 'I don't have that information' but the answer is in your docs. What's most likely wrong?