How to Build a RAG Pipeline
A RAG pipeline has 5 steps: chunk documents → generate embeddings → store in vector DB → retrieve top-K chunks → generate a grounded answer with citations. The full pipeline can be built in Python using LangChain or raw API calls in under 100 lines of code.
Steps
Chunk Your Documents
Split source documents into chunks before embedding. Use 256–512 tokens with 10–20% overlap so context doesn't get cut off at chunk boundaries.
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=512,
chunk_overlap=64,
separators=['\n\n', '\n', '. ', ' '])
chunks = splitter.split_documents(documents)Generate Embeddings
Convert each chunk to a dense vector. OpenAI's text-embedding-3-small is cost-efficient (1536 dimensions). For local/private setups, use sentence-transformers.
from openai import OpenAI
client = OpenAI()
def embed(text: str) -> list[float]:
response = client.embeddings.create(
input=text,
model='text-embedding-3-small'
)
return response.data[0].embeddingStore in a Vector Database
Index embeddings + metadata (source file, page, chunk index) in a vector database. Chroma is simplest for local dev; Pinecone for managed prod.
import chromadb
client = chromadb.PersistentClient(path='./chroma_db')
collection = client.get_or_create_collection('docs')
collection.add(
ids=[chunk.id for chunk in chunks],
documents=[chunk.text for chunk in chunks],
embeddings=[embed(chunk.text) for chunk in chunks],
metadatas=[chunk.metadata for chunk in chunks],
)Retrieve Top-K Chunks
Embed the user query and search for the most similar chunks. Retrieve top-20 candidates then rerank to top-4 with a cross-encoder for higher precision.
query = 'What is our refund policy?'
query_embedding = embed(query)
results = collection.query(
query_embeddings=[query_embedding],
n_results=4,
include=['documents', 'metadatas', 'distances'])
chunks = results['documents'][0]Generate a Grounded Answer
Build a prompt with retrieved chunks as context. Send to the LLM and return the answer with source citations.
context = '\n\n'.join(chunks)
prompt = f"""Answer based ONLY on the context below.
Context:
{context}
Question: {query}
"""
response = client.chat.completions.create(
model='gpt-4o',
messages=[{'role': 'user', 'content': prompt}]
)
print(response.choices[0].message.content)Common Issues
Retrieved chunks don't contain the answer
Increase top-K from 4 to 8–12 and add hybrid BM25 search. Check your chunk size — chunks may be too large or too small for your query patterns.
LLM ignores the retrieved context
The prompt must instruct the model to answer ONLY from the context. Add "If the context doesn't contain the answer, say so" to prevent hallucination fallback.
High latency on retrieval
Add Redis caching for frequent queries. Use batch embedding for ingestion. Reduce vector index size with approximate nearest neighbor (ANN) search.
Citation sources are wrong
Store chunk metadata (file name, page number, chunk index) at ingest time and pass it through retrieval. Never construct citations from LLM output alone.
FAQ
- What is the best chunk size for RAG?
- 256–512 tokens with 10–20% overlap is a good default. Smaller chunks (128 tokens) improve precision for factual Q&A; larger chunks (1024 tokens) preserve context for complex reasoning. Evaluate on your specific corpus.
- Which vector database should I use for RAG?
- Chroma for local dev, Pinecone for managed prod, pgvector if you already use PostgreSQL. Weaviate and Qdrant for self-hosted open-source deployments.
- How do I improve RAG retrieval accuracy?
- Biggest gains: (1) hybrid BM25 + vector search, (2) cross-encoder reranking, (3) semantic chunking, (4) metadata filtering. Measure hit rate and MRR to track improvement.