Skip to content

How to Build a RAG Pipeline

A RAG pipeline has 5 steps: chunk documents → generate embeddings → store in vector DB → retrieve top-K chunks → generate a grounded answer with citations. The full pipeline can be built in Python using LangChain or raw API calls in under 100 lines of code.

Steps

1

Chunk Your Documents

Split source documents into chunks before embedding. Use 256–512 tokens with 10–20% overlap so context doesn't get cut off at chunk boundaries.

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=64,
    separators=['\n\n', '\n', '. ', ' '])
chunks = splitter.split_documents(documents)
2

Generate Embeddings

Convert each chunk to a dense vector. OpenAI's text-embedding-3-small is cost-efficient (1536 dimensions). For local/private setups, use sentence-transformers.

from openai import OpenAI

client = OpenAI()

def embed(text: str) -> list[float]:
    response = client.embeddings.create(
        input=text,
        model='text-embedding-3-small'
    )
    return response.data[0].embedding
3

Store in a Vector Database

Index embeddings + metadata (source file, page, chunk index) in a vector database. Chroma is simplest for local dev; Pinecone for managed prod.

import chromadb

client = chromadb.PersistentClient(path='./chroma_db')
collection = client.get_or_create_collection('docs')

collection.add(
    ids=[chunk.id for chunk in chunks],
    documents=[chunk.text for chunk in chunks],
    embeddings=[embed(chunk.text) for chunk in chunks],
    metadatas=[chunk.metadata for chunk in chunks],
)
4

Retrieve Top-K Chunks

Embed the user query and search for the most similar chunks. Retrieve top-20 candidates then rerank to top-4 with a cross-encoder for higher precision.

query = 'What is our refund policy?'
query_embedding = embed(query)

results = collection.query(
    query_embeddings=[query_embedding],
    n_results=4,
    include=['documents', 'metadatas', 'distances'])
chunks = results['documents'][0]
5

Generate a Grounded Answer

Build a prompt with retrieved chunks as context. Send to the LLM and return the answer with source citations.

context = '\n\n'.join(chunks)
prompt = f"""Answer based ONLY on the context below.

Context:
{context}

Question: {query}
"""

response = client.chat.completions.create(
    model='gpt-4o',
    messages=[{'role': 'user', 'content': prompt}]
)
print(response.choices[0].message.content)

Common Issues

Retrieved chunks don't contain the answer

Increase top-K from 4 to 8–12 and add hybrid BM25 search. Check your chunk size — chunks may be too large or too small for your query patterns.

LLM ignores the retrieved context

The prompt must instruct the model to answer ONLY from the context. Add "If the context doesn't contain the answer, say so" to prevent hallucination fallback.

High latency on retrieval

Add Redis caching for frequent queries. Use batch embedding for ingestion. Reduce vector index size with approximate nearest neighbor (ANN) search.

Citation sources are wrong

Store chunk metadata (file name, page number, chunk index) at ingest time and pass it through retrieval. Never construct citations from LLM output alone.

FAQ

What is the best chunk size for RAG?
256–512 tokens with 10–20% overlap is a good default. Smaller chunks (128 tokens) improve precision for factual Q&A; larger chunks (1024 tokens) preserve context for complex reasoning. Evaluate on your specific corpus.
Which vector database should I use for RAG?
Chroma for local dev, Pinecone for managed prod, pgvector if you already use PostgreSQL. Weaviate and Qdrant for self-hosted open-source deployments.
How do I improve RAG retrieval accuracy?
Biggest gains: (1) hybrid BM25 + vector search, (2) cross-encoder reranking, (3) semantic chunking, (4) metadata filtering. Measure hit rate and MRR to track improvement.

Related

Press Cmd+K to open