What is the best chunk size for RAG?

The best chunk size depends on your documents, but 256–512 tokens with 10–20% overlap is a good default. Smaller chunks (128 tokens) improve retrieval precision for factual Q&A; larger chunks (1024 tokens) preserve more context for complex reasoning. Always tune and evaluate on your specific corpus.

Which vector database should I use for RAG?

Chroma is ideal for local development and small-scale apps. Pinecone is a managed service with minimal ops overhead. pgvector adds vector search to your existing PostgreSQL database — good for teams already using Postgres. Weaviate and Qdrant are good open-source options for self-hosted production deployments.

How do I improve RAG retrieval accuracy?

The biggest accuracy gains come from: (1) hybrid search — combine BM25 keyword search with vector similarity; (2) cross-encoder reranking — re-score the top-20 candidates to select the best top-4; (3) better chunking — use semantic chunking rather than fixed-size chunks; (4) metadata filtering — filter by document type, date, or section before vector search.

How to Build a RAG Pipeline

A RAG pipeline has 5 steps: chunk documents → generate embeddings → store in vector DB → retrieve top-K chunks → generate a grounded answer with citations. The full pipeline can be built in Python using LangChain or raw API calls in under 100 lines of code.

Steps

Chunk Your Documents

Split source documents into chunks before embedding. Use 256–512 tokens with 10–20% overlap so context doesn't get cut off at chunk boundaries.

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=64,
    separators=['\n\n', '\n', '. ', ' '])
chunks = splitter.split_documents(documents)

Generate Embeddings

Convert each chunk to a dense vector. OpenAI's text-embedding-3-small is cost-efficient (1536 dimensions). For local/private setups, use sentence-transformers.

from openai import OpenAI

client = OpenAI()

def embed(text: str) -> list[float]:
    response = client.embeddings.create(
        input=text,
        model='text-embedding-3-small'
    )
    return response.data[0].embedding

Store in a Vector Database

Index embeddings + metadata (source file, page, chunk index) in a vector database. Chroma is simplest for local dev; Pinecone for managed prod.

import chromadb

client = chromadb.PersistentClient(path='./chroma_db')
collection = client.get_or_create_collection('docs')

collection.add(
    ids=[chunk.id for chunk in chunks],
    documents=[chunk.text for chunk in chunks],
    embeddings=[embed(chunk.text) for chunk in chunks],
    metadatas=[chunk.metadata for chunk in chunks],
)

Retrieve Top-K Chunks

Embed the user query and search for the most similar chunks. Retrieve top-20 candidates then rerank to top-4 with a cross-encoder for higher precision.

query = 'What is our refund policy?'
query_embedding = embed(query)

results = collection.query(
    query_embeddings=[query_embedding],
    n_results=4,
    include=['documents', 'metadatas', 'distances'])
chunks = results['documents'][0]

Generate a Grounded Answer

Build a prompt with retrieved chunks as context. Send to the LLM and return the answer with source citations.

context = '\n\n'.join(chunks)
prompt = f"""Answer based ONLY on the context below.

Context:
{context}

Question: {query}
"""

response = client.chat.completions.create(
    model='gpt-4o',
    messages=[{'role': 'user', 'content': prompt}]
)
print(response.choices[0].message.content)

Common Issues

✗

Retrieved chunks don't contain the answer

Increase top-K from 4 to 8–12 and add hybrid BM25 search. Check your chunk size — chunks may be too large or too small for your query patterns.

✗

LLM ignores the retrieved context

The prompt must instruct the model to answer ONLY from the context. Add "If the context doesn't contain the answer, say so" to prevent hallucination fallback.

✗

High latency on retrieval

Add Redis caching for frequent queries. Use batch embedding for ingestion. Reduce vector index size with approximate nearest neighbor (ANN) search.

✗

Citation sources are wrong

Store chunk metadata (file name, page number, chunk index) at ingest time and pass it through retrieval. Never construct citations from LLM output alone.

FAQ

What is the best chunk size for RAG?: 256–512 tokens with 10–20% overlap is a good default. Smaller chunks (128 tokens) improve precision for factual Q&A; larger chunks (1024 tokens) preserve context for complex reasoning. Evaluate on your specific corpus.
Which vector database should I use for RAG?: Chroma for local dev, Pinecone for managed prod, pgvector if you already use PostgreSQL. Weaviate and Qdrant for self-hosted open-source deployments.
How do I improve RAG retrieval accuracy?: Biggest gains: (1) hybrid BM25 + vector search, (2) cross-encoder reranking, (3) semantic chunking, (4) metadata filtering. Measure hit rate and MRR to track improvement.

→

What is RAG?

/guide/what-is-rag

→

RAG Learning Path

/learn/rag

→

Build Enterprise RAG

/projects/enterprise-rag