RAG Enterprise Project
Step-by-Step Walkthrough: Build a Production RAG System
Total Time: ~2 hours
Difficulty: Advanced
Tools: LangChain, ChromaDB, OpenAI
What You'll Build
In this walkthrough, you'll build a production RAG (Retrieval-Augmented Generation) system that enhances LLM responses with your own documents:
- Parse and chunk documents (PDF, TXT, Markdown)
- Generate embeddings and store in vector database
- Implement semantic search with similarity scoring
- Build question-answering with retrieved context
- Test retrieval quality and answer accuracy
Prerequisites
Python 3.8+ installed
OpenAI API key (or other LLM provider)
Understanding of embeddings and vector search
Basic knowledge of LLMs and prompting
1
Set Up RAG Environment
25 min1.1 Create Project Structure
# Create project directory
mkdir rag-enterprise
cd rag-enterprise
# Create subdirectories
mkdir -p documents/raw documents/processed src vectorstore
1.2 Install RAG Stack
# Create virtual environment
python -m venv venv
source venv/bin/activate
# Install dependencies
pip install langchain==0.1.0 chromadb==0.4.18 \
openai==1.6.1 tiktoken==0.5.2 \
pypdf==3.17.4 python-dotenv==1.0.0
# Save requirements
pip freeze > requirements.txt
1.3 Configure API Keys
# Create .env file
cat > .env <<EOF
OPENAI_API_KEY=your-api-key-here
CHROMA_PERSIST_DIRECTORY=./vectorstore
EMBEDDING_MODEL=text-embedding-3-small
LLM_MODEL=gpt-3.5-turbo
EOF
API Key Security
Never commit .env files to Git! Add
.env to your .gitignore file immediately.1.4 Create Sample Documents
Add a sample document to test with:
# Create sample document
cat > documents/raw/company_policy.txt <<EOF
Company Remote Work Policy
1. Eligibility: All full-time employees are eligible for remote work
after 90 days of employment.
2. Schedule: Remote employees must maintain core hours of 10am-3pm EST
for team collaboration.
3. Equipment: Company provides laptop, monitor, and $500 home office
stipend for eligible employees.
4. Communication: Daily standup at 10am via Zoom. Slack response
within 2 hours during core hours.
5. Performance: Quarterly reviews assess output quality and
collaboration effectiveness, not hours worked.
EOF
1.5 Test OpenAI Connection
from openai import OpenAI
from dotenv import load_dotenv
import os
load_dotenv()
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
# Test embedding generation
response = client.embeddings.create(
model="text-embedding-3-small",
input="Hello, world!"
)
print(f"Embedding dimension: {len(response.data[0].embedding)}")
Expected Output
Embedding dimension: 1536
2
Build Document Parser and Chunker
40 min2.1 Create Document Loader
# src/document_loader.py
from langchain.document_loaders import TextLoader, PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from pathlib import Path
def load_documents(directory):
"""Load all documents from directory"""
documents = []
path = Path(directory)
for file in path.glob("**/*"):
if not file.is_file():
continue
if file.suffix == ".txt":
loader = TextLoader(str(file))
elif file.suffix == ".pdf":
loader = PyPDFLoader(str(file))
else:
continue
documents.extend(loader.load())
print(f"Loaded {len(documents)} documents")
return documents
2.2 Implement Smart Chunking
Chunk documents with overlap for better retrieval:
def chunk_documents(documents, chunk_size=500, chunk_overlap=50):
"""Split documents into chunks with overlap"""
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=chunk_overlap,
length_function=len,
separators=["\n\n", "\n", " ", ""]
)
chunks = text_splitter.split_documents(documents)
print(f"Created {len(chunks)} chunks")
return chunks
Why Chunk Overlap?
Overlap ensures context isn't lost at chunk boundaries. A 50-character overlap prevents splitting important information across chunks.
2.3 Test Loading and Chunking
# Test the loader
docs = load_documents("documents/raw")
chunks = chunk_documents(docs)
# Inspect first chunk
print(f"\nFirst chunk:")
print(chunks[0].page_content)
print(f"\nMetadata:")
print(chunks[0].metadata)
Expected Output
Loaded 1 documents
Created 3 chunks
First chunk:
Company Remote Work Policy
1. Eligibility: All full-time employees...
Created 3 chunks
First chunk:
Company Remote Work Policy
1. Eligibility: All full-time employees...
3
Create Vector Store with Embeddings
30 min3.1 Build Vector Store
# src/vector_store.py
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
import os
def create_vector_store(chunks):
"""Create vector store from chunks"""
embeddings = OpenAIEmbeddings(
model=os.getenv("EMBEDDING_MODEL")
)
# Create vector store with persistence
vector_store = Chroma.from_documents(
documents=chunks,
embedding=embeddings,
persist_directory=os.getenv("CHROMA_PERSIST_DIRECTORY")
)
print(f"Created vector store with {len(chunks)} chunks")
return vector_store
3.2 Test Semantic Search
# Create vector store
vector_store = create_vector_store(chunks)
# Test similarity search
query = "What are the core hours for remote work?"
results = vector_store.similarity_search(query, k=2)
print(f"\nQuery: {query}")
print(f"\nTop {len(results)} results:")
for i, doc in enumerate(results, 1):
print(f"\n{i}. {doc.page_content[:200]}...")
Expected Behavior
The search should return chunks mentioning "10am-3pm EST" core hours, demonstrating semantic understanding beyond keyword matching.
3.3 Search with Similarity Scores
# Search with scores
results_with_scores = vector_store.similarity_search_with_score(query, k=2)
for doc, score in results_with_scores:
print(f"\nSimilarity: {score:.4f}")
print(f"Content: {doc.page_content[:150]}...")
Similarity Scores
Lower scores = higher similarity (distance metric). Typical good matches have scores < 0.5. Use scores to filter low-quality retrievals.
4
Test Retrieval and Generation
25 min4.1 Create RAG Chain
# src/rag_chain.py
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
# Custom prompt template
template = """Use the following context to answer the question.
If you cannot answer from the context, say "I don't have that information."
Context: {context}
Question: {question}
Answer:"""
prompt = PromptTemplate(
template=template,
input_variables=["context", "question"]
)
# Create QA chain
llm = ChatOpenAI(model_name=os.getenv("LLM_MODEL"), temperature=0)
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=vector_store.as_retriever(search_kwargs={"k": 3}),
chain_type_kwargs={"prompt": prompt}
)
4.2 Ask Questions
# Test questions
questions = [
"What equipment does the company provide for remote work?",
"How long must employees work before being eligible for remote work?",
"What are the core hours for remote employees?"
]
for question in questions:
result = qa_chain({"query": question})
print(f"\nQ: {question}")
print(f"A: {result['result']}")
Expected Answers
1. "Laptop, monitor, and $500 home office stipend"
2. "90 days of employment"
3. "10am-3pm EST"
2. "90 days of employment"
3. "10am-3pm EST"
4.3 Evaluate Answer Quality
# Test with out-of-context question
question = "What is the company's vacation policy?"
result = qa_chain({"query": question})
print(f"\nQ: {question}")
print(f"A: {result['result']}")
Quality Check
The system should respond "I don't have that information" for questions not covered in the documents. This prevents hallucination!
Troubleshooting
- • Poor retrieval: Adjust chunk_size and chunk_overlap parameters
- • Hallucinations: Strengthen prompt instructions to stick to context
- • API errors: Check OpenAI API key and rate limits
See the RAG Troubleshooting Guide for more solutions.
Walkthrough Complete!
You've built a production RAG system with document parsing, embeddings, vector search, and LLM-powered question answering. You're ready for Part 2!
What You've Learned:
Document loading and parsing
Smart chunking with overlap
Embedding generation with OpenAI
Vector store with ChromaDB
Semantic search and similarity scoring
RAG chain with custom prompts
Question answering with context
Hallucination prevention techniques