RAG Systems at Production Scale

Notion

Workspace search · permission-first

Problem

Workspace context missing from AI — needed semantic search across 1B+ private documents

Users wanted AI that understood their workspace context — documents, databases, wikis — not generic responses. The system needed to search across millions of pages in real time while maintaining security boundaries and generating contextually relevant responses. Without RAG, the AI had no knowledge of a user's actual data, making responses generic and unhelpful.

Scale

Active users: 30M+
Documents indexed: 1B+
AI queries/day: 10M+
Vector dims: 1536 (Ada)
Latency SLA: <2s
Accuracy target: >90%

Solution

Hybrid search + permission-pre-filter + GPT-4 rerank

Notion combined OpenAI embeddings with Pinecone vector indexing for semantic search, added BM25 keyword matching for exact phrases, and implemented GPT-4 reranking to find the most relevant results. Permission filtering happens before vector search to prevent data leakage. Real-time indexing ensures new and updated documents are searchable within seconds. Redis caching reduces embedding costs by 80% while maintaining sub-2-second latency.

OpenAI (GPT-4, Ada)PineconePostgresRedisKubernetesLangChainFastAPI

Incremental indexing: embed on create/update, not batch
Hybrid search: vector + BM25 keyword, α=0.7
Permission filter applied at query, not post-retrieval
Multi-stage: fast approximate → rerank top 50
Redis caching for frequently accessed embeddings (24h TTL)
Contextual chunking: split at semantic boundaries (headings, lists, paragraphs)
Real-time indexing: new/updated docs searchable within ~10 seconds

System architecture

Service topology

Data flow

Sequence — how a request moves

Code

Pythonnotion_rag.pyHybrid search — vector + BM25 + permission filter + GPT-4 rerank

import pinecone
from openai import OpenAI

class NotionRAG:
    def __init__(self):
        self.pinecone_index = pinecone.Index("notion-docs")
        self.openai = OpenAI()

    def hybrid_search(self, query, user_workspaces, top_k=5):
        """Hybrid search: vector + keyword + permissions."""
        # 1. Embed query
        embedding = self.openai.embeddings.create(
            model="text-embedding-ada-002", input=query,
        ).data[0].embedding

        # 2. Vector search WITH permission filter (pre-VS!)
        vector_results = self.pinecone_index.query(
            vector=embedding,
            top_k=100,
            filter={"workspace_id": {"$in": user_workspaces}},
            include_metadata=True,
        )

        # 3. BM25 keyword search (for exact matches)
        keyword_results = self.keyword_search(query, user_workspaces)

        # 4. Hybrid scoring
        combined = self.combine_scores(
            vector_results, keyword_results, alpha=0.7,
        )

        # 5. Rerank top candidates with GPT-4
        return self.rerank_with_llm(query, combined[:50], top_k=top_k)

    def rerank_with_llm(self, query, candidates, top_k):
        prompt = f"""Query: {query}

Rank these documents by relevance (1 = most relevant):

{self.format_candidates(candidates)}

Return only the document IDs in order."""
        response = self.openai.chat.completions.create(
            model="gpt-4",
            messages=[{"role": "user", "content": prompt}],
            temperature=0,
        )
        ranked_ids = self.parse_ranking(response)
        return [c for c in candidates if c.id in ranked_ids[:top_k]]

Pythonnotion_chunker.pyChunk at semantic boundaries, not arbitrary character counts

from langchain.text_splitter import RecursiveCharacterTextSplitter

class NotionChunker:
    """Smart chunking — split at semantic boundaries, preserve structure."""

    def __init__(self):
        self.splitter = RecursiveCharacterTextSplitter(
            chunk_size=512,
            chunk_overlap=50,
            separators=[
                "\n## ", "\n### ",   # Notion headings
                "\n- ",                # bullet points
                "\n\n", "\n",        # paragraphs
                ". ", " ",
            ],
        )

    def chunk_document(self, document):
        chunks = []
        for i, chunk in enumerate(self.splitter.split_text(document.content)):
            chunks.append({
                'text': chunk,
                'document_id': document.id,
                'chunk_index': i,
                'workspace_id': document.workspace_id,
                'created_by': document.created_by,
                'last_edited': document.last_edited,
                'heading': self.extract_heading(chunk, document),
                'position': i / len(chunks),
            })
        return chunks

    def extract_heading(self, chunk, document):
        headings = self.parse_headings(document.content)
        chunk_start = document.content.find(chunk)
        for heading in reversed(headings):
            if heading['position'] < chunk_start:
                return heading['text']
        return document.title

Outcomes

Business outcomes

Notion AI drove 40% of new user sign-ups in first year
Engagement +35% for AI-enabled workspaces
Support tickets -25% via AI-powered Q&A
Paid conversion: 20% of AI users vs 10% baseline

Technical outcomes

Query latency P95: <2s (retrieval + generation)
Retrieval relevance: 92% top-3 rated helpful
Index freshness: searchable within ~10s of update
Uptime: 99.95% with graceful degradation

Impact

Notion AI became the #1 feature driver for user acquisition

AI-enabled workspaces converted to paid plans at 2× the baseline rate (20% vs 10%), and search became a defining product advantage.

Takeaways

Hybrid search beats pure vector. Combine semantic similarity with BM25 for better recall of exact phrases.
Chunking strategy matters more than model choice. Optimal chunk size is domain-dependent — 256–512 tokens works for most.
Reranking is cheap and effective. Use fast approximate search first, then rerank top 50 with an expensive model.
Permissions are non-negotiable. Filter by user access BEFORE vector search to prevent leaking private data.
Cache aggressively. Embeddings are expensive to compute but cheap to store — 24-hour TTL handles most workloads.

Intercom

Fin · confidence-gated support

Problem

Support agents lacked instant access to 100M+ messages of past knowledge

Support teams needed AI that could answer customer questions using knowledge base articles, past conversations, and product documentation. The system required real-time retrieval across 100M+ messages while maintaining conversation context and handling multi-turn dialogues. Without context awareness, the AI couldn't have natural conversations or learn from prior exchanges in the same thread.

Scale

Messages/day: 100M+
KB articles: 500K+
Convos indexed: 1B+
Languages: 45+
Response time: <3s
Resolution rate: 60% automated

Solution

Conversation context + multi-source + confidence gate

Intercom built a multi-source RAG system that retrieves from knowledge base, past conversations, and product docs. Conversation context (last 5 messages) is passed alongside each query to enable multi-turn dialogues. Weaviate provides vector search with Elasticsearch for hybrid matching. Intent detection routes queries to the best retrieval strategy. Critically, Intercom only shows answers with confidence >0.75; lower-confidence queries escalate to human agents with full context.

OpenAI (GPT-4, Ada)WeaviateElasticsearchPostgresKafkaRayLangChain

Multi-source: KB + past conversations + product docs
Conversation-aware retrieval (sliding 5-message context)
Multilingual embeddings: single model for all 45 languages
Streaming updates: Kafka → embedding service → Weaviate
Intent detection routes by query type
Confidence scoring: only show answers >0.75
Human-in-the-loop: low-confidence queries escalate with full context
Recency weighting: boost recent conversations +30%

System architecture

Service topology

Data flow

Sequence — how a request moves

Code

Pythonintercom_rag.pyConversation-aware retrieval with sliding-window context

from weaviate import Client

class IntercomRAG:
    def __init__(self):
        self.weaviate = Client("http://weaviate:8080")

    def retrieve_with_context(self, query, conversation_history, customer_id):
        """Use sliding window of recent messages as additional context."""
        # 1. Build context from last 5 messages
        context_window = conversation_history[-5:]
        context_text = "\n".join(
            f"{msg['role']}: {msg['content']}" for msg in context_window
        )

        # 2. Combine query with conversation context
        enriched_query = f"""Conversation context:
{context_text}

Current question: {query}"""

        # 3. Multi-source retrieval
        results = (
            self.weaviate.query.get(
                "KnowledgeArticle",
                ["title", "content", "last_updated", "source"],
            )
            .with_near_text({"concepts": [enriched_query], "distance": 0.7})
            .with_where({
                "operator": "Or",
                "operands": [
                    {"path": ["source"], "operator": "Equal",
                     "valueString": "knowledge_base"},
                    {"path": ["source"], "operator": "Equal",
                     "valueString": "past_conversations"},
                ],
            })
            .with_additional(["distance", "certainty"])
            .do()
        )

        # 4. Boost recent sources +30%
        return self.rerank_by_recency(results, recency_boost=0.3)[:5]

    def rerank_by_recency(self, results, recency_boost):
        import datetime
        for result in results:
            days_old = (datetime.now() - result['last_updated']).days
            recency_factor = 1.0 + (recency_boost / (1 + days_old / 30))
            result['final_score'] = result['certainty'] * recency_factor
        return sorted(results, key=lambda x: x['final_score'], reverse=True)

Pythonintercom_confidence.pyConfidence gate — only show answers >0.75, else escalate

from openai import OpenAI

class ConfidenceGate:
    """Only show AI answers when confident; escalate low-confidence to humans."""

    def __init__(self):
        self.openai = OpenAI()
        self.thresholds = {}  # per-customer thresholds

    def generate_with_confidence(self, query, retrieved_docs):
        response = self.openai.chat.completions.create(
            model="gpt-4",
            messages=[
                {"role": "system",
                 "content": "You are a customer support AI. Provide accurate answers based on the knowledge base."},
                {"role": "user",
                 "content": f"""Query: {query}

Relevant articles:
{self.format_docs(retrieved_docs)}

Provide your answer and rate your confidence (0-1)."""},
            ],
            temperature=0,
        )
        answer = response.choices[0].message.content
        confidence = self.calculate_confidence(query, retrieved_docs, answer)
        return {
            'answer': answer,
            'confidence': confidence,
            'retrieved_docs': retrieved_docs,
        }

    def calculate_confidence(self, query, docs, answer):
        """Multi-factor confidence: retrieval quality + consistency + complexity."""
        retrieval_score = docs[0]['certainty'] if docs else 0.0
        consistency_score = self.check_consistency(answer, docs)
        complexity_score = 1.0 - self.estimate_complexity(query)
        return 0.4 * retrieval_score + 0.4 * consistency_score + 0.2 * complexity_score

    def should_escalate(self, confidence, customer_id):
        threshold = self.thresholds.get(customer_id, 0.75)
        if confidence < threshold:
            return True, "Low confidence — escalating to human"
        return False, "High confidence — showing AI answer"

Outcomes

Business outcomes

60% of customer queries resolved by AI (up from 30% with rules)
Avg response time: 2h → 30s
Support team handles 3× more customers same headcount
CSAT: 4.1 → 4.6/5

Technical outcomes

Answer accuracy: 85% rated correct
Latency P99: <3s (retrieve + generate + score)
False positive rate: <5%
Multilingual coverage: 45 languages with same accuracy

Impact

Cut response time from hours to seconds at 85% accuracy

Support teams now handle 3× more customers at higher satisfaction, and customers either get immediate answers or escalate to humans with full context.

Takeaways

Conversation context is critical for multi-turn support. A sliding window of the last 5 messages enables natural dialogue.
Confidence thresholds prevent bad experiences. Only show answers above 0.75 — escalate the rest with full context.
Multilingual embeddings simplify infrastructure. A single unified model for all 45 languages cuts complexity 45×.
Source recency matters in evolving products. Weight recent conversations 30% higher than old articles.
Human-in-the-loop is essential. AI won't be perfect — design graceful handoff so humans don't repeat what AI tried.

Common pitfalls

The mistakes both teams hit on the road from "it works on my laptop" to "it runs the business."

Permission filtering AFTER vector search (data leak)

Problem

Filtering by user permissions after vector search allowed users to discover private documents from other workspaces via crafted queries — a catastrophic data exposure.

Solution

Apply permission filters AT the query stage via metadata filters in Pinecone/Weaviate. Only search vectors the user has access to. <50ms latency overhead.

Impact

Zero permission violations. Security posture shifted from reactive to proactive.

Pure vector search without keyword matching

Problem

Missed exact phrase matches. Query "Q3 2024 OKRs" returned documents about "Q4 2023" because embeddings were semantically close but did not match exact keywords.

Solution

Hybrid search: combine vector similarity with BM25 keyword matching. Weight both scores (α=0.7 vector, 0.3 keyword) for best recall and precision.

Impact

Recall 65% → 85%, precision maintained.

Showing all AI answers regardless of confidence

Problem

Low-confidence answers were often wrong (50% accuracy). Frustrated customers, damaged trust, increased escalations.

Solution

Implement a confidence gate (>0.75 to show). Escalate low-confidence queries to humans with full context so they can provide accurate help.

Impact

Answer accuracy for shown responses 50% → 85%. CSAT 3.8 → 4.6/5.

Build it, don't just read about it

Build your own production RAG

Production RAG is not the quick-start. It is permissions-first retrieval, semantic chunking, hybrid scoring, reranking, caching, and confidence-gated answers. The Notion + Intercom patterns generalize — same shape, different context.

Our RAG module covers the full stack: ingestion + chunking, Pinecone + Weaviate setup, the permission filter pattern, hybrid scoring, the confidence-gate design, and observability for retrieval quality.

Start the RAG module Browse RAG projects