AI/ML/NLPNovember 27, 202515 min read

RAG Tutorial: Build Knowledge-Grounded AI with Retrieval-Augmented Generation

Learn RAG from scratch - chunking strategies, embeddings, vector databases, and fusion mechanisms. Build AI that cites sources and never hallucinates facts.

Space Services

ai tools

Large language models have demonstrated remarkable capabilities in generating fluent, coherent text. Yet they suffer from a fundamental limitation: their knowledge is frozen at training time, and they cannot reliably access or cite specific sources. Retrieval-Augmented Generation (RAG) addresses this by combining the generative power of neural networks with the precision of information retrieval systems.

RAG Retrieval Simulation

Visualize how semantic similarity finds relevant documents

import numpy as np
import matplotlib.pyplot as plt

# Simulate document embeddings in 2D for visualization
np.random.seed(42)

# Create document clusters (different topics)
topics = {
    'Machine Learning': np.random.randn(5, 2) + [2, 3],
    'Web Development': np.random.randn(5, 2) + [-2, 2],
    'Database Systems': np.random.randn(5, 2) + [0, -2],
    'Cloud Computing': np.random.randn(5, 2) + [3, -1],
}

# Query embedding (looking for ML content)
query = np.array([2.5, 3.5])

fig, ax = plt.subplots(figsize=(10, 8))

colors = ['#7c6f9c', '#9089a3', '#b8b0c8', '#5c5270']
for i, (topic, docs) in enumerate(topics.items()):
    ax.scatter(docs[:, 0], docs[:, 1], c=colors[i], label=topic, s=100, alpha=0.7)

# Plot query
ax.scatter(query[0], query[1], c='#e8e4f0', marker='*', s=300,
           edgecolors='white', linewidths=2, label='Query', zorder=5)

# Draw lines to nearest documents
all_docs = np.vstack(list(topics.values()))
distances = np.linalg.norm(all_docs - query, axis=1)
top_k = 3
nearest_indices = np.argsort(distances)[:top_k]

for idx in nearest_indices:
    ax.plot([query[0], all_docs[idx, 0]], [query[1], all_docs[idx, 1]],
            'w--', alpha=0.5, linewidth=2)
    ax.scatter(all_docs[idx, 0], all_docs[idx, 1], c='white', s=150,
               facecolors='none', linewidths=2)

ax.set_title('RAG Document Retrieval: Finding Relevant Chunks')
ax.set_xlabel('Embedding Dimension 1')
ax.set_ylabel('Embedding Dimension 2')
ax.legend(loc='upper left')
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"Query retrieved {top_k} most similar documents (shown with circles)")
print("In practice, these embeddings have 768+ dimensions!")

Ctrl/Cmd + Enter to run

The Knowledge Problem in Language Models

Traditional language models encode knowledge implicitly in their parameters during training. This creates several challenges:

Challenge	Description	Impact
Knowledge Cutoff	Parameters reflect training data date	Cannot answer about recent events
Hallucination	Model generates plausible but false information	Unreliable for factual queries
Opacity	No clear source attribution	Difficult to verify claims
Update Cost	Retraining required for new knowledge	Expensive and time-consuming

The foundational RAG paper by Lewis et al. (2020) introduced an elegant solution: rather than storing all knowledge in model parameters, retrieve relevant documents at inference time and condition generation on this retrieved context.

"RAG models combine the best of both worlds: the parametric memory of pre-trained seq2seq models and the non-parametric memory of retrieval-based models, accessing knowledge from an external corpus." - Lewis et al., 2020

RAG Architecture Overview

The RAG architecture consists of three primary components working in concert:

RAG System Architecture:
----------------------------------------------------------------

    User Query
        |
        v
+------------------+
|   Query Encoder  |  E_q: Transforms query into dense vector
|   (Bi-Encoder)   |
+--------+---------+
         |
         | query embedding
         v
+------------------+      +----------------------+
|  Dense Retrieval |----->|   Document Index     |
|     (MIPS)       |      | (FAISS / ScaNN /     |
+--------+---------+      |  Pinecone / etc.)    |
         |                +----------------------+
         | top-k documents
         v
+------------------+
|  Fusion Module   |  Combines query + retrieved docs
+--------+---------+
         |
         | augmented context
         v
+------------------+
|    Generator     |  Seq2seq model (BART, T5, GPT, etc.)
|   (LLM Decoder)  |
+--------+---------+
         |
         v
    Generated Answer

The Retrieval Score

The core retrieval mechanism uses dense vector similarity. Given a query $q$ and document $d$ , the retrieval score is computed as the inner product of their embeddings:

$s(q, d) = E_q(q)^T E_d(d)$

Where:

$E_q$ is the query encoder
$E_d$ is the document encoder
Both encoders map text to dense vectors in the same embedding space

This formulation enables efficient Maximum Inner Product Search (MIPS) over millions of documents using approximate nearest neighbor algorithms.

Dense Retrieval: The Bi-Encoder Architecture

How Bi-Encoders Work

The bi-encoder architecture processes queries and documents independently through separate (or shared) transformer encoders:

Bi-Encoder Architecture:
----------------------------------------------------------------

Query: "What is the capital of France?"

        +------------------+
        |   Query Text     |
        +--------+---------+
                 |
                 v
        +------------------+
        |  Query Encoder   |  BERT-based transformer
        |   (E_q)          |
        +--------+---------+
                 |
                 v
        +------------------+
        | Query Embedding  |  768-dim dense vector
        |   q = [0.2, ...] |
        +------------------+


Document: "Paris is the capital and largest city of France..."

        +------------------+
        |  Document Text   |
        +--------+---------+
                 |
                 v
        +------------------+
        | Document Encoder |  BERT-based transformer
        |   (E_d)          |
        +--------+---------+
                 |
                 v
        +------------------+
        |  Doc Embedding   |  768-dim dense vector
        |   d = [0.3, ...] |
        +------------------+


Similarity: s(q, d) = q^T d = 0.87

Training Dense Retrievers

Dense retrievers are typically trained with contrastive learning. For each query $q$ , we have:

One positive document $d^+$ (relevant)
Multiple negative documents $d^-_1, d^-_2, ..., d^-_n$ (irrelevant)

The training objective maximizes the score for positive pairs while minimizing it for negatives. The loss function uses the negative log-likelihood of the positive document:

$\mathcal{L} = -\log \frac{\exp(s(q, d^+))}{\exp(s(q, d^+)) + \sum_{i=1}^{n} \exp(s(q, d^-_i))}$

This is equivalent to the cross-entropy loss over a softmax distribution of document scores.

Hard Negative Mining

The quality of negative examples significantly impacts retrieval performance. Strategies include:

Strategy	Description	Effectiveness
Random Negatives	Sample random documents from corpus	Baseline
BM25 Negatives	Top BM25 results that aren't relevant	Good
In-Batch Negatives	Other queries' positives in same batch	Efficient
Hard Negatives	High-scoring but incorrect documents	Best

Document Chunking Strategies

Real-world documents are often too long for transformer context windows. Effective chunking is critical for RAG performance.

Common Chunking Approaches

Document Chunking Strategies:
----------------------------------------------------------------

Original Document (10,000 tokens)
|
+-- Fixed-Size Chunking
|   |
|   +-- Chunk 1: tokens [0:512]
|   +-- Chunk 2: tokens [512:1024]
|   +-- Chunk 3: tokens [1024:1536]
|   ...
|   (Simple but may break semantic units)
|
+-- Sentence-Based Chunking
|   |
|   +-- Chunk 1: sentences 1-5 (~400 tokens)
|   +-- Chunk 2: sentences 6-12 (~500 tokens)
|   ...
|   (Preserves sentence boundaries)
|
+-- Semantic Chunking
|   |
|   +-- Chunk 1: "Introduction" section
|   +-- Chunk 2: "Methods" section
|   +-- Chunk 3: "Results" section
|   ...
|   (Preserves document structure)
|
+-- Recursive Chunking
    |
    +-- Split on paragraphs
        +-- If too large, split on sentences
            +-- If still too large, split on tokens
    (Hierarchical, respects structure)

Chunk Size Trade-offs

Chunk Size	Advantages	Disadvantages
Small (100-200 tokens)	Precise retrieval, fine-grained	May lose context
Medium (300-500 tokens)	Balanced	General purpose
Large (500-1000 tokens)	More context per chunk	Less precise retrieval

Overlapping Chunks

To prevent information loss at chunk boundaries, many systems use overlapping windows:

Overlapping Chunk Strategy:
----------------------------------------------------------------

Document: [==========================================]

Chunk 1:  [==========]
                   |-- overlap (50-100 tokens)
Chunk 2:       [==========]
                        |-- overlap
Chunk 3:            [==========]
                             |-- overlap
Chunk 4:                 [==========]

The overlap ensures that information spanning chunk boundaries is captured in at least one chunk.

The RAG Probability Distribution

The mathematical formulation of RAG defines a marginalized probability over retrieved documents. For input $x$ (query) and output $y$ (generated answer):

$p(y|x) = \sum_{z \in \text{top-k}} p(z|x) p(y|x,z)$

Where:

$z$ represents a retrieved document
$p(z|x)$ is the retrieval probability (from dense retrieval scores)
$p(y|x,z)$ is the generation probability conditioned on query and document

RAG-Sequence vs RAG-Token

The Lewis et al. paper introduced two variants:

RAG-Sequence: Uses the same document for generating the entire sequence

$p_{\text{RAG-Sequence}}(y|x) = \sum_{z \in \text{top-k}} p(z|x) \prod_{i=1}^{N} p(y_i|x, z, y_{1:i-1})$

RAG-Token: Can use different documents for each output token

$p_{\text{RAG-Token}}(y|x) = \prod_{i=1}^{N} \sum_{z \in \text{top-k}} p(z|x) p(y_i|x, z, y_{1:i-1})$

RAG-Sequence vs RAG-Token:
----------------------------------------------------------------

RAG-Sequence:
Query --> Retrieve [Doc A, Doc B, Doc C]
          |
          +--> Generate with Doc A: "Paris is the capital..."
          +--> Generate with Doc B: "The capital of France..."
          +--> Generate with Doc C: "France's capital city..."
          |
          +--> Marginalize: weighted sum of complete sequences

RAG-Token:
Query --> Retrieve [Doc A, Doc B, Doc C]
          |
          Token 1: marginalize over docs --> "The"
          Token 2: marginalize over docs --> "capital"
          Token 3: marginalize over docs --> "is"
          ...
          (Each token can draw from different document)

Fusion Mechanisms

Fusion mechanisms determine how retrieved documents are combined with the query for generation.

Concatenation-Based Fusion

The simplest approach concatenates retrieved documents with the query:

Concatenation Fusion:
----------------------------------------------------------------

Input to Generator:

[QUERY] What is photosynthesis?
[DOC 1] Photosynthesis is a process used by plants to convert
        light energy into chemical energy...
[DOC 2] The light-dependent reactions occur in the thylakoid
        membranes and require sunlight...
[DOC 3] Carbon fixation occurs in the Calvin cycle, where CO2
        is converted into glucose...

--> Generator produces answer conditioned on full context

Cross-Attention Fusion

More sophisticated approaches use cross-attention between query and documents:

Cross-Attention Fusion:
----------------------------------------------------------------

        Query Tokens
            |
            v
    +---------------+
    | Self-Attention|
    +-------+-------+
            |
            v
    +---------------+
    | Cross-Attention|<---- Document Representations
    | (Query attends |
    |  to Documents) |
    +-------+-------+
            |
            v
    +---------------+
    |  Feed-Forward |
    +---------------+
            |
            v
      Output Tokens

Fusion-in-Decoder (FiD)

The Fusion-in-Decoder approach, introduced for open-domain QA, processes each document independently with the query, then fuses representations in the decoder:

Fusion-in-Decoder Architecture:
----------------------------------------------------------------

Query + Doc 1 --> Encoder --> Representation 1 --|
Query + Doc 2 --> Encoder --> Representation 2 --|
Query + Doc 3 --> Encoder --> Representation 3 --|---> Concatenate
        ...                         ...          --|      |
Query + Doc k --> Encoder --> Representation k --|      |
                                                        v
                                                   +--------+
                                                   | Decoder|
                                                   +--------+
                                                        |
                                                        v
                                                    Answer

This allows the model to process many documents efficiently while still attending to all of them during generation.

Training RAG Systems

End-to-End Training

The original RAG approach trains the retriever and generator jointly. The total loss is:

$\mathcal{L}_{\text{RAG}} = -\log p(y|x) = -\log \sum_{z \in \text{top-k}} p(z|x) p(y|x,z)$

However, this is computationally expensive because:

Retrieval requires searching the entire corpus
Gradients must flow through the retrieval operation
Document embeddings must be periodically updated

Practical Training Strategies

Strategy	Description	Pros	Cons
Frozen Retriever	Train only generator	Fast, stable	Suboptimal retrieval
Separate Training	Train retriever, then generator	Modular	No joint optimization
Periodic Updates	Update doc embeddings periodically	Balanced	Complexity
End-to-End	Joint training	Optimal	Expensive

The Comprehensive Survey Perspective

According to the 2023 survey on RAG for Large Language Models, the field has evolved significantly since the original formulation:

"RAG has evolved from a simple retrieve-then-read pipeline to sophisticated systems incorporating query rewriting, iterative retrieval, and self-reflection mechanisms."

The survey identifies three paradigms:

Naive RAG: Simple retrieve-then-generate pipeline
Advanced RAG: Incorporates pre-retrieval optimization (query rewriting) and post-retrieval processing (re-ranking)
Modular RAG: Flexible composition of retrieval, generation, and reasoning modules

Advanced RAG Techniques

Query Expansion and Rewriting

Poor queries lead to poor retrieval. Query expansion improves retrieval by reformulating the original query:

Query Rewriting Pipeline:
----------------------------------------------------------------

Original Query: "Why is the sky blue?"
                    |
                    v
            +---------------+
            | Query Rewriter|  (LLM-based)
            +---------------+
                    |
                    v
Expanded Queries:
  - "Rayleigh scattering atmosphere blue light"
  - "Physics of sky color wavelength"
  - "Why does the sky appear blue during daytime"
                    |
                    v
        Retrieve using all queries
                    |
                    v
        Merge and re-rank results

Hypothetical Document Embeddings (HyDE)

HyDE generates a hypothetical answer first, then uses it for retrieval:

HyDE Process:
----------------------------------------------------------------

Query: "What causes aurora borealis?"
            |
            v
    +------------------+
    | LLM generates    |
    | hypothetical     |
    | answer           |
    +------------------+
            |
            v
Hypothetical: "The aurora borealis is caused by charged
particles from the sun interacting with gases in Earth's
atmosphere, creating colorful light displays..."
            |
            v
    +------------------+
    | Embed hypothetical|
    | document         |
    +------------------+
            |
            v
    Retrieve similar real documents

Iterative Retrieval

Complex queries may require multiple retrieval steps:

Iterative Retrieval:
----------------------------------------------------------------

Query: "How did Einstein's work influence quantum computing?"

Round 1: Retrieve docs about Einstein's contributions
         --> Learn about photoelectric effect, EPR paradox

Round 2: Retrieve docs about quantum computing foundations
         --> Learn about qubits, entanglement

Round 3: Retrieve docs connecting these concepts
         --> Find connections between EPR and quantum gates

Final: Synthesize information from all rounds

Evaluation Metrics

Retrieval Metrics

Metric	Formula	Description
Recall@k	Relevant in top-k / Total relevant	Coverage of relevant docs
Precision@k	Relevant in top-k / k	Accuracy of top-k
MRR	Mean of 1/rank of first relevant	Rank of first hit
NDCG	Normalized discounted cumulative gain	Rank-weighted relevance

End-to-End Metrics

Metric	Application	Notes
Exact Match	QA tasks	Strict string matching
F1 Score	QA tasks	Token overlap
BLEU/ROUGE	Generation	N-gram overlap
Faithfulness	Attribution	Does answer match sources?

Practical Implementation Considerations

Vector Database Selection

Vector Database Comparison:
----------------------------------------------------------------

Database     | Scalability | Speed    | Features
-------------|-------------|----------|------------------
FAISS        | Billions    | Very Fast| Open source, GPU
Pinecone     | Billions    | Fast     | Managed, easy API
Weaviate     | Millions    | Fast     | Hybrid search
Milvus       | Billions    | Fast     | Open source
Chroma       | Millions    | Moderate | Simple, local-first
Qdrant       | Billions    | Fast     | Rust-based, filtering

Latency Breakdown

A typical RAG query involves multiple steps:

RAG Latency Analysis:
----------------------------------------------------------------

Query Encoding:     ~10-50ms   (transformer forward pass)
Vector Search:      ~10-100ms  (depends on index size)
Document Fetch:     ~5-50ms    (database read)
Context Assembly:   ~1-5ms     (string concatenation)
LLM Generation:     ~500-2000ms (depends on model/length)
                    ----------
Total:              ~550-2200ms

Optimization targets:
- Caching frequent queries
- Smaller embedding models
- Approximate search algorithms
- Streaming generation

Challenges and Future Directions

Current Limitations

Attribution Accuracy: Retrieved documents may not actually support generated claims
Retrieval Failures: Dense retrieval struggles with certain query types
Context Length: Limited ability to use many retrieved documents
Latency: Additional retrieval step adds latency
Maintenance: Document indices require updates as knowledge changes

Emerging Solutions

The 2023 RAG survey highlights several promising directions:

Self-RAG: Models that decide when retrieval is needed
Corrective RAG: Self-correction mechanisms for retrieval errors
Graph RAG: Using knowledge graphs alongside dense retrieval
Long-Context RAG: Leveraging extended context windows (100k+ tokens)

Conclusion

Retrieval-Augmented Generation represents a fundamental shift in how we build knowledge-intensive AI systems. By combining the fluency of neural text generation with the precision of information retrieval, RAG systems can provide accurate, attributable, and up-to-date responses.

The architecture's elegance lies in its modularity: the retriever and generator can be independently improved, and the knowledge base can be updated without retraining the model. As the comprehensive 2023 survey notes, RAG has evolved from a research technique to a foundational pattern for production AI systems.

The mathematical framework - from the bi-encoder retrieval score $s(q, d) = E_q(q)^T E_d(d)$ to the marginalized generation probability $p(y|x) = \sum_{z} p(z|x) p(y|x,z)$ - provides a principled foundation for continued innovation. As language models grow more capable and retrieval systems more efficient, the synergy between these components will only deepen.

Test Your Knowledge: Retrieval-Augmented Generation

Question 1 of 5

What is the main purpose of RAG?

This article draws on peer-reviewed research including the foundational RAG paper by Lewis et al. (2020) and the comprehensive 2023 survey on RAG systems. For complete technical details, consult the original publications.

Interactive

LLM Quantization Guide: Run 70B Models on Consumer GPUs with GPTQ, AWQ & GGUF

Run 70B parameter models on a single GPU. Learn LLM quantization from 8-bit to 2-bit precision - GPTQ, AWQ, GGUF, QuIP#, and when to use each method.

AI/ML/NLPDecember 19, 202513 min read

Interactive

Mixture of Experts Explained: How Mixtral, DeepSeek & Grok Scale to Trillions of Parameters

Understand MoE architecture - the technology behind Mixtral 8x7B, DeepSeek, and Grok. Learn expert routing, load balancing, and why sparse models beat dense ones.

AI/ML/NLPNovember 21, 202512 min read

Interactive

Constitutional AI Explained: How Anthropic Makes Claude Safe Without Human Labels

Learn Constitutional AI (CAI) - Anthropic's technique for training safe AI without massive human labeling. Understand RLAIF, self-critique, and how it compares to RLHF.

AI/ML/NLPJanuary 21, 202610 min read

The Knowledge Problem in Language Models

RAG Architecture Overview

The Retrieval Score

Dense Retrieval: The Bi-Encoder Architecture

How Bi-Encoders Work

Training Dense Retrievers

Hard Negative Mining

Document Chunking Strategies

Common Chunking Approaches

Chunk Size Trade-offs

Overlapping Chunks

The RAG Probability Distribution

RAG-Sequence vs RAG-Token

Fusion Mechanisms

Concatenation-Based Fusion

Cross-Attention Fusion

Fusion-in-Decoder (FiD)

Training RAG Systems

End-to-End Training

Practical Training Strategies

The Comprehensive Survey Perspective

Advanced RAG Techniques

Query Expansion and Rewriting

Hypothetical Document Embeddings (HyDE)

Iterative Retrieval

Evaluation Metrics

Retrieval Metrics

End-to-End Metrics

Practical Implementation Considerations

Vector Database Selection

Latency Breakdown

Challenges and Future Directions

Current Limitations

Emerging Solutions

Conclusion

Related Articles

LLM Quantization Guide: Run 70B Models on Consumer GPUs with GPTQ, AWQ & GGUF

Mixture of Experts Explained: How Mixtral, DeepSeek & Grok Scale to Trillions of Parameters

Constitutional AI Explained: How Anthropic Makes Claude Safe Without Human Labels

SPACE SERVICES