Back to research
AI/ML/NLPDecember 5, 202515 min read

RAG Tutorial: Build Knowledge-Grounded AI with Retrieval-Augmented Generation

Learn RAG from scratch - chunking strategies, embeddings, vector databases, and fusion mechanisms. Build AI that cites sources and never hallucinates facts.

Space Services

Space Services

RAG Tutorial: Build Knowledge-Grounded AI with Retrieval-Augmented Generation
Share:

Large language models have demonstrated remarkable capabilities in generating fluent, coherent text. Yet they suffer from a fundamental limitation: their knowledge is frozen at training time, and they cannot reliably access or cite specific sources. Retrieval-Augmented Generation (RAG) addresses this by combining the generative power of neural networks with the precision of information retrieval systems.

RAG Retrieval Simulation
Visualize how semantic similarity finds relevant documents
Ctrl/Cmd + Enter to run

The Knowledge Problem in Language Models

Traditional language models encode knowledge implicitly in their parameters during training. This creates several challenges:

Challenge Description Impact
Knowledge Cutoff Parameters reflect training data date Cannot answer about recent events
Hallucination Model generates plausible but false information Unreliable for factual queries
Opacity No clear source attribution Difficult to verify claims
Update Cost Retraining required for new knowledge Expensive and time-consuming

The foundational RAG paper by Lewis et al. (2020) introduced an elegant solution: rather than storing all knowledge in model parameters, retrieve relevant documents at inference time and condition generation on this retrieved context.

"RAG models combine the best of both worlds: the parametric memory of pre-trained seq2seq models and the non-parametric memory of retrieval-based models, accessing knowledge from an external corpus." - Lewis et al., 2020


RAG Architecture Overview

The RAG architecture consists of three primary components working in concert:

RAG System Architecture:
----------------------------------------------------------------

    User Query
        |
        v
+------------------+
|   Query Encoder  |  E_q: Transforms query into dense vector
|   (Bi-Encoder)   |
+--------+---------+
         |
         | query embedding
         v
+------------------+      +----------------------+
|  Dense Retrieval |----->|   Document Index     |
|     (MIPS)       |      | (FAISS / ScaNN /     |
+--------+---------+      |  Pinecone / etc.)    |
         |                +----------------------+
         | top-k documents
         v
+------------------+
|  Fusion Module   |  Combines query + retrieved docs
+--------+---------+
         |
         | augmented context
         v
+------------------+
|    Generator     |  Seq2seq model (BART, T5, GPT, etc.)
|   (LLM Decoder)  |
+--------+---------+
         |
         v
    Generated Answer

The Retrieval Score

The core retrieval mechanism uses dense vector similarity. Given a query qq and document dd, the retrieval score is computed as the inner product of their embeddings:

s(q,d)=Eq(q)TEd(d)s(q, d) = E_q(q)^T E_d(d)

Where:

  • EqE_q is the query encoder
  • EdE_d is the document encoder
  • Both encoders map text to dense vectors in the same embedding space

This formulation enables efficient Maximum Inner Product Search (MIPS) over millions of documents using approximate nearest neighbor algorithms.


Dense Retrieval: The Bi-Encoder Architecture

How Bi-Encoders Work

The bi-encoder architecture processes queries and documents independently through separate (or shared) transformer encoders:

Bi-Encoder Architecture:
----------------------------------------------------------------

Query: "What is the capital of France?"

        +------------------+
        |   Query Text     |
        +--------+---------+
                 |
                 v
        +------------------+
        |  Query Encoder   |  BERT-based transformer
        |   (E_q)          |
        +--------+---------+
                 |
                 v
        +------------------+
        | Query Embedding  |  768-dim dense vector
        |   q = [0.2, ...] |
        +------------------+


Document: "Paris is the capital and largest city of France..."

        +------------------+
        |  Document Text   |
        +--------+---------+
                 |
                 v
        +------------------+
        | Document Encoder |  BERT-based transformer
        |   (E_d)          |
        +--------+---------+
                 |
                 v
        +------------------+
        |  Doc Embedding   |  768-dim dense vector
        |   d = [0.3, ...] |
        +------------------+


Similarity: s(q, d) = q^T d = 0.87

Training Dense Retrievers

Dense retrievers are typically trained with contrastive learning. For each query qq, we have:

  • One positive document d+d^+ (relevant)
  • Multiple negative documents d1,d2,...,dnd^-_1, d^-_2, ..., d^-_n (irrelevant)

The training objective maximizes the score for positive pairs while minimizing it for negatives. The loss function uses the negative log-likelihood of the positive document:

L=logexp(s(q,d+))exp(s(q,d+))+i=1nexp(s(q,di))\mathcal{L} = -\log \frac{\exp(s(q, d^+))}{\exp(s(q, d^+)) + \sum_{i=1}^{n} \exp(s(q, d^-_i))}

This is equivalent to the cross-entropy loss over a softmax distribution of document scores.

Hard Negative Mining

The quality of negative examples significantly impacts retrieval performance. Strategies include:

Strategy Description Effectiveness
Random Negatives Sample random documents from corpus Baseline
BM25 Negatives Top BM25 results that aren't relevant Good
In-Batch Negatives Other queries' positives in same batch Efficient
Hard Negatives High-scoring but incorrect documents Best

Document Chunking Strategies

Real-world documents are often too long for transformer context windows. Effective chunking is critical for RAG performance.

Common Chunking Approaches

Document Chunking Strategies:
----------------------------------------------------------------

Original Document (10,000 tokens)
|
+-- Fixed-Size Chunking
|   |
|   +-- Chunk 1: tokens [0:512]
|   +-- Chunk 2: tokens [512:1024]
|   +-- Chunk 3: tokens [1024:1536]
|   ...
|   (Simple but may break semantic units)
|
+-- Sentence-Based Chunking
|   |
|   +-- Chunk 1: sentences 1-5 (~400 tokens)
|   +-- Chunk 2: sentences 6-12 (~500 tokens)
|   ...
|   (Preserves sentence boundaries)
|
+-- Semantic Chunking
|   |
|   +-- Chunk 1: "Introduction" section
|   +-- Chunk 2: "Methods" section
|   +-- Chunk 3: "Results" section
|   ...
|   (Preserves document structure)
|
+-- Recursive Chunking
    |
    +-- Split on paragraphs
        +-- If too large, split on sentences
            +-- If still too large, split on tokens
    (Hierarchical, respects structure)

Chunk Size Trade-offs

Chunk Size Advantages Disadvantages
Small (100-200 tokens) Precise retrieval, fine-grained May lose context
Medium (300-500 tokens) Balanced General purpose
Large (500-1000 tokens) More context per chunk Less precise retrieval

Overlapping Chunks

To prevent information loss at chunk boundaries, many systems use overlapping windows:

Overlapping Chunk Strategy:
----------------------------------------------------------------

Document: [==========================================]

Chunk 1:  [==========]
                   |-- overlap (50-100 tokens)
Chunk 2:       [==========]
                        |-- overlap
Chunk 3:            [==========]
                             |-- overlap
Chunk 4:                 [==========]

The overlap ensures that information spanning chunk boundaries is captured in at least one chunk.


The RAG Probability Distribution

The mathematical formulation of RAG defines a marginalized probability over retrieved documents. For input xx (query) and output yy (generated answer):

p(yx)=ztop-kp(zx)p(yx,z)p(y|x) = \sum_{z \in \text{top-k}} p(z|x) p(y|x,z)

Where:

  • zz represents a retrieved document
  • p(zx)p(z|x) is the retrieval probability (from dense retrieval scores)
  • p(yx,z)p(y|x,z) is the generation probability conditioned on query and document

RAG-Sequence vs RAG-Token

The Lewis et al. paper introduced two variants:

RAG-Sequence: Uses the same document for generating the entire sequence

pRAG-Sequence(yx)=ztop-kp(zx)i=1Np(yix,z,y1:i1)p_{\text{RAG-Sequence}}(y|x) = \sum_{z \in \text{top-k}} p(z|x) \prod_{i=1}^{N} p(y_i|x, z, y_{1:i-1})

RAG-Token: Can use different documents for each output token

pRAG-Token(yx)=i=1Nztop-kp(zx)p(yix,z,y1:i1)p_{\text{RAG-Token}}(y|x) = \prod_{i=1}^{N} \sum_{z \in \text{top-k}} p(z|x) p(y_i|x, z, y_{1:i-1})

RAG-Sequence vs RAG-Token:
----------------------------------------------------------------

RAG-Sequence:
Query --> Retrieve [Doc A, Doc B, Doc C]
          |
          +--> Generate with Doc A: "Paris is the capital..."
          +--> Generate with Doc B: "The capital of France..."
          +--> Generate with Doc C: "France's capital city..."
          |
          +--> Marginalize: weighted sum of complete sequences

RAG-Token:
Query --> Retrieve [Doc A, Doc B, Doc C]
          |
          Token 1: marginalize over docs --> "The"
          Token 2: marginalize over docs --> "capital"
          Token 3: marginalize over docs --> "is"
          ...
          (Each token can draw from different document)

Fusion Mechanisms

Fusion mechanisms determine how retrieved documents are combined with the query for generation.

Concatenation-Based Fusion

The simplest approach concatenates retrieved documents with the query:

Concatenation Fusion:
----------------------------------------------------------------

Input to Generator:

[QUERY] What is photosynthesis?
[DOC 1] Photosynthesis is a process used by plants to convert
        light energy into chemical energy...
[DOC 2] The light-dependent reactions occur in the thylakoid
        membranes and require sunlight...
[DOC 3] Carbon fixation occurs in the Calvin cycle, where CO2
        is converted into glucose...

--> Generator produces answer conditioned on full context

Cross-Attention Fusion

More sophisticated approaches use cross-attention between query and documents:

Cross-Attention Fusion:
----------------------------------------------------------------

        Query Tokens
            |
            v
    +---------------+
    | Self-Attention|
    +-------+-------+
            |
            v
    +---------------+
    | Cross-Attention|<---- Document Representations
    | (Query attends |
    |  to Documents) |
    +-------+-------+
            |
            v
    +---------------+
    |  Feed-Forward |
    +---------------+
            |
            v
      Output Tokens

Fusion-in-Decoder (FiD)

The Fusion-in-Decoder approach, introduced for open-domain QA, processes each document independently with the query, then fuses representations in the decoder:

Fusion-in-Decoder Architecture:
----------------------------------------------------------------

Query + Doc 1 --> Encoder --> Representation 1 --|
Query + Doc 2 --> Encoder --> Representation 2 --|
Query + Doc 3 --> Encoder --> Representation 3 --|---> Concatenate
        ...                         ...          --|      |
Query + Doc k --> Encoder --> Representation k --|      |
                                                        v
                                                   +--------+
                                                   | Decoder|
                                                   +--------+
                                                        |
                                                        v
                                                    Answer

This allows the model to process many documents efficiently while still attending to all of them during generation.


Training RAG Systems

End-to-End Training

The original RAG approach trains the retriever and generator jointly. The total loss is:

LRAG=logp(yx)=logztop-kp(zx)p(yx,z)\mathcal{L}_{\text{RAG}} = -\log p(y|x) = -\log \sum_{z \in \text{top-k}} p(z|x) p(y|x,z)

However, this is computationally expensive because:

  1. Retrieval requires searching the entire corpus
  2. Gradients must flow through the retrieval operation
  3. Document embeddings must be periodically updated

Practical Training Strategies

Strategy Description Pros Cons
Frozen Retriever Train only generator Fast, stable Suboptimal retrieval
Separate Training Train retriever, then generator Modular No joint optimization
Periodic Updates Update doc embeddings periodically Balanced Complexity
End-to-End Joint training Optimal Expensive

The Comprehensive Survey Perspective

According to the 2023 survey on RAG for Large Language Models, the field has evolved significantly since the original formulation:

"RAG has evolved from a simple retrieve-then-read pipeline to sophisticated systems incorporating query rewriting, iterative retrieval, and self-reflection mechanisms."

The survey identifies three paradigms:

  1. Naive RAG: Simple retrieve-then-generate pipeline
  2. Advanced RAG: Incorporates pre-retrieval optimization (query rewriting) and post-retrieval processing (re-ranking)
  3. Modular RAG: Flexible composition of retrieval, generation, and reasoning modules

Advanced RAG Techniques

Query Expansion and Rewriting

Poor queries lead to poor retrieval. Query expansion improves retrieval by reformulating the original query:

Query Rewriting Pipeline:
----------------------------------------------------------------

Original Query: "Why is the sky blue?"
                    |
                    v
            +---------------+
            | Query Rewriter|  (LLM-based)
            +---------------+
                    |
                    v
Expanded Queries:
  - "Rayleigh scattering atmosphere blue light"
  - "Physics of sky color wavelength"
  - "Why does the sky appear blue during daytime"
                    |
                    v
        Retrieve using all queries
                    |
                    v
        Merge and re-rank results

Hypothetical Document Embeddings (HyDE)

HyDE generates a hypothetical answer first, then uses it for retrieval:

HyDE Process:
----------------------------------------------------------------

Query: "What causes aurora borealis?"
            |
            v
    +------------------+
    | LLM generates    |
    | hypothetical     |
    | answer           |
    +------------------+
            |
            v
Hypothetical: "The aurora borealis is caused by charged
particles from the sun interacting with gases in Earth's
atmosphere, creating colorful light displays..."
            |
            v
    +------------------+
    | Embed hypothetical|
    | document         |
    +------------------+
            |
            v
    Retrieve similar real documents

Iterative Retrieval

Complex queries may require multiple retrieval steps:

Iterative Retrieval:
----------------------------------------------------------------

Query: "How did Einstein's work influence quantum computing?"

Round 1: Retrieve docs about Einstein's contributions
         --> Learn about photoelectric effect, EPR paradox

Round 2: Retrieve docs about quantum computing foundations
         --> Learn about qubits, entanglement

Round 3: Retrieve docs connecting these concepts
         --> Find connections between EPR and quantum gates

Final: Synthesize information from all rounds

Evaluation Metrics

Retrieval Metrics

Metric Formula Description
Recall@k Relevant in top-k / Total relevant Coverage of relevant docs
Precision@k Relevant in top-k / k Accuracy of top-k
MRR Mean of 1/rank of first relevant Rank of first hit
NDCG Normalized discounted cumulative gain Rank-weighted relevance

End-to-End Metrics

Metric Application Notes
Exact Match QA tasks Strict string matching
F1 Score QA tasks Token overlap
BLEU/ROUGE Generation N-gram overlap
Faithfulness Attribution Does answer match sources?

Practical Implementation Considerations

Vector Database Selection

Vector Database Comparison:
----------------------------------------------------------------

Database     | Scalability | Speed    | Features
-------------|-------------|----------|------------------
FAISS        | Billions    | Very Fast| Open source, GPU
Pinecone     | Billions    | Fast     | Managed, easy API
Weaviate     | Millions    | Fast     | Hybrid search
Milvus       | Billions    | Fast     | Open source
Chroma       | Millions    | Moderate | Simple, local-first
Qdrant       | Billions    | Fast     | Rust-based, filtering

Latency Breakdown

A typical RAG query involves multiple steps:

RAG Latency Analysis:
----------------------------------------------------------------

Query Encoding:     ~10-50ms   (transformer forward pass)
Vector Search:      ~10-100ms  (depends on index size)
Document Fetch:     ~5-50ms    (database read)
Context Assembly:   ~1-5ms     (string concatenation)
LLM Generation:     ~500-2000ms (depends on model/length)
                    ----------
Total:              ~550-2200ms

Optimization targets:
- Caching frequent queries
- Smaller embedding models
- Approximate search algorithms
- Streaming generation

Challenges and Future Directions

Current Limitations

  1. Attribution Accuracy: Retrieved documents may not actually support generated claims
  2. Retrieval Failures: Dense retrieval struggles with certain query types
  3. Context Length: Limited ability to use many retrieved documents
  4. Latency: Additional retrieval step adds latency
  5. Maintenance: Document indices require updates as knowledge changes

Emerging Solutions

The 2023 RAG survey highlights several promising directions:

  • Self-RAG: Models that decide when retrieval is needed
  • Corrective RAG: Self-correction mechanisms for retrieval errors
  • Graph RAG: Using knowledge graphs alongside dense retrieval
  • Long-Context RAG: Leveraging extended context windows (100k+ tokens)

Conclusion

Retrieval-Augmented Generation represents a fundamental shift in how we build knowledge-intensive AI systems. By combining the fluency of neural text generation with the precision of information retrieval, RAG systems can provide accurate, attributable, and up-to-date responses.

The architecture's elegance lies in its modularity: the retriever and generator can be independently improved, and the knowledge base can be updated without retraining the model. As the comprehensive 2023 survey notes, RAG has evolved from a research technique to a foundational pattern for production AI systems.

The mathematical framework - from the bi-encoder retrieval score s(q,d)=Eq(q)TEd(d)s(q, d) = E_q(q)^T E_d(d) to the marginalized generation probability p(yx)=zp(zx)p(yx,z)p(y|x) = \sum_{z} p(z|x) p(y|x,z) - provides a principled foundation for continued innovation. As language models grow more capable and retrieval systems more efficient, the synergy between these components will only deepen.

Test Your Knowledge: Retrieval-Augmented Generation
Question 1 of 5
What is the main purpose of RAG?

This article draws on peer-reviewed research including the foundational RAG paper by Lewis et al. (2020) and the comprehensive 2023 survey on RAG systems. For complete technical details, consult the original publications.

Share:

Related Articles

Space landscape

SPACE SERVICES