Agents

RAG

comprehensive guide to Retrieval-Augmented Generation (RAG), covering Vector Databases, Semantic Search, Embeddings, Chunking strategies, and the RAG workflow.

RAG (Retrieval-Augmented Generation)

Retrieval-Augmented Generation (RAG) is a technique that enhances Large Language Models (LLMs) by retrieving relevant data from external sources to ground the model's responses in factual information. This guide covers the essential components of building a RAG system.

1. Introduction to Vector Database

A Vector Database (Vector DB) is a specialized database designed to store and manage high-dimensional vector embeddings. Unlike traditional relational databases that store rows and columns, or document databases that store JSON objects, vector databases are optimized for storing and querying vectors—lists of numbers that represent data (text, images, audio) in a multi-dimensional space.

Key Characteristics

  • High-Dimensional Indexing: Uses algorithms to index vectors for fast retrieval.
  • Similarity Search: Optimized to find vectors that are "closest" to a query vector, rather than exact matches.
  • Scalability: Built to handle millions or billions of embeddings.

Why use a Vector DB?

Traditional databases are great for exact keyword matching (e.g., WHERE name = 'John'). However, they struggle with "fuzzy" concepts. Vector DBs allow you to ask, "Find me items similar to this one," or "Find documents that talk about climate change," even if they don't use those exact words.

Semantic Search goes beyond keyword matching (lexical search) to understand the meaning and context behind a query. It uses vector embeddings to represent the semantic meaning of text.

How it Works

  1. Embedding: Both the search query and the documents in the database are converted into vector embeddings using the same embedding model (e.g., OpenAI text-embedding-3, HuggingFace models).
  2. Distance Calculation: The search engine calculates the distance/similarity between the query vector and document vectors in the vector space.
  3. Ranking: Documents are ranked based on their closeness to the query vector. Closer vectors = higher semantic similarity.
FeatureKeyword Search (Lexical)Semantic Search (Vector)
MechanismMatches exact words/tokensMatches meaning/concepts
ContextIgnores context; sensitive to typosUnderstands context and intent
SynonymsFails unless explicitly mappedHandles synonyms naturally (e.g., "car" ~ "vehicle")
Use CaseSpecific terms (IDs, names)Conceptual queries, Q&A

3. Storing Embeddings

To build a RAG system, you first need to generate embeddings for your data and store them in a Vector DB.

Process

  1. Load Data: Read your source documents.
  2. Chunking: Split text into smaller, manageable pieces (see Chunking section).
  3. Embed: Pass chunks through an embedding model.
  4. Upsert: Save the chunk text, metadata, and generated vector to the DB.

Code Example (Python)

Here is an example using LangChain and ChromaDB (a popular open-source vector store).

import os
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma

# 1. Load Data
loader = TextLoader("./knowledge_base.txt")
documents = loader.load()

# 2. Chunk Data
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200
)
splits = text_splitter.split_documents(documents)

# 3. Initialize Embedding Model
# Ensure OPENAI_API_KEY is set in environment variables
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# 4. Store Embeddings (ChromaDB)
# This creates a local vector store, generates embeddings, and saves them
vectorstore = Chroma.from_documents(
    documents=splits,
    embedding=embeddings,
    collection_name="my_rag_collection",
    persist_directory="./chroma_db"
)

print(f"Stored {len(splits)} chunks in the vector database.")
  • Chroma: Open-source, easy to run locally.
  • Pinecone: Managed cloud service, highly scalable.
  • Weaviate: Open-source, hybrid search (keyword + vector).
  • Qdrant: High-performance, open-source.
  • pgvector: Vector similarity search extension for PostgreSQL.

Once data is indexed, you can query it. This involves finding vectors in your database that are similar to your query vector.

Similarity Algorithms (Distance Metrics)

The "closeness" of vectors is measured using distance metrics. Choosing the right one depends on your embedding model and use case.

  1. Cosine Similarity: Measures the cosine of the angle between two vectors.
    • Range: -1 (opposite) to 1 (identical).
    • Use case: Most common for text embeddings; focuses on orientation, not magnitude.
  2. Euclidean Distance (L2): Measures the straight-line distance between two points.
    • Use case: When vector magnitude matters. Lower value = more similar.
  3. Dot Product: Measures the product of vector magnitudes and the cosine of the angle.
    • Use case: Optimized for normalized vectors (where it equals Cosine Similarity).

Search Algorithms (Indexing)

To search efficiently without comparing the query to every vector (Exact kNN), Vector DBs use Approximate Nearest Neighbor (ANN) algorithms.

  • HNSW (Hierarchical Navigable Small World): The gold standard for vector search. It creates a multi-layered graph structure for extremely fast search with high accuracy.
  • IVF (Inverted File Index): Partitions the vector space into clusters (Voronoi cells). Search is restricted to the nearest clusters.
# Continue from previous example...

query = "How does photosynthesis work?"

# Perform Similarity Search
# k=3 returns the top 3 most similar chunks
results = vectorstore.similarity_search(query, k=3)

print("Top results:")
for doc in results:
    print(f"---\n{doc.page_content}\n---")

Stored Procedure Implementation (PostgreSQL + pgvector)

In a relational DB like Postgres, you can encapsulate vector search logic in a function or stored procedure.

-- Prerequisite: Enable pgvector extension
-- CREATE EXTENSION vector;

-- Assume we have a table: items (id, content, embedding vector(1536))

CREATE OR REPLACE FUNCTION search_knowledge_base(
    query_embedding vector(1536), 
    match_threshold float, 
    match_count int
)
RETURNS TABLE (
    id int,
    content text,
    similarity float
)
LANGUAGE plpgsql
AS $$
BEGIN
    RETURN QUERY
    SELECT
        items.id,
        items.content,
        1 - (items.embedding <=> query_embedding) as similarity -- Cosine similarity
    FROM
        items
    WHERE
        1 - (items.embedding <=> query_embedding) > match_threshold
    ORDER BY
        items.embedding <=> query_embedding -- Order by distance (closest first)
    LIMIT
        match_count;
END;
$$;

-- Usage:
-- SELECT * FROM search_knowledge_base('[0.1, -0.2, ...]', 0.7, 5);

5. Chunking Strategies

Chunking is the process of breaking down large texts into smaller segments. It is critical for RAG performance because:

  1. Context Window: LLMs have a limit on how much text they can process.
  2. Semantic Granularity: A whole book has "mixed" meaning. A single paragraph focuses on a specific topic, making it easier to match precisely.

Common Chunking Methods

1. Fixed-Size Chunking

Splits text based on a fixed number of characters or tokens.

  • Pros: Simple, computationally cheap.
  • Cons: Can cut sentences in half, losing context.

Splits text based on a list of separators (e.g., \n\n, \n, ``, ``) recursively. It tries to keep related text (like paragraphs) together.

  • Pros: Respects document structure (paragraphs, sentences).
  • Cons: Slightly more complex logic.

3. Semantic Chunking

Uses an LLM or embedding model to identify "topic shifts" in the text and splits there.

  • Pros: High coherent chunks.
  • Cons: Slower and more expensive.

LangChain Chunking Examples

from langchain_text_splitters import (
    CharacterTextSplitter,
    RecursiveCharacterTextSplitter,
    MarkdownHeaderTextSplitter
)

# Method 1: Recursive Character Splitter (Most Common)
recursive_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200, # Keep some overlap to maintain context between chunks
    separators=["\n\n", "\n", " ", ""]
)
chunks_recursive = recursive_splitter.split_text(large_text)

# Method 2: Token-based Splitter (Good for LLM limits)
from langchain_text_splitters import TokenTextSplitter

token_splitter = TokenTextSplitter(
    encoding_name="cl100k_base", # OpenAI encoding
    chunk_size=500,
    chunk_overlap=50
)
chunks_token = token_splitter.split_text(large_text)

# Method 3: Markdown Header Splitter (Structure-aware)
# Great for documentation to keep sections together
markdown_document = "# Section 1\nContent...\n## Subsection 1.1\nDetails..."

headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]

markdown_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on
)
md_header_splits = markdown_splitter.split_text(markdown_document)

6. RAG: Retrieval & Generation

The RAG pipeline combines all previous steps.

The RAG Workflow

  1. Retrieval:
    • User asks a question.
    • System converts question to a vector.
    • System performs similarity search to find top k relevant text chunks.
  2. Augmentation:
    • The retrieved text chunks are stuck into a "Context" block in the system prompt.
  3. Generation:
    • The prompt (Context + User Question) is sent to the LLM.
    • LLM generates an answer based only on the provided context.

RAG Code Example (LangChain Chain)

from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI

# 1. Setup LLM and Retriever
llm = ChatOpenAI(model="gpt-4o")
retriever = vectorstore.as_retriever() # From previous ChromaDB example

# 2. Create Prompt Template
system_prompt = (
    "You are an assistant for question-answering tasks. "
    "Use the following pieces of retrieved context to answer "
    "the question. If you don't know the answer, say that you "
    "don't know. Use three sentences maximum and keep the "
    "answer concise."
    "\n\n"
    "{context}"
)

prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        ("human", "{input}"),
    ]
)

# 3. Create the Chain
# 'stuff' chain puts all docs into the context variable
question_answer_chain = create_stuff_documents_chain(llm, prompt)
rag_chain = create_retrieval_chain(retriever, question_answer_chain)

# 4. Run RAG
response = rag_chain.invoke({"input": "What is semantic search?"})
print(response["answer"])

Summary of RAG Benefits

  • Up-to-date Knowledge: Can answer questions about recent data without re-training the model.
  • Reduced Hallucination: Grounding the model in retrieved facts makes it less likely to make things up.
  • Source Citation: You can show exactly which documents were used to generate the answer.