Retrieval-Augmented Generation (RAG) is a technique that enhances Large Language Models (LLMs) by retrieving relevant data from external sources to ground the model's responses in factual information. This guide covers the essential components of building a RAG system.
A Vector Database (Vector DB) is a specialized database designed to store and manage high-dimensional vector embeddings. Unlike traditional relational databases that store rows and columns, or document databases that store JSON objects, vector databases are optimized for storing and querying vectors—lists of numbers that represent data (text, images, audio) in a multi-dimensional space.
Traditional databases are great for exact keyword matching (e.g., WHERE name = 'John'). However, they struggle with "fuzzy" concepts. Vector DBs allow you to ask, "Find me items similar to this one," or "Find documents that talk about climate change," even if they don't use those exact words.
Semantic Search goes beyond keyword matching (lexical search) to understand the meaning and context behind a query. It uses vector embeddings to represent the semantic meaning of text.
| Feature | Keyword Search (Lexical) | Semantic Search (Vector) |
|---|---|---|
| Mechanism | Matches exact words/tokens | Matches meaning/concepts |
| Context | Ignores context; sensitive to typos | Understands context and intent |
| Synonyms | Fails unless explicitly mapped | Handles synonyms naturally (e.g., "car" ~ "vehicle") |
| Use Case | Specific terms (IDs, names) | Conceptual queries, Q&A |
To build a RAG system, you first need to generate embeddings for your data and store them in a Vector DB.
Here is an example using LangChain and ChromaDB (a popular open-source vector store).
import os
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma
# 1. Load Data
loader = TextLoader("./knowledge_base.txt")
documents = loader.load()
# 2. Chunk Data
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200
)
splits = text_splitter.split_documents(documents)
# 3. Initialize Embedding Model
# Ensure OPENAI_API_KEY is set in environment variables
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
# 4. Store Embeddings (ChromaDB)
# This creates a local vector store, generates embeddings, and saves them
vectorstore = Chroma.from_documents(
documents=splits,
embedding=embeddings,
collection_name="my_rag_collection",
persist_directory="./chroma_db"
)
print(f"Stored {len(splits)} chunks in the vector database.")
Once data is indexed, you can query it. This involves finding vectors in your database that are similar to your query vector.
The "closeness" of vectors is measured using distance metrics. Choosing the right one depends on your embedding model and use case.
To search efficiently without comparing the query to every vector (Exact kNN), Vector DBs use Approximate Nearest Neighbor (ANN) algorithms.
# Continue from previous example...
query = "How does photosynthesis work?"
# Perform Similarity Search
# k=3 returns the top 3 most similar chunks
results = vectorstore.similarity_search(query, k=3)
print("Top results:")
for doc in results:
print(f"---\n{doc.page_content}\n---")
In a relational DB like Postgres, you can encapsulate vector search logic in a function or stored procedure.
-- Prerequisite: Enable pgvector extension
-- CREATE EXTENSION vector;
-- Assume we have a table: items (id, content, embedding vector(1536))
CREATE OR REPLACE FUNCTION search_knowledge_base(
query_embedding vector(1536),
match_threshold float,
match_count int
)
RETURNS TABLE (
id int,
content text,
similarity float
)
LANGUAGE plpgsql
AS $$
BEGIN
RETURN QUERY
SELECT
items.id,
items.content,
1 - (items.embedding <=> query_embedding) as similarity -- Cosine similarity
FROM
items
WHERE
1 - (items.embedding <=> query_embedding) > match_threshold
ORDER BY
items.embedding <=> query_embedding -- Order by distance (closest first)
LIMIT
match_count;
END;
$$;
-- Usage:
-- SELECT * FROM search_knowledge_base('[0.1, -0.2, ...]', 0.7, 5);
Chunking is the process of breaking down large texts into smaller segments. It is critical for RAG performance because:
Splits text based on a fixed number of characters or tokens.
Splits text based on a list of separators (e.g., \n\n, \n, ``, ``) recursively. It tries to keep related text (like paragraphs) together.
Uses an LLM or embedding model to identify "topic shifts" in the text and splits there.
from langchain_text_splitters import (
CharacterTextSplitter,
RecursiveCharacterTextSplitter,
MarkdownHeaderTextSplitter
)
# Method 1: Recursive Character Splitter (Most Common)
recursive_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200, # Keep some overlap to maintain context between chunks
separators=["\n\n", "\n", " ", ""]
)
chunks_recursive = recursive_splitter.split_text(large_text)
# Method 2: Token-based Splitter (Good for LLM limits)
from langchain_text_splitters import TokenTextSplitter
token_splitter = TokenTextSplitter(
encoding_name="cl100k_base", # OpenAI encoding
chunk_size=500,
chunk_overlap=50
)
chunks_token = token_splitter.split_text(large_text)
# Method 3: Markdown Header Splitter (Structure-aware)
# Great for documentation to keep sections together
markdown_document = "# Section 1\nContent...\n## Subsection 1.1\nDetails..."
headers_to_split_on = [
("#", "Header 1"),
("##", "Header 2"),
("###", "Header 3"),
]
markdown_splitter = MarkdownHeaderTextSplitter(
headers_to_split_on=headers_to_split_on
)
md_header_splits = markdown_splitter.split_text(markdown_document)
The RAG pipeline combines all previous steps.
k relevant text chunks.from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
# 1. Setup LLM and Retriever
llm = ChatOpenAI(model="gpt-4o")
retriever = vectorstore.as_retriever() # From previous ChromaDB example
# 2. Create Prompt Template
system_prompt = (
"You are an assistant for question-answering tasks. "
"Use the following pieces of retrieved context to answer "
"the question. If you don't know the answer, say that you "
"don't know. Use three sentences maximum and keep the "
"answer concise."
"\n\n"
"{context}"
)
prompt = ChatPromptTemplate.from_messages(
[
("system", system_prompt),
("human", "{input}"),
]
)
# 3. Create the Chain
# 'stuff' chain puts all docs into the context variable
question_answer_chain = create_stuff_documents_chain(llm, prompt)
rag_chain = create_retrieval_chain(retriever, question_answer_chain)
# 4. Run RAG
response = rag_chain.invoke({"input": "What is semantic search?"})
print(response["answer"])
ReAct Agents
ReAct (Reasoning and Acting) is a powerful agent pattern that combines the reasoning capabilities of Large Language Models with the ability to take actions through tools. This approach allows agents to break down complex problems, gather necessary information, and provide accurate responses.
Markdown Syntax
Text, title, and styling in standard markdown.