RAG Explained — Connect Your Data to Any LLM

RAG pipeline retrieving documents and feeding them to an AI

Ask ChatGPT about your company's internal documentation and it draws a blank. That data wasn't in its training set. Fine-tuning the model to learn it is expensive and slow. RAG is the practical alternative.

What RAG Is

RAG stands for Retrieval-Augmented Generation. The name sounds complex but the idea is simple.

User asks a question
Relevant documents get retrieved from a knowledge base
Those documents get stuffed into the LLM prompt as context
The LLM answers using that context

Think of it as an open-book exam. The LLM isn't pulling answers from memory — it's reading reference material that you placed next to it.

Why RAG Instead of Fine-Tuning

Fine-tuning and RAG both aim to give an LLM knowledge it doesn't have. The approach is completely different.

Fine-tuning modifies the model's weights. You prepare training data, run GPU-intensive retraining, and the model internalizes the knowledge. It's expensive. When data changes, you retrain. Hallucinations are harder to control because the model "knows" the information (or thinks it does).

RAG leaves the model untouched. You just attach a search system. When data changes, update the documents. You can cite sources, making answers traceable. The model answers based on what it reads, not what it memorized.

Some systems use both. But for the majority of "I want my LLM to know about my data" scenarios, RAG wins on cost-effectiveness. It's almost always the right first step.

The Three Core Components

A RAG pipeline breaks into three parts.

1. Document Preprocessing

Raw documents need to be broken into chunks the LLM can digest. PDFs, Word docs, web pages — first convert to plain text, then split into chunks of appropriate size.

Chunk size is typically 500-1000 tokens. Too large and search accuracy drops. Too small and chunks lose context. Overlapping chunks (where the end of one overlaps the start of the next) help prevent information loss at paragraph boundaries.

Raw document → Text extraction → Chunking → Embedding → Vector DB storage

2. Embeddings and Vector DBs

This is the heart of RAG. Converting text into numerical vectors is called "embedding."

"Cat" and "dog" look nothing alike as strings, but in embedding space they're close together — they're semantically similar. "Automobile" is far from both. Embeddings let you calculate semantic similarity as a number.

Common embedding models include OpenAI's text-embedding-3-small and open-source options like bge-m3. Every chunk gets converted to an embedding and stored in a vector database.

Vector databases are specialized for efficiently searching these vectors. Key options:

Pinecone — Managed service, easy setup
Weaviate — Supports hybrid search (keyword + vector)
Chroma — Lightweight, great for local development
pgvector — PostgreSQL extension, leverages existing DB infrastructure

3. Retrieval and Generation

When a user question comes in:

The question gets embedded using the same embedding model
The vector DB returns the k most similar chunks
Those chunks go into the prompt as "context"
The LLM generates an answer referencing that context

[System prompt]
Answer the question based on the context below.
If the answer isn't in the context, say "I don't know."

[Context]
{retrieved chunk 1}
{retrieved chunk 2}
{retrieved chunk 3}

[Question]
{user question}

That's the full RAG flow.

Practical Considerations

The concept is straightforward. Making it work well is another story.

Chunking strategy determines answer quality. The same document chunked differently produces very different search results. Splitting by paragraph or section boundaries beats splitting by raw character count. Including titles and metadata in chunks improves search accuracy.

Consider hybrid search. Vector search (semantic) alone struggles with exact keywords and proper nouns. BM25-style keyword search fills the gap. In practice, combining both and merging results (Reciprocal Rank Fusion, etc.) is common.

Re-ranking improves precision. Pull a generous set of candidates first (say 20), then re-rank with a dedicated model to keep only the top few for context. Cohere Rerank and bge-reranker are popular choices here.

Evaluation is hard. Systematically measuring "are the answers good?" is tricky. Frameworks like RAGAS exist, but domain experts reviewing outputs remains necessary. Automated metrics only go so far.

Getting Started

The fastest path to a working RAG prototype is LangChain + Chroma. In Python, you can have something functional in a handful of files.

# Conceptual flow (pseudocode)
documents = load_documents("./docs")
chunks = split_into_chunks(documents, chunk_size=500)
embeddings = embed(chunks, model="text-embedding-3-small")
vector_store = store_in_chroma(embeddings)

# At query time
query = "What were quarterly revenue trends?"
relevant_chunks = vector_store.search(query, k=3)
answer = llm.generate(context=relevant_chunks, question=query)

Production-grade RAG requires experimenting with chunk strategies, comparing embedding models, adding re-ranking, and building evaluation pipelines. But getting a working prototype first is what matters most. It's the only way to see where the bottlenecks are and what needs improving.

RAG has become the default pattern for LLM applications. Internal search systems, customer support chatbots, document-based Q&A — it shows up everywhere. Nail the fundamentals and the applications follow naturally.