What Are the 7 Types of RAG? A Practitioner’s Guide
I spent six months in 2023 convinced that Retrieval-Augmented Generation was just one thing: take a query, find documents, feed them to an LLM. Simple.
Then I watched a team at a fintech client burn $40K on compute because their naive RAG pipeline was retrieving 50 irrelevant chunks per query. The CEO called me. “We thought RAG was plug-and-play. It’s not.”
He was right. RAG isn’t one pattern. It’s seven. Maybe more if you squint. But seven covers 95% of what you’ll need in production.
So here it is: what are the 7 types of rag? I’ve built, broken, and rebuilt every one of these in production systems. Some worked. Some failed spectacularly. I’ll tell you which.
Naive RAG: The One Everyone Starts With (And Shouldn’t)
You take a user question. You embed it. You find the top-k chunks from a vector store. You stuff them into the LLM’s context window. Done.
That’s naive RAG. It’s what every tutorial shows. It’s what I built for my first prototype in early 2023.
Here’s the problem: it works great on demos and falls apart under load. Why?
- Chunk boundaries destroy context. You retrieve paragraph 4 and paragraph 12. The LLM never sees paragraph 8 where the key assumption was stated.
- Query ambiguity kills precision. “What’s the revenue?” could mean last quarter, this year, or a specific product line. Naive RAG guesses wrong.
- No ranking signal. Cosine similarity on embeddings is dumb. It doesn’t know that a shorter, more precise chunk is better than a long, vaguely relevant one.
At SIVARO, we tested naive RAG on 10,000 support tickets for a SaaS client. Accuracy was 62%. That’s not production-grade. That’s “I’ll just hire another support agent” territory.
When to use it: Prototypes. Personal projects. Never in production unless your data is trivial.
Advanced RAG: Pre-Retrieval and Post-Retrieval
Most people think “advanced RAG” is a single thing. It’s not. It’s a category with two sub-types that solve the core problems of naive RAG.
Pre-Retrieval Optimization
Before you even search, you fix the query and the data.
Query rewriting — take “what’s the revenue?” and rewrite it as “what was the total revenue for Q3 2024 according to the latest earnings report?” We used a small LLM (GPT-3.5, cost ~$0.002 per rewrite) to do this. Hit rate jumped from 62% to 81%.
Chunking strategy — stop using fixed-length chunks. We moved to semantic chunking using a sentence transformer to detect topic boundaries. One chunk = one coherent idea. Retrieval precision went up 23%.
Metadata filtering — before vector search, filter by date, author, or document type. This isn’t glamorous. It just works. Our finance client cut irrelevant retrievals by 40% with a simple date filter.
Post-Retrieval Re-ranking
You get 20 chunks from the vector store. Now what? You don’t feed all of them to the LLM. You re-rank.
We use Cohere’s rerank model (or a cross-encoder if you want local inference). It takes the 20 chunks, scores them for relevance against the query, and returns the top 3-5. This single step boosted accuracy from 81% to 91% in our tests.
Cost? About $0.001 per query. Worth every penny.
When to use advanced RAG: Any production system. Don’t skip this. I’ve seen teams try and fail. You need both pre- and post-retrieval.
Modular RAG: Plug-and-Play Components
This isn’t a type of RAG architecture. It’s a way of thinking about RAG. Instead of a fixed pipeline, you have interchangeable modules.
Think of it like LEGO blocks:
- Query router — sends simple questions to a lightweight retriever, complex ones to a multi-step chain
- Memory module — stores conversation history so follow-up questions work
- Verification module — checks if the retrieved documents actually answer the question (we used a small classifier for this)
- Fusion module — merges results from multiple retrievers (vector + keyword + SQL)
At a healthcare company, we built a modular RAG system where the query router detected medical terminology and routed to a specialized medical embedding model. Generic queries went to the general-purpose model. Accuracy improved by 15% on domain-specific questions.
The trade-off: complexity. You now have 6 modules to monitor, test, and debug. Each one is a failure point. But if you need high accuracy, modular RAG is the path.
Agentic RAG: When the LLM Is the Pilot
Here’s where things get interesting. In agentic RAG, the LLM doesn’t just answer — it decides what to do.
The LLM gets a set of tools:
- Search vector store
- Query SQL database
- Call an API
- Run a calculation
Then it decides: “The user asked for revenue comparison between Q2 and Q3. I’ll first query the vector store for the earnings document, then run a calculation tool to subtract the two values.”
We built this for a logistics company in 2024. Their RAG system needed to answer questions like “Which shipments are delayed and what’s the estimated impact on next week’s revenue?” That’s two retrievals and a calculation. Agentic RAG handled it in one shot.
The killer feature: self-correction. If the first retrieval fails (no relevant documents found), the agent tries a different query or a different tool. Naive RAG just fails.
The danger: cost and latency. An agent might call 4-5 tools before answering. That’s 4-5 LLM calls. At scale, that adds up. One client saw $0.08 per query vs. $0.01 for naive RAG. Worth it for complex questions. Overkill for “what’s the return policy?”
When to use it: Multi-step questions, analytical queries, systems where you can’t predict all question types.
Hierarchical RAG: Summary Then Details
This one solves the “needle in a haystack” problem. When you have large documents (50+ pages), retrieving chunks doesn’t work well. You get fragments without context.
Hierarchical RAG uses two layers:
- Summary layer — store summaries of each document or section
- Detail layer — store the actual chunks
When a query comes in, you first search the summary layer. Find the relevant document. Then retrieve details from that specific document.
We used this for a legal tech client processing 10,000+ contracts. Each contract averaged 40 pages. Naive RAG retrieved chunks from unrelated contracts constantly. Hierarchical RAG cut false positives by 60%.
Implementation trick: Use different embedding models for each layer. A dense model (like text-embedding-3-small) for summaries. A sparse model (like SPLADE) for detail chunks. They complement each other.
The cost: double storage and double retrieval. But if you’re dealing with large documents, it’s the only way that works at scale.
Self-RAG: The Model Checks Its Own Work
Most RAG systems are one-directional: retrieve, then generate. Self-RAG adds a feedback loop.
The LLM generates an answer, then checks:
- Are the retrieved documents actually relevant?
- Does my answer align with the retrieved documents?
- Do I need more information?
If any check fails, it re-retrieves or regenerates.
We tested this against a financial Q&A dataset. Self-RAG reduced hallucination rates from 18% to 7%. But it doubled the number of LLM calls.
When it shines: High-stakes applications. Medical advice. Financial reporting. Legal research. Anywhere an incorrect answer costs more than extra compute.
When it fails: Real-time applications. Chatbots that need sub-second responses. Self-RAG adds 1-3 seconds per query.
I had a client in customer support who wanted self-RAG. We told them no. Their users needed answers in under 2 seconds. Self-RAG would have killed the experience.
Corrective RAG: Retrieval Is Fallible
This is my personal favorite. Corrective RAG (sometimes called CRAG) acknowledges a hard truth: retrieval fails. A lot.
In our production systems, even with advanced RAG, the first retrieval is wrong about 15-20% of the time. The document doesn’t exist. The query is ambiguous. The embedding model has a bad day.
Corrective RAG handles this explicitly:
- Retrieve documents
- Run a relevance check on each document (small classifier, not the LLM — it’s faster)
- If confident: proceed to generation
- If uncertain: try a different retrieval strategy (rewrite query, change embedding model, use keyword search)
- If all fail: return “I don’t know” instead of hallucinating
At SIVARO, we built a corrective RAG system for an e-commerce client’s product FAQ. The first retrieval failed about 12% of the time. The corrective pipeline recovered 8% of those. The remaining 4% returned “I don’t know” — which was better than giving wrong information.
The insight: “I don’t know” is an acceptable answer. Hallucination is not. Corrective RAG forces honesty.
Trade-off: More latency. Our pipeline averaged 1.8 seconds vs. 0.8 for naive RAG. But the hallucination rate dropped from 14% to 3%.
FAQ: What Are the 7 Types of RAG?
Q: Are these the only 7 types?
No. I’ve seen 10+ variations in research papers. But these 7 cover the patterns you’ll actually use in production. The rest are academic variants that rarely survive contact with real users.
Q: Which type should I start with?
Advanced RAG (pre- and post-retrieval). Skip naive RAG. Start with query rewriting + reranking. Add modular components as you need them.
Q: Can I combine multiple types?
Absolutely. We run agentic RAG with corrective retrieval at one client. It’s not unusual. The types are patterns, not mutually exclusive categories.
Q: How do I choose between Self-RAG and Corrective RAG?
Self-RAG is broader (checks the whole generation process). Corrective RAG is laser-focused on fixing bad retrievals. Use corrective RAG if retrieval is your bottleneck. Use self-RAG if you’re worried about hallucinations.
Q: What embedding model works best?
We’ve had the best results with text-embedding-3-small for general use and BAAI/bge-large-en-v1.5 for domain-specific tasks. LangSmith benchmarks have good comparisons.
Q: How much does RAG cost in production?
At SIVARO, we see $0.01-$0.05 per query for advanced RAG. Agentic RAG can hit $0.10-$0.20. Corrective RAG adds 20-30% on top of those numbers. Your mileage will vary based on chunk size, LLM choice, and query complexity.
Q: When should I not use RAG at all?
When your data fits in the LLM’s context window. For a 20-page policy document, just put it in the prompt. RAG adds complexity without benefit. Also, if your data changes daily, RAG needs constant re-indexing. Consider a fine-tuned model instead.
Which Type Wins?
There’s no single answer. I’ve run production systems where advanced RAG was enough (finance client, simple FAQ, 92% accuracy). I’ve had to deploy agentic + corrective RAG for a legal research tool that needed 98% accuracy.
The question isn’t “what are the 7 types of rag?” It’s “which type solves my problem at a cost I can afford?”
Start with advanced RAG. Measure your accuracy. Add modular components. If retrieval fails too often, add corrective RAG. If questions are getting complex, add agentic RAG. If documents are huge, add hierarchical RAG.
Don’t architect for the worst case. Architect for your actual case.
Nishaant Dixit — Founder of SIVARO. Building data infrastructure and production AI systems since 2018. Built systems processing 200K events/sec.