Apr 11, 2026
Retrieval-Augmented Generation Done Right: Architectures That Actually Work
AI Development

Retrieval-Augmented Generation Done Right: Architectures That Actually Work

RAG became the default answer to a simple question: how do you get an LLM to talk about things it was never trained on, using data that changes every day? Most teams implement the same recipe. Split documents into chunks, stuff them into a vector store, run a similarity search on user queries, feed the top few chunks into the prompt, hope hallucinations go away.
Daniel BrooksNovember 13, 202511 min read561 views

RAG became the default answer to a simple question: how do you get an LLM to talk about things it was never trained on, using data that changes every day? Most teams implement the same recipe. Split documents into chunks, stuff them into a vector store, run a similarity search on user queries, feed the top few chunks into the prompt, hope hallucinations go away. It works in demos. It falls apart under load communicating confidence and failure modes ai tools that-help people think, with real on the open web users, on real data. Done properly, retrieval is not a plugin. It is a system: document processing, indexing, retrieval, generation, and evaluation. If any part is sloppy, your "RAG app" turns back into a slightly slower hallucination engine with some citations stapled on. The rest of this piece is about what survives contact with production. ## The document layer: where most RAG systems)-reliability engineering fail Most RAG failures start before the first token of retrieval. Teams dump PDFs and HTML pages into a pipeline that: - Extracts text with little regard for structure

  • Splits everything into fixed-size chunks
  • Throws away layout, headings, and metadata Then they wonder why the model answers with half a paragraph from page 47 of a policy document that barely mentions the topic. If your input is garbage, your retrieval will be polite, vectorized garbage. You need to decide what a "unit of meaning" is in your domain. Sometimes it is a paragraph. Sometimes it is a section including its heading and subheading. Sometimes it is a full document with a structured table of contents. Good chunking respects boundaries: headings, bullet lists, table rows, dialogue turns. It often includes some overlap, but not so much that each chunk becomes a blurry copy of its neighbors. You want each item in the index to be specific enough to be relevant and large enough to carry context. Metadata is not decoration. Source system, document type, creation and revision times, author, jurisdiction, product line, language, permission tags: these fields will save you when users start asking "what changed since last week?" or "only for the German market" or "only contracts signed after 2022". If you do not index that metadata and wire it into your retrieval filters, you are wasting half of RAG's value. ## Indexing: more than one vector store "Put it in a vector database" sounds simple. Under load, it becomes a series of design choices that matter. Dense embeddings capture fuzzy semantic similarity. Sparse retrieval (classical keyword search) captures exact term matches and rare vocabulary. Hybrid systems that combine both consistently outperform either alone in knowledge-heavy applications. You also rarely have one homogeneous corpus. Product docs, legal policies, support tickets, structured FAQs, internal wiki pages: these behave differently. Doling them all into one index and letting cosine similarity sort it out is laziness disguised as simplicity. Architectures that work use: - Multiple indices for different domains or document types
  • Hybrid search that blends lexical and dense results
  • A re-ranking step that looks at candidate passages in context of the query, not just in embedding space The re-ranker is often more important than the base embedding model. A decent cross-encoder or small model dedicated to reranking can correct a lot of mistakes from the first stage. In practice, you retrieve generously (for example 50 candidates), then let the re-ranker decide which 5 or 10 really matter. ## Retrieval: query understanding, not just similarity User queries are messy. They contain context, instructions, and sometimes contradictions. Naive RAG runs a similarity search on the raw string and hopes for the best. Better architectures separate "what the user wants to do" from "what the system should search". There are three useful steps here. First, normalize the query. Strip out politeness, restate the task in a canonical form, fold in relevant conversation history. You can use a smaller model for this, or the main model with a strict prompt. Second, sometimes you need multiple queries. A question like "compare our Enterprise and Pro plans, with focus on security chain security for ai models weights datasets and dependencies and SLAs" actually implies at least two retrieval actions: one for plan comparison, one specifically for security and SLAs. Splitting these sub-queries and retrieving for each improves coverage. Third, control the search space with filters. If you know the tenant, product, language, or timeframe, turn those into structured filters rather than extra natural language. Force the search engine to respect hard constraints. In other words, teach the system to search deliberately, not just pan around the entire corpus with a single embedding. ## Generation: answering from evidence, not vibes If you hand the model a pile of retrieved text and say "answer the question," you are asking for selective hallucination. It will blend retrieved content with its priors and output something that sounds convincing. Systems that work in practice do a few things differently. They tell the model explicitly to ground answers in the retrieved content, to quote or reference specific passages, and to admit when the documents do not support an answer. That is a prompt design problem, but it is also an evaluation problem: you have to punish models that ignore evidence in favor of fluent guesses. They structure the context. Dumping ten chunks into a flat prompt makes it hard for the model to know which passages correspond to which source. Grouping by document, adding short summaries and titles, and using clear separators increases the chance that the model will anchor properly. For more insights on this topic, see our analysis in Quantization, Pruning, Distillation: How to Shrink Models Without Breaking Them. They limit context length aggressively. It is tempting to stuff as much as possible into the prompt "just in case". In practice, retrieval quality drops when you exceed what the model can attend to meaningfully. You want just enough context, not everything. ## Architectural patterns that actually hold up Once you have the basic pieces, a few architectures show up again and again in systems that survive real-world use. One is the skinny RAG layer behind a strong general model. Here, you rely on a frontier LLM for reasoning and language quality, and use RAG purely as a way to inject current, domain specific domain specific assistants for law finance and medicine facts. The indexing is simple but careful, the retrieval is modest, and most of the complexity sits in prompt design and evaluation. Another is the two-stage RAG stack. A smaller or cheaper model handles query rewriting and candidate retrieval. A re-ranker model refines the set. Then a more capable LLM performs the final answer generation. This is cost-efficient and flexible: you can swap pieces independently. A third pattern is domain-specific RAG. Instead of a single mega-index, you build separate stacks per domain (legal, HR, product, support) with different chunking strategies, embedding models, and eval sets. A top-level router decides which domain to hit for a given query. This avoids the "one vector store to rule them all" bottleneck. Finally, there is offline RAG. For some use cases, you do not need live retrieval at all. You pre-compute summaries, change logs, or synthetic FAQs from your documents and serve those directly, maybe with a light LLM layer on top. This reduces latency and cost, and it can be more reliable than ad-hoc retrieval on every request. ## Failure modes that matter When RAG fails, it often fails silently. The most common failure mode is retrieval drift. Something changes in the corpus, indexing pipeline, or filters, and suddenly the top results shift away from the truly relevant documents. The model still answers fluently. Users notice weeks later, if at all. Another is boundary errors. Important information straddles chunk boundaries and gets cut in half. Neither half looks relevant enough to make the top-k results, so the model never sees the one sentence that actually answers the question. A third is permissions. In multi-tenant systems, retrieval leaks information across tenants because metadata filters are misapplied or missing. That is an architectural bug, not a model issue, but users will phrase it as "the AI showed me someone else's data". You do not prevent these failures with clever prompts. You prevent them with logging, tests, and alarms. ## Observability and evaluation A serious RAG system logs everything: queries, rewritten queries, retrieved documents, ranking scores, model inputs and outputs. It lets you reconstruct any answer as a chain of events. On top of that you need regression tests. A fixed set of questions with known correct answers and known supporting documents. Every time you change your indexer, embedder, filters, or prompts, you run the suite and look for drops in answer quality or retrieval coverage. Metrics matter, but you do not need exotic ones. Retrieval recall on your eval set, answer correctness as judged by humans or a strong model, rate of unsupported claims, latency distribution. Track them by model version and index version. If you cannot tell whether your last change made things better or worse, you are not doing RAG. You are just generating text near some documents. ## When RAG is the wrong answer One last point. RAG is not the right architecture for every knowledge problem. If your data is highly structured and lives in clean databases, query planning plus SQL often beats retrieval plus LLM. If your corpus is small and stable, fine-tuning a model or even hard-coding templates might be simpler. RAG shines when you have large, messy, mostly unstructured text that changes over time, and when you care about answering questions grounded in that text. Outside that envelope, think twice before defaulting to vector search. Retrieval-augmented generation done right is boring in the way good infrastructure is boring. No magic, no "AI assistant that knows everything", just disciplined choices about how to store, search, and use information. If you get those choices right, the model stops hallucinating quite so much and starts doing what you wanted in the first place: telling you what is actually in your data.

Master AI with Top-Rated Courses

Compare the best AI courses and accelerate your learning journey

Explore Courses

This should also interest you

Mixture-of-Experts at Scale: Sparse Compute, Routing, and Failure Modes
AI Development

Mixture-of-Experts at Scale: Sparse Compute, Routing, and Failure Modes

Most people still picture a "big model" as a single, dense stack where every token flows through the same layers. Double the parameters, double the memory, almost double the compute. That picture stopped scaling cleanly the moment we tried to push beyond a few dozen billion parameters while staying inside realistic latency and cost budgets.

Brandon ScottNov 18, 202510 min read
Beyond Chatbots: LLM Tool Use, Function Calling, and Agentic Workflows
AI Development

Beyond Chatbots: LLM Tool Use, Function Calling, and Agentic Workflows

The "chatbot" metaphor was useful at the beginning. It let people map a strange capability onto something familiar: a text box, a reply, a back-and-forth. As soon as teams tried to build serious systems on top of that metaphor, they hit the wall. A chatbot is a UI. A modern LLM stack is closer to a programmable runtime.

Daniel BrooksNov 8, 202512 min read
Mixture-of-Experts at Scale: Sparse Compute, Routing, and Failure Modes
AI Development

Mixture-of-Experts at Scale: Sparse Compute, Routing, and Failure Modes

Most people still picture a "large model" as one big uniform block: same layers, same weights, every token marching through the same path. You want more capacity, you make the block bigger. You pay almost linearly in memory, compute, and power. That picture breaks the moment you try to push capacity far beyond what you can afford to run for every single token.

Brandon ScottOct 25, 202511 min read