Skip to content
Houtini.
Contact
Explainer ·31 May 2026

What is RAG, and what could it do in your company?

Discuss and expand Ask ChatGPT Email LinkedIn

Your AI confidently answers questions about your business with public-internet knowledge. RAG is the architecture that gets your actual contracts, customer list and operating playbook into the conversation, with citations. Here is what it is, where it sits next to long context and MCP, and what it changes if you sponsor the work this quarter.

The 2026 RAG stack: four-layer architectural diagram from query through hybrid retrieval, embedding model, vector database and re-ranker to a grounded cited answer.

Ask the standard frontier model a question about your company and it will answer with confidence. It will get the named-customer story slightly wrong, miss the policy you updated last quarter, and cite no source you can verify. That is the failure mode RAG is designed to fix - and over the last eighteen months it has become the architecture every credible enterprise AI deployment is built on. Here is what it is, where it sits next to the 1M-token context window and MCP, and what it changes if you sponsor the work this quarter.

The shape of the problem

Marina Danilevsky at IBM Research tells a useful story about her kids asking her which planet has the most moons . She answered Jupiter, with 88. She was confident. She had read about it. She also had no source to hand, and her answer was thirty years out of date. The right answer is Saturn, with 146. The model in your CEO's hand has the same two problems, applied to your business.

It will confidently answer a question about your sales pipeline with knowledge it absorbed from public internet text that does not contain your sales pipeline. It will tell a customer-service agent the wrong refund policy with the same fluency it would use to recite the correct one. And it cannot tell you why it answered the way it did, because it cannot produce the source.

For a personal conversation about moons, the cost is mild embarrassment. For an AI deployed at scale across customer service, sales, finance or compliance, the cost is the kind of incident that ends up on a risk register. The model's confidence is not the bug. The fact that the confidence is unanchored is.

What RAG is

RAG stands for Retrieval-Augmented Generation. In plain language: before the model answers, it goes and reads the relevant document, and then it answers from what the document says. The model becomes a reader and a writer instead of a guesser.

The shape of it is straightforward. A user asks a question. The system retrieves the chunks of your documents most relevant to that question. The model receives the user's question, plus the retrieved chunks, plus an instruction that says "answer the question using these chunks, and cite them, and if the chunks do not contain the answer, say so". The response now has three properties the unmodified model could not give you: it is grounded in your actual content, it is current as of whenever you last updated your content, and it can show its work.

Two-panel comparison. Left: a frontier model answering an Acme refund-clause question with an invented '30 days' policy, no source. Right: the same model with a retrieval step that fetches the actual MSA section 4.2, producing the correct 60-day pro-rated answer with a clickable source.

It is, honestly, the same discipline a careful analyst would apply to a board paper. Read the source first, then write. The novelty is that the model can do it at the speed and scale of an API.

Why the standard model is not enough on its own

The frontier models are trained on something close to the public internet plus a curated slice of code and licensed content. They are remarkable at general reasoning, language tasks and common knowledge. They are not trained on your contracts, your customer list, your supplier database, your meeting notes, your Confluence wiki, your Sharepoint, or your operating playbook. Most of what makes your business specific lives in places the model has never read.

This is why "we just need to give our staff access to ChatGPT" is the most common failure mode of enterprise AI deployment that I see at the moment. The staff who try it discover within a week that the model is brilliant at general writing and unreliable on anything that involves the company's own situation. They lose confidence. The pilot stalls. The same pattern plays out on the customer-facing side: a chatbot built on the unmodified model gives plausibly-worded answers that the support team has to walk back, and the experiment is shelved.

RAG is the architecture that fixes the gap. The model still does the reasoning and the writing. Your documents do the grounding.

What the 2026 RAG stack looks like

The version of RAG most people picture is the 2023 version: chunk the documents, embed each chunk as a vector, store the vectors in a database, retrieve the top-K matches by cosine similarity, paste them in front of the model. Meta's CRAG benchmark from early 2024 showed that the naive pipeline only answered about 63% of factual questions correctly. The current state of the art has moved on, and it matters because the gap between the 63% pipeline and the 90%-plus pipeline is the difference between an embarrassing demo and a production system.

Three of the things that changed in the last 18 months are worth a CEO understanding by name:

Contextual Retrieval. Anthropic shipped this technique in September 2024. The idea is simple: before each chunk is embedded, a model reads the whole document and writes a short, chunk-specific note that situates that particular chunk in the broader picture. So a paragraph from page 47 of a contract gets stamped with something like "this is from the supplier-renewal clause of the 2024 Acme renewal where the auto-extension terms are defined" rather than floating in isolation. Anthropic's reported numbers had retrieval failure rates dropping by 49 to 67%. The technique is now adopted as a baseline.

GraphRAG. Microsoft open-sourced this in 2024. Standard RAG can answer "What does clause 4 of the Acme contract say?" because the relevant chunk is one retrieval away. It struggles with "What are the common termination patterns across all our supplier contracts?" because that question requires reasoning over the whole corpus at once. GraphRAG builds an entity-relationship graph alongside the vector database, so the model can ask the graph "list all contracts that mention Acme" and then read the chunks. It is the difference between a search box and an analyst.

Agentic RAG. Rather than one retrieval pass, the model runs a loop. It searches, looks at what came back, decides whether the evidence is sufficient, rewrites the query if it is not, and iterates. This is the pattern most production deployments in 2026 use, because it handles the messy real-world case where the first query is too vague or the documents are split across multiple sources.

There is also the question of how you search. The 2026 default is "hybrid search" - combining classic keyword search with semantic vector search, and merging the results. Dense vectors alone miss the cases where the user uses a specific term that exists verbatim in the document. Keyword alone misses synonyms and paraphrases. Combining them outperforms either, by a wide margin, on every benchmark I have seen.

I know this is technical for a strategy article. The point is not that you need to understand the implementation. The point is that "RAG" is no longer one thing. When a vendor says they offer it, the right question is which version, and on what stack.

The honest question: do you even need RAG in 2026?

This is the bit where the conversation has shifted in the last twelve months, and where most pieces written before May 2026 are now out of date.

Claude Sonnet 4.6, Opus 4.6 and Gemini 3.1 Pro now support context windows of 1M tokens; the GPT-5 family sits at 400K. That means the model can hold a small library of documents in working memory in a single prompt. For some use cases - cross-document synthesis, comparative analysis, summarising a small enough corpus - you can simply pass the documents in and skip the retrieval layer entirely.

At the same time, MCP - the Model Context Protocol that Anthropic open-sourced in November 2024 - has become the standard way for AI agents to talk to live systems. If you want the model to know your current inventory, your live Jira tickets, today's bank balance, you do not vectorise those - you give the model an MCP server that queries the live API. That is closer to "agent does a database lookup" than to "agent retrieves a vector chunk".

So the question in 2026 is not "should we do RAG?". It is "for each kind of question my staff or my customers might ask the model, what is the right way to ground it?". The decision is roughly:

  • RAG when the source is unstructured text that changes slowly. Policies, contracts, wikis, historical reports, training material, marketing copy.
  • MCP / tool use when the source is a live system. CRM, ERP, ticketing, finance, anything where "current" matters.
  • Long context when the corpus is small enough to fit, and you need the model to reason across the whole thing simultaneously. A handful of 100-page documents you want compared side-by-side.
Three-column decision card matrix comparing RAG, MCP and Long Context. Each card lists when to use the approach and gives concrete examples. RAG for slow-moving unstructured text (wikis, contracts, policies). MCP for live systems (CRM, inventory, today's Jira tickets). Long context for small corpora needing cross-document synthesis. Footer note: most enterprises will need all three.

Most enterprises will need all three. The skill is matching the question to the right grounding mechanism, which is a design choice more than a technology one. I keep finding that the senior teams who internalise this stop having unproductive "RAG vs fine-tuning" conversations and start asking the better question, which is "what is the source of truth for this answer, and how do we connect the model to it".

The previous Houtini piece on building a small content research index for Claude makes the case that for very small corpora, the lightest weight solution beats a heavy RAG infrastructure. Both can be true. Small-corpus questions can skip RAG; enterprise-scale questions need it.

What it costs to ship, in your company

A proper RAG deployment has four layers, and a senior buyer should know what sits in each before signing off on a build:

The vector database is where the document chunks live in their embedded form. The 2026 enterprise defaults are Pinecone for managed serverless scale, Weaviate for self-hosted, pgvector if your team wants to keep the embeddings alongside your existing Postgres, and Turbopuffer for ultra-low-latency. None of them are exotic. None of them lock you in.

The embedding model is what turns a chunk of text into a vector. OpenAI's text-embedding-3-large is the baseline. Voyage and Cohere often win in enterprise procurement because they have stronger results in legal, medical and multilingual contexts.

The retrieval layer in 2026 is hybrid by default - dense vector search plus BM25 keyword search, fused. This is not a complicated piece of engineering. It is, again, not a place to take on risk.

The re-ranker is a small model that runs after the initial retrieval and re-orders the top 50 candidates to find the best 5. Cohere's Rerank-v3 and a handful of open alternatives win here. The lift in quality from adding a re-ranker is consistently large enough that I would not ship a production RAG pipeline without one.

The thing that catches teams off guard is not any of the four. It is the parsing step before the documents ever reach the pipeline. PDFs with multi-column tables, scanned documents, Word files with embedded objects, contracts with footnotes - all of them break naive chunkers. The single biggest reason RAG pipelines hallucinate in production is that the documents were destroyed in the parsing step and the model is reading garbage. Get the parsing right and you have already solved most of your problem.

The second thing is access control. If your vector database contains your HR records, and an employee asks the chatbot for a colleague's salary, you need row-level access control on the vector store. This sounds obvious. It is the kind of obvious that gets overlooked when a small team ships an internal demo and "small internal demo" becomes "the thing the whole company uses". Build the access model into the architecture from week one.

Where it delivers, and where it does not

Gartner reckons that by 2026 over 70% of enterprise generative AI initiatives require a structured RAG pipeline to be defensible against the EU AI Act and similar frameworks. The companies who treat it as infrastructure rather than as a magic trick are seeing real numbers.

Vanguard and Morgan Stanley are the named operators most often cited in this space. Vanguard's RAG-backed generative AI platform is embedded in the financial-advisor workflow and is used to produce customised content summaries for clients, with retrieval quality benchmarked against human-annotated relevance judgments rather than a raw API-cost metric. Morgan Stanley's knowledge-management system, built on OpenAI infrastructure with a RAG layer over more than 100,000 internal research documents, launched in March 2023, is in daily use by tens of thousands of wealth managers, and has since extended into AskResearchGPT for the Institutional Securities division. The pattern that lets both work is that the use cases are narrow - one well-defined kind of question, asked at high volume, against a well-curated source set. RAG rewards focus.

The unfocused deployments are why up to 95% of generative AI pilots, per MIT Project NANDA's "GenAI Divide" study published in July 2025, fail to deliver measurable ROI six months in. Not because the technology does not work. Because the question "what do we want this model to answer for our company" was never asked. You cannot build a useful retrieval system for an undefined corpus on behalf of an undefined user. The model is the easy part.

The way I describe this to clients is that RAG is not what you should buy. RAG is the thing you build because you have decided that a specific question, asked at specific volume, against a specific source set, is worth answering at machine speed. Get the question right and the architecture follows. Get the architecture right without the question and you have shipped infrastructure that does nothing.

What this means for your company

If you are sponsoring this work and you want a single rule that survives translation back to the operating team, here it is. The model is a brilliant reader and a brilliant writer. It has never read your company. RAG is the architecture that closes that gap, and as of 2026 it is mature enough that the engineering is not the hard part - the hard part is choosing the right question to apply it to.

That choice is a senior team decision. Which simple-but-expensive question across customer service, finance, sales, compliance or operations is your company answering badly at the moment because the right answer is locked inside documents nobody can search at speed? That is the question RAG is for. Pick one. Build it properly. Measure it against a business number, not a token count.

If you want a structured pass through the business that surfaces those questions and ranks them by where the value capture is most achievable this quarter, that is what the Houtini AI Audit is for . It is the same pattern we covered in the piece on what AI looks like from the CEO's chair , applied specifically to the data-access layer. The Audit produces a brief the senior team can sign off on, and the build engagement that follows ships the RAG pipeline against the briefed problem, with your people learning to operate it on the live system.

The technology is no longer the bottleneck. It has not been for a while. The bottleneck is the senior team deciding which question the model should be allowed to answer for the company - and then giving the small team the cover to ship it.

Educational only, not financial advice. Built for Houtini's senior buyer audience. For the practitioner build narrative on a small-corpus alternative, see RAG Without RAG: How to Make a Content Research Index for Claude .

Want a structured pass through your company's questions? Talk to us about the AI Audit .

By email

Get new posts by email.

Drop your email below and we will send you the next article when it lands. No spam, unsubscribe anytime.