Skip to content
Houtini.
Contact
Explainer ·31 May 2026

AI hallucination, and the boring discipline that stops it being a problem

Discuss and expand Ask ChatGPT Email LinkedIn

You have read about lawyers citing made-up cases and chatbots inventing refund policies. The fix is not a new platform or a clever prompt. It is the most boring discipline in software: read what came out and check the bits that matter against a second source. Here is what hallucination is, how we run a Houtini-grade check on every claim, and what would change if your team did the same.

The boring fix for AI hallucination: a two-step flow showing every model claim getting read first, then any load-bearing claim getting cross-checked against Brave web search and Gemini grounded search before it ships.

You have read the stories. Lawyers filing briefs that cited made-up case law. Chatbots inventing refund policies their company never had. The LinkedIn thought-leaders calling the technology "fundamentally unreliable" and the CEOs nodding along because the word "hallucination" sounds like a defect that ought to be fixable. At Houtini we barely ever ship a hallucination into client work, and the reason is more boring than you would hope. There is no platform, no clever prompt, no detection layer. There is a discipline. Here is what hallucination is, the two-step check we run on every claim that matters, and what would change in your company if your team ran the same check on theirs.

What hallucination is, mechanically

A large language model does not have a fact store. It has a probability distribution over what word should come next given the words that came before. When the model writes "Saturn has 146 moons", it is not retrieving a number from a table. It is predicting the most likely token sequence given the question. When the model is right, that prediction matches reality because reality was in the training data and the pattern is well-rehearsed. When the model is wrong, the prediction is still confidently fluent because fluency and truth are produced by different machinery.

That is the whole mechanism. The model is not lying. The model has no internal signal for the difference between "I know this" and "I am completing the sentence in a way that sounds right". Every output it produces feels equally confident to it because the confidence is in the fluency, not in the truth.

Knowing this changes the kind of fix you look for. You stop looking for a feature that makes the model "stop hallucinating". You start looking for a process that catches the wrong outputs before they ship.

The four shapes it takes

The CEOs I work with find the conversation easier when we name what we are looking at. Most "hallucination" reports fall into one of four shapes:

Factual. The model states a number, date, name or claim that is wrong. "Sonnet 4.5 supports a 1M-token context window" when 4.5 is 200K. This is the textbook case and the one the catch-it-early process handles best.

Attribution. The model cites a source that does not exist, or attributes a real quote to the wrong person. "According to the 2024 Gartner report on retrieval pipelines..." when no such report was issued. The model has produced the most likely-sounding citation, not the real one.

Capability. The model agrees that it can do something it cannot. "Yes, I can read your CRM data directly." It cannot. This one bites teams that take the model's claim about its own behaviour at face value.

Agreement. The model accepts a flawed premise in the user's question and produces a fluent answer inside that flawed frame. "Given the well-known correlation between X and Y, can you brief me on..." - the model proceeds without pushing back on whether X and Y correlate at all.

Each shape needs a different muscle. The first two are caught by the check we are about to describe. The third needs scepticism towards the model's claims about itself. The fourth needs not leading with your conclusion when you ask.

The honest read: we barely hit it. Here is the unimpressive method.

I use the word "eval" sometimes when I am talking to other people who work in this. The word is doing a lot of social work. In the AI-engineering world it means automated test suites that score model outputs against reference behaviour - RAGAS, Braintrust, the lm-eval-harness, a whole stack of frameworks. At Houtini we run those when the work justifies them, which is roughly never on the editorial side and sometimes on the agent-shipping side.

When I say "eval" in a Houtini context I mostly mean: read what came out and check the bits that matter. That is the whole method. The reason the method works is that it is paired with a second step that takes thirty seconds. Together they catch ninety-something percent of what the model gets wrong.

The two steps

Step one - read it like a careful analyst would. Before any AI-produced content moves out of draft, somebody reads it on the page. The questions you ask while reading are not exotic:

  • Does this claim cite anything? If yes, is the citation a real thing that exists?
  • Is the number current? Models trained six months ago cite figures eighteen months old.
  • Does the model's confidence match what I already know to be true? A fluent paragraph about something I have no knowledge of is the most dangerous shape an output can take.
  • Is this what I would have written if I had to argue it? If the model wrote something I would not say, the model has either taught me something or made something up. Both are worth a check.

This step is unromantic. It is also the step the industry keeps trying to skip with promises of automated detection layers. The catch is that automated layers cannot tell you whether the output is true; they can only tell you whether it pattern-matches things flagged as untrue. That is not the same problem.

Step two - for any claim that carries weight, run it through two grounded systems and see if they agree. The two we use most at Houtini are Brave Search (for "is this thing even real?") and Gemini grounded search (for "and is this thing the way the model just described it?"). Either one alone is useful. The two together produce a consensus signal that is the closest thing the field has to a reliable truth check.

If both grounded systems return the same answer with sources, you have a verified claim. Ship it.

If they disagree, you have a question. Dig until you know which one is right.

If they both return the same wrong answer, you have a shared misconception in the public web - rare but real. This is the limit of the method, and we will come back to it.

This is what I mean when I say "consensus search". The engineering vocabulary makes it sound impressive. It is also exactly what a competent journalist does before filing.

The worked example: this article's older sibling caught three errors

The piece we shipped earlier today about what RAG is and what it could do in your company had a fact-check pass on the named claims before it went live. The Houtini-grade check caught three things the confident model had written into the draft:

The first draft said Claude Sonnet 4.5 supports a 1M-token context window. Gemini grounded - primed with my question and instructed to ground every answer - pointed me at the Anthropic release notes. Sonnet 4.5 supports 200K, expanded to 400K on the API. The model with the 1M context is Sonnet 4.6. The piece shipped corrected.

The same draft said OpenAI's GPT-5 family supports 1M tokens. Brave Search returned a 400K maximum across the released GPT-5 sizes. Gemini grounded agreed. Two confident sources, one verdict. The piece shipped with the right number.

The same draft described Anthropic's Contextual Retrieval as "prepending a document-level summary to every chunk before embedding". Gemini grounded came back with the Anthropic engineering post and the implementation detail: the technique prepends chunk-specific context that situates each chunk in the broader document, not a document-level summary. Different mechanism, different reason it works. The piece shipped correctly described.

That is three real errors caught on one short piece. Each one would have been published as fluent, confident, plausible prose. Each one would have damaged the brand. The discipline is unromantic and it does the job.

What this could be in your company

You will know whether your team needs this method by whether it has ever been bitten. If a colleague has shipped AI-produced work that turned out to contain a fluent untruth, your company has a hallucination problem. The fix is not a tool you buy. It is a habit you install.

The two questions to embed in any workflow that uses an LLM to produce something that matters - a customer email, a regulatory filing, a slide for the board, a code commit:

  • Did anyone read this output before it shipped?
  • For the claims in it that carry weight, did anyone check those claims against a second source?

Most teams answer "no" to the second question and assume the first one is enough. It is not. A careful reading catches the obvious things and misses everything the reader does not happen to know. The second check is what catches what the reader does not know they do not know.

You do not need a vendor for any of this. You need a small amount of time built into the workflow and a couple of grounded-search tools available at the point of writing. Most teams already have them - the gap is procedural, not technological.

Where the method falls down

The check is unromantic but it is not perfect. Two failure modes worth a CEO knowing:

Shared misconceptions. When the public web is wrong, the grounded sources will be wrong together. The method gives you consensus, not truth. Where the stakes are very high - medical, legal, regulatory - consensus should be the lower bar, with the higher bar being a human with subject-matter authority who reads the work. The grounded systems get you 95% of the way there. The last 5% needs a person.

Recency. Brave and Gemini are both fast at indexing, but a claim about a product launched last week may sit in the gap. Where the claim is recency-sensitive, go to the primary source - the company's announcement page, the regulator's filing, the GitHub release.

Anchoring. If you frame your query so it leads the answer, both grounded systems will lead you back the same way. "Why does X cause Y" gets you a confident answer even when X does not cause Y. Phrase the question neutrally and the consensus is meaningful. Phrase it with your conclusion baked in and the consensus is theatre.

Four-quadrant taxonomy of hallucination types: factual (wrong number or claim), attribution (made-up citation), capability (model overstating its own abilities), and agreement (model accepting the user's flawed premise). Each quadrant labelled with one short concrete example a CEO would recognise.

The single rule

If you take one thing from this piece, take this. The model is not lying when it hallucinates. The model has no concept of lying. The fix is not a feature of the model. The fix is a habit of the team that ships the work. Read the output. Check the load-bearing claims against a second grounded source. Repeat on every piece of work that matters.

That is the entire discipline. It is not impressive. It also works.

If you want a structured pass through the workflows in your company that already use LLM output - what is being checked, what is not, where the catch-it-before-it-ships habit is missing - that is what the Houtini AI Audit is for. The Audit produces a brief the senior team can sign off on, and the engagement that follows installs the discipline alongside whatever tools you choose to ship around it.

For the conceptual companion piece on the architecture that grounds AI outputs in your own documents, see what RAG is and what it could do in your company . For the executive framing this sits inside, see AI for the managing director and the CEO .

Educational only, not financial or legal advice. Tested in the Houtini studio. Built for senior teams who would rather install a habit than buy a platform.

By email

Get new posts by email.

Drop your email below and we will send you the next article when it lands. No spam, unsubscribe anytime.