Ai Business

AI Quality Control: How to Catch What AI Gets Wrong

· Felix Lenhard

Last quarter, an AI I was using confidently cited a study showing that Austrian startups have a seventy-two percent survival rate after five years. It sounded great. It was completely fabricated. No such study exists. If I had published that number without checking, my credibility with every Austrian founder who reads my blog would have taken a hit.

AI does not lie in the way humans lie. It does not intend to deceive. It generates plausible-sounding output based on patterns, and sometimes those patterns produce output that sounds factual but is not. The technical term is hallucination. The practical term is: AI makes stuff up, and it does it with complete confidence.

If you are using AI in your business, quality control is not optional. It is the system that separates professionals who use AI effectively from amateurs who publish AI errors.

The Types of AI Errors

Not all AI errors are the same. Understanding the categories helps you build targeted quality checks rather than reviewing everything with equal paranoia.

Factual hallucinations. The AI generates specific facts, statistics, or references that do not exist. “According to a 2024 McKinsey study…” when no such study exists. “The Austrian startup ecosystem has grown 340% since 2019…” when the real number is different or the claim has no source. These are the most dangerous errors because they sound authoritative.

Logical inconsistencies. The AI contradicts itself within the same document. It says “always prioritize quality over speed” in section two and “speed is more important than polish” in section five. In long documents, these contradictions can be subtle and hard to catch.

Tone drift. The AI starts in your brand voice and gradually shifts to a more generic or different tone. This happens especially in long content where the AI loses its grip on the initial voice instructions. The output is not wrong, but it does not sound like you.

Context hallucinations. The AI generates information it claims comes from context you provided, but the details are wrong or invented. You gave it client data showing revenue of EUR 50,000, and it references “the client’s EUR 75,000 revenue.” This happens when the AI interpolates rather than quoting.

Omission errors. The AI was supposed to include something and did not. A required section is missing. A key point from the brief is not addressed. An internal link was requested but not placed. These errors are easy to miss because you are checking what is there, not what is not there.

Over-confidence. The AI presents uncertain or debatable points as established fact. “The best approach is X” when there are legitimate alternatives. “Research shows Y” when the research is inconclusive. This type of error is insidious because it is often technically within the realm of plausibility.

Each type requires a different checking approach. Factual claims need source verification. Logic needs consistency review. Tone needs comparative reading. Context needs data matching. Omissions need checklist verification. Over-confidence needs nuance checking.

The Three-Layer Review System

I use a three-layer review system that catches errors at different stages. No single layer catches everything, but together they catch virtually all significant errors.

Layer 1: AI self-review using a self-correction chain. This is not a single “review your work” instruction tacked onto the end of a prompt. It is three separate steps, each producing inspectable output:

Step 1: Generate the draft. Step 2: A separate prompt reviews the draft against structured evaluation criteria:

<evaluation_criteria>
  <criterion name="factual_accuracy">Are all factual claims verifiable? 
  Mark anything uncertain with [VERIFY].</criterion>
  <criterion name="tone_consistency">Is the tone consistent throughout, 
  or does it drift toward generic prose in later sections?</criterion>
  <criterion name="brief_compliance">Does the output address every point 
  in the original brief?</criterion>
  <criterion name="internal_consistency">Are there any contradictions 
  between sections?</criterion>
  <criterion name="voice_match">Does this sound like [brand] or like 
  a committee?</criterion>
</evaluation_criteria>

Step 3: A third prompt refines the draft based on the review findings.

Why three separate steps instead of one “generate and self-review” prompt? Because each step produces visible output. I can read the review in Step 2 and see whether the AI identified a real problem or missed it. I can read the refinement in Step 3 and see whether the fix actually worked. Bundling everything into one prompt hides the reasoning. Splitting it exposes it — and that transparency is the entire point of quality control.

This catches roughly thirty to forty percent of errors. The AI is surprisingly good at identifying its own factual uncertainties when the review criteria are explicit and structured. It will flag claims with [VERIFY] that it is not confident about. It is less good at catching tone drift or logical inconsistencies, which is why Layer 2 exists.

Layer 2: Systematic human review. This is my structured editing pass, and it follows a checklist:

  • Every factual claim verified against a source
  • All numbers and statistics confirmed
  • No internal contradictions
  • Tone consistent with brand voice throughout
  • All brief requirements addressed
  • Internal links correct and working
  • No banned words or phrases
  • Quality meets publishing standard

I print this checklist (figuratively; it is a template in my notes app) for every piece of content that will be published. Working through it systematically takes fifteen to twenty minutes per article but catches errors that a casual read-through misses.

Layer 3: Time-delayed review. I never publish AI-assisted content the same day it is edited. There is always at least an overnight gap between editing and publishing. Reading with fresh eyes catches things that in-the-moment editing misses. This is not AI-specific advice; it applies to all writing. But it is especially important with AI content because the AI’s confident tone can lull you into accepting claims that deserve scrutiny.

The combination of these three layers, AI self-review, systematic human review, and time-delayed review, catches approximately ninety-five percent of significant errors. The broader question of what AI makes possible versus what it makes faster depends entirely on whether you have these quality systems in place. The remaining five percent are usually minor issues that readers may not even notice.

Fact-Checking AI Output

Fact-checking is the most time-consuming quality control step and the most important. Here is my specific process.

Step 1: Flag all claims. During editing, I highlight every specific factual claim: statistics, study references, company information, historical dates, and any statement presented as fact rather than opinion. In a 2,000-word article, there are typically eight to fifteen flagged items.

Step 2: Categorize by risk. Each claim falls into one of three categories:

  • High risk: Specific statistics, named studies, company performance data. These need verification from primary sources.
  • Medium risk: General industry trends, broadly accepted best practices, commonly cited frameworks. These need a quick confirmation search.
  • Low risk: Logical statements, self-evident observations, clearly marked opinions. These need only a plausibility check.

Step 3: Verify high-risk claims. For each high-risk claim, I find the primary source. Not another blog post that cites the same number. The actual study, report, or dataset. If I cannot find the source, the claim gets removed or rewritten as a general statement rather than a specific one.

Step 4: Spot-check medium-risk claims. I verify roughly half of the medium-risk claims, chosen randomly. If any fail verification, I verify all of them.

Step 5: Remove unverifiable claims. If a claim cannot be verified, it does not get published. Period. I would rather say “many Austrian startups report that…” than cite a specific percentage that might be wrong.

This process takes ten to twenty minutes per article. That is a small investment to protect years of reputation building. When I write about building businesses in Austria, the Austrian founders reading those articles will know if a statistic is wrong. Getting facts right is baseline credibility.

Building Quality Into the Process, Not Around It

The most effective quality control happens during content production, not after it. Here is how I build quality into the AI workflow rather than bolting it on as a review step.

Constrained inputs using XML structure. The more specific and structured the input, the less room the AI has to hallucinate:

<source_material>
  [All reference documents placed at TOP of prompt]
</source_material>
<constraints>
  Only reference information from the source material above. 
  Do not add statistics, studies, or data points from training data.
  Quote relevant passages from the source material before making claims.
</constraints>
<task>
  Write about FFG Basisprogramme grants, specifically the EUR 5,000-10,000 
  feasibility study funding, using the details provided.
</task>

Three techniques at work here. First, source material at the top of the prompt with the task at the bottom. With long reference documents, this ordering produces up to thirty percent better accuracy because the AI reads the material before encountering the task. Second, the explicit constraint against adding training data prevents the most common hallucination: the AI supplementing your context with fabricated supporting evidence. Third, the “quote before claiming” instruction anchors output in the actual source material.

Structured output with traceability. I request that the AI cite which input document each claim comes from. “Based on the client data provided, revenue grew 15% [Source: Q3 report].” This traceability makes verification faster because you can check the AI’s work against specific documents rather than searching broadly.

Self-correction chains over one-shot generation. Instead of generating and reviewing in one pass, I use three separate steps: generate, review against structured criteria, refine. Each step produces output I can inspect. When errors occur, I know exactly which step failed. This is more effective than single-pass generation with “review your work” appended, because the review has its own dedicated context and criteria.

These process-level quality measures reduce the error rate of AI output significantly. Combined with the three-layer review system, they produce content that meets professional publication standards.

Quality Control for Different Content Types

Different content types have different risk profiles and require different quality emphasis.

Client-facing deliverables (proposals, reports, recommendations): Highest quality standard. Every claim verified. Tone checked against client-specific guidelines. Numbers triple-checked. One error in a client deliverable costs far more than the time spent checking.

Published content (blog posts, articles, social media): High quality standard. All facts verified. Voice consistent. Internal links working. Published content is permanent and represents your brand to everyone who reads it.

Internal content (meeting notes, planning documents, internal reports): Medium quality standard. Facts in key decision sections verified. Tone less critical. Accepting some imperfection is appropriate when the audience is internal and the content is temporary.

Draft and brainstorming content (idea generation, rough plans, exploration): Low quality standard. The AI is generating options, not final output. Quality checking happens when the content moves to a higher-stakes context.

Applying the same quality standard to everything is wasteful. Applying no quality standard to anything is dangerous. Calibrating your review effort to the risk level of each content type is the efficient approach.

Training Your Team on AI Quality Control

If you have team members using AI, quality control training is essential. Here is what I cover:

The trust calibration. New AI users tend toward one extreme: either trusting everything the AI produces or trusting nothing. The productive middle ground is: trust the structure and general content, verify every specific claim, and always check the tone.

The verification habit. Make it automatic. Any time an AI output includes a specific number, a named source, or a definitive statement, the team member verifies it before using it. This becomes habitual within two to three weeks if reinforced.

The “sounds right” trap. AI is extraordinarily good at producing output that sounds correct. The most dangerous AI errors are the ones that sound plausible. Train team members to be especially skeptical of claims that sound too perfect or too convenient.

The output comparison. Periodically have team members produce the same deliverable with and without AI, then compare. This builds intuition about where AI adds value and where it introduces risk.

For founders building AI-native businesses, quality control culture is a strategic investment. The reputation cost of publishing AI errors far exceeds the time cost of preventing them.

Measuring Quality Control Effectiveness

I track three metrics for my quality control system:

Error catch rate: What percentage of errors identified by the three-layer system were caught by each layer? This tells me if any layer is underperforming. Currently, AI self-review catches about thirty-five percent, human review catches about fifty-five percent, and time-delayed review catches about ten percent.

Escaped error rate: How many errors make it through all three layers and are found after publication? I track these through reader feedback and my own post-publication reviews. My current escaped error rate is approximately one significant error per twenty published articles.

Review time per piece: How long does the complete quality review take? If this number creeps up, it may indicate that the AI’s base output quality is declining or that my review process needs streamlining. Currently, the full review adds twenty to twenty-five minutes per article.

These metrics keep the quality system itself accountable and help me identify where to invest improvement efforts. For a more comprehensive look at building permanent quality infrastructure rather than per-piece review, see the guide on AI quality control systems.

Takeaways

  1. Build a three-layer review system: AI self-review, systematic human review, and time-delayed review. No single layer catches everything. Together, they catch approximately ninety-five percent of significant errors.

  2. Fact-check every specific claim before publishing. Statistics, studies, company data, and definitive statements all need verification against primary sources. Remove what you cannot verify.

  3. Build quality into the process, not around it. Constrained inputs, reference-locked output, structured traceability, and iterative generation reduce errors before the review stage.

  4. Calibrate review effort to content risk level. Client-facing deliverables get maximum scrutiny. Internal brainstorming documents get minimal checking. Match the effort to the stakes.

  5. Track error catch rates and escaped errors. Measuring your quality system’s performance tells you where to invest improvement efforts and provides evidence that your AI-assisted content meets professional standards.

ai quality

You might also like

ai business

The Future of AI in Business: What's Coming in 2027

Predictions grounded in what's already working today.

ai business

Training AI on Your Brand Voice

How to make AI sound like you, not like a robot.

ai business

AI for Invoice Processing and Bookkeeping

Automate the most tedious part of running a business.

ai business

The AI Audit: Where Is Your Business Wasting Human Hours?

Find the manual processes that AI should handle.

Stay in the Loop

One Insight Per Week.

What I'm building, what's working, what's not — and frameworks you can use on Monday.