Calibrating Scrutiny
The Review Burnout Problem
A customer success manager adopted AI enthusiastically. She used it for email drafts, meeting summaries, client reports, and process documentation. To protect herself, she reviewed everything carefully—reading each output word by word, verifying claims, checking formatting.
Six months later, she was spending more time reviewing AI outputs than she had spent writing before AI. Review fatigue set in. She started skimming outputs that should have been checked carefully. Her attention flagged. She caught fewer errors.
She had achieved the worst of both worlds: too much effort for too little protection.
The problem wasn’t review itself—review is essential. The problem was undifferentiated review. She applied the same level of scrutiny to internal meeting notes and client-facing reports. She verified trivial claims with the same rigor as regulatory statements. She burned out because she treated all AI outputs as equally important.
The solution isn’t abandoning review. It’s calibrating review—matching scrutiny intensity to actual stakes. This chapter introduces the Review Spectrum: a framework for sustainable, effective AI oversight.
The Review Paradox
Two failure modes exist with AI review, and they’re both costly.
Too Much Review Defeats the Purpose
If you review every AI output as if your career depends on catching every possible error, several things happen:
You burn out. Human attention is a finite resource. Sustained vigilance degrades over time—this is a well-established finding in psychology. Eventually, even highly motivated reviewers lose effectiveness.
You lose efficiency benefits. If review takes as long as the original task, AI provides no productivity gain. You’ve added a step without adding value.
Quality actually suffers. Paradoxically, reviewing everything intensely means reviewing nothing well. When everything is urgent, nothing is. Your attention spreads thin instead of concentrating where it matters.
Too Little Review Creates Risk
The opposite failure mode is equally problematic:
Errors slip through. AI outputs can be subtly wrong—plausible but inaccurate, confident but mistaken. Without appropriate review, these errors reach their destination.
Accountability remains. As Chapter 13 established, AI errors carry your name. Insufficient review doesn’t reduce accountability; it just means you’ll face consequences for errors you didn’t catch.
Trust erodes. Once errors emerge from AI-assisted work, credibility suffers. Rebuilding trust takes far longer than maintaining it.
The Calibration Solution
The answer isn’t choosing between over-review and under-review. It’s calibrating review intensity to match actual stakes.
High-stakes outputs deserve deep review—they warrant the investment because the consequences of errors justify the effort. Low-stakes outputs need lighter review—they don’t warrant intensive scrutiny because errors are less consequential and more easily corrected.
When you calibrate appropriately, total effort becomes sustainable. You protect what matters most while maintaining efficiency where stakes are lower.
The Review Spectrum
Four distinct review levels exist, each appropriate for different situations.
Level 1: Scan
What it involves: A quick visual pass for obvious problems. You’re glancing at the output to confirm it looks reasonable, not examining it in detail.
Time investment: Seconds
What you’re checking: - Basic format correct (right sections, appropriate length) - No obvious errors (glaring mistakes, wrong names, bizarre content) - Tone appropriate (matches context and audience)
Appropriate for: Low-stakes internal content. Well-tested workflow outputs. Routine tasks where AI reliability is established.
Example: Internal meeting summary for your own reference. AI has generated these reliably before. A quick scan confirms it looks right.
Level 2: Spot-Check
What it involves: Sample verification of key points. You’re not checking everything, but you’re verifying enough to have reasonable confidence.
Time investment: 2-5 minutes
What you’re checking: - Random facts accurate (pick 2-3 claims and verify) - Key claims supported (check the important assertions) - No red flags (nothing that triggers concern) - Logic flows reasonably (conclusion follows from evidence)
Appropriate for: Medium-stakes outputs. Moderately reliable AI tasks. Standard business communications.
Example: Client email summary. You verify that the key action items are accurate, check that dates are correct, and confirm the tone is appropriate for this client.
Level 3: Deep Review
What it involves: Comprehensive verification. You’re examining the entire output systematically, verifying claims, checking logic, and ensuring quality throughout.
Time investment: 15-30 minutes or more
What you’re checking: - All claims verified against sources - Logic sound throughout - Format complete and professional - No issues of any kind - Would be comfortable defending every statement
Appropriate for: High-stakes outputs. Novel tasks. First uses of new workflows. Anything external-facing with significant consequences.
Example: Financial analysis for leadership. Every number verified. Every claim sourced. Every recommendation defensible. You could confidently answer detailed questions about any element of this document.
Level 4: Rewrite
What it involves: AI output serves as input or outline only. You’re not reviewing the output—you’re using it as raw material and producing your own version.
Time investment: Similar to producing from scratch
What you’re doing: - Taking AI ideas and structure as starting point - Rewriting in your voice with your verification - Treating AI contribution as brainstorming input - Accepting full ownership of the final product
Appropriate for: Highest stakes situations. Cases where AI limitations are significant. When your personal voice and judgment must dominate.
Example: Board presentation on strategic direction. AI may suggest structure and talking points, but the final product is entirely yours. Every word reflects your judgment; AI’s contribution is invisible in the final output.
Stakes Assessment
Stakes determine your baseline review level. Higher stakes demand more scrutiny; lower stakes permit lighter review.
The Stakes Matrix
Low stakes: Internal, reversible, limited audience. Examples include meeting notes for yourself, internal brainstorming documents, first drafts that will receive further editing, and process documentation for your own use.
Medium stakes: Some external visibility, correctable errors. Examples include routine client communications, standard reports, internal presentations, and team documentation.
High stakes: External visibility, difficult to correct, reputational implications. Examples include published content, financial analysis shared externally, formal proposals, and presentations to senior leadership.
Critical stakes: Regulatory implications, legal exposure, high-value decisions. Examples include SEC filings, legal documents, safety-critical content, and board materials.
Stakes Determine Minimum Review
- Low stakes → Scan is usually sufficient
- Medium stakes → Spot-check minimum
- High stakes → Deep review minimum
- Critical stakes → Deep review or rewrite
Note these are minimums. You might choose deeper review based on other factors, but you shouldn’t go lighter than stakes warrant.
Reliability Assessment
Stakes set the floor. Reliability adjusts from there. How reliable is AI for this specific task?
Task-Based Reliability
AI isn’t uniformly reliable across all tasks. Some tasks consistently produce high-quality outputs; others frequently require correction.
Generally reliable tasks: - Summarizing provided content (AI can accurately condense what you give it) - Drafting based on clear templates (structure is defined, AI fills in) - Formatting and restructuring (mechanical tasks AI handles well) - Grammar and style improvement (AI excels at polishing)
Less reliable tasks: - Factual claims (especially recent or specific—AI may hallucinate) - Numeric calculations (AI can make arithmetic errors) - Creative interpretation of ambiguous requirements (AI may miss nuance) - Synthesis of complex multi-factor situations (AI may oversimplify)
Experience-Based Reliability
Beyond general patterns, your specific experience matters:
Track your error history. Has this particular workflow produced errors before? What types of errors have you caught? How often does AI get this specific task right?
Note patterns. Some AI tasks work perfectly in your context; others consistently need correction. Your calibration should reflect your actual experience, not general assumptions.
Update continuously. AI models change. Your workflows evolve. Reliability that held last month might not hold today. Stay attentive to shifts.
Adjusting for Reliability
If a task is highly reliable based on your experience, you might review one level lighter than stakes alone would suggest. If a task has shown problems, review one level deeper.
But reliability never overrides critical stakes. A highly reliable workflow producing critical outputs still requires deep review—the consequences of rare failures are too significant.
The Reliability Trap
A common mistake: assuming because AI has been reliable, it will always be reliable. This is “automation complacency”—the tendency to reduce vigilance as systems prove reliable, creating vulnerability when rare failures occur.
Guard against this by: - Maintaining minimum review levels regardless of reliability history - Periodically increasing scrutiny to re-verify reliability - Treating model updates or changes as reliability resets - Never assuming perfect reliability even for proven workflows
The goal is sustainable vigilance, not false confidence.
Novelty and Context
Two additional factors affect calibration: novelty and current context.
Novelty Increases Scrutiny
New task types deserve deeper initial review. You don’t yet know AI reliability for this specific task. Deep review on early iterations builds calibration data for later.
New workflows similarly warrant caution. Even if the underlying task is familiar, a new workflow might introduce errors. Start conservative and adjust as you gain experience.
First uses after changes should trigger increased scrutiny. Model updates, prompt changes, or input variations can shift reliability. Re-verify before resuming lighter review.
Context Matters
Time pressure paradoxically should increase, not decrease, scrutiny. When rushed, errors become more likely and consequences feel worse. Fight the instinct to skim when time is short.
Your current state affects review quality. If you’re tired, distracted, or stressed, acknowledge reduced effectiveness. Either delay review or increase its intensity to compensate.
Environmental factors like interruptions degrade attention. If you can’t focus, don’t pretend you’re reviewing effectively.
Calibration in Practice
Building calibration skill takes deliberate effort. Here’s how to develop it.
Start Higher, Adjust Down
When beginning any AI task, start with more scrutiny than you think you need. Deep review on initial iterations serves two purposes:
First, it protects you during the highest-risk period—when you don’t yet know how AI handles this task.
Second, it builds calibration data. As you track what errors you catch (or don’t), you develop informed judgment about appropriate review levels.
Track Your Patterns
Keep a simple record of AI review outcomes: - What errors did you catch? - At what review level? - What type of task? - What would have happened if the error escaped?
This tracking doesn’t need to be elaborate—mental notes or a simple log suffices. The goal is building personal calibration data.
Adjust Based on Experience
After sufficient experience (perhaps 10-20 iterations), you can calibrate more precisely. If deep review consistently catches nothing, step down to spot-check. If spot-check catches significant issues, step up to deep review.
Calibration is personal. It depends on your AI tools, your task types, and your quality standards. Generic guidance helps, but your experience determines final calibration.
Recalibrate When Things Change
Don’t assume calibration persists forever. Recalibrate when: - AI model updates (announced or noticed) - Workflow changes significantly - Error patterns shift - Stakes of outputs change - You return after extended absence
A few deep reviews after any significant change protects against calibration drift.
Making Calibration Sustainable
The goal is review that’s effective and maintainable over time.
The Time Budget Approach
A useful heuristic: review time as a percentage of AI time savings.
10% review time (relative to time saved) = highly sustainable. AI provides major efficiency with minimal overhead.
50% review time = marginal efficiency. AI still helps, but review is substantial.
100% review time = no net efficiency. Reconsider whether AI is adding value.
If review consistently exceeds time savings, either reduce review level (if stakes permit) or reconsider whether AI is appropriate for this task.
Batch Similar Reviews
Review similar outputs together rather than interspersed:
Efficiency: Less context-switching means faster, more effective review.
Consistency: Reviewing similar items together reveals patterns and inconsistencies.
Natural calibration: Batching creates decision points—you can assess “how did this batch of reviews go?” and adjust.
Protect Your Review Capacity
Deep review is a finite resource. Spending it on low-stakes outputs depletes capacity for high-stakes ones. Calibration protects your ability to review deeply when it matters most.
Think of review capacity like a budget. You have a limited daily allotment of focused attention. Spending it on meeting notes means less for client deliverables. Calibration is attention budgeting—allocating your finite review resources where they create the most protection.
The Weekly Review Reset
Consider a weekly review reset ritual: - End of week: Review your calibration decisions - What worked? Where did you catch errors? Where did errors slip through? - Adjust next week’s calibration based on this experience - Reset your attention budget for the coming week
This prevents calibration drift and builds continuous improvement into your practice. Many professionals find Friday afternoon ideal for this reflection.
Common Objections
“Won’t lighter review miss errors?”
Yes, sometimes. That’s the tradeoff. The question is whether the error rate at each scrutiny level is acceptable for those stakes. A 2% error rate on internal meeting notes is acceptable. A 2% error rate on SEC filings is not. Calibration matches acceptable error rates to review intensity.
“How do I know what level is right?”
You won’t know precisely at first. Start conservative, track what you catch, and adjust based on experience. Calibration improves over time. Perfect calibration isn’t the goal—reasonable calibration is achievable quickly.
“My stakes are always high.”
Probably not. Even in high-stakes roles, most outputs are routine. The CEO signs important documents, but they also write internal emails. Reserve deep review capacity for what truly matters; don’t exhaust yourself on trivial outputs.
“This seems complicated.”
The framework takes seconds to apply once internalized. Ask yourself: “What are the stakes?” and “How reliable is AI for this?” The answer suggests the review level. With practice, calibration becomes automatic—you’ll naturally scan meeting notes and deeply review external reports without conscious deliberation.
“What if I calibrate wrong and something slips through?”
This will happen eventually—perfect calibration doesn’t exist. The question is whether your calibration is reasonable, not perfect. If you’re reviewing appropriately for stated stakes and documented reliability, you’ve met your professional obligation. The alternative—reviewing everything at maximum intensity—isn’t sustainable and actually increases error risk through burnout and attention depletion.
Your Monday Morning Action Item
Build a personal review calibration guide:
Step 1: List 5-10 AI outputs you create regularly. Be specific: not “emails” but “client status update emails” or “internal meeting summaries.”
Step 2: For each output, assess two factors: - What are the stakes? (Low, Medium, High, Critical) - How reliable is AI for this specific task? (High, Medium, Low)
Step 3: Assign a review level: Scan, Spot-Check, Deep Review, or Rewrite.
Step 4: Time yourself this week. How long does each review level actually take for each output type?
Step 5: Follow your calibration for one week. Note what errors you catch at each level.
Most people discover they’ve been over-reviewing low-stakes outputs and under-reviewing high-stakes ones. Explicit calibration redistributes effort to where it matters most.
Chapter 15 shows how to build review into your workflows structurally—so calibrated review happens automatically rather than requiring constant judgment.