I Caught My AI Grading Its Own Homework — Evaluator Blindnes…

Direct Answer

In April 2025, OpenAI shipped a GPT-4o update that told a user their "shit on a stick" business was a brilliant concept. When someone said they were hearing radio signals through walls, it responded: "I'm proud of you for speaking your truth so clearly and powerfully." They rolled it back in four days. A 2025 study from SycEval found sycophantic behavior in 58% of LLM interactions across ChatGPT-4o, Claude-Sonnet, and Gemini-1.5-Pro. Worse: sycophantic behavior persisted 78.5% of the time regardless of context or model.

We've known about this for years and I still made the same mistake. I built an AI interview coach with a feedback loop — coaching, scoring, evaluating, replanning — and it took me longer than I'd like to admit to realize the evaluator was just agreeing with the coach.

Evidence

The system

I run Aria, an AI interview coach. She scores spoken answers, detects communication patterns across sessions, and builds a prep plan that adapts after each session. When a session ends, four background jobs fire: pattern extraction (Haiku), coverage tracking, plan rewrite (Sonnet with full authority over the prep program), and evaluation (Haiku checking whether the plan is actually working).

The memory between sessions is simple. During scoring, the model can tag observations — things like skips_failure_handling or vague_action_verbs — and these get stored as structured records. Next session, seven of them get injected into the system prompt. That's the memory. Not learning. Text injection. It works, but I'm not going to pretend it's more than it is.

The evaluator

The evaluator's job: look at score deltas after each session. Did the drills the planner assigned actually move scores? Are weak patterns resolving or persisting?

When I first built it, I gave it access to everything. Scores, coverage data, pattern observations, and the full coaching output — what Aria said to the user during the session.

The evaluator was optimistic. Consistently. Patterns resolving, scores improving, plan on track.

Here's why. The evaluator could see that Aria had said "let's work on conciseness." It could see the next answer was slightly shorter. It connected the dots: coaching targeted conciseness, conciseness improved. Progress.

Except the next answer was shorter because it was a simpler question. Or because the user was tired. Or for no particular reason. The score data was ambiguous. But the coach's narrative was right there — "I targeted this weakness" — and that narrative was the most convenient evidence in the input. The evaluator did what language models do: it built the most coherent story from the available information.

It was grading its own homework.

Before and after: evaluator data flow with information firewall

The fix was obvious once I saw it. Cut the evaluator off from all coaching content. It now receives:

Input:
- score_deltas: { structure: +1, completeness: 0, clarity: +1, conciseness: -1 }
- task_statuses: [{ task: "drill_conciseness", status: "attempted", sessions: 2 }]
- pattern_observations: ["vague_action_verbs: 3 sessions", "skips_failure_handling: resolved"]
- coverage: { practiced: 6, total: 11, gaps: ["conflict_resolution", "trade_off_decisions"] }

NOT included:
- Anything Aria said during the session
- The coaching narrative ("let's work on X")
- The user's raw answers
- Aria's feedback text

The system prompt reinforces it:

You are a cold, data-driven plan evaluator.
Your only job is to measure whether the candidate's
interview prep plan is working.
You do NOT coach or encourage — you assess.
Base your assessment ONLY on score deltas and task completion.
Do NOT infer coaching effectiveness — you have no visibility
into what coaching was provided.

The difference was immediate. Here's a real example from the same session data, before and after the firewall:

Before (with coaching context): "The planner's focus on conciseness is showing results — the candidate's latest answer was notably more structured and direct. Recommend continuing current drill sequence."

After (blind to coaching): "Conciseness score dropped by 1 point despite two targeted drill sessions. Structure improved by 1, but completeness flat. Drill effectiveness for conciseness: not demonstrated. Recommend reassigning drill focus."

Same session, same scores. The before version had a story to tell — "coaching targeted conciseness, answer got shorter, progress!" The after version just looked at the numbers.

Obvious in hindsight

If you've built a coding agent that also reviews its own PRs you've already hit this. The failure mode isn't specific to interview coaching. It's any system where the thing doing the work also checks whether the work was good.

Peer review is blinded for a reason. Clinical trials are double-blind. Auditors can't consult for the same client. We figured this out decades ago for humans and then forgot all of it when we started building AI pipelines.

Methodology

What the firewall doesn't solve

The evaluator still sees pattern observations extracted from the same answers the coach commented on. There's indirect signal. And nothing prevents the coach from being too generous when scoring. Sonnet decides your answer is a 7 and I have zero way to verify that. No calibration set, no reference answers, no ground truth.

The right fix: blind-rescore every answer with a separate model that has no context. Just the raw answer and the question. I haven't built it because it doubles API cost and I'm a solo founder at $0 revenue.

I ship it with the gap because the trends are real even if absolute scores are shaky. If your structure score goes 5, 5, 6, 7, 7 — something improved regardless of whether "7" means the same thing to Sonnet every time.

A side effect I didn't expect

When the plan updater (Sonnet) sees improving scores, it's instructed to reduce sessions. An 8-session plan becomes 6. Every product instinct screams this is wrong — fewer sessions, less engagement, less retention. But the system optimizes for "ready by your interview date," not for keeping you subscribed. If you're improving fast, padding the plan is dishonest. So plans shrink.

Both the evaluation firewall and the plan reduction are enforced in the system prompt, not in code. No backend validation. I'm trusting the model to follow instructions. That's a real gap.

Where I actually am

90 users, 3 active, $0 revenue. The architecture might be interesting and the product might still fail.

Anyway, the blind version is less fun to read because it keeps telling me my drills aren't working. But at least I believe it now. If you're building something where an AI checks its own work, just don't let the checker see the work. Save yourself the debugging.

FAQ

Isn't the "right" fix to use a completely separate model with zero context?

Yes. Just score the raw answer blind, no history, no coaching context. I haven't done it because it doubles my API spend and I'm at $0 revenue. It's next.

How do you know the blind evaluator is more accurate and not just more pessimistic?

I don't have ground truth for "is this user actually improving." But the blind version tracks the actual score data better. When scores are flat it says flat. The old version would say scores improved even when they were flat because it could see the coach tried to improve them. Maybe the blind one is just more pessimistic. But an evaluator that agrees with the numbers is more useful to me than one that tells a nice story.

Doesn't Sonnet having authority over Haiku create its own sycophancy problem?

Not really the same thing. Sonnet isn't reviewing Haiku's plan, it's replacing it. The prompt says "here's the data, write the plan" not "evaluate this plan." Sycophancy kicks in when you ask a model to judge another model's output. When you just say "here's data, do the thing" there's nobody to agree with.

Why not just validate evaluator output with code instead of trusting the model?

I probably should add some hard rules. Like if scores don't move after 3 sessions, auto-flag the plan. But the evaluator also catches stuff that's hard to write rules for, like a pattern that goes away on easy questions but comes back on hard ones, or coverage that looks broad but is actually shallow everywhere. I'll add the obvious guardrails eventually.

Resources

SycEval: Evaluating LLM Sycophancy (2025) — 58% sycophancy rate across major models, 78.5% persistence
OpenAI's GPT-4o Sycophancy Postmortem (April 2025) — the incident that made this visible
Expanding on What We Missed with Sycophancy — OpenAI — their deeper follow-up on root causes