Direct Answer
In April 2025, OpenAI shipped a GPT-4o update that told a user their "shit on a stick" business was a brilliant concept. When someone said they were hearing radio signals through walls, it responded: "I'm proud of you for speaking your truth so clearly and powerfully." They rolled it back in four days. A 2025 study from SycEval found sycophantic behavior in 58% of LLM interactions across ChatGPT-4o, Claude-Sonnet, and Gemini-1.5-Pro. Worse: sycophantic behavior persisted 78.5% of the time regardless of context or model.
We've known about this for years. We keep building systems where the AI evaluates its own work anyway.
I did it too. I built an AI interview coach with a feedback loop — coaching, scoring, evaluating, replanning — and it took me longer than I'd like to admit to realize the evaluator was just agreeing with the coach.
Evidence
The system
I run Aria, an AI interview coach. She scores spoken answers, detects communication patterns across sessions, and builds a prep plan that adapts after each session. When a session ends, four background jobs fire: pattern extraction (Haiku), coverage tracking, plan rewrite (Sonnet with full authority over the prep program), and evaluation (Haiku checking whether the plan is actually working).
The memory between sessions is simple. During scoring, the model can tag observations — things like skips_failure_handling or vague_action_verbs — and these get stored as structured records. Next session, seven of them get injected into the system prompt. That's the memory. Not learning. Text injection. It works, but I'm not going to pretend it's more than it is.
The evaluator
The evaluator's job: look at score deltas after each session. Did the drills the planner assigned actually move scores? Are weak patterns resolving or persisting?
When I first built it, I gave it access to everything. Scores, coverage data, pattern observations, and the full coaching output — what Aria said to the user during the session.
The evaluator was optimistic. Consistently. Patterns resolving, scores improving, plan on track.
Here's why. The evaluator could see that Aria had said "let's work on conciseness." It could see the next answer was slightly shorter. It connected the dots: coaching targeted conciseness, conciseness improved. Progress.
Except the next answer was shorter because it was a simpler question. Or because the user was tired. Or for no particular reason. The score data was ambiguous. But the coach's narrative was right there — "I targeted this weakness" — and that narrative was the most convenient evidence in the input. The evaluator did what language models do: it built the most coherent story from the available information.
It was grading its own homework.
The fix was obvious once I saw it. Cut the evaluator off from all coaching content. It now receives:
Input:
- score_deltas: { structure: +1, completeness: 0, clarity: +1, conciseness: -1 }
- task_statuses: [{ task: "drill_conciseness", status: "attempted", sessions: 2 }]
- pattern_observations: ["vague_action_verbs: 3 sessions", "skips_failure_handling: resolved"]
- coverage: { practiced: 6, total: 11, gaps: ["conflict_resolution", "trade_off_decisions"] }
NOT included:
- Anything Aria said during the session
- The coaching narrative ("let's work on X")
- The user's raw answers
- Aria's feedback text
The system prompt reinforces it:
You are a cold, data-driven plan evaluator.
Your only job is to measure whether the candidate's
interview prep plan is working.
You do NOT coach or encourage — you assess.
Base your assessment ONLY on score deltas and task completion.
Do NOT infer coaching effectiveness — you have no visibility
into what coaching was provided.
The difference was immediate. Here's a real example from the same session data, before and after the firewall:
Before (with coaching context): "The planner's focus on conciseness is showing results — the candidate's latest answer was notably more structured and direct. Recommend continuing current drill sequence."
After (blind to coaching): "Conciseness score dropped by 1 point despite two targeted drill sessions. Structure improved by 1, but completeness flat. Drill effectiveness for conciseness: not demonstrated. Recommend reassigning drill focus."
Same session. Same scores. Completely different assessment. The before version had a story to tell — "coaching targeted conciseness, answer got shorter, progress!" The after version just saw the numbers and reported what they said.
This isn't an interview prep problem
Any system where an AI both acts and evaluates its own actions has this failure mode. A coding assistant that writes code and then reviews it. A tutoring system that teaches and then assesses learning. A content generator that produces text and then rates quality. If the evaluator can see what the actor did, it will construct a narrative of effectiveness.
This principle is old. Peer review is blinded. Clinical trials are double-blind. Auditors can't consult for the same client. The entity measuring the outcome cannot be exposed to the process that created it.
We haven't applied this to LLM systems yet. We keep building agents that evaluate themselves and wonder why they're optimistic.
Methodology
What the firewall doesn't solve
The evaluator still sees pattern observations extracted from the same answers the coach commented on. There's indirect signal. And nothing prevents the coach from being too generous when scoring. Sonnet decides your answer is a 7 and I have zero way to verify that. No calibration set, no reference answers, no ground truth.
The right fix: blind-rescore every answer with a separate model that has no context. Just the raw answer and the question. I haven't built it because it doubles API cost and I'm a solo founder at $0 revenue.
I ship it with the gap because the trends are real even if absolute scores are shaky. If your structure score goes 5, 5, 6, 7, 7 — something improved regardless of whether "7" means the same thing to Sonnet every time.
A side effect I didn't expect
When the plan updater (Sonnet) sees improving scores, it's instructed to reduce sessions. An 8-session plan becomes 6. Every product instinct screams this is wrong — fewer sessions, less engagement, less retention. But the system optimizes for "ready by your interview date," not for keeping you subscribed. If you're improving fast, padding the plan is dishonest. So plans shrink.
Both the evaluation firewall and the plan reduction are enforced in the system prompt, not in code. No backend validation. I'm trusting the model to follow instructions. That's a real gap.
Where I actually am
90 users, 3 active, $0 revenue. The architecture might be interesting and the product might still fail.
But the version where the evaluator could see the coaching was telling me what I wanted to hear. The version where it can't is telling me something closer to the truth.
If you're building any system where an AI acts and then checks its own work — separate them. Restrict what the checker sees. You'll get less optimistic results. That's the point.
FAQ
Isn't the "right" fix to use a completely separate model with zero context?
Yes. Blind re-scoring with no coaching history, no pattern data, just the raw answer and the question. That's the version that would actually prove whether answers are improving. I haven't built it because it doubles API cost per answer. It's the obvious next step.
How do you know the blind evaluator is more accurate and not just more pessimistic?
I don't, definitively. I don't have ground truth for "is this user actually improving." What I do know is that the blind evaluator's assessments correlate better with the raw score data. When scores are flat, it says scores are flat. When scores improve, it says scores improve. The non-blind version would say scores improved even when they were flat, because it could see the coach tried to improve them. Accuracy and pessimism might overlap here — an evaluator that matches the data is what I want, even if it's less encouraging.
Doesn't Sonnet having authority over Haiku create its own sycophancy problem?
Different failure mode. The Haiku bootstrapper generates an initial plan draft. Sonnet rewrites it with full authority — it's not "reviewing" Haiku's work, it's replacing it. There's no incentive for Sonnet to agree with Haiku because the prompt doesn't frame it as evaluation. It frames it as "here's the data, write the plan." The sycophancy problem specifically emerges when one model is asked to judge another model's output — that's where the agreeable-by-default behavior kicks in.
Why not just validate evaluator output with code instead of trusting the model?
Partially because the evaluator's job isn't binary. It's not "did scores go up or down" — it's "given these deltas, these patterns, this coverage, is the current plan working or should we change approach?" That's judgment, not math. I could add hard rules — "if average score delta < 0.5 after 3 sessions, flag plan as failing" — and I probably should. But the evaluator catches subtler things: a pattern that resolved in easy questions but persists in hard ones, or coverage that's broad but shallow. Code can't easily express that yet.
Resources
- SycEval: Evaluating LLM Sycophancy (2025) — 58% sycophancy rate across major models, 78.5% persistence
- OpenAI's GPT-4o Sycophancy Postmortem (April 2025) — the incident that made this visible
- Expanding on What We Missed with Sycophancy — OpenAI — their deeper follow-up on root causes