I check the dashboard on a Wednesday morning. The most recent session is from a user — call her M. She'd given a behavioral answer: three minutes of voice, transcribed, scored on Structure, Completeness, Clarity, Conciseness. Aria rated her Conciseness a 4 out of 10.
The feedback quoted her own words back: "You spent 47 seconds on context before getting to the action. Try opening with the decision." Calibrated. Specific. The kind of feedback you'd expect from a hiring manager who'd done this thousands of times.
She didn't come back.
I noticed because the next session never arrived. Then the same thing happened with the user before her, and the one before that. Of the 14 users who got Aria's full four-dimension scoring on their first session, 12 completed only one session. Of the 31 people who started a session with Aria at all in the last four months, only 3 returned for a second one with a real answer.
This is the post about that. Too small to call a study. But fourteen people showed me something I think the research already knew.
What I built
Aria scores spoken interview answers on four dimensions:
- Structure — did the answer follow a clear arc (situation, action, result, or something equivalent)?
- Completeness — did you say what an evaluator actually grades?
- Clarity — was it understandable on first listen?
- Conciseness — was it tight, or did it sprawl?
Each gets a 1-10 score. The total is shown. The dimensions are where the feedback lives.
The rubric isn't invented. It's adapted from how hiring managers I'd talked to actually grade — the same four dimensions show up in every interviewer-training document I'd seen, just under different names. The bet I made was that calibration to a real evaluator was the differentiator.
Every other AI interview tool I'd tried — Yoodli, Interviewing.io, Final Round AI, the rest — praised everything. They'd say "Great job!" to answers I knew were rambly. I wanted to build the one that didn't.
The other half of the bet: feedback that quotes the user's own answer text back. Not generic ("be more concise") but specific ("you spent 47 seconds on context — try opening with the decision"). The hypothesis was that quoted-text feedback would feel less algorithmic, more like a coach.
I thought honest, calibrated, specific was the moat. The opposite of every "Great job!" tool.
It worked. And in the first 14 people who got that full treatment, 12 never came back.
What I saw in the data
A note on size before I show you anything: this is not a study. The numbers are small. Read what's below as "what I'm seeing in my own database," not "what's true about AI interview feedback." I'll show every number so you can call it what you want.
The funnel
Between 5 January and 23 April 2026 — about sixteen weeks — 31 real users started a session with Aria (test and internal accounts excluded). Of those:
- 31 tried Aria. A "try" means starting a session that recorded at least one conversation turn.
- 15 gave a substantial answer. I'm defining "substantial" as a user turn with more than 50 characters of content. This excludes "ok", "hi", "yes", and the people who opened a session and never spoke. The threshold is arbitrary — pick a different one and the number shifts.
- 14 received Aria's full 4-dimension feedback. The gap from 15 to 14 is one user who joined before formal scoring went live in mid-January.
- 3 came back for a second session with another substantial answer.
Across those sessions, 88 questions got scored, with sessions typically running about 30 minutes — long enough to get the full feedback loop and react to it.
12 of those 14 never came back.
The retry asymmetry
Of the answers that were scored, 21 attempts were retries — a user got a score, then re-recorded the same answer to try to improve it. The average score change on retry, by dimension:
| Dimension | Avg change on retry |
|---|---|
| Completeness | +0.86 |
| Structure | +0.62 |
| Clarity | 0.00 |
| Conciseness | −0.05 |
Retries improve what people said. They don't improve how they said it. Conciseness actually drifts the wrong way — on the second attempt, people add more, not less.
The split is clean. Structure and Completeness are content dimensions: did you include the right pieces, in a sensible order? Clarity and Conciseness are delivery dimensions: did the words come out cleanly and tightly? Corrective feedback seems to work on the first kind and not on the second.
Caveats
- N=14 (formal scoring) / N=31 (any engagement). Small either way.
- The 21 retry attempts cluster heavily in a few users. One user did nine sessions on their own and probably contributes a meaningful share. In the 60-day follow-up I'll re-run this excluding them and report whether the asymmetry holds.
- The 50-character "substantial answer" threshold is a judgment call. The funnel ratios shift if I move it.
- 14 internal/test user IDs were excluded from the counts.
- Date range: 2026-01-05 to 2026-04-23.
Why this might be happening
Why might honest feedback drive users away? There's a real literature on whether corrective feedback works at all. The short version: not as often as people assume.
In 1996, Avraham Kluger and Angelo DeNisi published a meta-analysis of 607 feedback intervention studies — every controlled experiment they could find on whether telling someone they did something wrong actually made them do it better next time. The headline finding:
In about 38% of cases, feedback decreased performance (Psychological Bulletin 119(2)).
The other 62% improved on average less than people expected.
The pattern they extracted is what they called Feedback Intervention Theory: feedback works when it draws attention to the task and away from the self.
When it draws attention to the self ("you're bad at this dimension"), the receiver disengages, withdraws, or tries to defend their identity instead of changing behavior.
The closer feedback gets to the person's identity, the worse it works.
John Hattie and Helen Timperley's 2007 paper The Power of Feedback refined this further across the education literature:
Feedback effect sizes range from d=−0.05 to d=+1.13 for the same kind of intervention, depending on whether it's task-level, process-level, self-regulation-level, or person-level (Review of Educational Research 77(1)).
Same intervention, opposite effect, depending on framing.
Carol Dweck's mindset work showed the receiving end of this: when feedback feels like a judgment of who you are, you withdraw. When it feels like information about a specific thing you did, you try harder.
Try the same lens on my data. The dimensions where retries improve scores — Structure and Completeness — are about the content of an answer. They're task-level. "You forgot to mention what the outcome was" is information; the user can add it.
The dimensions where retries don't help — Clarity and Conciseness — are about how someone speaks. They're closer to the self. "You spend too long on context" doesn't tell someone how to compress their thinking — it tells them their natural way of speaking is wrong.
If the literature is right, the dimensions where my feedback fails are exactly the dimensions where corrective feedback was always going to fail. The retention drop may be the same thing one level up: a person sees four scores and reads two of them — Clarity, Conciseness — as judgments of who they are, not what they did. Then they don't come back.
Three citations and N=14 don't prove this. But the pattern is consistent enough with the literature that my next experiment is built around it.
What I'm going to try
Three changes. None guaranteed. I'll publish the numbers in sixty days.
1. Show the modeled answer before the scores. Aria already creates a "better version" of each user's answer — a short artifact rendered as an "Aria Suggestion" card. Today it appears alongside the dimension scores.
By the time a user reads "here's how this could sound," they've already absorbed "Conciseness: 4/10" and started withdrawing. The fix: artifact first, scores collapsed below. If the literature's right, the modeled answer is the actual intervention for delivery dimensions. The number was never going to do that work.
Measure: percentage of users completing 2+ sessions. Baseline 21% (3 of 14). Threshold for "this worked": above 40%.
2. Retry before showing any score. Today the user answers, sees four dimension scores, and is invited to retry. First score, on a raw first try.
New flow: answer → one qualitative suggestion ("try opening with the decision") → retry → then the scores, on the revised attempt. The first number is on a polished version, not a stumble. No fake praise, no yes-man — every feedback step is constructive, just sequenced differently.
Measure: percentage of answers where the user attempts a retry. Baseline 23.8% (21 of 88). Threshold: above 40%.
3. A 24-hour email after the first session. The dashboard is rich — that's not the problem. The problem is the outside-the-app moment that brings them back.
One email, twenty-four hours after session one: a single specific observation from their answer, a single concrete suggestion, a link to the dashboard. Then nothing else.
Measure: percentage of users who open the email and return to the dashboard within seven days. Baseline 0% (the email doesn't exist yet). Threshold: above 30%.
On 15 July 2026 — sixty days from publishing this — I'll publish a follow-up at /blog/honest-ai-feedback-60-day-update. Same queries, same caveats, whatever the numbers say. If the changes didn't move the metric, the post will say that.
What I'd want to test at N=100
There's a separate version of this piece I can't write yet. At N=14, every observation is "I'm seeing something consistent with the literature." At N=100, with enough engaged users to do real retention math, things become testable:
- Does the retry asymmetry survive when the one heavy user is excluded? Right now I can't say with confidence it isn't driven by one person.
- Does the retention drop concentrate in users who scored low on Clarity or Conciseness specifically? If yes, that's a direct causal arrow. If no, the harshness is more diffuse — maybe just anyone who got a hard truth.
- Does scaffolding-style feedback on the delivery dimensions move retention? I can't answer this until I've shipped the alternative and measured against the current cohort. That's the next post.
- Are there question types where the asymmetry doesn't hold? System-design answers might behave differently from behavioral. Right now I don't have enough data per question type to know.
I'll publish a follow-up at N=100. If this post lands, that one will too — even if the answer is "we tried the fix and it didn't work." A null result on a public commitment is more useful than a hidden retreat.
FAQ
Q1. Is N=14 really enough to publish anything?
No, not for statistical claims. It's enough to publish a field note — what I'm seeing in my own database, paired with the published literature that suggests why. The post says this in several places. Read it as "what I'm seeing in my own database," not "what's true about AI interview feedback."
Q2. Could the retry asymmetry just be one heavy user?
Yes, possibly. One user did nine sessions and probably contributes a meaningful share of the 21 retry attempts. The 60-day follow-up will re-run the analysis excluding that user and report whether the asymmetry holds. If it doesn't, the post will say that.
Q3. Why publish before you've fixed it?
The alternative was waiting for "enough" data, which could be six months out. Publishing now is a commitment device for the three changes in Section 5. The academic-grounding lets the thesis stand on its own even if my data turns out to be noise. If the changes don't work, the follow-up post will say that too.
Q4. Are you sharing the raw data?
Aggregate counts, yes — the queries are documented in the Methodology section and anyone with the same schema can reproduce them. Individual scores or answer text, no. That's user data and stays private.
Q5. How does this compare to retention numbers other AI coaches have shipped?
Most AI interview coaches don't share retention numbers. The few public datapoints I could find report similar one-session drop-offs but rarely break down by feedback type or scoring calibration. If anyone has compared "calibrated honest" vs "friendly affirming" retention curves rigorously, I haven't found that work yet. Pointers welcome.
Q6. Did users know they were being studied?
Aria's terms of service permit aggregate analysis of session data; individual user content stays private. Nothing in this post identifies a user. M in the opening is a stand-in for the pattern, not a real first name.
Methodology
The numbers come from two database tables: aria_conversation_turns (created 23 December 2025) and aria_answer_scores (created 15 January 2026). All queries cover the 5 January – 23 April 2026 window and exclude 14 internal/test user IDs.
- 31 users tried Aria = distinct
user_idvalues with at least one row inaria_conversation_turnsduring the window. - 15 gave a substantial answer = users with at least one
role = 'user'turn whosecontentis longer than 50 characters. - 14 received full 4-dimension feedback = users with at least one row in
aria_answer_scoreswhere all four dimension scores are non-null. - 3 came back for a second session = users with at least one substantial answer in two or more distinct
aria_session_idvalues. - 88 questions scored = total rows in
aria_answer_scores. - 21 retry attempts = rows in
aria_answer_scoreswhereis_retry = true.
Retry improvements use the improvement_delta JSON column on aria_answer_scores, populated by the scoring service when a retry is recorded. I report unweighted averages across the 21 rows. The 50-character "substantial answer" threshold is a judgment call — different thresholds shift the funnel ratios, and Section 3 discloses this.
Evidence
The argument rests on the funnel and retry-asymmetry visualizations above plus three peer-reviewed citations.
- Funnel chart — generated from the queries in the Methodology section. Source:
/images/blog/honest-ai-feedback-14-users/funnel.svg. - Retry-asymmetry chart — same source. Generated from the
improvement_deltaaggregates. - Kluger & DeNisi (1996) — Psychological Bulletin 119(2), meta-analysis of 607 feedback studies. Key claim cited: ~38% of feedback interventions decreased performance.
- Hattie & Timperley (2007) — Review of Educational Research 77(1). Key claim cited: feedback effect sizes range d=−0.05 to d=+1.13 depending on whether the intervention is task-level, process-level, self-regulation-level, or person-level.
- Dweck (2006) — Mindset: The New Psychology of Success (Random House). Key claim cited: corrective feedback without process scaffolding triggers fixed-mindset withdrawal.
Aggregate counts can be reproduced by running the queries in the Methodology section against any Aria instance with the same schema.
Sources
- Kluger & DeNisi 1996 — The Effects of Feedback Interventions on Performance
- Hattie & Timperley 2007 — The Power of Feedback
- Dweck, C. S. (2006). Mindset: The New Psychology of Success. Random House.