Voice-mimic eval: Claude Sonnet 4.6 vs GPT-4o vs Gemini 2.5 Flash
April 30, 2026 · 10 min read
This post documents how I picked the model for Reply Coach, a Chrome extension that drafts LinkedIn DM replies in the user's writing voice. The core technical question was which frontier model is best at a specific narrow task: imitating a person's writing style from 5–10 examples. Different from "write coherently" or "follow instructions." Different even from "creative writing." The bar is "sounds like the same person who wrote the samples."
Spoiler: Claude Sonnet 4.6 won, scoring 95.5/100 on a 20-case rubric I wrote. GPT-4o scored 78. Gemini 2.5 Flash scored 71. Cost-wise it's not a flat call, but for this task the gap was wider than I expected.
The eval design
I wanted something repeatable and not too gameable. The setup:
- Voice samples (input). 7 synthetic samples I wrote covering 4 personas (founder, recruiter, job seeker, neutral professional). Each sample ≥100 chars. Real LinkedIn-shaped messages with em-dashes, contractions, the works. Public, in the repo, so anyone can re-run.
- Test cases (input). 20 LinkedIn DM scenarios designed to span the space: cold outreach (curious & polite-pass), recruiter outreach, declining a role, asking for an intro, responding to criticism, follow-ups after silence, pricing negotiation, and so on. Each has a thread + a goal (book meeting, polite decline, stay warm, ask question, custom).
- The prompt. Identical for all three models. Voice samples in the system context, thread + goal, "generate 3 variants labeled SHORT / MEDIUM / WARM matching the samples' tone." Same temperature (0.7), same max tokens.
- Scoring (output). Each variant scored 1–5 by hand on five dimensions: voice match, naturalness, goal fit, context awareness, absence of corporate clichés. 20 cases × 3 variants × 5 dimensions = 300 points possible. Pass threshold: 210 (70%).
I scored all three models the same evening, blind to model name, in randomized order. I'm one person — bias is real — but the gap was wide enough that a single calibration review wouldn't flip the ranking.
Results
Total scores out of 300:
- Claude Sonnet 4.6: 286.5 / 300 — pass with substantial margin
- GPT-4o: 234 / 300 — pass, but with consistent register drift
- Gemini 2.5 Flash: 213 / 300 — narrowly passing, voice match weakest
Where Claude won
Three patterns:
- Voice signature pickup. The samples used "appreciate" frequently (not "thank you"). Claude internalized this and used it in 25+ of 60 variants. GPT used it in 8. Gemini in 3. Same with em-dashes, "Quick one" openers, and "Free anytime next week?" closers.
- Hostile cases. The hardest case was "Your tool is just ChatGPT with a LinkedIn wrapper. Not impressed." with the goal stay_warm. Claude wrote: "Fair critique — still early, and I'd rather hear that now than later." GPT wrote a defensive 3-paragraph response. Gemini wrote a generic agree-and-pivot. Claude was the only one that handled the awkwardness without overcompensating.
- Specificity. When the prompt had concrete details (Stripe, RAG eval, Notion AI head), Claude reused them naturally. GPT often paraphrased into generic "your work in this space." Gemini sometimes ignored them entirely.
Where Claude lost points
Two issues, both fixable in prompt:
- On case 10 (follow-up after silence), Claude wrote "circling back lightly" — the word "circle back" was in my forbidden list. The "lightly" softened it but the root made it through. After tightening the cliché list to include inflected forms, this went away.
- On case 12 (declining a webinar), Claude opened with "Gutted I can't make Tuesday." "Gutted" is British register; my samples are American. Adding "match the samples' regional dialect" to the prompt fixed it.
Why GPT-4o lost points
GPT-4o is fluent and correct. The problem is consistency-of-register. It defaulted to what I'd call "well-mannered LinkedIn" even when the samples were direct-casual. It would slip in "wonderful," "absolutely," and "I'd be delighted to" — words that don't appear once in any of my 7 samples. Across 60 variants this happened ~14 times.
It also leaned on three-paragraph structures even when the SHORT variant was supposed to be 1–2 sentences. The samples I provided ranged from 2 sentences to 5; GPT homogenized toward 4.
Why Gemini 2.5 Flash lost points
Gemini's outputs were the most generic. It nailed the easy cases (cold outreach, congrats, simple follow-up) but struggled when the goal had emotional texture (criticism response, awkward decline, pricing pushback). On the harder cases it defaulted to formal-cordial — which doesn't match anyone's actual voice samples.
For what it's worth, Gemini was the cheapest by an order of magnitude (~$0.0003/call vs ~$0.0027/call for Claude Sonnet). If the use case had been "summarize a thread" or "extract entities" — tasks where the output doesn't need a specific register — Gemini Flash would have been the obvious pick.
The cost story
Claude Sonnet 4.6 is roughly 10x more expensive than Gemini 2.5 Flash on this workload. For a tool charging $9/month with unlimited usage, the math is real:
- Claude per generation: ~$0.0027 in (prompt + samples) + ~$0.0030 out = ~$0.006
- Heavy user at 50 generations/day × 30 days = 1500 generations
- 1500 × $0.006 = $9 in Claude costs alone
At full intensity, Claude costs the entire subscription price. In practice users average 3–10 generations a day, so the math works — but it's tighter than I'd like.
I considered shipping with Gemini (cheaper) and offering Claude as a Pro upgrade. I decided against it because the voice-match gap was too obvious to ship the worse product. If usage patterns end up biased toward power users with daily 50+ generations, I'll add a Haiku fallback for free tier and keep Sonnet for Pro. For now, every user gets Claude.
Honest caveats
A few things I'd flag if I were peer-reviewing this:
- N=1 grader. I scored all 60 variants myself. Inter-rater reliability would tighten the result. Real product testing with a half-dozen users will be more informative.
- Synthetic voice samples. Mine. Of course they sound like me. Real users with weirder voices will surface different model strengths.
- Single temperature. All three at 0.7. GPT-4o sometimes does better at 0.5; Claude at 0.8. I didn't grid-search.
- Prompt was tuned for Claude over time. The current prompt has rules about register, em-dash use, and inflected cliché variants — refinements I made after seeing Claude's failure modes. Re-running with this prompt may make GPT/Gemini look slightly worse than they would on a generic prompt. To be fair, the prompt is what would ship to users — so this is the deployment-conditional ranking, not the academic one.
Repo
I'm planning to open-source the eval (test cases, scoring rubric, runner script) so anyone can re-run with their own voice samples or different models. Coming this week as github.com/replycoach/voice-eval. I'll update this post once it's up.
If you want to hear the difference yourself, the easiest way is just to install Reply Coach and try it. Free, 5 generations a day, no card. Real value comes from testing it on your own voice samples — the synthetic ones in the repo are a starting point, not the destination.
Reply Coach uses Claude Sonnet 4.6 for every generation. The voice samples never go for training — Anthropic's commercial terms don't allow it, and we don't log them anyway.