Mistral vs. Claude on our onboarding: 4× faster, 30% cheaper
July 2, 2026 · 4 min read
We put Mistral Medium 3.5 up against Claude Sonnet 4.6, the model behind our onboarding agent, on our own workload. Mistral came out about 4× faster and 30% cheaper, at near-equal quality. For this particular job it turned out to be a surprisingly good fit, for reasons specific to how onboarding actually runs.
We're based in Malmö, and we've always wanted a European tech stack. It's part of why we run on Hetzner, for one. The goal is an all-European stack, and we'll push for it. But the drive to build an awesome product is equally strong, so where a European option can't yet carry the job, we'll take the best model available and make sure it runs under EU legislation. We are not even an option for many of our potential customers if we don't handle their data under EU rules.
But with AI, most of the frontier is American, and our early investigations landed us on Anthropic's Claude for most of the models in our stack. It has taken us to where we are today with extreme speed, but now it's time to slow down and see how we can reach that all-European goal.
The first candidate
The first conversation you have when you hire a Squidler specialist is driven by an onboarding agent that works out what you actually need. As it runs fairly isolated from the rest of the system, it feels like a good candidate to start with.
Onboarding runs on Claude Sonnet 4.6 today, as it strikes a good balance between quality, speed, and price for the task. Initially we also evaluated Haiku 4.5, Anthropic's cheaper tier, but didn't get the quality we wanted. To find a European model that could stand next to it, we pulled the published benchmarks (Artificial Analysis for capability and speed, the vendors for price). One candidate stood out.
| Metric | Mistral Medium 3.5 🇫🇷 | Claude Sonnet 4.6 🇺🇸 | Claude Haiku 4.5 🇺🇸 |
|---|---|---|---|
| Intelligence Index | 30 | 36 | 24 |
| Output speed | 92 tok/s | 44.8 tok/s | 92 tok/s |
| Time to first token | 2.06s | 1.26s | 0.93s |
| Price (in / out per 1M) | $1.50 / $7.50 | $3 / $15 | $1 / $5 |
Numbers as of July 2, 2026, for each vendor's own API. They drift as new benchmark runs land.
Mistral Medium 3.5 sits six points behind Sonnet on intelligence, at half the price. Its time to first token, the wait before anything shows on screen, is a bit slower than Sonnet's, but its output speed matches Haiku's, twice Sonnet's, so it should produce the complete reply quicker. On paper, Mistral is the obvious one to test: created in France, and served through their own API with data hosted in the EU. The all-European option, if the quality holds.
The test
We don't judge models on vibes. Onboarding has its own eval suite: 17 scripted conversation fixtures, each run three times, scored with promptfoo on the things that matter. Does the agent surface a concrete problem, hand off cleanly, handle multiple languages, and stay grounded instead of making things up?
| Result | Mistral Medium 3.5 🇫🇷 | Claude Sonnet 4.6 🇺🇸 | Claude Haiku 4.5 🇺🇸 |
|---|---|---|---|
| Quality (checks passed) | 46 / 51 (90%) | 48 / 51 (94%) | 39 / 51 (76%) |
| Speed (median per turn) | ~1.9s | ~7.6s | ~1.9s |
| Cost (per session) | ~$0.044 | ~$0.063 | ~$0.035 |
Each check is one fixture run graded pass/fail as a whole, the subjective ones by a Claude model, a bias that favors the incumbent if anything. A two-check gap on 51 runs is within run-to-run noise. Speed and cost come from replayed eight-turn sessions per model.
On quality, Mistral Medium 3.5 came out almost as good as Sonnet in our tests, in line with the published numbers. The surprise is speed: about 4× as quick per turn, double what the spec sheet promised. Pulling the raw replies from the eval explains it: once the conversation gets going, Mistral averaged roughly 170 output tokens per turn where Sonnet used about 500. Same job, a third of the words. As in the example below:
The next reply is the model's, same conversation for both
That's a real problem — missing something important because it's buried is stressful. Is the main pain that you're missing things you need to act on(reply, follow up, do something), or more that you can't find information when you need it later?
That sounds like a real pain. Do you use Gmail or another email service?
Some readers will prefer Sonnet's fuller reply here, and that's fair. Both pass the same checks.
Worth noting: the per-turn figure is total time, and felt latency is not the same as measured latency. What you feel first is time to first token. The spec sheet has Mistral slower to the first word, but we never felt it: a median Mistral turn finished in less time than the spec sheet's quoted time to first token. Take published latency numbers as a starting point, not an answer.
When it comes to cost, it is not as easy as reading the model card. Two other factors weigh in: caching and output tokens. Both Mistral and Anthropic price cached input at a tenth of the normal rate. The catch is that a cache only helps if it hits: Claude has been outstanding at this in our experience, while Mistral's, best-effort by its own docs, never gave us a single cached token when we tried it head-on. For onboarding the miss doesn't flip the result: caching only discounts input, Sonnet's bill is mostly its long replies, and Mistral lands about 30% cheaper per session even paying full price for every input token. If Mistral's cache ever starts delivering, that number only improves.
Where it stands
It turned out surprisingly well. On our own onboarding workload, a European model held its own with Claude on quality, came in at a lower cost, and answered with noticeably lower latency. For a job we run every day, that's a genuine option we didn't expect to have.
Almost as good, 4× faster, 30% cheaper, and it's European.
A note on Sonnet 5: Anthropic shipped it the same week we ran this, and we didn't rerun the eval against it. We compare against what actually runs onboarding today.