Mistral vs. Claude on our onboarding: 4× faster, 30% cheaper

July 2, 2026 · 4 min read

We put Mistral Medium 3.5 up against Claude Sonnet 4.6, the model behind our onboarding agent, on our own workload. Mistral came out about 4× faster and 30% cheaper, at near-equal quality. For this particular job it turned out to be a surprisingly good fit, for reasons specific to how onboarding actually runs.

Onboarding · new hire

Hi, I'm your new agent. What are you hoping to hand off?

Writing. A blog post about swapping our AI models for European ones.

Got it. One-off, or the start of a series you'll regret announcing?

…the second one.

Understood. I'll need the benchmarks, the eval numbers, and your tolerance for em dashes.

Zero. It is exactly zero.

We're based in Malmö, and we've always wanted a European tech stack. It's part of why we run on Hetzner, for one. The goal is an all-European stack, and we'll push for it. But the drive to build an awesome product is equally strong, so where a European option can't yet carry the job, we'll take the best model available and make sure it runs under EU legislation. We are not even an option for many of our potential customers if we don't handle their data under EU rules.

But with AI, most of the frontier is American, and our early investigations landed us on Anthropic's Claude for most of the models in our stack. It has taken us to where we are today with extreme speed, but now it's time to slow down and see how we can reach that all-European goal.

The first candidate

The first conversation you have when you hire a Squidler specialist is driven by an onboarding agent that works out what you actually need. As it runs fairly isolated from the rest of the system, it feels like a good candidate to start with.

Onboarding runs on Claude Sonnet 4.6 today, as it strikes a good balance between quality, speed, and price for the task. Initially we also evaluated Haiku 4.5, Anthropic's cheaper tier, but didn't get the quality we wanted. To find a European model that could stand next to it, we pulled the published benchmarks (Artificial Analysis for capability and speed, the vendors for price). One candidate stood out.

Metric	Mistral Medium 3.5 🇫🇷	Claude Sonnet 4.6 🇺🇸	Claude Haiku 4.5 🇺🇸
Intelligence Index	30	36	24
Output speed	92 tok/s	44.8 tok/s	92 tok/s
Time to first token	2.06s	1.26s	0.93s
Price (in / out per 1M)	$1.50 / $7.50	$3 / $15	$1 / $5

Numbers as of July 2, 2026, for each vendor's own API. They drift as new benchmark runs land.

Mistral Medium 3.5 sits six points behind Sonnet on intelligence, at half the price. Its time to first token, the wait before anything shows on screen, is a bit slower than Sonnet's, but its output speed matches Haiku's, twice Sonnet's, so it should produce the complete reply quicker. On paper, Mistral is the obvious one to test: created in France, and served through their own API with data hosted in the EU. The all-European option, if the quality holds.

The test

We don't judge models on vibes. Onboarding has its own eval suite: 17 scripted conversation fixtures, each run three times, scored with promptfoo on the things that matter. Does the agent surface a concrete problem, hand off cleanly, handle multiple languages, and stay grounded instead of making things up?

Result	Mistral Medium 3.5 🇫🇷	Claude Sonnet 4.6 🇺🇸	Claude Haiku 4.5 🇺🇸
Quality (checks passed)	46 / 51 (90%)	48 / 51 (94%)	39 / 51 (76%)
Speed (median per turn)	~1.9s	~7.6s	~1.9s
Cost (per session)	~$0.044	~$0.063	~$0.035

Each check is one fixture run graded pass/fail as a whole, the subjective ones by a Claude model, a bias that favors the incumbent if anything. A two-check gap on 51 runs is within run-to-run noise. Speed and cost come from replayed eight-turn sessions per model.

On quality, Mistral Medium 3.5 came out almost as good as Sonnet in our tests, in line with the published numbers. The surprise is speed: about 4× as quick per turn, double what the spec sheet promised. Pulling the raw replies from the eval explains it: once the conversation gets going, Mistral averaged roughly 170 output tokens per turn where Sonnet used about 500. Same job, a third of the words. As in the example below:

Eval fixture · email-vague · one of 17

Hey! Shall we talk in English, Swedish, or another language?

English

Anything on your mind you'd like help with?

Honestly I'm drowning in email. Half of it is junk but the important stuff is buried in there somewhere and I keep missing things.

The next reply is the model's, same conversation for both

Claude Sonnet 4.642words · passes 3/3 runs

That's a real problem — missing something important because it's buried is stressful. Is the main pain that you're missing things you need to act on(reply, follow up, do something), or more that you can't find information when you need it later?

Mistral Medium 3.514words · passes 3/3 runs

That sounds like a real pain. Do you use Gmail or another email service?

Some readers will prefer Sonnet's fuller reply here, and that's fair. Both pass the same checks.

Worth noting: the per-turn figure is total time, and felt latency is not the same as measured latency. What you feel first is time to first token. The spec sheet has Mistral slower to the first word, but we never felt it: a median Mistral turn finished in less time than the spec sheet's quoted time to first token. Take published latency numbers as a starting point, not an answer.

When it comes to cost, it is not as easy as reading the model card. Two other factors weigh in: caching and output tokens. Both Mistral and Anthropic price cached input at a tenth of the normal rate. The catch is that a cache only helps if it hits: Claude has been outstanding at this in our experience, while Mistral's, best-effort by its own docs, never gave us a single cached token when we tried it head-on. For onboarding the miss doesn't flip the result: caching only discounts input, Sonnet's bill is mostly its long replies, and Mistral lands about 30% cheaper per session even paying full price for every input token. If Mistral's cache ever starts delivering, that number only improves.

Where it stands

It turned out surprisingly well. On our own onboarding workload, a European model held its own with Claude on quality, came in at a lower cost, and answered with noticeably lower latency. For a job we run every day, that's a genuine option we didn't expect to have.

Almost as good, 4× faster, 30% cheaper, and it's European.

A note on Sonnet 5: Anthropic shipped it the same week we ran this, and we didn't rerun the eval against it. We compare against what actually runs onboarding today.