Benchmark April 2026

ocal LLM Benchmark: Gemma 4 vs Qwen 3.5

Head-to-head on a Mac Studio M3. 26 prompts, 6 categories, and one surprising finding about thinking mode overhead.

TL;DR

Gemma 4 31B (bf16) wins overall: better quality (3.77 vs 2.79), 9 prompt wins vs Qwen's 4, and ~8x faster end-to-end.

Qwen 3.5's thinking mode eats the max_tokens budget. Smaller model, but much slower in practice. Counterintuitive.

The Setup

I wanted to pick a daily-driver model for my Hermes agent. Two candidates on disk: Google's new Gemma 4 31B at full bf16 precision, and Alibaba's Qwen 3.5 27B at Q8_0 quantization.

Both served locally via LM Studio on port 1234, both loaded with a 32K context window. Hardware: Mac Studio M3, 512 GB unified memory — the same rig I use to evaluate and run a rotating stable of local models.

Gemma 4 31B — 61.4 GB in RAM, quadratic attention, vision-capable
Qwen 3.5 27B — 28.6 GB in RAM, hybrid DeltaNet linear attention, thinking mode

The Benchmark

26 prompts across six categories — reasoning, coding, math, instruction following, creative writing, and tool use. Each scored programmatically first (exact match, unit tests, constraint checks, tool trace validation), deferring to Claude-as-judge only for genuinely subjective outputs.

Deterministic prompts ran at temperature=0, seed=42. Creative ones at temperature=0.8 with three seeds. Two warmup calls discarded, three scored repeats per model per prompt — 156 total generations.

Quality Scores

Category	Gemma4 bf16	Qwen3.5 Q8	Winner
reasoning	3.60	2.60	Gemma4
coding	5.00	5.00	tie
math	1.25	0.75	Gemma4
instruction	5.00	3.00	Gemma4
creative	5.00	3.00	Gemma4
tool_use	3.12	2.50	Gemma4
Overall	3.77	2.79	Gemma4

Scores on 0–5 scale. Programmatic scoring across exact-match, unit tests, keyword coverage, and constraint checks.

Head-to-Head Wins

Taking the best-of-3 score per prompt per model and comparing:

Gemma4

Ties

Qwen3.5

Category	Gemma	Qwen	Tie
reasoning	2	2	1
coding	0	0	4
math	1	1	2
instruction	4	0	0
creative	1	0	2
tool_use	1	1	2

Performance: The Surprise

On paper, Qwen 3.5 should be faster. It's smaller (28.6 GB vs 61.4 GB) and uses linear-attention layers. On Apple Silicon where inference is memory-bandwidth-bound, smaller model = faster tokens.

In practice, Gemma 4 was ~8x faster end-to-end:

Metric	Gemma4	Qwen3.5
Median TTFT (s)	1.84	0.66
Median tok/s	8.2	3.4
Median total (s)	8.0	65.6

The culprit: thinking mode. Qwen 3.5 emits reasoning tokens in a separate reasoning_content stream before producing the final answer. Those reasoning tokens share the max_tokens budget with the answer.

My first run accidentally capped Qwen at 512 tokens. The model burned through its entire budget on reasoning and never output a final answer — finish_reason: length with empty content. I had to multiply the budget by 8x to get answers at all.

Even with the fix, each Qwen prompt took ~65 seconds. Gemma averaged 8. For an agent loop where you're making dozens of calls, this is the whole ballgame.

What Else Stood Out

Gemma dominated instruction following (4–0)

Prompts like "answer in exactly 3 bullet points, 4 words each, no punctuation" — Gemma nailed them. Qwen's thinking added explanatory preamble that broke strict formats.

Coding was a wash (0–0, 4 ties)

Both models passed all 4 coding unit tests. LRU cache, nginx parser, range compactor, SQL query — everything 5/5. Need harder coding prompts to differentiate.

Qwen's only speed win: TTFT

First token arrived in 0.66s (vs Gemma's 1.84s) — because it immediately streams thinking tokens. Feels responsive. Total time still dominates.

Caveats

This is a deployment comparison, not a pure architecture comparison. bf16 vs Q8_0 folds together model quality, quantization precision, memory bandwidth, and runtime backend.

Sample size is small: 26 prompts, 4–5 per category. Broad claims would be premature. "Gemma 4 is better than Qwen 3.5" is not what this tells us.

What it does tell me: for my specific hardware, my specific workload, and my specific budget of patience, Gemma 4 31B is the better daily driver for Hermes.

Code

Full suite, prompts, runner, and scoring code is open source:

github.com/omar16100/llm-benchmark