ocal LLM Benchmark: Gemma 4 vs Qwen 3.5
Head-to-head on a Mac Mini. 26 prompts, 6 categories, and one surprising finding about thinking mode overhead.
TL;DR
Gemma 4 31B (bf16) wins overall: better quality (3.77 vs 2.79), 9 prompt wins vs Qwen's 4, and ~8x faster end-to-end.
Qwen 3.5's thinking mode eats the max_tokens budget. Smaller model, but much slower in practice. Counterintuitive.
The Setup
I wanted to pick a daily-driver model for my Hermes agent. Two candidates on disk: Google's new Gemma 4 31B at full bf16 precision, and Alibaba's Qwen 3.5 27B at Q8_0 quantization.
Both served locally via LM Studio on port 1234, both loaded with a 32K context window. Hardware: Mac Mini M-series, 192GB unified memory.
- Gemma 4 31B — 61.4 GB in RAM, quadratic attention, vision-capable
- Qwen 3.5 27B — 28.6 GB in RAM, hybrid DeltaNet linear attention, thinking mode
The Benchmark
26 prompts across six categories — reasoning, coding, math, instruction following, creative writing, and tool use. Each scored programmatically first (exact match, unit tests, constraint checks, tool trace validation), deferring to Claude-as-judge only for genuinely subjective outputs.
Deterministic prompts ran at temperature=0, seed=42. Creative ones at temperature=0.8 with three seeds. Two warmup calls discarded, three scored repeats per model per prompt — 156 total generations.
Quality Scores
| Category | Gemma4 bf16 | Qwen3.5 Q8 | Winner |
|---|---|---|---|
| reasoning | 3.60 | 2.60 | Gemma4 |
| coding | 5.00 | 5.00 | tie |
| math | 1.25 | 0.75 | Gemma4 |
| instruction | 5.00 | 3.00 | Gemma4 |
| creative | 5.00 | 3.00 | Gemma4 |
| tool_use | 3.12 | 2.50 | Gemma4 |
| Overall | 3.77 | 2.79 | Gemma4 |
Scores on 0–5 scale. Programmatic scoring across exact-match, unit tests, keyword coverage, and constraint checks.
Head-to-Head Wins
Taking the best-of-3 score per prompt per model and comparing:
| Category | Gemma | Qwen | Tie |
|---|---|---|---|
| reasoning | 2 | 2 | 1 |
| coding | 0 | 0 | 4 |
| math | 1 | 1 | 2 |
| instruction | 4 | 0 | 0 |
| creative | 1 | 0 | 2 |
| tool_use | 1 | 1 | 2 |
Performance: The Surprise
On paper, Qwen 3.5 should be faster. It's smaller (28.6 GB vs 61.4 GB) and uses linear-attention layers. On Apple Silicon where inference is memory-bandwidth-bound, smaller model = faster tokens.
In practice, Gemma 4 was ~8x faster end-to-end:
| Metric | Gemma4 | Qwen3.5 |
|---|---|---|
| Median TTFT (s) | 1.84 | 0.66 |
| Median tok/s | 8.2 | 3.4 |
| Median total (s) | 8.0 | 65.6 |
The culprit: thinking mode. Qwen 3.5 emits reasoning tokens in a separate reasoning_content stream before producing the final answer. Those reasoning tokens share the max_tokens budget with the answer.
My first run accidentally capped Qwen at 512 tokens. The model burned through its entire budget on reasoning and never output a final answer — finish_reason: length with empty content. I had to multiply the budget by 8x to get answers at all.
Even with the fix, each Qwen prompt took ~65 seconds. Gemma averaged 8. For an agent loop where you're making dozens of calls, this is the whole ballgame.
What Else Stood Out
Gemma dominated instruction following (4–0)
Prompts like "answer in exactly 3 bullet points, 4 words each, no punctuation" — Gemma nailed them. Qwen's thinking added explanatory preamble that broke strict formats.
Coding was a wash (0–0, 4 ties)
Both models passed all 4 coding unit tests. LRU cache, nginx parser, range compactor, SQL query — everything 5/5. Need harder coding prompts to differentiate.
Qwen's only speed win: TTFT
First token arrived in 0.66s (vs Gemma's 1.84s) — because it immediately streams thinking tokens. Feels responsive. Total time still dominates.
Caveats
This is a deployment comparison, not a pure architecture comparison. bf16 vs Q8_0 folds together model quality, quantization precision, memory bandwidth, and runtime backend.
Sample size is small: 26 prompts, 4–5 per category. Broad claims would be premature. "Gemma 4 is better than Qwen 3.5" is not what this tells us.
What it does tell me: for my specific hardware, my specific workload, and my specific budget of patience, Gemma 4 31B is the better daily driver for Hermes.
Code
Full suite, prompts, runner, and scoring code is open source:
github.com/omar16100/llm-benchmark