Project Case Study

Apple Silicon LLM Inference Benchmarking Suite

What this project does When deploying a quantized LLM locally, you're making a tradeoff you can't see. A 3-bit model uses half the RAM of an 8-bit model but may hallucinate more. MLX might generate 47 tokens per second while llama.cpp delivers the first token in 30ms — even if its overall throughput is lower. The...

Tokens per second / Time to first token / Peak RAM delta / WikiText-2 perplexity / llama.cpp + Metal / Apple MLX

What this project does

When deploying a quantized LLM locally, you're making a tradeoff you can't see. A 3-bit model uses half the RAM of an 8-bit model but may hallucinate more. MLX might generate 47 tokens per second while llama.cpp delivers the first token in 30ms — even if its overall throughput is lower. The right choice depends entirely on whether you're building a batch processor or a chat interface.

This suite answers that question with statistical rigor: warmup runs are discarded, 5 timed runs are averaged with standard deviation reported, and output quality is measured via WikiText-2 perplexity so speed comparisons never ignore model degradation. CUDA benchmarks are everywhere; Metal benchmarks are not. This fills that gap with real measurements on real hardware.

System architecture

The system has three completely decoupled layers. No layer knows the internal details of another; they communicate only through well-defined interfaces. A shared dataclass, a REST API, and a Server-Sent Events stream.

Image: Fig. 1 — Three-layer decoupled architecture: benchmarking engine, FastAPI + SSE layer, Next.js dashboard

The key architectural decision is that the benchmarking layer is pure Python with no web dependencies. It runs standalone from the CLI or as a background task called by the API. This separation means the measurement code can be tested independently of the dashboard, and the dashboard can render any run stored on disk without re-running benchmarks.

End-to-end data flow

The most interesting engineering challenge is getting progress events from a blocking Python benchmark thread into an async Server-Sent Events stream in real time. The flow below traces a single benchmark run from button click to dashboard render.

Image: Fig. 2 — Data flow: browser → FastAPI → thread pool → asyncio.Queue → SSE → dashboard

Key challenge: The benchmark thread cannot call await queue.put() directly. The fix is asyncio.run_coroutine_threadsafe(queue.put(event), loop) — scheduling a coroutine on the event loop from the thread. The loop reference must be captured in the async context before the thread starts, or get_running_loop() raises RuntimeError inside the thread.

Inference backends

Two fundamentally different inference strategies are compared. Their architectural differences and not just implmentation choices to explain why they behave differently on latency, throughput, and first-token time.

Image: Fig. 3 — Backend memory models: llama.cpp offloads layers to a separate GPU device (with copies); MLX uses shared unified memory

Measurement methodology

Statistical rigor separates useful benchmarks from noise. Every number in the results went through the same pipeline.

Image: Fig. 4 — Measurement pipeline per configuration: load → warmup → timed runs → statistics → perplexity

Three prompt lengths are tested per configuration — short (50 expected tokens), medium (120), and long (200). Prompts cover ML topics to keep the task domain consistent. TPS often drops on longer prompts as the KV cache fills and attention computation grows quadratically with sequence length.

Without warmup, variance was ±15 tok/s. With proper warmup and 5-run averaging, variance dropped to ±3 tok/s. The error bars in the dashboard charts are real information — a configuration with high variance is unreliable even if its mean looks competitive.

Benchmark results

All results from the final benchmark run on Apple Silicon (M-series). Pareto-optimal configurations are highlighted.

Q4_K_M insight: Achieves PPL 9.02 vs Q8_0's 8.91 — a difference of only 0.11 perplexity units — while using 40% less RAM. For most local deployment scenarios, Q4_K_M is the practical choice even though it falls off the strict Pareto frontier.

Pareto frontier analysis

A configuration is Pareto-optimal if no other configuration simultaneously beats it on both throughput (TPS, higher better) and quality (PPL, lower better). Configurations behind the frontier are dominated — you can do strictly better by switching.

Image: Fig. 5 — Pareto frontier: TPS vs PPL. Configurations below and to the right of the frontier line are dominated.

The three Pareto-optimal configurations represent meaningfully different tradeoffs: MLX 4-bit maximizes raw throughput for batch workloads; IQ4_XS balances speed and quality with the fastest first token; Q8_0 minimizes perplexity for quality-sensitive applications. Everything else — Q3_K_L, Q5_K_M, Q6_K, MLX 8-bit — is strictly dominated.

Complete results table

The sortable table below shows all 27 runs — every configuration tested across all three prompt lengths, ordered by throughput. Standard deviations confirm measurement stability.

Key engineering challenges

Thread-safe SSE streaming

FastAPI runs on asyncio; the benchmark runs in a thread pool. Pushing events from a background thread into an asyncio Queue requires asyncio.run_coroutine_threadsafe(queue.put(event), loop). The event loop reference must be captured in the async context before the thread starts — calling get_running_loop() inside the thread raises RuntimeError: no running event loop.

MLX lazy evaluation

MLX records operations into a computation graph rather than executing them immediately. Calling mx.eval() before timing starts forces all pending ops to resolve. Without this, the first timed run includes graph compilation and the numbers are incorrect. Two warmup passes are also required to fully compile the JIT graph before timed runs begin.

llama.cpp logits_all restriction

The default benchmark model loads with logits_all=False — it only computes logits for the most recent token during generation. Perplexity requires logits for every token in the context window simultaneously. The fix: load a second Llama instance with logits_all=True just for perplexity, compute PPL, then explicitly del it before benchmarking begins to avoid doubling memory footprint.

Next.js 15.3 localStorage bug

Next.js 15.3 passes --localstorage-file to Node.js for Web Storage APIs server-side. Without a valid file path, a sealed Proxy object is created that exists but has no methods — localStorage.getItem throws TypeError: not a function. Since the Proxy is sealed it cannot be patched in place. The fix is an instrumentation.ts file that runs before any rendering and replaces global.localStorage entirely with a plain object stub.

MLX new sampler API

mlx-lm updated its generation API to use a sampler parameter instead of temperature/temp. Passing sampler=None selects greedy decoding (deterministic, equivalent to temperature=0). The old API threw TypeError on unexpected keyword arguments — every call site had to be updated.

Key findings

The most significant insight is the TTFT split between backends. For a chatbot or interactive application, 250ms before the first word appears feels sluggish even if overall token rate is higher. llama.cpp's 30ms TTFT feels instantaneous. The right backend choice depends entirely on use case: batch processing favors MLX throughput; interactive applications favor llama.cpp responsiveness.

The Pareto frontier confirms that for most local deployment scenarios, Q4_K_M (if responsiveness matters) or Q8_0 (if quality matters) are the defensible choices. The "middle" configurations — Q5_K_M, Q6_K, MLX 8-bit — offer no advantage over simply picking a frontier option on both the speed and quality axes simultaneously.