Sovereign Inference

Your hardware. Your code. No CUDA. No cloud. No lock-in.
9x faster inference on the EPYC fleet you already have.

9x speedup 70B (7→60-75)

60-75 tok/s EPYC 70B Q4

671B Q8 on one socket

6 TB max memory per socket

49-78 tok/s M3 Ultra 70B

The Performance Story

Stock llama.cpp on EPYC 9754: 7-8 tok/s for 70B. Disappointing. But that number is the floor, not the ceiling. The EPYC's NUMA topology — which normally hurts inference — becomes the weapon.

Dedicated-CCD speculative decoding partitions 128 cores by function. Draft model runs from L3 cache at 200 tok/s. 112 verification cores consume full 460.8 GB/s bandwidth. Result: 60-75 tok/s. Nine times faster.

EPYC Performance Gains

9x speedup on 70B Q4 (7-8 → 60-75 tok/s)
671B Q8 on one socket — no GPU can do this
768 GB — 6 TB per socket, scales with model growth
300-400 concurrent users from 100 servers
L3-speed drafting at 200 tok/s inside 32 MB cache

Apple Silicon Bare Metal

49-78 tok/s on M3 Ultra 70B (sovereign)
50 ns context switches (vs 3 µs macOS)
38 KV cache pools, zero-prefill switching
Tree speculation at 60x coordination speed
Zero Apple code in inference path

Sovereign Inference

EPYC Sovereign Inference

Apple Silicon Performance

The Performance Story

EPYC Performance Gains

Apple Silicon Bare Metal