Sovereign Inference

Your hardware. Your code. No CUDA. No cloud. No lock-in.
9x faster inference on the EPYC fleet you already have.

9x speedup 70B (7→60-75)
60-75 tok/s EPYC 70B Q4
671B Q8 on one socket
6 TB max memory per socket
49-78 tok/s M3 Ultra 70B

EPYC Sovereign Inference

AMD EPYC 9754 — 128 cores, 768 GB RAM, dedicated-CCD speculative decoding. 9x performance transformation. Runs 671B models that GPUs can't touch. Full comparison matrix.

Apple Silicon Performance

Three-architecture comparison: macOS native, Phase 2 EL2 VM, bare metal M3. Memory hierarchy analysis, LLM inference deep-dive, M3 vs M5 Ultra projections.


The Performance Story

Stock llama.cpp on EPYC 9754: 7-8 tok/s for 70B. Disappointing. But that number is the floor, not the ceiling. The EPYC's NUMA topology — which normally hurts inference — becomes the weapon.

Dedicated-CCD speculative decoding partitions 128 cores by function. Draft model runs from L3 cache at 200 tok/s. 112 verification cores consume full 460.8 GB/s bandwidth. Result: 60-75 tok/s. Nine times faster.

EPYC Performance Gains

  • 9x speedup on 70B Q4 (7-8 → 60-75 tok/s)
  • 671B Q8 on one socket — no GPU can do this
  • 768 GB — 6 TB per socket, scales with model growth
  • 300-400 concurrent users from 100 servers
  • L3-speed drafting at 200 tok/s inside 32 MB cache

Apple Silicon Bare Metal

  • 49-78 tok/s on M3 Ultra 70B (sovereign)
  • 50 ns context switches (vs 3 µs macOS)
  • 38 KV cache pools, zero-prefill switching
  • Tree speculation at 60x coordination speed
  • Zero Apple code in inference path