Your hardware. Your code. No CUDA. No cloud. No lock-in.
9x faster inference on the EPYC fleet you already have.
AMD EPYC 9754 — 128 cores, 768 GB RAM, dedicated-CCD speculative decoding. 9x performance transformation. Runs 671B models that GPUs can't touch. Full comparison matrix.
→Three-architecture comparison: macOS native, Phase 2 EL2 VM, bare metal M3. Memory hierarchy analysis, LLM inference deep-dive, M3 vs M5 Ultra projections.
→Stock llama.cpp on EPYC 9754: 7-8 tok/s for 70B. Disappointing. But that number is the floor, not the ceiling. The EPYC's NUMA topology — which normally hurts inference — becomes the weapon.
Dedicated-CCD speculative decoding partitions 128 cores by function. Draft model runs from L3 cache at 200 tok/s. 112 verification cores consume full 460.8 GB/s bandwidth. Result: 60-75 tok/s. Nine times faster.