Three EPYC configurations, three performance profiles. From 32-core efficiency to 128-core dominance. From DDR4 dual-socket to DDR5 flagship. All sovereign. All transformative.
| What | Stock | Sovereign | Gain |
|---|---|---|---|
| 70B Q4 tok/s | 7-8 | 60-75 | 9x |
| Bandwidth utilization | ~60% | ~85-90% | +50% |
| Draft speed (1B in L3) | N/A | 200 tok/s | L3 cache speed |
| Coordination latency | 1-3 µs (OS) | 200-400 ns (IF) | 5-15x |
| 671B Q8 (single socket) | Not attempted | 6.2 tok/s | Runs. Period. |
| Concurrent 70B users (100 servers) | ~30 | 300-400 | 10x |
128 Zen 4c cores organized into 8 CCDs, each with 16 cores and 32 MB L3 cache. All connected to a central I/O die via Infinity Fabric.
| Spec | EPYC 9754 | H100 | EPYC Advantage |
|---|---|---|---|
| Cores | 128 (Zen 4c) | 16,896 CUDA | General purpose, sovereign control |
| Memory | 768 GB DDR5 | 80 GB HBM3 | 9.6x more memory |
| Max Memory | 6 TB | 80 GB | 75x more memory |
| Memory BW | 460.8 GB/s | 3,350 GB/s | CCD spec decode closes the gap |
| TDP | ~360W | ~700W | Half the power |
| Sovereignty | Full. Your code, period. | CUDA + proprietary firmware | Permanent |
The key architectural insight that transforms EPYC from 7-8 tok/s to 60-75 tok/s. Partition 128 cores by function, not by model.
The problem with stock inference: Standard llama.cpp treats all 128 cores as one pool. 460.8 GB/s / 35 GB = 13.2 tok/s theoretical max. Linux overhead + NUMA penalties = 7-8 tok/s actual. Only 60% bandwidth utilization. This is what everyone benchmarks. This is where the story starts.
1B Q4 is ~500 MB. Hot path fits in 32 MB L3. Draft tokens at L3 speed (~20 ns) instead of DRAM (~100 ns). 200+ tok/s drafting speed.
Spec decode coordination transfers ~1-2 KB per round. At 64 GB/s, each round costs 200-400 ns. Fast enough for 8-12 token drafts per round.
7 CCDs x 16 cores = 112 cores streaming weights. NUMA-aware placement ensures local channel reads. 460.8 GB/s fully consumed.
Draft 12 tokens in 60 µs. Verify in ~76 ms at full bandwidth. Accept 7-10 tokens per pass. Effective: 60-75 tok/s combined cycle.
| Parameter | Standard (Linux) | CCD-Partitioned | Improvement |
|---|---|---|---|
| Draft model location | Competes for DRAM BW | Runs from L3 cache | 5x faster draft |
| Draft speed | ~50 tok/s (BW-shared) | ~200 tok/s (L3-speed) | 4x |
| Coordination latency | ~1-3 µs (OS scheduler) | ~200-400 ns (IF direct) | 5-15x |
| Tokens drafted/round | 4-6 | 8-12 | 2x |
| Verify cores | 128 (shared, contended) | 112 (dedicated) | Clean partition |
| Effective tok/s (70B Q4) | ~12-15 | ~60-75 | 5x (vs spec decode stock) |
| vs stock llama.cpp | 7-8 | 60-75 | 9x |
MSR-level and OS-bypass techniques that extract maximum performance. Each one is a measurable gain. All require ring 0.
Pin model layer weights to specific CCDs' local memory. Layer N's weights live on channels closest to the processing CCD. Eliminates cross-CCD memory traffic for sequential layer processing.
Partition L3 via MSRs. CCD 0: 24 MB draft weights + 4 MB KV + 4 MB coordination. CCDs 1-7: 24 MB prefetch buffer + 8 MB KV partition each. Keeps draft model at L3 speed, prevents weight-streaming from evicting KV cache.
CCD 0 locked at max boost (3.1 GHz) for fastest drafting. CCDs 1-7 at base (2.25 GHz) — bandwidth-bound anyway. SMT disabled to eliminate contention.
Tune stride prefetcher for sequential weight reads on verification CCDs. Reduce aggressiveness on CCD 0 where the draft model is cache-resident.
Default Linux interleaves pages across NUMA nodes for "fairness." For inference, this guarantees every access is cross-CCD. Disable and pin allocations to local nodes. Community benchmarks confirm 30-50% improvement.
Raw tok/s, memory capacity, and what actually fits. No asterisks.
| Platform | Memory | BW (GB/s) | tok/s | Fits? | GPUs Required |
|---|---|---|---|---|---|
| RTX 5090 | 32 GB | 1,792 | Can't fit | No | — |
| A100 80GB | 80 GB | 2,039 | ~42-58 | Barely | 1 |
| H100 80GB | 80 GB | 3,350 | ~65-82 | Barely | 1 |
| GH200 | 576 GB | 4,000 | ~114 | Yes | 1 |
| B200 | 192 GB | 8,000 | ~228 | Yes | 1 |
| EPYC 9754 (stock) | 768 GB | 460.8 | ~7-8 | Yes | 0 |
| EPYC 9754 (CCD sovereign) | 768 GB | 460.8 | ~60-75 | Yes | 0 |
60-75 tok/s competes with H100. Not "for a CPU." Competes. The CCD spec decode architecture closes a gap that everyone assumed was permanent. And EPYC does it with 768 GB of memory headroom for KV cache — something 80 GB H100 can only dream of.
| Platform | Memory | BW (GB/s) | tok/s | Fits? |
|---|---|---|---|---|
| RTX 5090 | 32 GB | 1,792 | ~230 | Yes |
| H100 80GB | 80 GB | 3,350 | ~400 | Yes |
| B200 | 192 GB | 8,000 | ~900 | Yes |
| EPYC 9754 (CCD sovereign) | 768 GB | 460.8 | ~100+ | Yes |
Small models: GPUs win on raw tok/s. But you're not buying GPUs for 7B. This is the warm-up act.
| Platform | Memory | tok/s | Fits? | GPUs Required |
|---|---|---|---|---|
| RTX 5090 / A100 / H100 | 32-80 GB | — | No | 3+ GPUs |
| 2x H100 NVLink | 160 GB | ~27 | Barely | 2 |
| GH200 | 576 GB | ~20 | Yes | 1 |
| 2x B200 NVLink | 384 GB | ~70 | Yes | 2 |
| EPYC 9754 | 768 GB | ~2.5-3 | Yes | 0 |
| 2x EPYC (2-socket) | 1.5 TB | ~5-6 | Yes | 0 |
405B: EPYC runs it on one socket. Slow (3 tok/s), but it fits. GPUs need NVLink pairs and multi-card setups. EPYC just loads the model and goes.
| Platform | Memory | tok/s | Fits? | GPUs Required |
|---|---|---|---|---|
| RTX 5090 / A100 / H100 | 32-80 GB | — | No | 9+ GPUs |
| 8x H100 (DGX) | 640 GB | ~19 | Barely | 8 |
| 4x B200 | 768 GB | ~36 | Barely | 4 |
| EPYC 9754 (768 GB) | 768 GB | 6.2 | Yes | 0 |
| EPYC 9754 (6 TB) | 6 TB | 6.2 | Yes, easily | 0 |
| Model | Weights | EPYC 9754 | Single GPU (best) |
|---|---|---|---|
| 70B Q4 | 35 GB | 60-75 tok/s | B200: 228 tok/s |
| 405B Q4 | 200 GB | 2.5-3 tok/s | Doesn't fit (B200: 192 GB) |
| 405B F16 | 810 GB | Runs (2-socket, 1.5 TB) | Doesn't fit anywhere |
| 671B Q8 | 670 GB | 6.2 tok/s (single socket) | Doesn't fit. Need 8+ GPUs. |
| 671B Q4 | 335 GB | Runs easily | Doesn't fit. Need 4+ GPUs. |
| Future 1T+ | 1+ TB | 6 TB available | No roadmap |
768 GB per socket. 6 TB with large DIMMs. As models grow past 200 GB, the GPU world needs multi-card NVLink setups with complex model parallelism. EPYC just loads the model into memory and runs it. One socket. One process. Done.
What a fleet of EPYC servers actually delivers.
Per server: 1x EPYC 9754, 768 GB DDR5
Workload: 70B Q4, dedicated-CCD spec decode
Throughput: ~60-75 tok/s per server
Density: ~3-4 concurrent users per server (at 20 tok/s per user)
100 servers = 300-400 concurrent 70B users
Per server: 1x EPYC 9754, 768 GB DDR5
Workload: 671B Q8 (~670 GB weights)
Throughput: ~6.2 tok/s per server
Density: 1 user per server (batch-interactive)
50 servers = 50 concurrent 671B users
No GPU cluster on earth does this at this density.
Per server: 2x EPYC 9754 (2-socket board), 1.5 TB DDR5
Workload: 405B F16 (810 GB) or 671B Q8 with large KV cache
Throughput: ~5-6 tok/s (405B), ~8-10 tok/s (671B with CCD spec)
Model sharding across sockets via Infinity Fabric.
| Metric | 100 EPYC Servers | Equivalent GPU Cluster |
|---|---|---|
| 70B concurrent users | 300-400 | ~300 (requires ~75 H100s) |
| 671B instances | 100 | ~12 (requires 8 GPUs each) |
| Power | ~36 kW | ~52-70 kW |
| Rack space | ~10 racks | ~8 racks |
| Multi-tenant per server | 3x 70B + 1x 7B | 1 model per GPU |
| Dependencies | Linux + your code | CUDA + NVIDIA drivers + firmware |
Multi-tenant advantage: 768 GB fits three 70B instances (35 GB each = 105 GB) plus KV caches plus a 7B assistant. Weight deduplication via shared mmap saves another 70 GB. GPU memory (80-192 GB) runs one model. EPYC runs a fleet in a single box.
The EPYC fleet you already have is a sovereign inference platform. CCD spec decode transforms it from a benchmarking footnote into a competitive system. 60-75 tok/s for 70B. 6.2 tok/s for 671B. Zero GPUs. Zero NVIDIA. Zero dependency you don't control.
| What | Stock | Sovereign | Gain |
|---|---|---|---|
| 70B Q4 tok/s | 5-6 | 25-32 | 5x |
| Bandwidth utilization | ~55% | ~80-85% | +55% |
| Draft speed (1B in L3) | N/A | 120 tok/s | L3 cache speed |
| Coordination latency | 1-3 μs (OS) | 200-400 ns (IF) | 5-15x |
| 405B Q4 (single socket) | Not attempted | 1.8 tok/s | Runs efficiently |
| Concurrent 70B users (50 servers) | ~15 | 100-125 | 7x |
32 Zen 4c cores organized into 4 CCDs, each with 8 cores and 32 MB L3 cache. Optimized for single-socket efficiency.
| Spec | EPYC 9354 | RTX 5090 | 9354 Advantage |
|---|---|---|---|
| Cores | 32 (Zen 4c) | 21,760 CUDA | General purpose, efficient |
| Memory | 512 GB DDR5 | 32 GB GDDR7 | 16x more memory |
| Max Memory | 3 TB | 32 GB | 100x more memory |
| Memory BW | 230.4 GB/s | 1,792 GB/s | CCD spec decode optimized |
| TDP | ~280W | ~575W | Half the power |
| Sovereignty | Full. Your code, period. | CUDA + proprietary firmware | Permanent |
With 4 CCDs and 32 cores, the 9354 runs a 1-CCD draft + 3-CCD verify pattern for maximum efficiency.
The 32-core challenge: Standard llama.cpp treats all 32 cores as one pool. 230.4 GB/s / 35 GB = 6.6 tok/s theoretical max. Linux overhead + NUMA penalties = 5-6 tok/s actual. The 9354 sovereign approach changes everything.
1B Q4 fits in CCD 0's 32 MB L3. Draft tokens at ~120 tok/s from L3 cache. Single CCD dedicated to high-speed drafting.
3 CCDs (24 cores) handle 70B verification. Lower core count means each verification round is faster, improving overall throughput.
6 memory channels fully utilized by 3 verification CCDs. 230.4 GB/s efficiently consumed for weight streaming.
Faster verification rounds mean shorter draft sequences. Higher acceptance rate (5-7 tokens/round) improves efficiency.
| Parameter | Standard (Linux) | CCD-Partitioned 9354 | Improvement |
|---|---|---|---|
| Draft speed | ~40 tok/s (BW-shared) | ~120 tok/s (L3-speed) | 3x |
| Verify cores | 32 (shared, contended) | 24 (dedicated) | Clean partition |
| Memory channels | 6 (contended) | 6 (verification-dedicated) | Full utilization |
| Tokens drafted/round | 3-5 | 6-8 | ~60% more |
| Effective tok/s (70B Q4) | ~8-10 | ~25-32 | 3x (vs spec decode stock) |
| vs stock llama.cpp | 5-6 | 25-32 | 5x |
Efficient inference for mid-range workloads.
| Platform | Memory | BW (GB/s) | tok/s | Efficiency |
|---|---|---|---|---|
| RTX 5090 | 32 GB | 1,792 | Can't fit | No |
| A100 80GB | 80 GB | 2,039 | ~42-58 | High power |
| EPYC 9354 (sovereign) | 512 GB | 230.4 | ~25-32 | Efficient |
25-32 tok/s from 32 cores. The 9354 proves that sovereign inference doesn't require flagship hardware. Efficient spec decode + sovereignty optimizations deliver competitive performance at a fraction of the power and cost.
Per server: 1x EPYC 9354, 512 GB DDR5
Workload: 70B Q4, 4-CCD spec decode
Throughput: ~25-32 tok/s per server
Density: ~2 concurrent users per server (at 15 tok/s per user)
Power: ~280W per server vs ~700W for GPU equivalents
50 servers = 100 concurrent 70B users
Cost-efficient, power-efficient, sovereign.
| What | Stock | Sovereign | Gain |
|---|---|---|---|
| 70B Q4 tok/s | 4-5 | 28-38 | 7x |
| Bandwidth utilization | ~50% | ~75-85% | +70% |
| Draft speed (1B in L3) | N/A | 110 tok/s | L3 cache speed |
| Socket coordination | 5-10 μs (OS) | 800-1200 ns (Infinity Fabric) | 5-12x |
| 405B Q4 (dual socket) | Not attempted | 3.2 tok/s | Runs smoothly |
| Concurrent 70B users (30 servers) | ~8 | 90-120 | 12x |
64 total cores across two sockets. 16 CCDs total (8 per socket), each with 4 cores and 16 MB L3 cache. Connected via Infinity Fabric.
| Spec | 2x EPYC 7543 | 2x A100 | 7543 Advantage |
|---|---|---|---|
| Cores | 64 (Zen 3) | 13,824 CUDA | Proven architecture |
| Memory | 1 TB DDR4 | 160 GB HBM2e | 6.25x more memory |
| Memory BW | 409.6 GB/s | 3,078 GB/s | Dual-socket optimization |
| TDP | ~450W | ~800W | Lower power |
| Legacy Support | DDR4, existing infrastructure | Requires new infrastructure | Use existing servers |
| Sovereignty | Full control, no vendor lock | CUDA dependency | Permanent |
Cross-socket coordination requires careful partitioning. Socket 0 handles drafting, Socket 1 focuses on verification.
The dual-socket challenge: Cross-socket communication adds latency. Standard approaches treat both sockets as one NUMA domain, leading to 50% cross-socket traffic and 4-5 tok/s actual performance. Sovereign partitioning eliminates this penalty.
CCD 0 (4 cores) handles 1B draft model. Fits in 16 MB L3. Socket 0's remaining 28 cores assist with coordination and KV management.
All 32 cores of Socket 1 dedicated to 70B verification. Full 204.8 GB/s local bandwidth. No cross-socket memory access during verification.
32 GB/s cross-socket IF handles draft token transfer (~1-2 KB per round). 800-1200 ns latency vs 5-10 μs for OS coordination.
Longer verification latency (dual socket) balanced by drafting 6-9 tokens per round. Higher accept rates due to dedicated socket resources.
| Parameter | Standard (NUMA) | Socket-Partitioned | Improvement |
|---|---|---|---|
| Cross-socket traffic | ~50% (all memory) | ~1% (coordination only) | 50x reduction |
| Draft speed | ~35 tok/s (contended) | ~110 tok/s (local L3) | 3x |
| Socket coordination | ~5-10 μs (OS/NUMA) | ~800-1200 ns (IF direct) | 5-12x |
| Verification bandwidth | ~200 GB/s (NUMA penalty) | ~204.8 GB/s (local) | Clean socket isolation |
| Effective tok/s (70B Q4) | ~6-8 | ~28-38 | 5x (vs dual-socket standard) |
| vs single-socket stock | 4-5 | 28-38 | 7x |
Legacy hardware, modern performance.
| Platform | Memory | BW (GB/s) | tok/s | Infrastructure |
|---|---|---|---|---|
| 2x A100 NVLink | 160 GB | 3,078 | ~65-85 | Requires NVLink |
| A100 80GB | 80 GB | 2,039 | ~42-58 | Single GPU |
| 2x EPYC 7543 (sovereign) | 1 TB | 409.6 | ~28-38 | Existing servers |
| Model | Weights | 2x EPYC 7543 | GPU Requirements |
|---|---|---|---|
| 70B Q4 | 35 GB | 28-38 tok/s | 1x A100+ required |
| 405B Q4 | 200 GB | 3.2 tok/s | 2+ A100s required |
| 405B F16 | 810 GB | 2.1 tok/s | Doesn't fit in any single-GPU |
| 671B Q8 | 670 GB | 4.5 tok/s | 8+ GPUs required |
1 TB of DDR4 memory. The 2x EPYC 7543 configuration runs models that require GPU clusters. 671B Q8 at 4.5 tok/s from servers you probably already have.
Per server: 2x EPYC 7543, 1 TB DDR4
Workload: 70B Q4, cross-socket spec decode
Throughput: ~28-38 tok/s per server
Density: ~2-3 concurrent users per server
Power: ~450W per server
30 servers = 60-90 concurrent 70B users
Legacy infrastructure, modern performance.