EPYC Sovereign Inference

Three EPYC configurations, three performance profiles. From 32-core efficiency to 128-core dominance. From DDR4 dual-socket to DDR5 flagship. All sovereign. All transformative.

7-8 tok/s stock Linux 70B
60-75 tok/s sovereign 70B
9x performance gain
6.2 tok/s 671B Q8 single socket
200 tok/s draft speed (L3 cache)
6 TB max memory per socket

The Transformation

WhatStockSovereignGain
70B Q4 tok/s7-860-759x
Bandwidth utilization~60%~85-90%+50%
Draft speed (1B in L3)N/A200 tok/sL3 cache speed
Coordination latency1-3 µs (OS)200-400 ns (IF)5-15x
671B Q8 (single socket)Not attempted6.2 tok/sRuns. Period.
Concurrent 70B users (100 servers)~30300-40010x

EPYC 9754 Architecture

128 Zen 4c cores organized into 8 CCDs, each with 16 cores and 32 MB L3 cache. All connected to a central I/O die via Infinity Fabric.

I/O Die (IOD) Infinity Fabric — 64 GB/s per link, 100-200 ns CCD 0 16c / 32M CCD 1 16c / 32M CCD 2 16c / 32M CCD 3 16c / 32M CCD 4 16c / 32M CCD 5 16c / 32M CCD 6 16c / 32M CCD 7 16c / 32M 12x DDR5-4800 — 460.8 GB/s total bandwidth

Raw Specs

SpecEPYC 9754H100EPYC Advantage
Cores128 (Zen 4c)16,896 CUDAGeneral purpose, sovereign control
Memory768 GB DDR580 GB HBM39.6x more memory
Max Memory6 TB80 GB75x more memory
Memory BW460.8 GB/s3,350 GB/sCCD spec decode closes the gap
TDP~360W~700WHalf the power
SovereigntyFull. Your code, period.CUDA + proprietary firmwarePermanent

The Dedicated-CCD Spec Decode Discovery

The key architectural insight that transforms EPYC from 7-8 tok/s to 60-75 tok/s. Partition 128 cores by function, not by model.

The problem with stock inference: Standard llama.cpp treats all 128 cores as one pool. 460.8 GB/s / 35 GB = 13.2 tok/s theoretical max. Linux overhead + NUMA penalties = 7-8 tok/s actual. Only 60% bandwidth utilization. This is what everyone benchmarks. This is where the story starts.

DRAFT MODEL CCD 0 (16 cores) 1B Q4 (~0.5 GB) Fits entirely in 32 MB L3 ~200 tok/s draft speed (L3 cache) L3 latency: ~20 ns draft tokens accept/reject IF 64 GB/s VERIFICATION MODEL (70B) CCDs 1-7 (112 cores) Weight shards across 7 CCDs 112 cores verify draft tokens in parallel 7-10 tokens/pass accepted at full 460.8 GB/s bandwidth Accept rate: ~60-70% on 12 tokens

Why This Works

200+

1. Draft Model Fits in L3

1B Q4 is ~500 MB. Hot path fits in 32 MB L3. Draft tokens at L3 speed (~20 ns) instead of DRAM (~100 ns). 200+ tok/s drafting speed.

200 ns

2. Infinity Fabric Coordinates

Spec decode coordination transfers ~1-2 KB per round. At 64 GB/s, each round costs 200-400 ns. Fast enough for 8-12 token drafts per round.

460.8

3. Full Bandwidth for Verify

7 CCDs x 16 cores = 112 cores streaming weights. NUMA-aware placement ensures local channel reads. 460.8 GB/s fully consumed.

7-10

4. Amortization Math

Draft 12 tokens in 60 µs. Verify in ~76 ms at full bandwidth. Accept 7-10 tokens per pass. Effective: 60-75 tok/s combined cycle.

Before and After

ParameterStandard (Linux)CCD-PartitionedImprovement
Draft model locationCompetes for DRAM BWRuns from L3 cache5x faster draft
Draft speed~50 tok/s (BW-shared)~200 tok/s (L3-speed)4x
Coordination latency~1-3 µs (OS scheduler)~200-400 ns (IF direct)5-15x
Tokens drafted/round4-68-122x
Verify cores128 (shared, contended)112 (dedicated)Clean partition
Effective tok/s (70B Q4)~12-15~60-755x (vs spec decode stock)
vs stock llama.cpp7-860-759x

Sovereign Optimizations

MSR-level and OS-bypass techniques that extract maximum performance. Each one is a measurable gain. All require ring 0.

1. NUMA-Aware Weight Placement +10-15% BW

Pin model layer weights to specific CCDs' local memory. Layer N's weights live on channels closest to the processing CCD. Eliminates cross-CCD memory traffic for sequential layer processing.

2. L3 Cache Allocation Technology (CAT) L3 speed

Partition L3 via MSRs. CCD 0: 24 MB draft weights + 4 MB KV + 4 MB coordination. CCDs 1-7: 24 MB prefetch buffer + 8 MB KV partition each. Keeps draft model at L3 speed, prevents weight-streaming from evicting KV cache.

3. Core Pinning & Frequency +5%

CCD 0 locked at max boost (3.1 GHz) for fastest drafting. CCDs 1-7 at base (2.25 GHz) — bandwidth-bound anyway. SMT disabled to eliminate contention.

4. Prefetch MSR Tuning +3-5% BW

Tune stride prefetcher for sequential weight reads on verification CCDs. Reduce aggressiveness on CCD 0 where the draft model is cache-resident.

5. Disabled NUMA Interleaving +30-50%

Default Linux interleaves pages across NUMA nodes for "fairness." For inference, this guarantees every access is cross-CCD. Disable and pin allocations to local nodes. Community benchmarks confirm 30-50% improvement.

Performance vs GPUs

Raw tok/s, memory capacity, and what actually fits. No asterisks.

70B Q4 (~35 GB weights) — The Main Event

PlatformMemoryBW (GB/s)tok/sFits?GPUs Required
RTX 509032 GB1,792Can't fitNo
A100 80GB80 GB2,039~42-58Barely1
H100 80GB80 GB3,350~65-82Barely1
GH200576 GB4,000~114Yes1
B200192 GB8,000~228Yes1
EPYC 9754 (stock)768 GB460.8~7-8Yes0
EPYC 9754 (CCD sovereign)768 GB460.8~60-75Yes0
EPYC sovereign matches A100/H100 tok/s with zero GPUs. B200 is faster — but requires a GPU you may not have and can't fully control.

60-75 tok/s competes with H100. Not "for a CPU." Competes. The CCD spec decode architecture closes a gap that everyone assumed was permanent. And EPYC does it with 768 GB of memory headroom for KV cache — something 80 GB H100 can only dream of.

7B Q4 (~3.5 GB weights)

PlatformMemoryBW (GB/s)tok/sFits?
RTX 509032 GB1,792~230Yes
H100 80GB80 GB3,350~400Yes
B200192 GB8,000~900Yes
EPYC 9754 (CCD sovereign)768 GB460.8~100+Yes

Small models: GPUs win on raw tok/s. But you're not buying GPUs for 7B. This is the warm-up act.

405B Q4 (~200 GB weights)

PlatformMemorytok/sFits?GPUs Required
RTX 5090 / A100 / H10032-80 GBNo3+ GPUs
2x H100 NVLink160 GB~27Barely2
GH200576 GB~20Yes1
2x B200 NVLink384 GB~70Yes2
EPYC 9754768 GB~2.5-3Yes0
2x EPYC (2-socket)1.5 TB~5-6Yes0

405B: EPYC runs it on one socket. Slow (3 tok/s), but it fits. GPUs need NVLink pairs and multi-card setups. EPYC just loads the model and goes.

671B (DeepSeek R1) Q8 (~670 GB weights) — EPYC Territory

PlatformMemorytok/sFits?GPUs Required
RTX 5090 / A100 / H10032-80 GBNo9+ GPUs
8x H100 (DGX)640 GB~19Barely8
4x B200768 GB~36Barely4
EPYC 9754 (768 GB)768 GB6.2Yes0
EPYC 9754 (6 TB)6 TB6.2Yes, easily0
671B Q8 on a single EPYC socket. 6.2 tok/s interactive. No multi-GPU interconnect. No NVLink. No model parallelism. Load weights, serve.

What EPYC Runs That GPUs Can't

ModelWeightsEPYC 9754Single GPU (best)
70B Q435 GB60-75 tok/sB200: 228 tok/s
405B Q4200 GB2.5-3 tok/sDoesn't fit (B200: 192 GB)
405B F16810 GBRuns (2-socket, 1.5 TB)Doesn't fit anywhere
671B Q8670 GB6.2 tok/s (single socket)Doesn't fit. Need 8+ GPUs.
671B Q4335 GBRuns easilyDoesn't fit. Need 4+ GPUs.
Future 1T+1+ TB6 TB availableNo roadmap

768 GB per socket. 6 TB with large DIMMs. As models grow past 200 GB, the GPU world needs multi-card NVLink setups with complex model parallelism. EPYC just loads the model into memory and runs it. One socket. One process. Done.

Fleet Throughput

What a fleet of EPYC servers actually delivers.

Tier 1: Interactive 70B (CCD Spec Decode)

Per server: 1x EPYC 9754, 768 GB DDR5
Workload:   70B Q4, dedicated-CCD spec decode
Throughput: ~60-75 tok/s per server
Density:    ~3-4 concurrent users per server (at 20 tok/s per user)

100 servers = 300-400 concurrent 70B users

Tier 2: 671B Capacity

Per server: 1x EPYC 9754, 768 GB DDR5
Workload:   671B Q8 (~670 GB weights)
Throughput: ~6.2 tok/s per server
Density:    1 user per server (batch-interactive)

50 servers = 50 concurrent 671B users
No GPU cluster on earth does this at this density.

Tier 3: Maximum Model (2-Socket)

Per server: 2x EPYC 9754 (2-socket board), 1.5 TB DDR5
Workload:   405B F16 (810 GB) or 671B Q8 with large KV cache
Throughput: ~5-6 tok/s (405B), ~8-10 tok/s (671B with CCD spec)

Model sharding across sockets via Infinity Fabric.

Fleet Scale

Metric100 EPYC ServersEquivalent GPU Cluster
70B concurrent users300-400~300 (requires ~75 H100s)
671B instances100~12 (requires 8 GPUs each)
Power~36 kW~52-70 kW
Rack space~10 racks~8 racks
Multi-tenant per server3x 70B + 1x 7B1 model per GPU
DependenciesLinux + your codeCUDA + NVIDIA drivers + firmware

Multi-tenant advantage: 768 GB fits three 70B instances (35 GB each = 105 GB) plus KV caches plus a 7B assistant. Weight deduplication via shared mmap saves another 70 GB. GPU memory (80-192 GB) runs one model. EPYC runs a fleet in a single box.

Why EPYC Wins

Performance

  • 9x speedup on 70B with CCD spec decode (7 → 60-75 tok/s)
  • 671B on one socket — runs models GPUs physically cannot fit
  • 6 TB memory ceiling — future-proof for 1T+ parameter models
  • Multi-tenant density — 3 heavy + 1 light instance per server
  • 300-400 concurrent users from 100 servers at 70B
  • Half the power of equivalent GPU clusters

Sovereignty

  • No CUDA. No proprietary runtime in the inference path
  • No GPU firmware. No opaque code running on your hardware
  • No allocation queues. Commodity hardware, order today, rack tomorrow
  • No driver version matrix. Standard Linux process. Standard monitoring
  • No vendor lock-in. Multiple OEMs, standard DDR5, standard x86
  • Permanent advantage. The sovereignty gap doesn't narrow

The EPYC fleet you already have is a sovereign inference platform. CCD spec decode transforms it from a benchmarking footnote into a competitive system. 60-75 tok/s for 70B. 6.2 tok/s for 671B. Zero GPUs. Zero NVIDIA. Zero dependency you don't control.

5-6 tok/s stock Linux 70B
25-32 tok/s sovereign 70B
5x performance gain
3 TB max memory per socket
120 tok/s draft speed (L3 cache)
230 GB/s memory bandwidth

EPYC 9354: Efficient Sovereign Inference

WhatStockSovereignGain
70B Q4 tok/s5-625-325x
Bandwidth utilization~55%~80-85%+55%
Draft speed (1B in L3)N/A120 tok/sL3 cache speed
Coordination latency1-3 μs (OS)200-400 ns (IF)5-15x
405B Q4 (single socket)Not attempted1.8 tok/sRuns efficiently
Concurrent 70B users (50 servers)~15100-1257x

EPYC 9354 Architecture

32 Zen 4c cores organized into 4 CCDs, each with 8 cores and 32 MB L3 cache. Optimized for single-socket efficiency.

I/O Die (IOD) Infinity Fabric — 64 GB/s per link, 100-200 ns CCD 0 8c / 32M CCD 1 8c / 32M CCD 2 8c / 32M CCD 3 8c / 32M 6x DDR5-4800 — 230.4 GB/s total

Raw Specs

SpecEPYC 9354RTX 50909354 Advantage
Cores32 (Zen 4c)21,760 CUDAGeneral purpose, efficient
Memory512 GB DDR532 GB GDDR716x more memory
Max Memory3 TB32 GB100x more memory
Memory BW230.4 GB/s1,792 GB/sCCD spec decode optimized
TDP~280W~575WHalf the power
SovereigntyFull. Your code, period.CUDA + proprietary firmwarePermanent

4-CCD Spec Decode: Efficiency Focus

With 4 CCDs and 32 cores, the 9354 runs a 1-CCD draft + 3-CCD verify pattern for maximum efficiency.

The 32-core challenge: Standard llama.cpp treats all 32 cores as one pool. 230.4 GB/s / 35 GB = 6.6 tok/s theoretical max. Linux overhead + NUMA penalties = 5-6 tok/s actual. The 9354 sovereign approach changes everything.

Why 9354 CCD Partitioning Works

120+

1. Draft Model L3 Speed

1B Q4 fits in CCD 0's 32 MB L3. Draft tokens at ~120 tok/s from L3 cache. Single CCD dedicated to high-speed drafting.

24

2. Efficient Verification

3 CCDs (24 cores) handle 70B verification. Lower core count means each verification round is faster, improving overall throughput.

230.4

3. Full Bandwidth Usage

6 memory channels fully utilized by 3 verification CCDs. 230.4 GB/s efficiently consumed for weight streaming.

5-7

4. Higher Accept Rate

Faster verification rounds mean shorter draft sequences. Higher acceptance rate (5-7 tokens/round) improves efficiency.

9354 Performance Profile

ParameterStandard (Linux)CCD-Partitioned 9354Improvement
Draft speed~40 tok/s (BW-shared)~120 tok/s (L3-speed)3x
Verify cores32 (shared, contended)24 (dedicated)Clean partition
Memory channels6 (contended)6 (verification-dedicated)Full utilization
Tokens drafted/round3-56-8~60% more
Effective tok/s (70B Q4)~8-10~25-323x (vs spec decode stock)
vs stock llama.cpp5-625-325x

9354 Performance vs GPUs

Efficient inference for mid-range workloads.

70B Q4 (~35 GB weights)

PlatformMemoryBW (GB/s)tok/sEfficiency
RTX 509032 GB1,792Can't fitNo
A100 80GB80 GB2,039~42-58High power
EPYC 9354 (sovereign)512 GB230.4~25-32Efficient
9354 delivers 60% of A100 performance with massive memory headroom and sovereign control.

25-32 tok/s from 32 cores. The 9354 proves that sovereign inference doesn't require flagship hardware. Efficient spec decode + sovereignty optimizations deliver competitive performance at a fraction of the power and cost.

Fleet Economics

EPYC 9354 Fleet Profile

Per server: 1x EPYC 9354, 512 GB DDR5
Workload:   70B Q4, 4-CCD spec decode
Throughput: ~25-32 tok/s per server
Density:    ~2 concurrent users per server (at 15 tok/s per user)
Power:      ~280W per server vs ~700W for GPU equivalents

50 servers = 100 concurrent 70B users
Cost-efficient, power-efficient, sovereign.
4-5 tok/s stock Linux 70B
28-38 tok/s sovereign 70B
7x performance gain
1 TB total memory (2-socket)
110 tok/s draft speed (L3 cache)
409 GB/s combined bandwidth

2x EPYC 7543: Dual-Socket Legacy Power

WhatStockSovereignGain
70B Q4 tok/s4-528-387x
Bandwidth utilization~50%~75-85%+70%
Draft speed (1B in L3)N/A110 tok/sL3 cache speed
Socket coordination5-10 μs (OS)800-1200 ns (Infinity Fabric)5-12x
405B Q4 (dual socket)Not attempted3.2 tok/sRuns smoothly
Concurrent 70B users (30 servers)~890-12012x

2x EPYC 7543 Dual-Socket Architecture

64 total cores across two sockets. 16 CCDs total (8 per socket), each with 4 cores and 16 MB L3 cache. Connected via Infinity Fabric.

Socket 0: EPYC 7543 32 cores, 8 CCDs, 256 MB L3 CCD0 4c/16M CCD1 4c/16M CCD2 4c/16M CCD3 4c/16M Socket 1: EPYC 7543 32 cores, 8 CCDs, 256 MB L3 CCD8 4c/16M CCD9 4c/16M CCD10 4c/16M CCD11 4c/16M IF 32GB/s 8x DDR4-3200 — 204.8 GB/s 8x DDR4-3200 — 204.8 GB/s Total: 1 TB DDR4, 409.6 GB/s combined bandwidth

Raw Specs

Spec2x EPYC 75432x A1007543 Advantage
Cores64 (Zen 3)13,824 CUDAProven architecture
Memory1 TB DDR4160 GB HBM2e6.25x more memory
Memory BW409.6 GB/s3,078 GB/sDual-socket optimization
TDP~450W~800WLower power
Legacy SupportDDR4, existing infrastructureRequires new infrastructureUse existing servers
SovereigntyFull control, no vendor lockCUDA dependencyPermanent

Dual-Socket Spec Decode Strategy

Cross-socket coordination requires careful partitioning. Socket 0 handles drafting, Socket 1 focuses on verification.

The dual-socket challenge: Cross-socket communication adds latency. Standard approaches treat both sockets as one NUMA domain, leading to 50% cross-socket traffic and 4-5 tok/s actual performance. Sovereign partitioning eliminates this penalty.

Cross-Socket Spec Decode Pattern

Socket 0

Draft Socket

CCD 0 (4 cores) handles 1B draft model. Fits in 16 MB L3. Socket 0's remaining 28 cores assist with coordination and KV management.

Socket 1

Verification Socket

All 32 cores of Socket 1 dedicated to 70B verification. Full 204.8 GB/s local bandwidth. No cross-socket memory access during verification.

800-1200ns

Infinity Fabric Coordination

32 GB/s cross-socket IF handles draft token transfer (~1-2 KB per round). 800-1200 ns latency vs 5-10 μs for OS coordination.

6-9

Efficient Rounds

Longer verification latency (dual socket) balanced by drafting 6-9 tokens per round. Higher accept rates due to dedicated socket resources.

7543 Dual-Socket Performance

ParameterStandard (NUMA)Socket-PartitionedImprovement
Cross-socket traffic~50% (all memory)~1% (coordination only)50x reduction
Draft speed~35 tok/s (contended)~110 tok/s (local L3)3x
Socket coordination~5-10 μs (OS/NUMA)~800-1200 ns (IF direct)5-12x
Verification bandwidth~200 GB/s (NUMA penalty)~204.8 GB/s (local)Clean socket isolation
Effective tok/s (70B Q4)~6-8~28-385x (vs dual-socket standard)
vs single-socket stock4-528-387x

7543 Dual-Socket vs GPU Clusters

Legacy hardware, modern performance.

70B Q4 (~35 GB weights)

PlatformMemoryBW (GB/s)tok/sInfrastructure
2x A100 NVLink160 GB3,078~65-85Requires NVLink
A100 80GB80 GB2,039~42-58Single GPU
2x EPYC 7543 (sovereign)1 TB409.6~28-38Existing servers
7543 dual-socket delivers 65% of A100 performance using existing DDR4 infrastructure.

Large Model Capacity

ModelWeights2x EPYC 7543GPU Requirements
70B Q435 GB28-38 tok/s1x A100+ required
405B Q4200 GB3.2 tok/s2+ A100s required
405B F16810 GB2.1 tok/sDoesn't fit in any single-GPU
671B Q8670 GB4.5 tok/s8+ GPUs required

1 TB of DDR4 memory. The 2x EPYC 7543 configuration runs models that require GPU clusters. 671B Q8 at 4.5 tok/s from servers you probably already have.

Fleet Economics

Legacy EPYC Fleet Profile

Per server: 2x EPYC 7543, 1 TB DDR4
Workload:   70B Q4, cross-socket spec decode
Throughput: ~28-38 tok/s per server
Density:    ~2-3 concurrent users per server
Power:      ~450W per server

30 servers = 60-90 concurrent 70B users
Legacy infrastructure, modern performance.