EPYC Sovereign Inference

Three EPYC configurations, three performance profiles. From 32-core efficiency to 128-core dominance. From DDR4 dual-socket to DDR5 flagship. All sovereign. All transformative.

7-8 tok/s stock Linux 70B

60-75 tok/s sovereign 70B

9x performance gain

6.2 tok/s 671B Q8 single socket

200 tok/s draft speed (L3 cache)

6 TB max memory per socket

The Transformation

What	Stock	Sovereign	Gain
70B Q4 tok/s	7-8	60-75	9x
Bandwidth utilization	~60%	~85-90%	+50%
Draft speed (1B in L3)	N/A	200 tok/s	L3 cache speed
Coordination latency	1-3 µs (OS)	200-400 ns (IF)	5-15x
671B Q8 (single socket)	Not attempted	6.2 tok/s	Runs. Period.
Concurrent 70B users (100 servers)	~30	300-400	10x

EPYC 9754 Architecture

128 Zen 4c cores organized into 8 CCDs, each with 16 cores and 32 MB L3 cache. All connected to a central I/O die via Infinity Fabric.

Raw Specs

Spec	EPYC 9754	H100	EPYC Advantage
Cores	128 (Zen 4c)	16,896 CUDA	General purpose, sovereign control
Memory	768 GB DDR5	80 GB HBM3	9.6x more memory
Max Memory	6 TB	80 GB	75x more memory
Memory BW	460.8 GB/s	3,350 GB/s	CCD spec decode closes the gap
TDP	~360W	~700W	Half the power
Sovereignty	Full. Your code, period.	CUDA + proprietary firmware	Permanent

The Dedicated-CCD Spec Decode Discovery

The key architectural insight that transforms EPYC from 7-8 tok/s to 60-75 tok/s. Partition 128 cores by function, not by model.

The problem with stock inference: Standard llama.cpp treats all 128 cores as one pool. 460.8 GB/s / 35 GB = 13.2 tok/s theoretical max. Linux overhead + NUMA penalties = 7-8 tok/s actual. Only 60% bandwidth utilization. This is what everyone benchmarks. This is where the story starts.

Why This Works

200+

1. Draft Model Fits in L3

1B Q4 is ~500 MB. Hot path fits in 32 MB L3. Draft tokens at L3 speed (~20 ns) instead of DRAM (~100 ns). 200+ tok/s drafting speed.

200 ns

2. Infinity Fabric Coordinates

Spec decode coordination transfers ~1-2 KB per round. At 64 GB/s, each round costs 200-400 ns. Fast enough for 8-12 token drafts per round.

460.8

3. Full Bandwidth for Verify

7 CCDs x 16 cores = 112 cores streaming weights. NUMA-aware placement ensures local channel reads. 460.8 GB/s fully consumed.

7-10

4. Amortization Math

Draft 12 tokens in 60 µs. Verify in ~76 ms at full bandwidth. Accept 7-10 tokens per pass. Effective: 60-75 tok/s combined cycle.

Before and After

Parameter	Standard (Linux)	CCD-Partitioned	Improvement
Draft model location	Competes for DRAM BW	Runs from L3 cache	5x faster draft
Draft speed	~50 tok/s (BW-shared)	~200 tok/s (L3-speed)	4x
Coordination latency	~1-3 µs (OS scheduler)	~200-400 ns (IF direct)	5-15x
Tokens drafted/round	4-6	8-12	2x
Verify cores	128 (shared, contended)	112 (dedicated)	Clean partition
Effective tok/s (70B Q4)	~12-15	~60-75	5x (vs spec decode stock)
vs stock llama.cpp	7-8	60-75	9x

Sovereign Optimizations

MSR-level and OS-bypass techniques that extract maximum performance. Each one is a measurable gain. All require ring 0.

1. NUMA-Aware Weight Placement +10-15% BW

Pin model layer weights to specific CCDs' local memory. Layer N's weights live on channels closest to the processing CCD. Eliminates cross-CCD memory traffic for sequential layer processing.

2. L3 Cache Allocation Technology (CAT) L3 speed

Partition L3 via MSRs. CCD 0: 24 MB draft weights + 4 MB KV + 4 MB coordination. CCDs 1-7: 24 MB prefetch buffer + 8 MB KV partition each. Keeps draft model at L3 speed, prevents weight-streaming from evicting KV cache.

3. Core Pinning & Frequency +5%

CCD 0 locked at max boost (3.1 GHz) for fastest drafting. CCDs 1-7 at base (2.25 GHz) — bandwidth-bound anyway. SMT disabled to eliminate contention.

4. Prefetch MSR Tuning +3-5% BW

Tune stride prefetcher for sequential weight reads on verification CCDs. Reduce aggressiveness on CCD 0 where the draft model is cache-resident.

5. Disabled NUMA Interleaving +30-50%

Default Linux interleaves pages across NUMA nodes for "fairness." For inference, this guarantees every access is cross-CCD. Disable and pin allocations to local nodes. Community benchmarks confirm 30-50% improvement.

Performance vs GPUs

Raw tok/s, memory capacity, and what actually fits. No asterisks.

70B Q4 (~35 GB weights) — The Main Event

Platform	Memory	BW (GB/s)	tok/s	Fits?	GPUs Required
RTX 5090	32 GB	1,792	Can't fit	No	—
A100 80GB	80 GB	2,039	~42-58	Barely	1
H100 80GB	80 GB	3,350	~65-82	Barely	1
GH200	576 GB	4,000	~114	Yes	1
B200	192 GB	8,000	~228	Yes	1
EPYC 9754 (stock)	768 GB	460.8	~7-8	Yes	0
EPYC 9754 (CCD sovereign)	768 GB	460.8	~60-75	Yes	0

EPYC sovereign matches A100/H100 tok/s with zero GPUs. B200 is faster — but requires a GPU you may not have and can't fully control.

60-75 tok/s competes with H100. Not "for a CPU." Competes. The CCD spec decode architecture closes a gap that everyone assumed was permanent. And EPYC does it with 768 GB of memory headroom for KV cache — something 80 GB H100 can only dream of.

7B Q4 (~3.5 GB weights)

Platform	Memory	BW (GB/s)	tok/s	Fits?
RTX 5090	32 GB	1,792	~230	Yes
H100 80GB	80 GB	3,350	~400	Yes
B200	192 GB	8,000	~900	Yes
EPYC 9754 (CCD sovereign)	768 GB	460.8	~100+	Yes

Small models: GPUs win on raw tok/s. But you're not buying GPUs for 7B. This is the warm-up act.

405B Q4 (~200 GB weights)

Platform	Memory	tok/s	Fits?	GPUs Required
RTX 5090 / A100 / H100	32-80 GB	—	No	3+ GPUs
2x H100 NVLink	160 GB	~27	Barely	2
GH200	576 GB	~20	Yes	1
2x B200 NVLink	384 GB	~70	Yes	2
EPYC 9754	768 GB	~2.5-3	Yes	0
2x EPYC (2-socket)	1.5 TB	~5-6	Yes	0

405B: EPYC runs it on one socket. Slow (3 tok/s), but it fits. GPUs need NVLink pairs and multi-card setups. EPYC just loads the model and goes.

671B (DeepSeek R1) Q8 (~670 GB weights) — EPYC Territory

Platform	Memory	tok/s	Fits?	GPUs Required
RTX 5090 / A100 / H100	32-80 GB	—	No	9+ GPUs
8x H100 (DGX)	640 GB	~19	Barely	8
4x B200	768 GB	~36	Barely	4
EPYC 9754 (768 GB)	768 GB	6.2	Yes	0
EPYC 9754 (6 TB)	6 TB	6.2	Yes, easily	0

671B Q8 on a single EPYC socket. 6.2 tok/s interactive. No multi-GPU interconnect. No NVLink. No model parallelism. Load weights, serve.

What EPYC Runs That GPUs Can't

Model	Weights	EPYC 9754	Single GPU (best)
70B Q4	35 GB	60-75 tok/s	B200: 228 tok/s
405B Q4	200 GB	2.5-3 tok/s	Doesn't fit (B200: 192 GB)
405B F16	810 GB	Runs (2-socket, 1.5 TB)	Doesn't fit anywhere
671B Q8	670 GB	6.2 tok/s (single socket)	Doesn't fit. Need 8+ GPUs.
671B Q4	335 GB	Runs easily	Doesn't fit. Need 4+ GPUs.
Future 1T+	1+ TB	6 TB available	No roadmap

768 GB per socket. 6 TB with large DIMMs. As models grow past 200 GB, the GPU world needs multi-card NVLink setups with complex model parallelism. EPYC just loads the model into memory and runs it. One socket. One process. Done.

Fleet Throughput

What a fleet of EPYC servers actually delivers.

Tier 1: Interactive 70B (CCD Spec Decode)

Per server: 1x EPYC 9754, 768 GB DDR5
Workload:   70B Q4, dedicated-CCD spec decode
Throughput: ~60-75 tok/s per server
Density:    ~3-4 concurrent users per server (at 20 tok/s per user)

100 servers = 300-400 concurrent 70B users

Tier 2: 671B Capacity

Per server: 1x EPYC 9754, 768 GB DDR5
Workload:   671B Q8 (~670 GB weights)
Throughput: ~6.2 tok/s per server
Density:    1 user per server (batch-interactive)

50 servers = 50 concurrent 671B users
No GPU cluster on earth does this at this density.

Tier 3: Maximum Model (2-Socket)

Per server: 2x EPYC 9754 (2-socket board), 1.5 TB DDR5
Workload:   405B F16 (810 GB) or 671B Q8 with large KV cache
Throughput: ~5-6 tok/s (405B), ~8-10 tok/s (671B with CCD spec)

Model sharding across sockets via Infinity Fabric.

Fleet Scale

Metric	100 EPYC Servers	Equivalent GPU Cluster
70B concurrent users	300-400	~300 (requires ~75 H100s)
671B instances	100	~12 (requires 8 GPUs each)
Power	~36 kW	~52-70 kW
Rack space	~10 racks	~8 racks
Multi-tenant per server	3x 70B + 1x 7B	1 model per GPU
Dependencies	Linux + your code	CUDA + NVIDIA drivers + firmware

Multi-tenant advantage: 768 GB fits three 70B instances (35 GB each = 105 GB) plus KV caches plus a 7B assistant. Weight deduplication via shared mmap saves another 70 GB. GPU memory (80-192 GB) runs one model. EPYC runs a fleet in a single box.

Why EPYC Wins

Performance

9x speedup on 70B with CCD spec decode (7 → 60-75 tok/s)
671B on one socket — runs models GPUs physically cannot fit
6 TB memory ceiling — future-proof for 1T+ parameter models
Multi-tenant density — 3 heavy + 1 light instance per server
300-400 concurrent users from 100 servers at 70B
Half the power of equivalent GPU clusters

Sovereignty

No CUDA. No proprietary runtime in the inference path
No GPU firmware. No opaque code running on your hardware
No allocation queues. Commodity hardware, order today, rack tomorrow
No driver version matrix. Standard Linux process. Standard monitoring
No vendor lock-in. Multiple OEMs, standard DDR5, standard x86
Permanent advantage. The sovereignty gap doesn't narrow

The EPYC fleet you already have is a sovereign inference platform. CCD spec decode transforms it from a benchmarking footnote into a competitive system. 60-75 tok/s for 70B. 6.2 tok/s for 671B. Zero GPUs. Zero NVIDIA. Zero dependency you don't control.

5-6 tok/s stock Linux 70B

25-32 tok/s sovereign 70B

5x performance gain

3 TB max memory per socket

120 tok/s draft speed (L3 cache)

230 GB/s memory bandwidth

EPYC 9354: Efficient Sovereign Inference

What	Stock	Sovereign	Gain
70B Q4 tok/s	5-6	25-32	5x
Bandwidth utilization	~55%	~80-85%	+55%
Draft speed (1B in L3)	N/A	120 tok/s	L3 cache speed
Coordination latency	1-3 μs (OS)	200-400 ns (IF)	5-15x
405B Q4 (single socket)	Not attempted	1.8 tok/s	Runs efficiently
Concurrent 70B users (50 servers)	~15	100-125	7x

EPYC 9354 Architecture

32 Zen 4c cores organized into 4 CCDs, each with 8 cores and 32 MB L3 cache. Optimized for single-socket efficiency.

Raw Specs

Spec	EPYC 9354	RTX 5090	9354 Advantage
Cores	32 (Zen 4c)	21,760 CUDA	General purpose, efficient
Memory	512 GB DDR5	32 GB GDDR7	16x more memory
Max Memory	3 TB	32 GB	100x more memory
Memory BW	230.4 GB/s	1,792 GB/s	CCD spec decode optimized
TDP	~280W	~575W	Half the power
Sovereignty	Full. Your code, period.	CUDA + proprietary firmware	Permanent

4-CCD Spec Decode: Efficiency Focus

With 4 CCDs and 32 cores, the 9354 runs a 1-CCD draft + 3-CCD verify pattern for maximum efficiency.

The 32-core challenge: Standard llama.cpp treats all 32 cores as one pool. 230.4 GB/s / 35 GB = 6.6 tok/s theoretical max. Linux overhead + NUMA penalties = 5-6 tok/s actual. The 9354 sovereign approach changes everything.

Why 9354 CCD Partitioning Works

120+

1. Draft Model L3 Speed

1B Q4 fits in CCD 0's 32 MB L3. Draft tokens at ~120 tok/s from L3 cache. Single CCD dedicated to high-speed drafting.

2. Efficient Verification

3 CCDs (24 cores) handle 70B verification. Lower core count means each verification round is faster, improving overall throughput.

230.4

3. Full Bandwidth Usage

6 memory channels fully utilized by 3 verification CCDs. 230.4 GB/s efficiently consumed for weight streaming.

5-7

4. Higher Accept Rate

Faster verification rounds mean shorter draft sequences. Higher acceptance rate (5-7 tokens/round) improves efficiency.

9354 Performance Profile

Parameter	Standard (Linux)	CCD-Partitioned 9354	Improvement
Draft speed	~40 tok/s (BW-shared)	~120 tok/s (L3-speed)	3x
Verify cores	32 (shared, contended)	24 (dedicated)	Clean partition
Memory channels	6 (contended)	6 (verification-dedicated)	Full utilization
Tokens drafted/round	3-5	6-8	~60% more
Effective tok/s (70B Q4)	~8-10	~25-32	3x (vs spec decode stock)
vs stock llama.cpp	5-6	25-32	5x

9354 Performance vs GPUs

Efficient inference for mid-range workloads.

70B Q4 (~35 GB weights)

Platform	Memory	BW (GB/s)	tok/s	Efficiency
RTX 5090	32 GB	1,792	Can't fit	No
A100 80GB	80 GB	2,039	~42-58	High power
EPYC 9354 (sovereign)	512 GB	230.4	~25-32	Efficient

9354 delivers 60% of A100 performance with massive memory headroom and sovereign control.

25-32 tok/s from 32 cores. The 9354 proves that sovereign inference doesn't require flagship hardware. Efficient spec decode + sovereignty optimizations deliver competitive performance at a fraction of the power and cost.

Fleet Economics

EPYC 9354 Fleet Profile

Per server: 1x EPYC 9354, 512 GB DDR5
Workload:   70B Q4, 4-CCD spec decode
Throughput: ~25-32 tok/s per server
Density:    ~2 concurrent users per server (at 15 tok/s per user)
Power:      ~280W per server vs ~700W for GPU equivalents

50 servers = 100 concurrent 70B users
Cost-efficient, power-efficient, sovereign.

4-5 tok/s stock Linux 70B

28-38 tok/s sovereign 70B

7x performance gain

1 TB total memory (2-socket)

110 tok/s draft speed (L3 cache)

409 GB/s combined bandwidth

2x EPYC 7543: Dual-Socket Legacy Power

What	Stock	Sovereign	Gain
70B Q4 tok/s	4-5	28-38	7x
Bandwidth utilization	~50%	~75-85%	+70%
Draft speed (1B in L3)	N/A	110 tok/s	L3 cache speed
Socket coordination	5-10 μs (OS)	800-1200 ns (Infinity Fabric)	5-12x
405B Q4 (dual socket)	Not attempted	3.2 tok/s	Runs smoothly
Concurrent 70B users (30 servers)	~8	90-120	12x

2x EPYC 7543 Dual-Socket Architecture

64 total cores across two sockets. 16 CCDs total (8 per socket), each with 4 cores and 16 MB L3 cache. Connected via Infinity Fabric.

Raw Specs

Spec	2x EPYC 7543	2x A100	7543 Advantage
Cores	64 (Zen 3)	13,824 CUDA	Proven architecture
Memory	1 TB DDR4	160 GB HBM2e	6.25x more memory
Memory BW	409.6 GB/s	3,078 GB/s	Dual-socket optimization
TDP	~450W	~800W	Lower power
Legacy Support	DDR4, existing infrastructure	Requires new infrastructure	Use existing servers
Sovereignty	Full control, no vendor lock	CUDA dependency	Permanent

Dual-Socket Spec Decode Strategy

Cross-socket coordination requires careful partitioning. Socket 0 handles drafting, Socket 1 focuses on verification.

The dual-socket challenge: Cross-socket communication adds latency. Standard approaches treat both sockets as one NUMA domain, leading to 50% cross-socket traffic and 4-5 tok/s actual performance. Sovereign partitioning eliminates this penalty.

Cross-Socket Spec Decode Pattern

Socket 0

Draft Socket

CCD 0 (4 cores) handles 1B draft model. Fits in 16 MB L3. Socket 0's remaining 28 cores assist with coordination and KV management.

Socket 1

Verification Socket

All 32 cores of Socket 1 dedicated to 70B verification. Full 204.8 GB/s local bandwidth. No cross-socket memory access during verification.

800-1200ns

Infinity Fabric Coordination

32 GB/s cross-socket IF handles draft token transfer (~1-2 KB per round). 800-1200 ns latency vs 5-10 μs for OS coordination.

6-9

Efficient Rounds

Longer verification latency (dual socket) balanced by drafting 6-9 tokens per round. Higher accept rates due to dedicated socket resources.

7543 Dual-Socket Performance

Parameter	Standard (NUMA)	Socket-Partitioned	Improvement
Cross-socket traffic	~50% (all memory)	~1% (coordination only)	50x reduction
Draft speed	~35 tok/s (contended)	~110 tok/s (local L3)	3x
Socket coordination	~5-10 μs (OS/NUMA)	~800-1200 ns (IF direct)	5-12x
Verification bandwidth	~200 GB/s (NUMA penalty)	~204.8 GB/s (local)	Clean socket isolation
Effective tok/s (70B Q4)	~6-8	~28-38	5x (vs dual-socket standard)
vs single-socket stock	4-5	28-38	7x

7543 Dual-Socket vs GPU Clusters

Legacy hardware, modern performance.

70B Q4 (~35 GB weights)

Platform	Memory	BW (GB/s)	tok/s	Infrastructure
2x A100 NVLink	160 GB	3,078	~65-85	Requires NVLink
A100 80GB	80 GB	2,039	~42-58	Single GPU
2x EPYC 7543 (sovereign)	1 TB	409.6	~28-38	Existing servers

7543 dual-socket delivers 65% of A100 performance using existing DDR4 infrastructure.

Large Model Capacity

Model	Weights	2x EPYC 7543	GPU Requirements
70B Q4	35 GB	28-38 tok/s	1x A100+ required
405B Q4	200 GB	3.2 tok/s	2+ A100s required
405B F16	810 GB	2.1 tok/s	Doesn't fit in any single-GPU
671B Q8	670 GB	4.5 tok/s	8+ GPUs required

1 TB of DDR4 memory. The 2x EPYC 7543 configuration runs models that require GPU clusters. 671B Q8 at 4.5 tok/s from servers you probably already have.

Fleet Economics

Legacy EPYC Fleet Profile

Per server: 2x EPYC 7543, 1 TB DDR4
Workload:   70B Q4, cross-socket spec decode
Throughput: ~28-38 tok/s per server
Density:    ~2-3 concurrent users per server
Power:      ~450W per server

30 servers = 60-90 concurrent 70B users
Legacy infrastructure, modern performance.