Apple Silicon Performance

Three architectures compared: macOS native, Phase 2 (EL2 VM on macOS via SPTM cooperation), and bare metal (M3, no macOS). All numbers for Apple Silicon at ~3.5 GHz.

Why SPTM Forces macOS

On M4, SPTM runs at GL2 — a hardware-isolated lateral privilege level enforced by GXF. The causal chain:

SPTM owns all page tables (hardware-enforced, silicon-fused boot chain)
Can't boot at real EL2 (no page tables = no code execution)
Must use Hypervisor.framework for EL2 registers
HVF is a macOS userspace framework
Must run as a macOS process
macOS must be running
ALL I/O routes through macOS kernel

M3 vs M4: The Split

M3 is the last generation where you can run at real EL2 without Apple's permission.

PropertyM3 (PPL)M4 (SPTM)
Page table guardPPL at EL2 (software, standard ARM)SPTM at GL2 (hardware GXF)
Asahi / m1n1Boots bare metal (confirmed Jan 2026)Blocked, no timeline
EL2 accessFull, directOnly inside HVF VM
Known CVEsCVE-2023-38606, CVE-2024-23296Zero (unbroken)
Our sovereign codeRuns unchanged, bare metalRuns unchanged, but inside VM

Three-Way Performance Comparison

All times are worst-case typical. Percentages relative to macOS native (100%).

System Operations

OperationmacOS NativePhase 2 (EL2 VM)Bare Metal (M3)
Keystroke to screen5 ms (100%)5 ms (100%)15 µs (0.3%)
Context switch3 µs (100%)50 ns (1.7%)50 ns (1.7%)
Page fault (soft)5 µs (100%)100 ns (2%)100 ns (2%)
IPC message2 µs (100%)80 ns (4%)30 ns (1.5%)
Thread spawn15 µs (100%)200 ns (1.3%)200 ns (1.3%)
malloc (4KB)300 ns (100%)25 ns (8%)25 ns (8%)
Scheduling decision1.5 µs (100%)30 ns (2%)30 ns (2%)

I/O Operations

OperationmacOS NativePhase 2 (EL2 VM)Bare Metal (M3)
File open8 µs (100%)10 µs (125%)2 µs (25%)
File read (4KB)5 µs (100%)7 µs (140%)1.5 µs (30%)
File write (4KB)6 µs (100%)8 µs (133%)2 µs (33%)
App launch200 ms (100%)200 ms (100%)10 µs (0.005%)
Window create2 ms (100%)2 ms (100%)5 µs (0.25%)
Window resize1 ms (100%)1 ms (100%)3 µs (0.3%)
Mouse click to response4 ms (100%)4 ms (100%)12 µs (0.3%)
Scroll event3 ms (100%)3 ms (100%)10 µs (0.3%)
Network packet TX15 µs (100%)17 µs (113%)3 µs (20%)
Network packet RX15 µs (100%)17 µs (113%)3 µs (20%)
Timer interrupt2 µs (100%)2 µs (100%)300 ns (15%)
Disk flush1 ms (100%)1 ms (100%)80 µs (8%)
Screen blit (1080p)2 ms (100%)2 ms (100%)300 µs (15%)

Phase 2 has two categories: Internal ops (context switch, IPC, page fault) at 1.3-8% of macOS — entirely inside VM. I/O ops at 100-140% — paying a ~2.2 µs VM exit/enter tax on top of macOS.

Bare metal eliminates macOS entirely. App launch: 200 ms → 10 µs. Keystroke: 5 ms → 15 µs. No IOKit, no HID, no WindowServer, no event queue chain.

Memory Hierarchy

Raw DRAM speed is a hardware constant. Effective memory performance is ~2x better on bare metal because you own the entire cache hierarchy.

L1 · 192 KB
~3 ns
hot loop variables
L2 · 16 MB
~10 ns
active working set
SLC · 48 MB
~20 ns
recently used across cores
DRAM · 192 GB
~100 ns
everything else

Cache Pollution

macOSBare Metal
Processes competing for L2/SLC~400-6001
Cache evictions from OSConstantZero
Effective L2 hit rate~85-92%~98%+

TLB (Translation Lookaside Buffer)

macOSBare Metal
Page table levels4 (TTBR0 + TTBR1)2 (stage2)
TLB entries consumed by kernel~30-40%0%
TLB miss cost~20-30 ns (4-level walk)~10 ns (2-level)

Effective Memory Latency

Access PatternmacOSBare Metal
L1 hit (hot loop)3 ns3 ns (same)
L2 hit10 ns~8 ns
SLC hit25 ns~18 ns
DRAM miss100 ns100 ns (same)
Typical working set mix~15-20 ns avg~8-10 ns avg

GPU Analysis

CPU bare metal is almost free — boot at EL2, write registers, 50x faster. GPU bare metal requires ~20,000 lines of driver.

GPU Driver Effort

ComponentEffortNotes
GPU firmware loading (RTKit)~2,000 linesAsahi has this for M1/M2
DART (GPU IOMMU) setup~1,500 linesAsahi has this
Command buffer format~3,000 linesReverse-engineered for M1/M2
Power domain management~1,000 linesAsahi has this
Shader compiler~10,000+ linesAsahi has Mesa/Gallium
M3-specific changesUnknownDynamic Caching, ray tracing

GPU Performance Comparison

OperationmacOS MetalBare Metal
Submit draw call~5 µs~200 ns
Buffer upload (CPU→GPU)~3 µs (copy)0 ns (already there)
Texture bind~1 µs~100 ns
Pipeline state switch~8 µs~500 ns
Actual shader executionSameSame
DRAM bandwidthSameSame
Full frame render (complex)8 ms~7 ms

Summary: CPU bare metal = 50-15,000x faster (easy). GPU bare metal = 10-30% faster (hard, 20K lines). GPU compute = identical.

M3 Ultra vs M5 Ultra

Apple skipped M4 Ultra. The next Ultra is M5, expected June-September 2026. M5 may use TSMC SoIC chiplets.

SpecM3 UltraM5 Ultra (projected)
Architecture2x M3 Max (UltraFusion)SoIC chiplets or UltraFusion 2.0
CPU cores24 (16P + 8E)32-36 (28P + 8E)
Memory192 GB unified256 GB unified
Memory bandwidth800 GB/s800-1,100 GB/s
GPU cores7680-84
Neural Engine31 TOPS~76 TOPS
GPU Neural AcceleratorsNone~500-700 TOPS (Metal 4 only)
Combined AI TOPS31600-800 (marketing)
ProcessTSMC N3BTSMC N3P
EL2 accessFull bare metal (PPL)Blocked by SPTM

600-800 TOPS decomposition: Neural Engine doubled to ~76 TOPS. The explosion is GPU Neural Accelerators — ~500 TOPS of tensor silicon inside GPU cores. For token generation (bandwidth-bound): irrelevant. 800 TOPS sits idle waiting for memory. For prefill (compute-bound): massive 4-5x improvement.

M5 Ultra macOS vs M3 Ultra Bare Metal

OperationM5 Ultra (macOS)M3 Ultra (bare metal)M3 wins by
Context switch~2.2 µs50 ns44x
IPC message~1.5 µs30 ns50x
Page fault~3.5 µs100 ns35x
Keystroke to screen~3.5 ms15 µs233x
App launch~150 ms10 µs15,000x
Scheduling decision~1.1 µs30 ns37x
malloc (4KB)~220 ns25 ns9x
File read (4KB)~3.5 µs1.5 µs2.3x
Thread spawn~11 µs200 ns55x
Screen blit (1080p)~1.5 ms300 µs5x
A 30% faster CPU under 50 layers of macOS abstraction cannot compete with a slightly slower CPU talking directly to hardware.

LLM Inference

Token generation is memory-bandwidth bound. M3 Ultra does 800 GB/s regardless of software. Bare metal cannot change DRAM physics — but it changes everything else.

Current Performance (7B Q4)

ConfigurationToken GenPrefillNotes
macOS Metal (M2 Max)~61 tok/s~580 tok/sBest. Uses simdgroup_matrix.
Asahi Vulkan GPU (M2 Max)~22 tok/s~92 tok/s2.8x / 6.3x slower. Missing cooperative matrix.
CPU-only NEON (M2 Max)~25 tok/s~40 tok/sNo GPU driver needed.
CPU-only NEON (M1 8-core)~14 tok/s~20 tok/sBaseline.

Striking finding: CPU-only is nearly as fast as Asahi's GPU for token generation. Both have the same unified memory bandwidth on Apple Silicon.

The simdgroup_matrix Lock-In Chain

LLM inference needs matrix multiply
Matrix multiply needs GPU (or 2.8x slower CPU-only)
GPU needs simdgroup_matrix instruction
Only Metal's shader compiler emits it
Metal is macOS-only
Bare metal LLM inference loses GPU acceleration

The Apple Stack Lock-In

LayerWhat It DoesOpen AlternativeTOPS
MLXApple's PyTorch equivalentllama.cpp, PyTorchN/A
Metal 4GPU API, shader compilerVulkan (Asahi)N/A
simdgroup_matrixHardware 8x8 matmulNot yet exposed via Vulkan~180 TOPS
AGX firmware (RTKit)GPU command schedulingAsahi kernel driver (Rust)N/A
GPU Neural Accelerators (M5)Per-core tensor unitsNothing. Metal 4 only.~500-700 TOPS
Neural Engine (NPU)Dedicated ML acceleratorCoreML only~76 TOPS

What Bare Metal Wins for LLMs

FactormacOSBare MetalImpact
DRAM bandwidth800 GB/s800 GB/sNone (hardware constant)
Usable memory for weights~160 GB of 192192 GB (all of it)+20% capacity
Scheduling jitter50-200 µs spikesZeroConsistent latency
OS memory reservation~32 GB0More KV cache room
AMX matrix coprocessorVia Accelerate.frameworkDirect (undocumented)Possible, unquantified
GPU matrix multiplysimdgroup_matrix via MetalNot yet (RE plan exists)Currently a loss
Neural EngineCoreML onlyLimited driverNot useful for LLMs
M5 Neural AcceleratorsMetal 4 TensorOpsNo path existsPermanent loss

What Sovereign Can Do That Apple Can't

System-level advantages that no general-purpose framework can exploit.

60x

Spec Decode Coordination

50 ns bare metal vs 3 µs macOS. More draft candidates per round (8-16 vs 4-8), tree speculation, faster rejection recovery.

38

KV Cache Pools

Zero-prefill conversation switching. 192 GB = 38 simultaneous 70B KV caches. SLC pinning, physical layout, DRAM bank-aligned.

2x

Memory Latency

Identity-mapped physical memory, 2-level page tables, zero TLB waste. Effective working set latency ~8-10 ns vs ~15-20 ns.

24

AMX Saturation

24 cores x AMX with 50 ns dispatch. Exploit DRAM bank parallelism — CPU+AMX attention during GPU FFN idle bubbles.

0 ns

Fused Pipeline

LLM generates token → next instruction processes it. Zero kernel crossings. Compounds over 100 agentic steps per second.

Phase 2 Hybrid: Best of Both Worlds

On M4/M5 (SPTM blocks bare metal), Phase 2 combines sovereign scheduling with Apple's GPU stack.

VM (sovereign speed) 50 ns context switches 50 ns spec decode dispatch Clean SLC for KV cache AMX on all CPU cores Zero-boundary pipeline Host (macOS) Metal GPU (full speed) simdgroup_matrix Neural Accelerators (M5) Full bandwidth HVC "multiply these" HVC result back 1.1 µs round trip negligible vs ms GPU compute

The 1.1 µs HVC round trip is negligible compared to the milliseconds the GPU takes for a 70B matrix multiply. You get sovereign scheduling AND Apple's GPU stack.

Aggregate Optimization: 70B + 1B Speculative Decoding

M4 Studio, 587 GB/s, Llama 3.3 70B Q4. Starting from 10 tok/s native, 20 tok/s with spec decode on macOS.

Lever 1: Base Throughput (Bandwidth Utilization)

Source of wastemacOS overheadSovereignImprovement
Scheduling jitter50-200 µs spikesZero~8-10%
Memory allocatorVirtual indirectionIdentity-mapped physical~5-8%
TLB walks20-30 ns per miss10 ns (2-level)~3-5%
SLC pollutionConstant evictionZero~5-8%
Framework overheadPer-token MLX costBare Forth, zero~5-8%

Lever 2: Speculative Decoding Multiplier

ParametermacOS (3 µs dispatch)Sovereign (50 ns dispatch)
Tokens drafted per round~512-16
Speculation strategyLinear (single sequence)Tree (branch 2-3 paths)
Failed verification cost~15 µs~600 ns
Acceptance rate~70% (5 tokens)~55-65% (longer seqs, tree covers)
Effective tokens per verify~3.57-10
Effective multiplier2x2.5-4x

Combined: M4 Studio Phase 2

StepConservativeLiberalMechanism
Base (no spec decode)10 tok/s10 tok/sYour M4 Studio today
Sovereign base improvement12.5 tok/s (1.25x)14.2 tok/s (1.42x)Utilization: 60% → 75-85%
Sovereign spec decode31.3 tok/s (2.5x)49.7 tok/s (3.5x)Draft 8-16 tokens, tree
AMX parallel (+8%)33.8 tok/s53.7 tok/sDRAM bank parallelism
Fused pipeline (+3%)34.8 tok/s55.3 tok/sZero-boundary chaining
~35 tok/s M4 Studio conservative
~55 tok/s M4 Studio liberal
1.74x conservative vs macOS 20
2.77x liberal vs macOS 20

M3 Ultra Bare Metal

StepConservativeLiberalMechanism
macOS-equivalent base13.7 tok/s13.7 tok/s800 GB/s x 60%
Sovereign base17.1 tok/s (75%)19.4 tok/s (85%)Full bare metal
Sovereign spec decode42.8 tok/s (2.5x)67.8 tok/s (3.5x)Zero VM exits
AMX parallel (+10%)47.0 tok/s74.6 tok/s24 cores x AMX direct
Fused pipeline (+4%)48.9 tok/s77.6 tok/sTrue bare metal
~49 tok/s M3 Ultra conservative
~78 tok/s M3 Ultra liberal
2.45x conservative vs macOS 20
3.88x liberal vs macOS 20

Head-to-Head Summary

All configurations compared for 70B Q4 inference.

ConfigurationToken GenPrefill (8K)KV PoolsSovereignty
M4 Studio macOS (current)20 tok/s~120 tok/sNoNone
M4 Studio Phase 2 (conservative)~35 tok/s~120 tok/sYes (38+)CPU sovereign
M4 Studio Phase 2 (liberal)~55 tok/s~120 tok/sYes (38+)CPU sovereign
M3 Ultra bare metal (conservative)~49 tok/s~160 tok/sYes (38)Full sovereign
M3 Ultra bare metal (liberal)~78 tok/s~160 tok/sYes (38)Full sovereign
M3 Ultra + cracked simdgroup~78-90 tok/s~350 tok/sYes (38)Full sovereign
M5 Ultra macOS (800 GB/s)~27-30 tok/s~960 tok/sNoNone
M5 Ultra macOS (1,100 GB/s)~38-42 tok/s~960 tok/sNoNone
Even at 1,100 GB/s, M3 Ultra liberal beats M5 Ultra (~78 vs ~42 tok/s). M5's 600-800 TOPS is irrelevant for token gen. KV pools and sovereignty are exclusive to colorSixth.

Alternative Architecture: AMD EPYC 9754

Full register access, 128 cores, 768 GB RAM — but NUMA topology instead of flat unified memory.

EPYC vs M3 Ultra

SpecEPYC 9754M3 Ultra
Cores128 (Zen 4c)24
Memory BW460.8 GB/s800 GB/s
Max RAM768 GB (up to 6 TB)192-512 GB
Matrix AccelAVX-512 VNNIAMX (per-core)
TDP~360W~150W
Cost~$8K~$5-8K
Apple M3 Ultra Flat Unified SoC 24 CPU + 76 GPU + AMX Every core = same latency 800 GB/s LPDDR5 unified Flat. Uniform. Fast. AMD EPYC 9754 NUMA Multi-CCD C0 C1 C2 C3 C4 C5 C6 C7 IOD 100-200 ns 460.8 GB/s DDR5 (12 channels) NUMA. Non-uniform. Deep.

Capacity: Where EPYC Wins Absolutely

ModelEPYC 9754 (768 GB)M3 Ultra (512 GB)
70B Q4 (35 GB)RunsRuns
405B Q4 (200 GB)RunsRuns
405B F16 (810 GB)Runs (1.5 TB DIMMs)Can't
671B Q8 (670 GB)RunsCan't
671B Q4 (335 GB)Runs easilyCan't

DeepSeek R1 671B at Q8 gets 6.2 tok/s on a single EPYC socket. That's a model 10x larger than 70B, at interactive speed, on one CPU. No GPU. No Apple. No framework that can be revoked.