Three architectures compared: macOS native, Phase 2 (EL2 VM on macOS via SPTM cooperation), and bare metal (M3, no macOS). All numbers for Apple Silicon at ~3.5 GHz.
On M4, SPTM runs at GL2 — a hardware-isolated lateral privilege level enforced by GXF. The causal chain:
M3 is the last generation where you can run at real EL2 without Apple's permission.
| Property | M3 (PPL) | M4 (SPTM) |
|---|---|---|
| Page table guard | PPL at EL2 (software, standard ARM) | SPTM at GL2 (hardware GXF) |
| Asahi / m1n1 | Boots bare metal (confirmed Jan 2026) | Blocked, no timeline |
| EL2 access | Full, direct | Only inside HVF VM |
| Known CVEs | CVE-2023-38606, CVE-2024-23296 | Zero (unbroken) |
| Our sovereign code | Runs unchanged, bare metal | Runs unchanged, but inside VM |
All times are worst-case typical. Percentages relative to macOS native (100%).
| Operation | macOS Native | Phase 2 (EL2 VM) | Bare Metal (M3) |
|---|---|---|---|
| Keystroke to screen | 5 ms (100%) | 5 ms (100%) | 15 µs (0.3%) |
| Context switch | 3 µs (100%) | 50 ns (1.7%) | 50 ns (1.7%) |
| Page fault (soft) | 5 µs (100%) | 100 ns (2%) | 100 ns (2%) |
| IPC message | 2 µs (100%) | 80 ns (4%) | 30 ns (1.5%) |
| Thread spawn | 15 µs (100%) | 200 ns (1.3%) | 200 ns (1.3%) |
| malloc (4KB) | 300 ns (100%) | 25 ns (8%) | 25 ns (8%) |
| Scheduling decision | 1.5 µs (100%) | 30 ns (2%) | 30 ns (2%) |
| Operation | macOS Native | Phase 2 (EL2 VM) | Bare Metal (M3) |
|---|---|---|---|
| File open | 8 µs (100%) | 10 µs (125%) | 2 µs (25%) |
| File read (4KB) | 5 µs (100%) | 7 µs (140%) | 1.5 µs (30%) |
| File write (4KB) | 6 µs (100%) | 8 µs (133%) | 2 µs (33%) |
| App launch | 200 ms (100%) | 200 ms (100%) | 10 µs (0.005%) |
| Window create | 2 ms (100%) | 2 ms (100%) | 5 µs (0.25%) |
| Window resize | 1 ms (100%) | 1 ms (100%) | 3 µs (0.3%) |
| Mouse click to response | 4 ms (100%) | 4 ms (100%) | 12 µs (0.3%) |
| Scroll event | 3 ms (100%) | 3 ms (100%) | 10 µs (0.3%) |
| Network packet TX | 15 µs (100%) | 17 µs (113%) | 3 µs (20%) |
| Network packet RX | 15 µs (100%) | 17 µs (113%) | 3 µs (20%) |
| Timer interrupt | 2 µs (100%) | 2 µs (100%) | 300 ns (15%) |
| Disk flush | 1 ms (100%) | 1 ms (100%) | 80 µs (8%) |
| Screen blit (1080p) | 2 ms (100%) | 2 ms (100%) | 300 µs (15%) |
Phase 2 has two categories: Internal ops (context switch, IPC, page fault) at 1.3-8% of macOS — entirely inside VM. I/O ops at 100-140% — paying a ~2.2 µs VM exit/enter tax on top of macOS.
Bare metal eliminates macOS entirely. App launch: 200 ms → 10 µs. Keystroke: 5 ms → 15 µs. No IOKit, no HID, no WindowServer, no event queue chain.
Raw DRAM speed is a hardware constant. Effective memory performance is ~2x better on bare metal because you own the entire cache hierarchy.
| macOS | Bare Metal | |
|---|---|---|
| Processes competing for L2/SLC | ~400-600 | 1 |
| Cache evictions from OS | Constant | Zero |
| Effective L2 hit rate | ~85-92% | ~98%+ |
| macOS | Bare Metal | |
|---|---|---|
| Page table levels | 4 (TTBR0 + TTBR1) | 2 (stage2) |
| TLB entries consumed by kernel | ~30-40% | 0% |
| TLB miss cost | ~20-30 ns (4-level walk) | ~10 ns (2-level) |
| Access Pattern | macOS | Bare Metal |
|---|---|---|
| L1 hit (hot loop) | 3 ns | 3 ns (same) |
| L2 hit | 10 ns | ~8 ns |
| SLC hit | 25 ns | ~18 ns |
| DRAM miss | 100 ns | 100 ns (same) |
| Typical working set mix | ~15-20 ns avg | ~8-10 ns avg |
CPU bare metal is almost free — boot at EL2, write registers, 50x faster. GPU bare metal requires ~20,000 lines of driver.
| Component | Effort | Notes |
|---|---|---|
| GPU firmware loading (RTKit) | ~2,000 lines | Asahi has this for M1/M2 |
| DART (GPU IOMMU) setup | ~1,500 lines | Asahi has this |
| Command buffer format | ~3,000 lines | Reverse-engineered for M1/M2 |
| Power domain management | ~1,000 lines | Asahi has this |
| Shader compiler | ~10,000+ lines | Asahi has Mesa/Gallium |
| M3-specific changes | Unknown | Dynamic Caching, ray tracing |
| Operation | macOS Metal | Bare Metal |
|---|---|---|
| Submit draw call | ~5 µs | ~200 ns |
| Buffer upload (CPU→GPU) | ~3 µs (copy) | 0 ns (already there) |
| Texture bind | ~1 µs | ~100 ns |
| Pipeline state switch | ~8 µs | ~500 ns |
| Actual shader execution | Same | Same |
| DRAM bandwidth | Same | Same |
| Full frame render (complex) | 8 ms | ~7 ms |
Summary: CPU bare metal = 50-15,000x faster (easy). GPU bare metal = 10-30% faster (hard, 20K lines). GPU compute = identical.
Apple skipped M4 Ultra. The next Ultra is M5, expected June-September 2026. M5 may use TSMC SoIC chiplets.
| Spec | M3 Ultra | M5 Ultra (projected) |
|---|---|---|
| Architecture | 2x M3 Max (UltraFusion) | SoIC chiplets or UltraFusion 2.0 |
| CPU cores | 24 (16P + 8E) | 32-36 (28P + 8E) |
| Memory | 192 GB unified | 256 GB unified |
| Memory bandwidth | 800 GB/s | 800-1,100 GB/s |
| GPU cores | 76 | 80-84 |
| Neural Engine | 31 TOPS | ~76 TOPS |
| GPU Neural Accelerators | None | ~500-700 TOPS (Metal 4 only) |
| Combined AI TOPS | 31 | 600-800 (marketing) |
| Process | TSMC N3B | TSMC N3P |
| EL2 access | Full bare metal (PPL) | Blocked by SPTM |
600-800 TOPS decomposition: Neural Engine doubled to ~76 TOPS. The explosion is GPU Neural Accelerators — ~500 TOPS of tensor silicon inside GPU cores. For token generation (bandwidth-bound): irrelevant. 800 TOPS sits idle waiting for memory. For prefill (compute-bound): massive 4-5x improvement.
| Operation | M5 Ultra (macOS) | M3 Ultra (bare metal) | M3 wins by |
|---|---|---|---|
| Context switch | ~2.2 µs | 50 ns | 44x |
| IPC message | ~1.5 µs | 30 ns | 50x |
| Page fault | ~3.5 µs | 100 ns | 35x |
| Keystroke to screen | ~3.5 ms | 15 µs | 233x |
| App launch | ~150 ms | 10 µs | 15,000x |
| Scheduling decision | ~1.1 µs | 30 ns | 37x |
| malloc (4KB) | ~220 ns | 25 ns | 9x |
| File read (4KB) | ~3.5 µs | 1.5 µs | 2.3x |
| Thread spawn | ~11 µs | 200 ns | 55x |
| Screen blit (1080p) | ~1.5 ms | 300 µs | 5x |
Token generation is memory-bandwidth bound. M3 Ultra does 800 GB/s regardless of software. Bare metal cannot change DRAM physics — but it changes everything else.
| Configuration | Token Gen | Prefill | Notes |
|---|---|---|---|
| macOS Metal (M2 Max) | ~61 tok/s | ~580 tok/s | Best. Uses simdgroup_matrix. |
| Asahi Vulkan GPU (M2 Max) | ~22 tok/s | ~92 tok/s | 2.8x / 6.3x slower. Missing cooperative matrix. |
| CPU-only NEON (M2 Max) | ~25 tok/s | ~40 tok/s | No GPU driver needed. |
| CPU-only NEON (M1 8-core) | ~14 tok/s | ~20 tok/s | Baseline. |
Striking finding: CPU-only is nearly as fast as Asahi's GPU for token generation. Both have the same unified memory bandwidth on Apple Silicon.
| Layer | What It Does | Open Alternative | TOPS |
|---|---|---|---|
| MLX | Apple's PyTorch equivalent | llama.cpp, PyTorch | N/A |
| Metal 4 | GPU API, shader compiler | Vulkan (Asahi) | N/A |
| simdgroup_matrix | Hardware 8x8 matmul | Not yet exposed via Vulkan | ~180 TOPS |
| AGX firmware (RTKit) | GPU command scheduling | Asahi kernel driver (Rust) | N/A |
| GPU Neural Accelerators (M5) | Per-core tensor units | Nothing. Metal 4 only. | ~500-700 TOPS |
| Neural Engine (NPU) | Dedicated ML accelerator | CoreML only | ~76 TOPS |
| Factor | macOS | Bare Metal | Impact |
|---|---|---|---|
| DRAM bandwidth | 800 GB/s | 800 GB/s | None (hardware constant) |
| Usable memory for weights | ~160 GB of 192 | 192 GB (all of it) | +20% capacity |
| Scheduling jitter | 50-200 µs spikes | Zero | Consistent latency |
| OS memory reservation | ~32 GB | 0 | More KV cache room |
| AMX matrix coprocessor | Via Accelerate.framework | Direct (undocumented) | Possible, unquantified |
| GPU matrix multiply | simdgroup_matrix via Metal | Not yet (RE plan exists) | Currently a loss |
| Neural Engine | CoreML only | Limited driver | Not useful for LLMs |
| M5 Neural Accelerators | Metal 4 TensorOps | No path exists | Permanent loss |
System-level advantages that no general-purpose framework can exploit.
50 ns bare metal vs 3 µs macOS. More draft candidates per round (8-16 vs 4-8), tree speculation, faster rejection recovery.
Zero-prefill conversation switching. 192 GB = 38 simultaneous 70B KV caches. SLC pinning, physical layout, DRAM bank-aligned.
Identity-mapped physical memory, 2-level page tables, zero TLB waste. Effective working set latency ~8-10 ns vs ~15-20 ns.
24 cores x AMX with 50 ns dispatch. Exploit DRAM bank parallelism — CPU+AMX attention during GPU FFN idle bubbles.
LLM generates token → next instruction processes it. Zero kernel crossings. Compounds over 100 agentic steps per second.
On M4/M5 (SPTM blocks bare metal), Phase 2 combines sovereign scheduling with Apple's GPU stack.
The 1.1 µs HVC round trip is negligible compared to the milliseconds the GPU takes for a 70B matrix multiply. You get sovereign scheduling AND Apple's GPU stack.
M4 Studio, 587 GB/s, Llama 3.3 70B Q4. Starting from 10 tok/s native, 20 tok/s with spec decode on macOS.
| Source of waste | macOS overhead | Sovereign | Improvement |
|---|---|---|---|
| Scheduling jitter | 50-200 µs spikes | Zero | ~8-10% |
| Memory allocator | Virtual indirection | Identity-mapped physical | ~5-8% |
| TLB walks | 20-30 ns per miss | 10 ns (2-level) | ~3-5% |
| SLC pollution | Constant eviction | Zero | ~5-8% |
| Framework overhead | Per-token MLX cost | Bare Forth, zero | ~5-8% |
| Parameter | macOS (3 µs dispatch) | Sovereign (50 ns dispatch) |
|---|---|---|
| Tokens drafted per round | ~5 | 12-16 |
| Speculation strategy | Linear (single sequence) | Tree (branch 2-3 paths) |
| Failed verification cost | ~15 µs | ~600 ns |
| Acceptance rate | ~70% (5 tokens) | ~55-65% (longer seqs, tree covers) |
| Effective tokens per verify | ~3.5 | 7-10 |
| Effective multiplier | 2x | 2.5-4x |
| Step | Conservative | Liberal | Mechanism |
|---|---|---|---|
| Base (no spec decode) | 10 tok/s | 10 tok/s | Your M4 Studio today |
| Sovereign base improvement | 12.5 tok/s (1.25x) | 14.2 tok/s (1.42x) | Utilization: 60% → 75-85% |
| Sovereign spec decode | 31.3 tok/s (2.5x) | 49.7 tok/s (3.5x) | Draft 8-16 tokens, tree |
| AMX parallel (+8%) | 33.8 tok/s | 53.7 tok/s | DRAM bank parallelism |
| Fused pipeline (+3%) | 34.8 tok/s | 55.3 tok/s | Zero-boundary chaining |
| Step | Conservative | Liberal | Mechanism |
|---|---|---|---|
| macOS-equivalent base | 13.7 tok/s | 13.7 tok/s | 800 GB/s x 60% |
| Sovereign base | 17.1 tok/s (75%) | 19.4 tok/s (85%) | Full bare metal |
| Sovereign spec decode | 42.8 tok/s (2.5x) | 67.8 tok/s (3.5x) | Zero VM exits |
| AMX parallel (+10%) | 47.0 tok/s | 74.6 tok/s | 24 cores x AMX direct |
| Fused pipeline (+4%) | 48.9 tok/s | 77.6 tok/s | True bare metal |
All configurations compared for 70B Q4 inference.
| Configuration | Token Gen | Prefill (8K) | KV Pools | Sovereignty |
|---|---|---|---|---|
| M4 Studio macOS (current) | 20 tok/s | ~120 tok/s | No | None |
| M4 Studio Phase 2 (conservative) | ~35 tok/s | ~120 tok/s | Yes (38+) | CPU sovereign |
| M4 Studio Phase 2 (liberal) | ~55 tok/s | ~120 tok/s | Yes (38+) | CPU sovereign |
| M3 Ultra bare metal (conservative) | ~49 tok/s | ~160 tok/s | Yes (38) | Full sovereign |
| M3 Ultra bare metal (liberal) | ~78 tok/s | ~160 tok/s | Yes (38) | Full sovereign |
| M3 Ultra + cracked simdgroup | ~78-90 tok/s | ~350 tok/s | Yes (38) | Full sovereign |
| M5 Ultra macOS (800 GB/s) | ~27-30 tok/s | ~960 tok/s | No | None |
| M5 Ultra macOS (1,100 GB/s) | ~38-42 tok/s | ~960 tok/s | No | None |
Full register access, 128 cores, 768 GB RAM — but NUMA topology instead of flat unified memory.
| Spec | EPYC 9754 | M3 Ultra |
|---|---|---|
| Cores | 128 (Zen 4c) | 24 |
| Memory BW | 460.8 GB/s | 800 GB/s |
| Max RAM | 768 GB (up to 6 TB) | 192-512 GB |
| Matrix Accel | AVX-512 VNNI | AMX (per-core) |
| TDP | ~360W | ~150W |
| Cost | ~$8K | ~$5-8K |
| Model | EPYC 9754 (768 GB) | M3 Ultra (512 GB) |
|---|---|---|
| 70B Q4 (35 GB) | Runs | Runs |
| 405B Q4 (200 GB) | Runs | Runs |
| 405B F16 (810 GB) | Runs (1.5 TB DIMMs) | Can't |
| 671B Q8 (670 GB) | Runs | Can't |
| 671B Q4 (335 GB) | Runs easily | Can't |
DeepSeek R1 671B at Q8 gets 6.2 tok/s on a single EPYC socket. That's a model 10x larger than 70B, at interactive speed, on one CPU. No GPU. No Apple. No framework that can be revoked.