Sovereign Management

IPMI steals 8-15%. You can't remove the BMC — it's soldered to the board. But you can replace everything it does. Zero SMIs. Full sensor coverage. The server manages itself.

8-15% throughput recovered
<0.5% residual overhead
0 SMIs from management
100% sensor coverage retained

Architecture: Traditional vs Sovereign

Every BMC function replaced by an inference-compatible alternative. The server manages itself instead of being managed by opaque firmware.

Traditional IPMI Stack BMC (AST2600) IPMI 2.0 / Redfish / SNMP Sensor polling, fan control, SOL, SEL, watchdog KVM capture, virtual media, power capping SMI# / IRQ / DMA Host CPU (EPYC 9754) ALL 128 cores frozen during SMI Invisible stalls: 50-800 µs each, 100-500 Hz 8-15% throughput lost replaces Sovereign Management Stack Layer 3: Fleet Orchestrator HTTP pull from daemons • Grafana • Alertmanager Emergency BMC wake via Redfish (cold standby) HTTP :9100 Layer 2: Sovereign Health Daemon (Core 0) CPU Temp MSR 0xC0010299 DIMM Temp I2C SPD 0x18-0x1F Fan PID I2C controller Power RAPL 0xC001029B Prometheus :9100 endpoint Watchdog Process-level emergency only Layer 1: BMC Cold Standby Silenced. Hard reset & firmware recovery only. <0.5% overhead • 0 SMIs Hardware Safety (always active) CPU THERMTRIP# is hardwired silicon — no firmware needed

1

Layer 1: BMC Cold Standby

+5-8% recovered instantly

Silence the BMC. Keep it for emergencies. Eliminate 95% of overhead in one step.

What this does

The BMC is soldered to your motherboard. You can't remove it. But you can silence everything it does to the host CPU. In cold standby, the BMC is alive (it still has power and network connectivity) but it stops all host-side operations: no sensor polling via SMI, no KVM capture, no SOL, no health checks, no fan control loops through the host.

The BMC retains its ability to perform a hard reset or firmware recovery — the one function that has no software alternative. Everything else gets replaced by Layers 2 and 3.

Every BMC function: silenced or preserved

BMC FunctionDefault StateCold StandbyReplaced By
Sensor polling (temp/voltage/fan)SMI every 1-3sDisabledLayer 2: MSR + I2C
Fan controlThermal loop via SMIDisabledLayer 2: PID over I2C
Power monitoringDCMI pollingDisabledLayer 2: RAPL MSR
Watchdog timerResets serverDisabledLayer 2: Process watchdog
SOL / console redirectUART IRQ per byteDisabledSSH
KVM video captureDMA every 33msDisabledSSH / VNC on demand
SEL loggingSMI per eventDisabledLayer 2: journald
Virtual mediaDMA transfersDisabledNFS / HTTP boot
IPMI in-band (KCS)100-300 SMI/sDisabledLayer 3: Redfish OOB
Hard reset / power cycleAvailablePreservedNo replacement needed
Firmware recoveryAvailablePreservedNo replacement needed
bmc-cold-standby.sh
#!/bin/bash
# bmc-cold-standby.sh — Silence BMC for sovereign management
# Run as root. BMC remains reachable via dedicated NIC for emergencies.

set -e

echo "[1/7] Disabling IPMI in-band (KCS)..."
modprobe -r ipmi_si 2>/dev/null || true
modprobe -r ipmi_devintf 2>/dev/null || true
echo "  In-band IPMI disabled (zero SMIs from IPMI transport)"

echo "[2/7] Disabling SOL and console redirection..."
ipmitool -I lanplus -H $BMC_IP -U admin -P $BMC_PASS sol deactivate 2>/dev/null || true
echo "  SOL deactivated"

echo "[3/7] Disabling watchdog..."
ipmitool -I lanplus -H $BMC_IP -U admin -P $BMC_PASS mc watchdog off 2>/dev/null || true
echo "  BMC watchdog disabled"

echo "[4/7] Setting fans to full speed (BMC exits thermal loop)..."
ipmitool -I lanplus -H $BMC_IP -U admin -P $BMC_PASS raw 0x30 0x45 0x01 0x01 2>/dev/null || \
  ipmitool -I lanplus -H $BMC_IP -U admin -P $BMC_PASS raw 0x30 0x30 0x01 0x00 2>/dev/null || \
  echo "  Set fans manually via BMC web UI"
echo "  Fans at full speed — BMC thermal polling eliminated"

echo "[5/7] Disabling KVM video capture..."
ipmitool -I lanplus -H $BMC_IP -U admin -P $BMC_PASS raw 0x32 0x73 0x01 0x00 2>/dev/null || true
echo "  Video capture disabled (zero DMA)"

echo "[6/7] Clearing SEL and reducing sensor polling..."
ipmitool -I lanplus -H $BMC_IP -U admin -P $BMC_PASS sel clear 2>/dev/null || true
echo "  SEL cleared"

echo "[7/7] Unmounting virtual media..."
ipmitool -I lanplus -H $BMC_IP -U admin -P $BMC_PASS raw 0x32 0x71 0x01 0x00 2>/dev/null || true
echo "  Virtual media unmounted"

echo ""
echo "BMC is now in cold standby."
echo "  - Hard reset: ipmitool -I lanplus -H $BMC_IP chassis power reset"
echo "  - Wake BMC:   re-run sensor polling via Redfish"
echo "  - All monitoring now handled by Layer 2 (sovereign health daemon)"
2

Layer 2: Sovereign Health Daemon

<0.5% overhead • full coverage

Lightweight daemon on the housekeeping core. Replaces every BMC monitoring function. Zero SMIs.

How it works

Core 0 is already your housekeeping core — it runs the OS, handles interrupts, and isn't doing inference. The Sovereign Health Daemon runs on this core as a regular Linux process. It reads sensors directly through MSR registers (CPU temp, power) and I2C buses (DIMM temp, VRM voltage, fan speed). No SMI. No BMC interaction. No firmware in the path.

The daemon exposes a Prometheus endpoint on :9100 that external monitoring can scrape. The scrape happens over the network — zero host CPU interrupts from the monitoring infrastructure.

Direct Sensor Access

Every sensor the BMC reads, we read directly. Same data, zero SMIs.

SensorAccess MethodAddress / PathRead FrequencyCPU Cost
CPU Tctl/TdieMSR (rdmsr)0xC00102991 Hz<1 µs
Per-CCD tempMSR (rdmsr)0xC001029A1 Hz<1 µs
DIMM tempI2C (smbus)/dev/i2c-* addr 0x18-0x1F0.1 Hz<50 µs
Fan RPMI2C (smbus)Fan controller IC1 Hz<50 µs
Fan PWM setI2C (smbus)Fan controller ICOn change<50 µs
Core power (RAPL)MSR (rdmsr)0xC001029B1 Hz<1 µs
Package powerMSR (rdmsr)0xC0010299 (pkg)1 Hz<1 µs
VRM voltageI2C (smbus)VRM controller IC0.1 Hz<50 µs
Total per cycle<0.5 ms/s
All reads from Core 0 only. Inference cores 1-127 are never touched.

Fan Control Without BMC

PID control loop

The daemon runs a simple PID loop: read CPU temp via MSR, compute error from setpoint (e.g., 80°C target), adjust fan PWM via I2C. The loop runs at 1 Hz on Core 0. Total CPU time: <1 ms per second.

Safety: CPU THERMTRIP# is a hardwired silicon signal. If the CPU die temperature exceeds the absolute maximum (~105°C for EPYC), the processor asserts THERMTRIP# directly through hardware — no firmware, no software, no BMC involvement. The VRM shuts down power within microseconds. This is welded into the silicon. Even if our daemon crashes, the fans stop, and the BMC is offline, the CPU will still protect itself.

Read Tctl MSR 0xC0010299 PID Compute error = Tctl - 80°C Set Fan PWM I2C write Sleep 1s Core 0 only Repeat <1 ms/cycle THERMTRIP# hardwired at ~105°C — protects even if daemon crashes

Prometheus Endpoint

Metrics format

The daemon serves standard Prometheus metrics on :9100/metrics. Any Prometheus instance can scrape this endpoint. The scrape is a single HTTP GET over the network — zero SMIs, zero host-side interrupts beyond normal network I/O on Core 0.

example metrics output
# HELP sovereign_cpu_temp_celsius CPU temperature from MSR
# TYPE sovereign_cpu_temp_celsius gauge
sovereign_cpu_temp_celsius{sensor="tctl"} 72.5
sovereign_cpu_temp_celsius{sensor="tdie"} 68.2
sovereign_cpu_temp_celsius{sensor="ccd0"} 65.1
sovereign_cpu_temp_celsius{sensor="ccd1"} 67.3
sovereign_cpu_temp_celsius{sensor="ccd2"} 64.8
sovereign_cpu_temp_celsius{sensor="ccd3"} 66.9
sovereign_cpu_temp_celsius{sensor="ccd4"} 63.2
sovereign_cpu_temp_celsius{sensor="ccd5"} 65.7
sovereign_cpu_temp_celsius{sensor="ccd6"} 64.1
sovereign_cpu_temp_celsius{sensor="ccd7"} 66.4

# HELP sovereign_dimm_temp_celsius DIMM temperature from I2C SPD
# TYPE sovereign_dimm_temp_celsius gauge
sovereign_dimm_temp_celsius{slot="A1"} 42.3
sovereign_dimm_temp_celsius{slot="A2"} 41.8

# HELP sovereign_fan_rpm Fan speed from I2C controller
# TYPE sovereign_fan_rpm gauge
sovereign_fan_rpm{fan="0"} 8200
sovereign_fan_rpm{fan="1"} 8150

# HELP sovereign_power_watts Power draw from RAPL MSR
# TYPE sovereign_power_watts gauge
sovereign_power_watts{domain="package"} 342.7
sovereign_power_watts{domain="cores"} 298.1

# HELP sovereign_inference_active Whether inference is running
# TYPE sovereign_inference_active gauge
sovereign_inference_active 1

Software Watchdog

Process-level, systemd-managed

The BMC watchdog's job was to reset the server if it hanged. Our replacement is more precise: a process-level watchdog that monitors the inference process specifically, managed by systemd. If the inference process dies, systemd restarts it immediately. If the system kernel panics, systemd's WatchdogSec triggers a reboot. No BMC involvement.

Advantage over BMC watchdog: the BMC can't tell the difference between "server is doing inference" and "server is hung." Ours can. It monitors the actual inference process health, not just "is the OS responding to a timer."

sovereign-health.service
[Unit]
Description=Sovereign Health Daemon
After=network.target

[Service]
Type=notify
ExecStart=/usr/local/bin/sovereign-healthd
Restart=always
RestartSec=1
WatchdogSec=30
CPUAffinity=0
Nice=-10

# Capabilities for MSR and I2C access
AmbientCapabilities=CAP_SYS_RAWIO CAP_DAC_READ_SEARCH
NoNewPrivileges=true

[Install]
WantedBy=multi-user.target
3

Layer 3: Fleet Orchestrator

zero host impact

Centralized management replacing iDRAC/iLO/XClarity. HTTP pull from health daemons. Zero host CPU cost.

What this replaces

Traditional fleet management (Dell OpenManage, HPE OneView, Lenovo XClarity) polls each server's BMC via IPMI. Every poll generates SMIs on the host. With 100 servers polled every 30 seconds, each server receives ~2 polls per minute, each triggering multiple SMIs.

The Fleet Orchestrator is an external service (runs on a management node, not on inference servers) that pulls metrics from the Sovereign Health Daemons over HTTP. The inference servers are never interrupted — the daemon serves from a pre-computed buffer. The orchestrator aggregates, alerts, and dashboards.

Integration stack

Prometheus scrapes all health daemons every 15s. Grafana dashboards show fleet-wide thermals, power, and inference throughput. Alertmanager fires alerts for temp thresholds, fan failures, power anomalies. All standard open-source infrastructure — no vendor lock-in.

Emergency BMC access: If a server becomes unreachable (kernel panic, network failure), the orchestrator wakes the BMC from cold standby via Redfish to perform a hard reset. The BMC is the last resort, not the first tool.

Inference-Synchronized Maintenance

BMC firmware updates, BIOS updates, and hardware diagnostics require the BMC to be active. The orchestrator synchronizes these operations with inference workloads:

Drain Server Stop accepting requests Complete active batches Wake BMC Redfish: enable sensors Re-enable monitoring Maintain Firmware update BIOS update / diag Sleep BMC Cold standby script Verify zero SMIs Resume Accept requests Full throughput Zero inference downtime — maintenance happens on drained servers only
fleet orchestrator commands
# Drain a server from the inference pool
curl -X POST http://orchestrator:8080/api/v1/servers/epyc-042/drain

# Wake BMC for maintenance
curl -sk -u admin:password \
  -X POST https://bmc-ip/redfish/v1/Managers/1/Actions/Manager.Reset \
  -d '{"ResetType": "On"}'

# Perform maintenance (example: BMC firmware update)
curl -sk -u admin:password \
  -X POST https://bmc-ip/redfish/v1/UpdateService/Actions/UpdateService.SimpleUpdate \
  -d '{"ImageURI": "https://firmware-repo/bmc-v2.1.bin"}'

# Sleep BMC (re-apply cold standby)
ssh root@epyc-042 /usr/local/bin/bmc-cold-standby.sh

# Resume inference
curl -X POST http://orchestrator:8080/api/v1/servers/epyc-042/resume

# Prometheus scrape config (add to prometheus.yml)
# scrape_configs:
#   - job_name: 'sovereign-health'
#     scrape_interval: 15s
#     static_configs:
#       - targets: ['epyc-001:9100', 'epyc-002:9100', ...]

Before and After

Traditional IPMI Management

  • IPMI in-band via KCS — 100-500 SMIs/second
  • KVM capture always running — DMA every 33 ms
  • BMC thermal loop — SMI per sensor read
  • Watchdog timer — can reset during inference
  • SOL active — UART interrupts per character
  • SEL filling — ECC events generating SMIs
  • Fleet management polls via IPMI — more SMIs
  • Management overhead: 8-15% throughput loss
  • Sensor blind spots: none (but costs 60 tok/s × 15%)

Sovereign Management

  • Zero IPMI in-band — kernel modules unloaded
  • Zero BMC DMA — video capture disabled
  • PID fan control via I2C — Core 0 only, <1 ms/s
  • Process watchdog — monitors inference, not just OS
  • SSH only — no UART, no SOL
  • journald logging — no SEL, no flash writes
  • HTTP pull from Prometheus — zero host interrupts
  • Management overhead: <0.5%
  • Sensor coverage: 100% (CPU, DIMM, fan, power, VRM)

Cumulative Recovery

What each layer recovers, and what remains.

LayerWhat It DoesOverhead EliminatedResidual After
BaselineStock IPMI, all BMC functions active8-15%
Layer 1BMC cold standby (silence all host interaction)7-14%~1%
Layer 2Sovereign health daemon (replace all monitoring)+0.5%<0.5%
Layer 3Fleet orchestrator (external pull-based monitoring)Remaining polling<0.5%
Layer 1 alone recovers 95% of lost throughput. Layers 2 and 3 replace the monitoring you lost.

One-Shot Setup

Combines all three layers into a single deployment script.

sovereign-setup.sh
#!/bin/bash
# sovereign-setup.sh — Full sovereign management deployment
# Replaces all BMC management with inference-compatible alternatives
# Run as root on each inference server

set -e

BMC_IP=${BMC_IP:-"192.168.1.100"}
BMC_USER=${BMC_USER:-"admin"}
BMC_PASS=${BMC_PASS:-"password"}

echo "=========================================="
echo " Sovereign Management Setup"
echo "=========================================="
echo ""

# ── Layer 1: BMC Cold Standby ──
echo "[Layer 1] Silencing BMC..."

# Unload IPMI kernel modules (eliminates all in-band SMIs)
modprobe -r ipmi_si 2>/dev/null || true
modprobe -r ipmi_devintf 2>/dev/null || true
modprobe -r ipmi_msghandler 2>/dev/null || true
echo "  IPMI kernel modules unloaded"

# Disable watchdog, SOL, video capture via out-of-band IPMI
IPMI="ipmitool -I lanplus -H $BMC_IP -U $BMC_USER -P $BMC_PASS"
$IPMI mc watchdog off 2>/dev/null || true
$IPMI sol deactivate 2>/dev/null || true
$IPMI raw 0x32 0x73 0x01 0x00 2>/dev/null || true
$IPMI raw 0x32 0x71 0x01 0x00 2>/dev/null || true
$IPMI sel clear 2>/dev/null || true

# Set fans to full speed before disabling BMC thermal loop
$IPMI raw 0x30 0x45 0x01 0x01 2>/dev/null || \
  $IPMI raw 0x30 0x30 0x01 0x00 2>/dev/null || true
echo "  BMC silenced (cold standby)"

# ── Layer 2: Sovereign Health Daemon ──
echo "[Layer 2] Installing health daemon..."

# Ensure I2C and MSR access
modprobe i2c-dev 2>/dev/null || true
modprobe msr 2>/dev/null || true

# Install systemd service
cat > /etc/systemd/system/sovereign-health.service <<'UNIT'
[Unit]
Description=Sovereign Health Daemon
After=network.target

[Service]
Type=notify
ExecStart=/usr/local/bin/sovereign-healthd
Restart=always
RestartSec=1
WatchdogSec=30
CPUAffinity=0
Nice=-10
AmbientCapabilities=CAP_SYS_RAWIO CAP_DAC_READ_SEARCH
NoNewPrivileges=true

[Install]
WantedBy=multi-user.target
UNIT

systemctl daemon-reload
systemctl enable sovereign-health
systemctl start sovereign-health
echo "  Health daemon running on Core 0, Prometheus on :9100"

# ── Layer 3: Verify Fleet Integration ──
echo "[Layer 3] Verifying..."

# Test Prometheus endpoint
sleep 2
if curl -sf http://localhost:9100/metrics | grep -q sovereign_cpu_temp; then
  echo "  Prometheus endpoint OK"
else
  echo "  WARNING: Prometheus endpoint not responding yet"
fi

# Verify zero SMIs
SMI_START=$(rdmsr -p 0 0x00000034 2>/dev/null || echo "0")
sleep 5
SMI_END=$(rdmsr -p 0 0x00000034 2>/dev/null || echo "0")
SMI_DELTA=$(( 0x$SMI_END - 0x$SMI_START ))
echo "  SMIs in 5 seconds: $SMI_DELTA (target: 0)"

echo ""
echo "=========================================="
echo " Sovereign Management Active"
echo "=========================================="
echo "  BMC: cold standby (emergency reset only)"
echo "  Sensors: MSR + I2C on Core 0"
echo "  Monitoring: Prometheus :9100"
echo "  Overhead: <0.5%"
echo "  SMIs from management: 0"

The server manages itself. The BMC was designed for a world where servers run dozens of unpredictable workloads and need constant babysitting. Inference servers run one process, consuming every resource, forever. They don't need a management controller generating hundreds of invisible interrupts per second to tell them their temperature. They can read their own temperature. In one microsecond. Without freezing 128 cores.

Layer 1 silences the BMC. Layer 2 replaces it. Layer 3 manages the fleet. Total cost: one housekeeping core reading a few MSRs per second. Total recovery: 8-15% of your inference throughput — the equivalent of adding more servers, for free.