IPMI Impact

The Intelligent Platform Management Interface runs on every datacenter server. It monitors temperatures, controls fans, watches power, and enables remote management. It also silently steals performance from your inference workload. Here's how, why, and what to do about it.

8-15% throughput loss (worst case)

50-800 µs SMI stall per interrupt

ALL cores frozen during SMI

~2% residual after mitigation

What IPMI Actually Does to Your Server

IPMI is an open standard (IPMI 2.0, Intel/HP/NEC/Dell, 2004) for out-of-band server management. Every enterprise server ships with a Baseboard Management Controller (BMC) — a dedicated ARM or ARC processor soldered to the motherboard. This processor runs its own OS (typically Linux or a proprietary RTOS), has its own network interface, and operates independently of the host CPU. It can power the server on and off, read sensors, update firmware, and provide a remote console — even when the host OS is crashed or the machine is powered down.

The problem: independently doesn't mean invisibly. The BMC shares physical resources with the host CPU. To read a temperature sensor, it generates an interrupt. To manage power, it triggers System Management Interrupts. To access its firmware or log storage, it may steal memory bus cycles. Every one of these interactions costs your inference workload time.

The Six Collisions

Every one of these is a measurable performance tax on inference. They stack. In a worst-case datacenter deployment with aggressive monitoring, you can lose 8-15% of your optimized throughput to IPMI overhead alone.

System Management Interrupts (SMI)

-3 to 5% throughput

The worst offender. Silent, invisible, and it freezes every core simultaneously.

What happens

When the BMC needs to do certain operations — read a sensor through the legacy KCS interface, handle a power management event, or perform a firmware-level task — it triggers a System Management Interrupt (SMI#). This is a hardware interrupt that is above the operating system. The CPU immediately:

1. Freezes all cores — not one core, ALL 128 cores stop executing inference code
2. Saves processor state to SMRAM (a hidden memory region the OS can't see)
3. Enters System Management Mode (SMM) — ring -2, below the kernel
4. Executes the SMI handler (BIOS/firmware code)
5. Restores state and resumes normal execution

The OS has no visibility. There's no log entry. perf doesn't see it. top doesn't see it. From the kernel's perspective, time simply vanishes. A 200 µs SMI means all 128 cores producing zero tokens for 200 µs, and nobody knows.

Why this is devastating for inference

Your CCD speculative decoding cycle is ~76 ms. A single 200 µs SMI is 0.26% of one cycle — seems small. But BMCs can generate SMIs at rates of 100-500 per second during active monitoring. At 300 SMIs/s × 200 µs each = 60 ms of dead time per second. That's 6% of your throughput — gone, invisible, unexplainable by any OS-level tool.

Worse: SMIs don't respect your core isolation. isolcpus means nothing to SMM. Every SMI freezes your 112 verification cores and your 16 draft cores simultaneously.

SMI Source	Typical Duration	Frequency	Impact at 60 tok/s
Temperature sensor read	50-100 µs	1-10 Hz	Minimal
KCS (IPMI over LPC)	100-300 µs	10-100 Hz	-1 to 3%
Power management event	200-800 µs	Varies	Spiky
BIOS error logging	300-500 µs	On error	Rare
Memory ECC scrubbing (via SMI)	100-200 µs	Periodic	Predictable
Aggregate (active IPMI session)	—	100-500 Hz	-3 to 5%

Mitigation

Detect: Use perf stat -e msr/tsc/ to find unexplained TSC gaps. Or use a dedicated tool: hwlatdetect (from rt-tests package) measures SMI-induced latency by looking for time intervals where the CPU was stolen. Readings above 10 µs indicate SMI activity.

Reduce: Switch IPMI transport from KCS (legacy LPC-based, SMI-heavy) to BT or SSIF (I2C-based, uses regular interrupts instead of SMIs). Configure the BMC to use Redfish over LAN instead of in-band IPMI, which eliminates most host-side SMI triggers.

Minimize sensor polling: Set the BMC's sensor polling interval from the default (often 1-3 seconds for all sensors) to the minimum required. Many sensors can be polled every 30-60 seconds without losing meaningful thermal protection. Each poll cycle saved is an SMI eliminated.

detect and measure SMI impact

# Install rt-tests for hwlatdetect
sudo apt install -y rt-tests

# Measure SMI-induced latency (runs for 60 seconds)
sudo hwlatdetect --duration=60 --threshold=10
# Any spike >10 µs is likely an SMI. Count them.

# Count SMIs via MSR (AMD-specific)
sudo rdmsr -a 0x00000034   # SMI counter per core
sleep 10
sudo rdmsr -a 0x00000034   # Compare — delta = SMIs in 10s

# Reduce: switch IPMI to Redfish (out-of-band, no SMI)
# ipmitool to change BMC transport:
ipmitool mc setenables system_event_log=off
ipmitool mc setenables oem0=off

# Reduce sensor polling to 60s intervals
ipmitool sensor thresh all lower na na na
# Or configure via BMC web UI: Settings → Sensor Polling → 60s

Memory Bus Contention

-1 to 3% bandwidth

The BMC shares the memory bus. When it reads, your inference stalls.

What happens

Many BMC implementations (particularly ASPEED AST2500/2600, the most common) are connected to the host via PCIe or LPC and have DMA access to host memory. The BMC uses this for several purposes:

Video redirection: KVM-over-IP reads the host's video framebuffer from memory. Even when nobody is watching the remote console, many BMCs continuously scan the framebuffer for changes. Each scan reads 2-8 MB from the memory bus.

Virtual media: ISO mounts and virtual USB devices use DMA transfers through host memory. A virtual CD-ROM mount triggers ongoing DMA.

Shared memory regions: The BMC maintains shared memory buffers for IPMI messaging, sensor data records (SDR), and system event logs (SEL). Writing to these regions competes with your weight streaming.

Why this matters for inference

Your optimized inference uses 460.8 GB/s of memory bandwidth to stream 35 GB of weights. That bandwidth is your ceiling — every GB/s lost directly reduces tok/s. BMC DMA traffic is small in absolute terms (maybe 100-500 MB/s), but it creates bus contention at the memory controller level. When a BMC DMA request arrives, the memory controller must arbitrate between the CPU's weight streaming and the BMC's request. This adds latency to CPU memory reads — not because the BMC uses much bandwidth, but because the memory controller stalls the CPU pipeline for a few cycles to service the BMC.

Mitigation

Disable KVM video scanning: If you don't need remote console, disable the BMC's video capture. On most BMCs: ipmitool raw 0x32 0x73 0x01 0x00 (vendor-specific). Or through the BMC web interface: Remote Control → Console Redirection → Disabled.

Unmount virtual media: Disconnect any ISO or virtual USB devices. Each one generates continuous DMA.

Use dedicated BMC NIC: If your server has a shared NIC (BMC and host on the same physical port), IPMI network traffic flows through the host's PCIe bus. A dedicated BMC port (standard on most EPYC server boards) eliminates this entirely.

reduce BMC memory contention

# Check if virtual media is mounted
ipmitool raw 0x32 0x70 0x01
# Unmount any active virtual media
ipmitool raw 0x32 0x71 0x01 0x00

# Disable video capture (vendor-specific — Supermicro example)
ipmitool raw 0x32 0x73 0x01 0x00   # 0x00 = disable

# Verify BMC NIC is dedicated (not shared)
ipmitool lan print 1 | grep "MAC Address"
ip link show | grep -i bmc
# If the BMC MAC appears on a host interface, you're sharing

# Monitor memory bandwidth to detect BMC contention
perf stat -e amd_l3/event=0x04/ -a -- sleep 5
# Run with and without BMC activity — compare bandwidth

Power Capping & Throttling

-0 to 10% (when active)

The BMC controls power policy. Datacenter power limits become inference limits.

What happens

Datacenters have per-rack and per-server power budgets. The BMC enforces these through DCMI (Data Center Manageability Interface), a standard built on top of IPMI. When total server power exceeds the configured cap, the BMC sends a signal to the CPU to throttle — it reduces the allowed TDP, which forces the CPU to drop clock frequency and voltage.

An EPYC 9754 has a default TDP of 360W. A datacenter might cap it at 280W to fit power budget constraints. At 280W, the CPU can't sustain boost clocks on all 128 cores. It drops from 2.25 GHz base to perhaps 1.8 GHz on some CCDs. Your verification bandwidth drops proportionally.

How it collides with inference

Inference is thermally steady-state. Unlike burst workloads (compilation, web requests), inference draws a constant, high power load. 128 cores streaming weights continuously means the CPU runs near its power limit permanently. This is exactly the scenario where power capping is most aggressive.

Even more insidious: PROCHOT# (Processor Hot). If the BMC's thermal monitoring sees a temperature threshold exceeded, it asserts PROCHOT#, which instantly throttles the CPU to minimum frequency. A single sensor reading above threshold can drop your 60 tok/s to 15 tok/s for seconds until temperatures settle.

Power Scenario	CPU Behavior	Inference Impact
No cap (360W TDP)	Full frequency, all cores	0% loss
320W cap	Slight reduction on boost	-2 to 3%
280W cap	Base frequency constrained	-5 to 8%
240W cap (aggressive)	Below base, cores throttled	-10 to 20%
PROCHOT# asserted	Minimum frequency, all cores	-70% (temporary)

Mitigation

Know your power budget. Check the current DCMI power cap: ipmitool dcmi power get_limit. If it's below 360W, you're already throttled. Work with your datacenter team to raise the per-server limit, or accept the throughput reduction.

Optimize the power-frequency curve. Our implementation already sets verification CCDs to 2.25 GHz base clock (Step 5). This is lower power than boost, reducing total draw. If you're power-capped, this matters — you're spending your watt budget on bandwidth-useful work, not on clock speed that doesn't help bandwidth-bound cores.

Monitor PROCHOT#. AMD platforms expose throttle events via MSR. If you see periodic tok/s drops, PROCHOT is the first suspect.

power management

# Check current power cap
ipmitool dcmi power get_limit
# "Power Limit: 360 Watts" = no cap (good)
# "Power Limit: 280 Watts" = you're throttled

# Read current power draw
ipmitool dcmi power reading
# Compare actual draw vs cap — if draw is near cap, you're hitting it

# Check for thermal throttling (AMD)
sudo rdmsr -a 0xC0010063   # P-state status, all cores
# If any core shows a higher P-state than expected, it's throttling

# Monitor PROCHOT events
sudo rdmsr 0x19C           # IA32_THERM_STATUS
# Bit 0 = thermal status, Bit 1 = PROCHOT log (sticky)

# Raise power limit (if you have BMC admin access)
ipmitool dcmi power set_limit limit 400
ipmitool dcmi power activate

Watchdog Timer & Health Checks

-0 to 2% (can cause resets)

The BMC watches for signs of life. Inference looks a lot like a hang.

What happens

The BMC's watchdog timer expects the host OS to send periodic heartbeat signals. If the watchdog doesn't receive a heartbeat within its timeout (typically 5-10 minutes), it assumes the server has crashed and takes corrective action — which can mean hard resetting the server, power cycling, or sending an alert.

During sustained inference, the host OS is alive but your isolated cores (remember isolcpus=1-127) are doing nothing but inference. Core 0 handles system tasks including the watchdog pet. But if the system is heavily loaded and the watchdog daemon gets delayed, the BMC may decide your server is dead.

The real danger

The watchdog doesn't just cause jitter — it can kill your entire inference process by resetting the server. This has happened in production. A long inference batch runs for hours, the system is 100% utilized, the watchdog daemon gets a scheduling delay on core 0, the BMC doesn't receive its pet in time, and the server reboots. Your batch job, KV cache state, and any unsaved results are gone.

Some datacenter management tools (Dell OpenManage, HPE iLO, Lenovo XClarity) also send periodic health check commands via IPMI. Each health check triggers an SMI or interrupt to collect system data. Aggressive monitoring platforms can send these every 5-10 seconds.

Mitigation

Extend or disable the watchdog: ipmitool mc watchdog set timeout 0 disables it. If your datacenter requires a watchdog, extend the timeout to 30-60 minutes and ensure the watchdog daemon runs on core 0 with high priority.

Reduce external health checks: Configure your management platform to poll less aggressively during inference workloads. Set SNMP trap-based monitoring instead of poll-based — the server reports problems when they happen instead of being asked constantly.

watchdog management

# Check watchdog status
ipmitool mc watchdog get
# Look for: Timer Use, Countdown, Current State

# Disable watchdog (if allowed by datacenter policy)
ipmitool mc watchdog off

# Or extend to 60 minutes
ipmitool mc watchdog set timer use os_load timeout 3600 action none

# Ensure watchdog daemon runs on core 0 (not isolated)
taskset -c 0 /usr/sbin/ipmi-watchdog --interval=300

# Switch to trap-based monitoring (no polling)
# In BMC web UI: Configuration → SNMP → Trap Destination
# Disable: periodic sensor scanning over IPMI in-band

Serial-over-LAN & Console Redirection

-0.5 to 1% (when active)

Remote console access generates constant interrupts — even when nobody is watching.

What happens

Serial-over-LAN (SOL) redirects the server's serial console to the network. This is how admins get a "terminal" to the server through the BMC, even when SSH is down. SOL works by intercepting the system's UART and forwarding bytes to the BMC's network stack.

The problem: SOL generates interrupts per character. Every byte of console output triggers an interrupt on the host CPU. If your inference process logs anything to stdout/stderr, and the console is redirected, each log line generates dozens of interrupts. Even when SOL is "idle," many implementations poll the UART at 10-100 Hz, generating periodic interrupts.

Additional overhead: Java-based KVM

Many BMC web interfaces use Java or HTML5 KVM viewers. When a KVM session is active, the BMC continuously captures the host's video output, compresses it, and streams it over the network. On the host side, this means the BMC is performing repeated DMA reads of the video framebuffer — typically every 33-100 ms (10-30 fps). Each DMA read contends with memory bandwidth.

Even if nobody has the KVM viewer open, some BMCs leave the video capture engine running after the last session. You have to explicitly disable it.

Mitigation

Close all SOL sessions when not actively debugging. Use SSH for routine access — SSH doesn't touch IPMI at all.

Disable SOL in BIOS if you never need it: BIOS → Server Management → Console Redirection → Disabled.

Disable video capture: After any KVM session, explicitly stop the BMC's capture engine through the BMC web interface or IPMI raw commands.

Redirect inference output to files, not stdout. This eliminates UART traffic entirely.

SOL and console management

# Check active SOL sessions
ipmitool sol info 1

# Deactivate SOL
ipmitool sol deactivate

# Disable console redirection in BIOS (prevents UART polling)
# BIOS → Server Management → Console Redirection → Disabled

# Redirect inference output to file instead of console
./llama-speculative [args] > /var/log/inference.log 2>&1

# Verify no UART interrupts
cat /proc/interrupts | grep -i uart
# Should show 0 or near-0 interrupts per second on isolated cores

Firmware Updates, SEL Logging & Fan Control

-0.5 to 2% (periodic spikes)

Background BMC operations that spike at the worst times.

What happens

System Event Log (SEL): The BMC logs hardware events — temperature threshold crossings, voltage anomalies, fan failures, ECC errors — to a persistent log stored in BMC flash. Writing to flash via the LPC/eSPI bus generates SMIs. A flurry of correctable ECC errors (common in servers with 768 GB of RAM) can fill the SEL rapidly, each entry causing an SMI.

Fan control: The BMC adjusts fan speeds based on temperature readings. The control loop runs in the BMC firmware, but it reads temperature sensors through the host (I2C/SMBus or direct MSR reads). Aggressive thermal management with fast-reacting fan curves means more frequent sensor reads = more SMIs. And 128 cores running inference is hot — fans will be active.

BMC firmware updates: Some management platforms schedule automatic BMC firmware updates. During a firmware flash, the BMC may be partially unavailable, and certain operations trigger extended SMIs (500-2000 µs) as firmware regions are written. This should never happen during production inference, but automatic update policies can surprise you.

Mitigation

Clear the SEL regularly: A full SEL causes the BMC to work harder (wrap-around handling). ipmitool sel clear. Better: fix the underlying issues causing the events.

Set fan control to manual or full-speed: If your datacenter cooling can handle it, set fans to maximum speed and disable the BMC's thermal control loop. This eliminates all temperature-polling SMIs related to fan adjustments. Yes, it's louder. Your throughput doesn't care.

Disable automatic firmware updates: Schedule BMC firmware updates during maintenance windows, never during inference workloads.

SEL, fan, and firmware management

# Check SEL status and entries
ipmitool sel info
ipmitool sel list | tail -20
# If "Entries" is near "Free Space" limit, clear it:
ipmitool sel clear

# Set fan control to full speed (eliminates thermal polling SMIs)
# Supermicro example:
ipmitool raw 0x30 0x45 0x01 0x01   # 0x01 = full speed
# Dell iDRAC example:
ipmitool raw 0x30 0x30 0x01 0x00   # disable automatic fan control
ipmitool raw 0x30 0x30 0x02 0xff 0x64  # set fans to 100%

# Disable automatic BMC firmware updates
# Check with your management platform (XClarity, iLO, iDRAC)
# Always do firmware updates in scheduled maintenance windows

The Complete Mitigation Plan

You can't remove the BMC — it's soldered to the board. But you can silence it. Here's the priority-ordered checklist for every server in your inference fleet.

Priority	Action	Impact Eliminated	Risk	Reversible
1	Switch IPMI transport to Redfish / out-of-band	-3 to 5% (SMIs)	None	Yes
2	Disable KVM video capture	-1 to 2% (DMA)	None	Yes
3	Raise / remove power cap	-0 to 10%	Power budget	Yes
4	Extend watchdog to 60 min	Prevents resets	Delayed recovery	Yes
5	Set fans to full speed	-0.5 to 1% (polling)	Noise	Yes
6	Close SOL sessions, redirect output to file	-0.5 to 1% (IRQs)	None	Yes
7	Clear SEL, reduce sensor polling to 60s	-0.5 to 1% (SMIs)	None	Yes
8	Unmount virtual media	~0.5% (DMA)	None	Yes
9	Disable auto firmware updates	Prevents spikes	Manual updates	Yes

All actions are reversible. Priority 1 alone recovers most of the lost throughput.

Before and After

Default Datacenter Configuration

IPMI in-band via KCS — 100-300 SMIs per second
KVM capture always running — DMA every 33 ms
Power cap at 280W — forced frequency reduction
Watchdog at 5 min — resets during long batches
Fans on auto — thermal polling every 1-3 seconds
SOL active — UART interrupts per character
SEL filling up — ECC events generating SMIs
Total overhead: 8-15% throughput loss
Effective: 51-69 tok/s (from 60-75 tok/s potential)

After IPMI Mitigation

Redfish out-of-band — zero SMIs from IPMI transport
Video capture off — zero DMA from BMC
Full TDP (360W) — no frequency throttling
Watchdog at 60 min — no false resets
Fans at 100% — no thermal polling overhead
SOL disabled — SSH only, no UART traffic
SEL clear, sensor polling at 60s intervals
Residual overhead: ~2% (unavoidable SMIs)
Effective: 59-74 tok/s (near theoretical maximum)

The irony: IPMI was designed to help you manage servers. For inference, the management layer is the enemy. Every sensor read, every health check, every background BMC operation exists to make generic datacenter operations safer. But inference isn't generic. It's a single process consuming every resource on the machine. The management overhead that's invisible at 5% CPU load becomes a measurable tax at 100% load.

One-Shot Automation

Run this script on each inference server before starting workloads. It applies all mitigations in the correct order.

ipmi-silence.sh

#!/bin/bash
# ipmi-silence.sh — Minimize IPMI/BMC interference for inference
# Run as root before starting inference workloads

set -e

echo "[1/9] Detecting SMI baseline..."
SMI_BEFORE=$(sudo rdmsr -p 0 0x00000034 2>/dev/null || echo "N/A")
echo "  SMI count (core 0): $SMI_BEFORE"

echo "[2/9] Disabling SOL and console redirection..."
ipmitool sol deactivate 2>/dev/null || true
echo "  SOL deactivated"

echo "[3/9] Unmounting virtual media..."
ipmitool raw 0x32 0x71 0x01 0x00 2>/dev/null || true
echo "  Virtual media unmounted"

echo "[4/9] Clearing System Event Log..."
ipmitool sel clear
echo "  SEL cleared"

echo "[5/9] Extending watchdog timeout to 60 minutes..."
ipmitool mc watchdog set timer use os_load timeout 3600 action none 2>/dev/null || \
  ipmitool mc watchdog off 2>/dev/null || true
echo "  Watchdog extended"

echo "[6/9] Setting fans to full speed..."
# Try Supermicro, then Dell, then generic
ipmitool raw 0x30 0x45 0x01 0x01 2>/dev/null || \
  ipmitool raw 0x30 0x30 0x01 0x00 2>/dev/null || \
  echo "  (Set fans manually via BMC web UI)"
echo "  Fan control set to full"

echo "[7/9] Checking power cap..."
POWER_LIMIT=$(ipmitool dcmi power get_limit 2>/dev/null | grep "Power Limit" | awk '{print $4}')
if [ -n "$POWER_LIMIT" ] && [ "$POWER_LIMIT" -lt 360 ] 2>/dev/null; then
  echo "  WARNING: Power cap is ${POWER_LIMIT}W (below 360W TDP)"
  echo "  Consider raising: ipmitool dcmi power set_limit limit 400"
else
  echo "  Power cap OK (${POWER_LIMIT:-uncapped}W)"
fi

echo "[8/9] Reducing sensor polling interval..."
# This is vendor-specific — configure via BMC web UI if command fails
echo "  Configure via BMC web UI: Settings → Sensor Polling → 60s"

echo "[9/9] Measuring SMI rate (10 second sample)..."
SMI_START=$(sudo rdmsr -p 0 0x00000034 2>/dev/null || echo "0")
sleep 10
SMI_END=$(sudo rdmsr -p 0 0x00000034 2>/dev/null || echo "0")
SMI_RATE=$(( (0x$SMI_END - 0x$SMI_START) ))
echo "  SMIs in 10 seconds: $SMI_RATE (target: <10)"

echo ""
echo "Done. IPMI interference minimized."
echo "Residual overhead: ~2% (unavoidable platform SMIs)"
echo "Run hwlatdetect --duration=60 --threshold=10 to verify"

Why Redfish Is the Answer

The single highest-impact mitigation is moving from in-band IPMI to Redfish. Here's why.

Attribute	IPMI In-Band (KCS)	Redfish (Out-of-Band)
Transport	LPC/eSPI bus via SMI	HTTPS over BMC NIC
Host CPU impact	SMI per command (all cores frozen)	Zero (BMC-only)
Protocol	Binary, 20 years old	RESTful JSON, modern
Authentication	IPMI 2.0 (known weaknesses)	TLS + session tokens
Sensor reading	SMI per sensor	BMC reads sensors locally, serves over HTTP
Monitoring tools	ipmitool, freeipmi	curl, any HTTP client, Prometheus
Inference overhead	3-5%	~0%

How to switch

Every modern server BMC (iDRAC 9+, iLO 5+, Supermicro X12+, AMD SP5 platforms) supports Redfish natively. Access it at https://<bmc-ip>/redfish/v1/. You can monitor sensors, manage power, read logs, and control the server entirely through the BMC's network interface — without ever touching the host CPU.

Redfish examples

# Read all sensor data (zero host CPU impact)
curl -sk -u admin:password \
  https://bmc-ip/redfish/v1/Chassis/1/Thermal | jq '.Temperatures'

# Read power consumption
curl -sk -u admin:password \
  https://bmc-ip/redfish/v1/Chassis/1/Power | jq '.PowerControl'

# Prometheus integration (via Redfish exporter)
# github.com/stmcginnis/gofish — Go Redfish client
# No in-band IPMI needed. Zero SMIs. Full monitoring.

# Check BMC Redfish capability
curl -sk https://bmc-ip/redfish/v1/ | jq '.RedfishVersion'
# Should return "1.6.0" or higher

Silencing the BMC is Step 1. Replacing it is Step 2. The mitigations on this page tell you what to turn off. Sovereign Management tells you what to turn on — direct MSR/I2C sensor access, PID fan control, Prometheus monitoring, fleet orchestration. Zero SMIs. Full sensor coverage retained. The server manages itself.