Local Perf Testing
Every so often, I like to do some benchmarking on my own machine. Anyone who’s ever done serious performance measurement is probably already scrambling for the comment section to put me on blast as benchmarking on your desktop is basically the opposite of a controlled environment. But with a few tweaks, you can make it just good enough to trust your results.
This post is a walkthrough of how I prep my local machine for low-jitter, semi-deterministic benchmarking on Linux.
Step 0: Hardware & Plan
As with any performance work, the first step is to understand the hardware you’re running on.
In my case, I’m usually doing microbenchmarking - comparing two ways to do the same thing, not simulating full production load. So my home PC works fine as a baseline.
Here’s the hardware I’m working with:
- CPU: AMD Ryzen 9 7950X
- RAM: Corsair 32GB DDR5 5600MHz CL36
- Disk: Samsung 980 Pro NVMe
- Motherboard: Asus ROG STRIX B650E-I GAMING WiFi
- Cooling: Arctic Liquid Freezer II 240
- GPU: AMD Radeon RX 6600 XT
I’m typically CPU-bound, so most of the following tuning here will be CPU-centric.
The lstopo output above shows my CPU topology: 32 processing units (16 cores x 2 SMT threads), split across two CCDs.
While disabling SMT (hyperthreads) in BIOS would help determinism further, I want to automate this process via bootloader options and scripts rather than manual BIOS changes.
So, let’s talk about the most powerful low-jitter kernel option we’ll start with: core isolation.
Step 1: Core Isolation
When you isolate cores, you’re telling the kernel:
“Leave these CPUs alone — don’t schedule any system tasks or background threads on them. I’ll schedule work on them myself.”
Here’s what that entails:
-
No kernel housekeeping:
Threads likercuo,watchdog, andksoftirqdstay on other CPUs. Your isolated cores only run your workload. -
Tickless operation (
nohz_full):
Periodic kernel timer interrupts are disabled on isolated cores. This removes “kernel noise” that can interrupt benchmarks. -
RCU offloading (
rcu_nocbs):
Moves deferred RCU callbacks off isolated cores, keeping them clean. -
Optional NUMA binding:
Bind processes and memory to the same NUMA node to reduce cross-node latency.
Why bother?
- Deterministic performance: no background jitter.
- Reproducibility: same CPU environment every run.
- Accurate profiling: stable timing and fewer “mystery” slowdowns.
On this PC I usually isolate cores 14,15,16-23 — one from CCD0, and half from CCD1.
That setup lets me play with inter-CCD latency when running multi-threaded benchmarks.
Step 2: Boot Entry
My main OS is Arch Linux (btw), using systemd-boot as the boot loader.
nick@tempest ~ $ bootctl status
System:
Firmware: UEFI 2.80 (American Megatrends 5.26)
Firmware Arch: x64
Secure Boot: disabled
TPM2 Support: yes
Measured UKI: no
Boot into FW: supported
Current Boot Loader:
Product: systemd-boot 258.1-1-arch
I maintain two boot entries - one normal and one for performance testing.
nick@tempest ~ $ bat /boot/loader/entries/*.conf
───────┬────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│ File: /boot/loader/entries/arch.conf
───────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ title Arch Linux
2 │ linux /vmlinuz-linux
3 │ initrd /amd-ucode.img
4 │ initrd /initramfs-linux.img
5 │ options root="LABEL=arch_os" rootfstype=ext4 add_efi_memmap nvidia_drm.modeset=1
6 │
───────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────────
───────┬────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│ File: /boot/loader/entries/perf.conf
───────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ title Arch Linux - perf
2 │ linux /vmlinuz-linux
3 │ initrd /amd-ucode.img
4 │ initrd /initramfs-linux.img
5 │ options root="LABEL=arch_os" rootfstype=ext4 add_efi_memmap nohz=on nohz_full=14,15,16-23 isolcpus=nohz,domain,
│ 14,15,16-23 rcu_nocbs=14,15,16-23 rcu_nocb_poll skew_tick=1 transparent_hugepage=never nosoftlockup mce=ignore_
│ ce audit=0 intel_pstate=disable intel_idle.max_cstate=0 idle=poll
6 │
───────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Key Boot Parameters
Core Isolation & Scheduling
-
nohz=onandnohz_full=14,15,16-23Enables “full dynticks” mode — no periodic ticks on isolated cores.
-
isolcpus=nohz,domain,14,15,16-23Keeps scheduler and housekeeping threads off isolated CPUs.
RCU Offloading
-
rcu_nocbs=14,15,16-23Pushes RCU callbacks to other cores.
-
rcu_nocb_pollPolls instead of interrupting — less kernel noise.
Timing & Tick Behavior
-
skew_tick=1Staggers tick events across cores to avoid synchronized spikes.
Memory & Hugepages
-
transparent_hugepage=neverDisables THP to avoid background page compaction.
Reliability / Watchdog Control
-
nosoftlockupDisables watchdog timers.
-
mce=ignore_ceIgnores corrected machine-check errors (minor noise).
-
audit=0Disables audit subsystem to reduce log spam.
Power & Frequency
-
intel_pstate=disableSwitches to legacy acpi-cpufreq, giving manual control.
-
intel_idle.max_cstate=0andidle=pollKeeps cores fully awake, no power-saving latency.
Result: your isolated CPUs stay hot, awake, and predictable — at the cost of a lot of watts.
Step 3: Post-Boot Tuning Script
Once booted into “perf mode” I run a short setup script (as root) to finish preparing the environment.
This does the following:
- Check we’re booted into perf mode with isolated cores
- Prints CPU and NUMA topology for visual confirmation
- Temporarily offlines & onlines isolated CPUs to reset their kernel state
- Forces the “performance” CPU frequency governor
- Pins housekeeping threads to non-performance cores:
- Watchdog (CPU/system health monitor)
- RCU threads (Handle deferred kernel callbacks)
- IRQ affinity (Handle hardware interrupts)
- Leaves you with a clean, deterministic system
#!/bin/bash
set -uo pipefail
NON_PERF_CPU_LIST=${1-"0-5"}
SCALING_GOV="performance"
isolated_file="/sys/devices/system/cpu/isolated"
if [[ ! -f "${isolated_file}" ]] || [[ -z "$(cat "${isolated_file}")" ]]; then
echo "No isolated CPUs detected!"
echo "It looks like you haven’t booted with isolation options enabled."
echo
total_cpus=$(nproc)
start_isol=$(( (total_cpus * 3) / 4 ))
end_isol=$(( total_cpus - 1 ))
example_isol="${start_isol}-${end_isol}"
echo "Example kernel command line options for isolation:"
echo
cat <<EOF
options root="LABEL=arch_os" rootfstype=ext4 add_efi_memmap \\
nohz=on nohz_full=${example_isol} isolcpus=nohz,domain,${example_isol} \\
rcu_nocbs=${example_isol} rcu_nocb_poll skew_tick=1 \\
transparent_hugepage=never nosoftlockup mce=ignore_ce audit=0 \\
intel_pstate=disable intel_idle.max_cstate=0 idle=poll
EOF
echo
echo "Tip: Adjust the isolated core list (${example_isol}) based on your CPU count and workload."
exit 1
fi
isol_cpus=$(cat ${isolated_file} | awk 'BEGIN{FS=","}{for(i=1;i<=NF; i++){if(split($i, range, "-")>1){for(j=range[1]; j<=range[2]; j++){print j}}else{print $i}}}')
echo "Kernel boot cmdline"
cat /proc/cmdline
echo
lstopo --no-factorize --no-collapse --output-format ascii
echo
echo "Non perf CPUs (override via arg1): ${NON_PERF_CPU_LIST}"
echo "Isolated CPUs: $(cat ${isolated_file})"
numa_nodes_count=$(lscpu | awk '/^NUMA node\(s\):/ {print $3}')
if [[ -n "${numa_nodes_count}" && "${numa_nodes_count}" -gt 1 ]]; then
echo "System has ${numa_nodes_count} NUMA nodes. Detecting isolated CPU NUMA nodes..."
# Find which NUMA node(s) contain isolated CPUs
numa_nodes=$(for cpu in ${isol_cpus}; do
node_path="/sys/devices/system/cpu/cpu${cpu}/node*/numa_node"
if [[ -f ${node_path} ]]; then
cat ${node_path}
elif [[ -d /sys/devices/system/cpu/cpu${cpu}/node0 ]]; then
# fallback for older kernels
echo 0
else
cat /sys/devices/system/cpu/cpu${cpu}/topology/physical_package_id
fi
done 2>/dev/null | sort -u | tr '\n' ',' | sed 's/,$//')
echo "Isolated CPUs belong to NUMA node(s): ${numa_nodes}"
echo "To ensure memory and threads stay local to your benchmark node(s), use:"
echo " numactl --cpunodebind=${numa_nodes} --membind=${numa_nodes} ./your_benchmark"
else
echo "System has a single NUMA node. Skipping NUMA setup."
fi
echo
read -n 1 -s -r -p "Press any key to continue"
echo
echo
echo "Taking all isolated CPUs offline"
for cpu_id in ${isol_cpus}; do
echo 0 > /sys/devices/system/cpu/cpu${cpu_id}/online
echo -n "${cpu_id},"
done
echo
sleep 0.5
echo
echo "Bringing all isolated CPUs back online"
for cpu_id in ${isol_cpus}; do
echo 1 > /sys/devices/system/cpu/cpu${cpu_id}/online
echo -n "${cpu_id},"
done
echo
echo
echo "Setting scaling governor to ${SCALING_GOV} for isolated CPUs"
for cpu_id in ${isol_cpus}; do
echo ${SCALING_GOV} > /sys/devices/system/cpu/cpu${cpu_id}/cpufreq/scaling_governor
echo -n "${cpu_id},"
done
echo
echo
echo "Setting watchdog affinity to ${NON_PERF_CPU_LIST}"
pgrep -f "watchdog" | while read pid; do taskset -cp "${NON_PERF_CPU_LIST}" "${pid}"; done
echo
echo "Setting RCU thread affinity to ${NON_PERF_CPU_LIST}"
pgrep -f "rcuo" | while read pid; do taskset -cp "${NON_PERF_CPU_LIST}" "${pid}"; done
echo
echo "Setting irq affinity to ${NON_PERF_CPU_LIST}"
for irq in /proc/irq/[0-9]*; do
echo "${NON_PERF_CPU_LIST}" > "${irq}/smp_affinity_list" 2>/dev/null # the stderr can show "permission denied" for some kernel-managed IRQs
done
echo
echo "System tuning complete :)"