Local Perf Testing

Posted on Nov 2, 2025

Every so often, I like to do some benchmarking on my own machine. Anyone who’s ever done serious performance measurement is probably already scrambling for the comment section to put me on blast as benchmarking on your desktop is basically the opposite of a controlled environment. But with a few tweaks, you can make it just good enough to trust your results.

This post is a walkthrough of how I prep my local machine for low-jitter, semi-deterministic benchmarking on Linux.


Step 0: Hardware & Plan

As with any performance work, the first step is to understand the hardware you’re running on.

In my case, I’m usually doing microbenchmarking - comparing two ways to do the same thing, not simulating full production load. So my home PC works fine as a baseline.

My PC

Here’s the hardware I’m working with:

I’m typically CPU-bound, so most of the following tuning here will be CPU-centric.

lstopo screenshot

The lstopo output above shows my CPU topology: 32 processing units (16 cores x 2 SMT threads), split across two CCDs.

While disabling SMT (hyperthreads) in BIOS would help determinism further, I want to automate this process via bootloader options and scripts rather than manual BIOS changes.

So, let’s talk about the most powerful low-jitter kernel option we’ll start with: core isolation.


Step 1: Core Isolation

When you isolate cores, you’re telling the kernel:

“Leave these CPUs alone — don’t schedule any system tasks or background threads on them. I’ll schedule work on them myself.”

Here’s what that entails:

  1. No kernel housekeeping:
    Threads like rcuo, watchdog, and ksoftirqd stay on other CPUs. Your isolated cores only run your workload.

  2. Tickless operation (nohz_full):
    Periodic kernel timer interrupts are disabled on isolated cores. This removes “kernel noise” that can interrupt benchmarks.

  3. RCU offloading (rcu_nocbs):
    Moves deferred RCU callbacks off isolated cores, keeping them clean.

  4. Optional NUMA binding:
    Bind processes and memory to the same NUMA node to reduce cross-node latency.

Why bother?

  • Deterministic performance: no background jitter.
  • Reproducibility: same CPU environment every run.
  • Accurate profiling: stable timing and fewer “mystery” slowdowns.
lstopo screenshot with isolated cores highlighted

On this PC I usually isolate cores 14,15,16-23 — one from CCD0, and half from CCD1. That setup lets me play with inter-CCD latency when running multi-threaded benchmarks.


Step 2: Boot Entry

My main OS is Arch Linux (btw), using systemd-boot as the boot loader.

nick@tempest ~ $ bootctl status
System:
      Firmware: UEFI 2.80 (American Megatrends 5.26)
 Firmware Arch: x64
   Secure Boot: disabled
  TPM2 Support: yes
  Measured UKI: no
  Boot into FW: supported

Current Boot Loader:
       Product: systemd-boot 258.1-1-arch

I maintain two boot entries - one normal and one for performance testing.

nick@tempest ~ $ bat /boot/loader/entries/*.conf
───────┬────────────────────────────────────────────────────────────────────────────────────────────────────────────────
       │ File: /boot/loader/entries/arch.conf
───────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1   │ title   Arch Linux
   2   │ linux   /vmlinuz-linux
   3   │ initrd  /amd-ucode.img
   4   │ initrd  /initramfs-linux.img
   5   │ options root="LABEL=arch_os" rootfstype=ext4 add_efi_memmap nvidia_drm.modeset=1
   6   │
───────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────────
───────┬────────────────────────────────────────────────────────────────────────────────────────────────────────────────
       │ File: /boot/loader/entries/perf.conf
───────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1   │ title   Arch Linux - perf
   2   │ linux   /vmlinuz-linux
   3   │ initrd  /amd-ucode.img
   4   │ initrd  /initramfs-linux.img
   5   │ options root="LABEL=arch_os" rootfstype=ext4 add_efi_memmap nohz=on nohz_full=14,15,16-23 isolcpus=nohz,domain,
       │ 14,15,16-23 rcu_nocbs=14,15,16-23 rcu_nocb_poll skew_tick=1 transparent_hugepage=never nosoftlockup mce=ignore_
       │ ce audit=0 intel_pstate=disable intel_idle.max_cstate=0 idle=poll
   6   │
───────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────────

Key Boot Parameters

Core Isolation & Scheduling

  • nohz=on and nohz_full=14,15,16-23

    Enables “full dynticks” mode — no periodic ticks on isolated cores.

  • isolcpus=nohz,domain,14,15,16-23

    Keeps scheduler and housekeeping threads off isolated CPUs.

RCU Offloading

  • rcu_nocbs=14,15,16-23

    Pushes RCU callbacks to other cores.

  • rcu_nocb_poll

    Polls instead of interrupting — less kernel noise.

Timing & Tick Behavior

  • skew_tick=1

    Staggers tick events across cores to avoid synchronized spikes.

Memory & Hugepages

  • transparent_hugepage=never

    Disables THP to avoid background page compaction.

Reliability / Watchdog Control

  • nosoftlockup

    Disables watchdog timers.

  • mce=ignore_ce

    Ignores corrected machine-check errors (minor noise).

  • audit=0

    Disables audit subsystem to reduce log spam.

Power & Frequency

  • intel_pstate=disable

    Switches to legacy acpi-cpufreq, giving manual control.

  • intel_idle.max_cstate=0 and idle=poll

    Keeps cores fully awake, no power-saving latency.

Result: your isolated CPUs stay hot, awake, and predictable — at the cost of a lot of watts.

Step 3: Post-Boot Tuning Script

Once booted into “perf mode” I run a short setup script (as root) to finish preparing the environment.

This does the following:

  1. Check we’re booted into perf mode with isolated cores
  2. Prints CPU and NUMA topology for visual confirmation
  3. Temporarily offlines & onlines isolated CPUs to reset their kernel state
  4. Forces the “performance” CPU frequency governor
  5. Pins housekeeping threads to non-performance cores:
    • Watchdog (CPU/system health monitor)
    • RCU threads (Handle deferred kernel callbacks)
    • IRQ affinity (Handle hardware interrupts)
  6. Leaves you with a clean, deterministic system
#!/bin/bash

set -uo pipefail

NON_PERF_CPU_LIST=${1-"0-5"}
SCALING_GOV="performance"

isolated_file="/sys/devices/system/cpu/isolated"

if [[ ! -f "${isolated_file}" ]] || [[ -z "$(cat "${isolated_file}")" ]]; then
    echo "No isolated CPUs detected!"
    echo "It looks like you haven’t booted with isolation options enabled."
    echo

    total_cpus=$(nproc)
    start_isol=$(( (total_cpus * 3) / 4 ))
    end_isol=$(( total_cpus - 1 ))
    example_isol="${start_isol}-${end_isol}"

    echo "Example kernel command line options for isolation:"
    echo
    cat <<EOF
options root="LABEL=arch_os" rootfstype=ext4 add_efi_memmap \\
    nohz=on nohz_full=${example_isol} isolcpus=nohz,domain,${example_isol} \\
    rcu_nocbs=${example_isol} rcu_nocb_poll skew_tick=1 \\
    transparent_hugepage=never nosoftlockup mce=ignore_ce audit=0 \\
    intel_pstate=disable intel_idle.max_cstate=0 idle=poll
EOF
    echo
    echo "Tip: Adjust the isolated core list (${example_isol}) based on your CPU count and workload."
    exit 1
fi

isol_cpus=$(cat ${isolated_file} | awk 'BEGIN{FS=","}{for(i=1;i<=NF; i++){if(split($i, range, "-")>1){for(j=range[1]; j<=range[2]; j++){print j}}else{print $i}}}')

echo "Kernel boot cmdline"
cat /proc/cmdline
echo

lstopo --no-factorize --no-collapse --output-format ascii

echo
echo "Non perf CPUs (override via arg1): ${NON_PERF_CPU_LIST}"
echo "Isolated CPUs: $(cat ${isolated_file})"

numa_nodes_count=$(lscpu | awk '/^NUMA node\(s\):/ {print $3}')
if [[ -n "${numa_nodes_count}" && "${numa_nodes_count}" -gt 1 ]]; then
    echo "System has ${numa_nodes_count} NUMA nodes. Detecting isolated CPU NUMA nodes..."
    # Find which NUMA node(s) contain isolated CPUs
    numa_nodes=$(for cpu in ${isol_cpus}; do
        node_path="/sys/devices/system/cpu/cpu${cpu}/node*/numa_node"
        if [[ -f ${node_path} ]]; then
            cat ${node_path}
        elif [[ -d /sys/devices/system/cpu/cpu${cpu}/node0 ]]; then
            # fallback for older kernels
            echo 0
        else
            cat /sys/devices/system/cpu/cpu${cpu}/topology/physical_package_id
        fi
    done 2>/dev/null | sort -u | tr '\n' ',' | sed 's/,$//')

    echo "Isolated CPUs belong to NUMA node(s): ${numa_nodes}"

    echo "To ensure memory and threads stay local to your benchmark node(s), use:"
    echo "  numactl --cpunodebind=${numa_nodes} --membind=${numa_nodes} ./your_benchmark"
else
    echo "System has a single NUMA node. Skipping NUMA setup."
fi


echo
read -n 1 -s -r -p "Press any key to continue"
echo

echo
echo "Taking all isolated CPUs offline"
for cpu_id in ${isol_cpus}; do
        echo 0 > /sys/devices/system/cpu/cpu${cpu_id}/online
        echo -n "${cpu_id},"
done
echo

sleep 0.5

echo
echo "Bringing all isolated CPUs back online"
for cpu_id in ${isol_cpus}; do
        echo 1 > /sys/devices/system/cpu/cpu${cpu_id}/online
        echo -n "${cpu_id},"
done
echo

echo
echo "Setting scaling governor to ${SCALING_GOV} for isolated CPUs"
for cpu_id in ${isol_cpus}; do
        echo ${SCALING_GOV} > /sys/devices/system/cpu/cpu${cpu_id}/cpufreq/scaling_governor
        echo -n "${cpu_id},"
done
echo

echo
echo "Setting watchdog affinity to ${NON_PERF_CPU_LIST}"
pgrep -f "watchdog" | while read pid; do taskset -cp "${NON_PERF_CPU_LIST}" "${pid}"; done

echo
echo "Setting RCU thread affinity to ${NON_PERF_CPU_LIST}"
pgrep -f "rcuo" | while read pid; do taskset -cp "${NON_PERF_CPU_LIST}" "${pid}"; done

echo
echo "Setting irq affinity to ${NON_PERF_CPU_LIST}"
for irq in /proc/irq/[0-9]*; do
    echo "${NON_PERF_CPU_LIST}" > "${irq}/smp_affinity_list" 2>/dev/null # the stderr can show "permission denied" for some kernel-managed IRQs
done

echo
echo "System tuning complete :)"