Caveat
Things you definitely want to know
Generic issues
isolcpus is our friend too
Isolating some CPUs on the kernel command line using the isolcpus= option, in order to prevent the load balancer from offloading in-band work to them is not only a good idea with PREEMPT_RT, but for any dual kernel configuration too.
By doing so, having some random in-band work evicting cache lines on a CPU where real-time threads briefly sleep is less likely, increasing the odds of costly cache misses, which translates positively into the latency numbers you can get. Even if EVL’s small footprint core has a limited exposure to such kind of disturbance, saving a handful of microseconds is worth it when the worst case figure is already within tenths of microseconds.
CONFIG_DEBUG_HARD_LOCKS is cool but ruins real-time guarantees
When CONFIG_DEBUG_HARD_LOCKS is enabled, the lock dependency engine
(CONFIG_LOCKDEP) which helps in tracking down deadlocks and other
locking-related issues is also enabled for Dovetail’s hard
locks,
which underpins most of the serialization mechanisms the EVL core
uses.
This is nice as it has the lock validator monitor the hard spinlocks
EVL uses too. However, this comes with a high price latency-wise:
seeing hundreds of microseconds spent in the validator with hard
interrupts off from time to time is not uncommon. Running the latency
monitoring utility (aka latmus) which is part of libevl in this
configuration should give you pretty ugly numbers.
In short, it is fine enabling CONFIG_DEBUG_HARD_LOCKS for debugging
some locking pattern in EVL, but you won’t be able to meet real-time
requirements at the same time in such configuration.
CPU frequency scaling (usually) has a negative impact on latency
Enabling the ondemand CPUFreq governor - or any governor performing dynamic adjustment of the CPU frequency - may induce significant latency for EVL on your system, from ten microseconds to more than a hundred depending on the hardware. Selecting the so-called performance governor is the safe option, which guarantees that no frequency transition ever happens, keeping the CPUs at their maximum processing speed.
In other words, if CONFIG_CPU_FREQ has to be enabled in your
configuration, enabling CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE and
CONFIG_CPU_FREQ_GOV_PERFORMANCE exclusively is most often the best way
to prevent unexpectedly high latency peaks.
Disable CONFIG_SMP for best latency on single-core systems
On single-core hardware, some out-of-line code may still be executed for dealing with various types of spinlock with a SMP build, which translates into additional CPU branches and cache misses. On low end hardware, this overhead may be noticeable.
Therefore, if you neither need SMP support nor kernel debug options
which depend on instrumenting the spinlock constructs (e.g.
CONFIG_DEBUG_PREEMPT), you may want to disable all the related kernel
options, starting with CONFIG_SMP.
Memory compaction and page migration impact real-time behavior
Several kernel configuration options related to memory management can introduce unpredictable latency through page migration and page fault handling, which is problematic for real-time workloads:
-
CONFIG_COMPACTION: Enables memory compaction to reduce fragmentation by migrating pages to create larger contiguous memory regions. The compaction process can trigger page migrations that introduce latency. -
CONFIG_MIGRATION: Allows the migration of the physical location of memory pages of processes while the virtual addresses are not changed. Useful for allowing the kernel to move pages between NUMA nodes or during compaction. Page migration involves copying page contents and updating page table entries, which can take time, or cause page faults when page table entries become temporarily unavailable. The use ofmlockormlockalldoes NOT automatically guarantee that a page will not be migrated. See below for more. -
CONFIG_TRANSPARENT_HUGEPAGE: Transparent Hugepages (THP) reduces page faults by mapping 2MB instead of 4KB pages, but causes higher latency during faults due to intensive memory allocation and zeroing. While reducing the number of faults, THP can incur higher latency during initial memory access or when khugepaged compacts memory.
The best approach depends on your workload characteristics. In an ideal
situation, you would disable CONFIG_COMPACTION,
CONFIG_MIGRATION, and CONFIG_TRANSPARENT_HUGEPAGE entirely. In other
situations, you may need to keep these options enabled
If you are running applications that alloc/free memory often, and/or need
a steady source of consecutive pages, you may need to keep CONFIG_MIGRATION
and CONFIG_COMPACTION enabled. Without it, your in-band applications may
experience out-of-memory issues.
Luckily, procfs provides tunables to control compaction behavior:
vm.compaction_proactiveness: determines how aggressively compaction is done in the background. Write of a non zero value to this tunable will immediately trigger the proactive compaction. Setting it to 0 disables proactive compaction.vm.compact_unevictable_allowed: When set to 1, compaction is allowed to examine the unevictable lru (mlocked pages) for pages to compact. This should be used on systems where stalls for minor page faults are an acceptable trade for large contiguous free memory. Set to 0 to prevent compaction from moving pages that are unevictable. On EVL, the default value is 0 in order to avoid a page fault due to compaction. (CONFIG_COMPACT_UNEVICTABLE_DEFAULT)
procfs also provides an interface to manually trigger a memory compaction
operation using vm.compact_memory.
Using these options, you can perform a sequence such as the following to prime the system for reliability:
- Launch EVL threads. At the end of initialization, but before calling mlockall..
- Trigger a memory compaction manually using
vm.compact_memory - Once complete, call
mlockallto lock pages. - Then, disable
vm.compact_unevictable_allowedto prevent those pages from getting migrated. - Finally, launch other in-band applications.
Architecture-specific issues
x86
Issues you can work around
-
Some processor idle states may significantly increase latency for the whole software system - up to hundreds of microseconds - in order to fully wake up some functional blocks upon external event. The
intel_idledriver which is part of the CPU idle time management subsystem of the kernel deals with those states akaC-states. Depending on the microarchitecture of the Intel CPU, you may have to disable this driver in order to reduce the latency figures to acceptable values, by passingintel_idle.max_cstate=0on the kernel command line. More details available there. -
CONFIG_ACPI_PROCESSOR_IDLEmay increase the latency upon wakeup on IRQ from idle on some SoC (up to 30 us observed) on x86. This option is implicitly selected by the following configuration chain:CONFIG_SCHED_MC_PRIO→CONFIG_INTEL_PSTATE→CONFIG_ACPI_PROCESSOR. If out-of-range latency figures are observed on your x86 hardware, turning off this chain may help. -
Tweaking the BIOS settings may be required in order to lower the latency figures as well. Typically, you may want to check whether disabling
Hyperthreadingand CPU power management there helps. -
When the HPET is disabled, the watchdog which monitors the sanity of the current clocksource for the kernel may use refined-jiffies as the reference clocksource to compare with. Unfortunately, such clocksource is fairly imprecise for timekeeping since timer interrupts might be missed. This could in turn trigger false positives with the watchdog, which would end up declaring the TSC clocksource as ‘unstable’. For instance, it has been observed that enabling
CONFIG_FUNCTION_GRAPH_TRACERon some legacy hardware would systematically cause such behavior at boot. The following warning splat appearing in the kernel log is symptomatic of this problem:clocksource: timekeeping watchdog on CPU0: Marking clocksource 'tsc-early' as unstable because the skew is too large: clocksource: 'refined-jiffies' wd_now: fffb7018 wd_last: fffb6e9d mask: ffffffff clocksource: 'tsc-early' cs_now: 68a6a7070f6a0 cs_last: 68a69ab6f74d6 mask: ffffffffffffffff tsc: Marking TSC unstable due to clocksource watchdogThis is a problem because the TSC is the best-rated clocksource and directly accessible from the vDSO, which speeds up timestamping operations. If the TSC on your hardware is known to be fine and face this issue nevertheless, you may want to pass
tsc=nowatchdogto the kernel to prevent it, or eventsc=reliableif all TSCs are reliable enough to be synchronized across CPUs. If the TSC is really unstable on some legacy hardware and you cannot ignore the watchdog alert, you can still leave it to other clocksources such as acpi_pm. Calls to evl_read_clock() would be slower compared to a direct syscall-less readout from the vDSO, but the EVL core would nevertheless manage to get timestamps from its built-in clocks at the expense of an out-of-band system call, without involving the in-band stage though. You definitely want to make sure everything is right on your platform with respect to reading timestamps by running the latmus test, which can detect any related issue.You can retrieve the current clocksource used by the kernel as follows:
# cat /sys/devices/system/clocksource/clocksource0/current_clocksource
tsc- NMI-based perf data collection may cause the kernel to execute
utterly sluggish ACPI driver code at each event. Since disabling
CONFIG_PERFis not an option, passingnmi_watchodg=0on the kernel command line at boot may help.
Note
Passing nmi_watchodg=0 turns off the hard lockup detection for the
in-band kernel. However, EVL will still detect runaway EVL threads
stuck in out-of-band execution if CONFIG_EVL_WATCHDOG is enabled.
The SMI nightmare
System Management Interrupts or SMIs are special interrupts at the highest priority causing the x86 CPU to enter the System Management Mode, a variant of the flat real mode for executing some handler implemented by the BIOS. SMIs don’t go through the interrupt controller, they are detected by the CPU logic in between instructions and unconditionally dispatched from there. This introduces critical issues for real-time systems:
-
SMIs may preempt the real-time code for an undefined amount of time, at any time, and cannot be masked or preempted by kernel software. Actually, the kernel software does not even know about ongoing SMI requests.
-
Transitioning to/from the SMM context requires the CPU to save/restore most of its register file, switching to a different CPU mode. With multi-core systems, the BIOS may even wait for all CPU cores to enter SMM before serializing the execution of the pending SMI request. This is yet another source of unexpected delay.
-
SMM handlers invoked by SMIs are implemented in the BIOS, therefore their implementation is opaque to us. We may just observe the pathological latency spots some of them cause (e.g. seeing 300 microsecond delays with USB-related SMI is common).
This means that regardless of using a single (PREEMPT-RT) or dual
kernel configuration like EVL, SMIs will bite the same way. Very
unfortunately, SMIs are commonly involved in health monitoring
operations such as thermal control in x86 chipsets, or regular device
management such as USB support, so there is no simple and
straightforward option for dealing with them.
Warning
In other words, for any x86-based development with real-time performance requirement, don’t get anything for granted but make sure to assess as early as possible the worst-case latency figures you can actually achieve with the hardware.
