EVL is now tracking kernel 5.5, heading to 5.6-rc1. A seriously silly bug in the ARM port was uncovered recently, causing undefined instruction exceptions to be spuriously reported by the EVL core. If you run the ARM port of EVL, make sure to pull from the evl/master branch, or at the very least pick this commit which is actually reverting a broken and useless change.
When a common application is being (p)traced by gdb in the so-called all-stop mode, any thread which is stopped (e.g. because it hits a breakpoint or ^C is pressed) causes all other threads within the same process to stop as well. As the gdb documentation states, this allows to examine the overall state of the program, including switching focus between threads, without worrying that things may change underfoot.
With a companion core sharing the responsibility of scheduling these threads, enforcing the all-stop mode requires a bit more work than just sending SIGSTOP to the siblings because the threads currently running out-of-band may delay stop requests which have to be issued and handled from the in-band context. In the meantime, the sibling threads running out-of-band may execute a significant amount of code before they eventually obey the in-band stop request, which is precisely what we would like to avoid.
To address this issue, support for synchronous debugger breakpoints is now available from the EVL core. This feature keeps a thread (single-)stepped by a debugger synchronized with its siblings from the same process running in the background, as follows:
as soon as a ptracer (e.g. gdb) regains control over a thread which just hit a breakpoint or received SIGINT, sibling threads from the same process which run out-of-band are immediately frozen.
all sibling threads which have been frozen are set to wait on a common barrier before they can be released. Such release happens once all of them have joined the barrier in out-of-band context, once the stepped thread resumes. This ensures that siblings resume in an orderly (i.e. priority-based) manner from a common point in the timeline, instead of staggered.
Although the implementation differs, this is inspired from the similar feature available with Xenomai 3.
Unboxed the Raspberry PI 4 this week, and ported EVL to the Broadcom BCM2711 chip to get a sense of its performance when it comes to real-time duties. Actually, there was nothing to be ported, at least for the basic timer and GPIO latency tests, everything just worked out of the box over EVL v5.5-rc4. For testing EVL with an armv8 kernel on the PI 4, you will need a 64bit root file system too, since EVL does not support the mixed ABI model yet (i.e. the EVL port to armv8 does not accept EVL system calls issued by executables targeting armv7 yet).
As the vendor documentation states, “the architecture of the BCM2711 is a considerable upgrade on that used by the SoCs in earlier Pi models”. Hell yes. It is, at the very least when it comes to real-time performance. Compared to the PI 3B+ model, worst case latency is slashed by half for the same tests.
Because of past thermal issues with this chip, you really want to consider upgrading the firmware for any serious testing with the Raspberry PI 4. Bonus point: you will get a decent PXE boot in the same move.
A new serialization mechanism was implemented in order to address a common problem with adding out-of-band support to existing drivers: how to safely share a portion of the driver logic between the in-band and out-of-band stages? The idea underlying this mechanism is based on the observation that dealing with a device involves different phases: most of which usually have no real-time requirement such as device set up and channel configuration, only a few may have, such as exchanging payload data between the application and the device via I/O transfers.
A typical example would be about accessing an Alsa device: we may want
to control the settings of a device using the mixer application from
the in-band context, or alternatively run a capture/playback loop via
the PCM core from the out-of-band context for achieving bounded
ultra-low latency when exchanging audio samples with the same
device. We may allow the device to be reconfigured when the driver is
not exchanging data with the codec in out-of-band mode and conversely,
but we do not want both operations to happen concurrently inside the
Alsa core. Because the application layer may not be able to ensure
that such operations never overlap (e.g. playing an audio stream with
aplay while changing the mixer settings from a different context
amixer), a kernel mechanism which helps in keeping the general
logic safe is welcome.
Documentation about benchmarking EVL is on its way. It will cover the GPIO latency test, and recommendations for measuring the worst case latency with EVL and other real-time Linux infrastructures.
The GPIO latency test was merged into the latmus program. This code is paired with a fairly simple Zephyr-based application which implements the latency monitor we need for measuring the response time of a user-space thread running on the system under test to GPIO events the monitoring device generates. Using this combo, we can even measure the response time of non-EVL, strictly single-kernel configurations, to GPIO events.
Also, Dovetail and EVL were upgraded to kernel 5.5-rc4. Trivial merge, no issue.
Dovetail was ported from kernel 5.4 to kernel 5.5-rc2. Not much fuss, except maybe on the ARM side with the rebase of the arch-specific vDSO support (arch/arm/vdso) on the generic vDSO implementation (lib/vdso) which took place upstream. This turned out to be an opportunity to rebase the originally ARM-specific vDSO support for user-mappable clocksources to the generic vDSO implementation. Except for these, other changes were fairly trivial to merge into kernel 5.5-rc2.
As I was debugging some core Xenomai issue on
x86, I stumbled over the reason for
CONFIG_MAXSMP breaking the
interrupt pipelines for ages: enabling this kernel feature extends the
range of interrupt vectors to 512K. Guess what would happen with the
3-level interrupt log which cannot address more than 256K vectors both
the I-pipe and Dovetail were using so far? Both gained a 4-level
interrupt log this week to cover the entire vector space when
CONFIG_MAXSMP is enabled, which did not go without a vague feeling
of wearing a dunce cap.
EVL timers were documented.
EVL cross-buffers were documented.
The week the EVL core eventually got rid of the SMP scalability issue which plagues the Cobalt core. The ugly big spinlock (aka nklock) which still serializes all operations inside the former is now gone from the EVL core, after a long incremental process whick took place over several months, introducing per-CPU serialization in the core scheduler, rewriting portions of the EVL thread synchronization support like waitqueues and mutexes in the same move.
Implemented the EVL shim library which mimics the behavior of the EVL API based on plain POSIX calls from the native *libc. It comes in handy whenever the real-time guarantees delivered by the EVL core are not required for quick prototyping or debugging some application code.
Out-of-band IRQs are now flowing faster through the pipeline until the EVL core can reschedule upon them. We are now on par with the I-pipe performance-wise, but with a cleaner integration, a less intrusive implementation, and a much saner locking model for interrupt descriptors.
Since a companion core exhibits a separate scheduler, there is no point in waiting patiently for the main kernel logic to finish switching its own tasks before preempting it upon out-of-band IRQ. Several micro-benchmarks on ARM and arm64 revealed that a significant portion of the maximum latency was induced by switching contexts atomically. Dovetail now supports generic non-atomic context switching for the in-band stage which allows the EVL core to preempt during this process, reducing even further the wakeup latency of out-of-band threads under high memory stress, especially when sluggish outer cache controllers are part of the picture.
The proxy now supports a ‘write granularity’ feature, in order to define a fixed size for writing bulks of data to the target in-band file. Because of this, the proxy is available for interfacing with in-band files with special requirements, like writing exactly 64bit words to eventfd(2) objects, which turns the proxy into yet another mechanism for synchronizing out-of-band and in-band threads.
Eventually finalized the out-of-band polling services for EVL. Since each instance of an EVL element is represented by a regular file, we can wait for events on the corresponding file descriptor via a common synchronous multiplexing mechanism. For instance, we could wait for an event group to have some event pending, or a proxy to receive some data from the in-band peer.
The GPIOLIB framework can now handle requests from out-of-band EVL threads, which means that any GPIO chip driver based on this generic GPIO core can deliver on ultra-low latency requirements. This is the first illustration that we need no specific driver model with the EVL core in order to cope with real-time duties in drivers: we can leverage the out-of-band operations Dovetail provides us as part of the common file operations defined by the VFS.
Just finished porting Dovetail to x86_64, which almost enabled the EVL core on this architecture in the same move given the very few arch-specific bits this core requires. Enabling libevl on this architecture was a trivial task for the same reason. So we now have support for ARM, arm64 and x86_64.
Added a scheduling policy for temporal partitioning (SCHED_TP). Useful when Arinc653-like scheduling is required.
Implementation-wise, the support for EVL timers was merged as a sub-function of the clock element, which simplifies the code.
EVL logger and mapper features are now merged into the proxy element.
Spent some time writing unit tests for libevl. Not exhilarating, but required anyway.
Thirteen years to the day after I respinned the Xenomai project with Xenomai 2, it seems a good time to articulate the lessons learned from the strengths and weaknesses of Xenomai in a modern real-time core implementation based on Dovetail. After months originally spent working on the ‘steely’ core, I have decided to send the whole thing back to the drawing board instead of continuing this effort: this needs more than a revamping of the implementation, the way such companion core integrates into the main kernel has to be re-designed.
This decision eventually led to developing the EVL core, which originally reused portions of Cobalt’s proven infrastructure such as the scheduler and thread management system. On the other hand, the kernel-to-core and user-to-core interfaces were entirely rewritten, dropping RTDM and POSIX entirely. The portion of inherited bits was expected to decrease over time, especially in the wake of improving SMP scalability, which eventually happened a year later.
The groundwork for Dovetail was laid during this period, by introducing a new high-priority execution stage into the mainline kernel logic. On this so-called out-of-band stage, we can run an interrupt pipeline and provide alternate scheduling support to any companion software core we may want to embed. The main kernel can offload work which has to meet ultra-low latency requirements to such core, without requiring the entire kernel machinery to abide by the real-time rules only for a handful of threads which actually need this.
During this same period, a streamlined version of Xenomai’s Cobalt core codenamed ‘steely’ which would only provide a POSIX API was developed for the purpose of testing the Dovetail interface.
Started working on Dovetail.
Last modified: Sun, 16 Feb 2020 17:27:28 CET