Developer Notes

Generic

Turn on debug options in the kernel configuration!

During the development phase, do yourself a favour: turn on CONFIG_DEBUG_IRQ_PIPELINE and CONFIG_DEBUG_DOVETAIL.

The first one will catch many nasty issues, such as calling unsafe in-band code from out-of-band context. The second one checks the integrity of the alternate scheduling support, detecting issues in the architecture port.

The runtime overhead induced by enabling these options is marginal. Just don’t port Dovetail or implement out-of-band client code without them enabled in your target kernel, seriously.

Serialize stages when accessing shared data

If some writable data is shared between in-band and out-of-band code, you have to guard against out-of-band code preempting or racing with the in-band code which accesses the same data. This is required to prevent dirty reads and dirty writes:

Check that the pipeline torture tests pass

Before any consideration is made to implement out-of-band code on a platform, check that interrupt pipelining is sane there, by enabling CONFIG_IRQ_PIPELINE_TORTURE_TEST in the configuration. As its name suggests, this option enables test code which exercizes the interrupt pipeline core, and related features such as the proxy tick device.

Since the torture tests need to enable the out-of-band stage for their own purpose, you may have to disable any Dovetail-based autonomous core in the kernel configuration for running those tests, like switching off CONFIG_EVL if the EVL core is enabled.

When those tests pass, the following output should appear in the kernel log:

Starting IRQ pipeline tests...
IRQ pipeline: high-priority torture stage added.
irq_pipeline-torture: CPU0 initiates stop_machine()
irq_pipeline-torture: CPU3 responds to stop_machine()
irq_pipeline-torture: CPU1 responds to stop_machine()
irq_pipeline-torture: CPU2 responds to stop_machine()
irq_pipeline-torture: CPU1: proxy tick registered (62.50MHz)
irq_pipeline-torture: CPU2: proxy tick registered (62.50MHz)
irq_pipeline-torture: CPU3: proxy tick registered (62.50MHz)
irq_pipeline-torture: CPU0: proxy tick registered (62.50MHz)
irq_pipeline-torture: CPU0: irq_work handled
irq_pipeline-torture: CPU0: in-band->in-band irq_work trigger works
irq_pipeline-torture: CPU0: stage escalation request works
irq_pipeline-torture: CPU0: irq_work handled
irq_pipeline-torture: CPU0: oob->in-band irq_work trigger works
IRQ pipeline: torture stage removed.
IRQ pipeline tests OK.

Otherwise, if you observe any issue when running any of those tests, then the IRQ pipeline definitely needs fixing.

Know how to differentiate safe from unsafe in-band code

Not all in-band kernel code is safe to be called from out-of-band context, actually most of it is unsafe for doing so.

A code is deemed safe in this respect when you are 101% sure that it never does, directly or indirectly, any of the following:

  • attempts to reschedule in-band wise, meaning that schedule() would end up being called. The rule of thumb is that any section of code traversing the might_sleep() check cannot be called from out-of-band context.

  • takes a spinlock from any regular type like raw_spinlock_t or spinlock_t. The former would affect the virtual interrupt disable flag which is invalid outside of the in-band context, the latter might reschedule if CONFIG_PREEMPT is enabled.

In the early days of dual kernel support in Linux, some people would mistakenly invoke the do_gettimeofday() routine from an out-of-band context in order to get a wallclock timestamp for their real-time code. Doing so would create a deadlock situation if some in-band code running do_gettimeofday() is preempted by the out-of-band code re-entering the same routine on the same CPU. The out-of-band code would then wait spinning indefinitely for the in-band context to leave the routine - which won’t happen by design - leading to a lockup. Nowadays, enabling CONFIG_DEBUG_IRQ_PIPELINE would be enough to detect such mistake early enough to preserve your mental health.

Careful with disabling interrupts in the CPU

When pipelining is enabled, use hard interrupt protection with caution, especially from in-band code. Not only this might send latency figures over the top, but this might even cause random lockups would a rescheduling happen while interrupts are hard disabled.

Dealing with spinlocks

Converting regular kernel spinlocks (e.g. spinlock_t, raw_spin_lock_t) to hard spinlocks should involve a careful review of every section covered by such lock. Any such section would then inherit the following requirements:

  • no unsafe in-band kernel service should be called within the section.

  • the section covered by the lock should be short enough to keep interrupt latency low.

Enable RAW_PRINTK support for printk-like debugging

Unless you are lucky enough to have an ICE for debugging hard issues involving out-of-band contexts, you might have to resort to basic printk-style debugging over a serial line. Although the printk() machinery can be used from out-of-band context when Dovetail is enabled, you should rather use the raw_printk() interface for this.

Fundamentally preemption-safe contexts

Over a few contexts, we may traverse code using unprotected, preemption-sensitive accessors such as percpu() without disabling preemption specifically, because either one condition is true;

  • if preempt_count() bears either of the PIPELINE_MASK or STAGE_MASK bits, which turns preemption off, therefore CPU migration cannot happen (debug_smp_processor_id() and preempt checks in percpu accessors would detect such context properly too).

  • if we are running over the context of the in-band stage’s event log syncer (sync_current_stage()) playing a deferred interrupt, in which case the virtual interrupt disable bit is set, so no CPU migration may occur either.

For instance, the following contexts qualify:

  • clockevents_handle_event(), which should either be called from the oob stage - therefore STAGE_MASK is set - when the [proxy tick device is active] (/dovetail/porting/timer/#proxy-tick-logic) on the CPU, and/or from the in-band stage playing a timer interrupt event from the corresponding device.

  • any IRQ flow handler from kernel/irq/chip.c. When called from generic_pipeline_irq() for pushing an external event to the pipeline, on_pipeline_entry() is true, which indicates that PIPELINE_MASK is set. When called for playing a deferred interrupt on the in-band stage, the virtual interrupt disable bit is set.

Checking for out-of-band interrupt property

The IRQF_OOB action flag should not be used for testing whether an interrupt is out-of-band, because out-of-band handling may be turned on/off dynamically on an IRQ descriptor using irq_switch_oob(), which would not translate to IRQF_OOB being set/cleared for the attached action handlers.

irq_is_oob() is the right way to check for out-of-band handling.

stop_machine() hard disables interrupts

The stop_machine() service guarantees that all online CPUs are spinning non-preemptible in a known code location before a subset of them may safely run a stop-context function. This service is typically useful for live patching the kernel code, or changing global memory mappings, so that no activity could run in parallel until the system has returned to a stable state after all stop-context operations have completed.

When interrupt pipelining is enabled, Dovetail provides the same guarantee by restoring hard interrupt disabling where virtualizing the interrupt disable flag would defeat it.

As those lines are written, all stop_machine() use cases must also exclude any oob stage activity (e.g. ftrace live patching the kernel code for installing tracepoints), or happen before any such activity can ever take place (e.g. KPTI boot mappings). Dovetail makes a basic assumption that stop_machine() could not get in the way of latency-sensitive processes, simply because the latter could not keep running safely until a call to the former has completed anyway.

However, one should keep an eye on stop_machine() usage upstream, identifying new callers which might cause unwanted latency spots under specific circumstances (maybe even abusing the interface).

Virtual interrupt disable state breakage

When some WARN_ON() triggers due to a wrong interrupt disable state (e.g. entering the softirqs/bh code with IRQs unexpectedly [virtually] disabled), this may be due to the CPU and virtual interrupt states being out-of-sync when traversing the epilogue code after a syscall, IRQ or trap has been handled during the latest kernel entry.

Typically, do_work_pending() or do_notify_resume() should make sure to reconcile both states in the work loop, and also to restore the virtual state they received on entry before returning to their caller.

The routines just mentioned always enter from their assembly call site with interrupts hard disabled in the CPU. However, they may be entered with the virtual interrupt state enabled or disabled, depending on the kind of event which led to them eventually. Typically, a system call epilogue would always enter with the virtual state enabled, but a fault might also occur when the virtual state is disabled though. The epilogue routine called for finalizing some IRQ handling must enter with the virtual state enabled, since the latter is a pre-requisite for running such code.

Losing the timer tick

The symptom of a common issue in a Dovetail port is losing the timer interrupt when the autonomous core takes control over the tick device, causing the in-band kernel to stall. After some time spent hanging, the in-band kernel may eventually complain about a RCU stall situation with a message like INFO: rcu_preempt detected stalls on CPUs/tasks followed by stack dump(s). In other cases, the machine may simply lock up due to an interrupt storm.

This is typical of timer interrupt events not flowing down normally to the in-band kernel anymore because something went wrong as soon as the proxy tick device replaced the regular device for serving in-band timing requests. When this happens, you should check the following code spots for bugs:

  • the timer acknowledge code is wrong once called from the oob stage, which is going to be the case as soon as an autonomous core installs the proxy tick device for interposing on the timer. Being wrong here means performing actions which are not legit from such a context.

  • the irqchip driver managing the interrupt event for the timer tick is wrong somehow, causing such interrupt to stay masked or stuck for some reason whenever it is switched to out-of-band mode. You need to double-check the implementation of the chip handlers, considering the effects and requirements of interrupt pipelining.

  • power management (CONFIG_CPUIDLE) gets in the way, often due to the infamous C3STOP misfeature turning off the original timer hardware controlled by the proxy device. A detailed explanation is given in Documentation/irq_pipeline.rst when discussing the few changes to the scheduler core for supporting the Dovetail interface. If this is acceptable from a power saving perspective, having the autonomous core prevent the in-band kernel from entering a deeper C-state is enough to fix the issue, by overriding the irq_cpuidle_control() routine as follows:

bool irq_cpuidle_control(struct cpuidle_device *dev,
			 struct cpuidle_state *state)
{
	/*
	 * Deny entering sleep state if this entails stopping the
	 * timer (i.e. C3STOP misfeature).
	 */
	if (state && (state->flags & CPUIDLE_FLAG_TIMER_STOP))
		return false;

	return true;
}

Printk-debugging such timer issue requires enabling raw printk() support, you won’t get away with tracing the kernel behavior using the plain printk() routine for this, because most of the output would remain stuck into a buffer, never reaching the console driver before the board hangs eventually.

Hard interrupt masking in clock chip handlers

The only valid way of sharing a clock tick device between the in-band and out-of-band stages is to access it through the tick proxy. For this reason, we don’t need to enforce hard interrupt masking in clock chip handlers to make them pipeline-safe, because once proxying is active for a tick device, hardware interrupts are off across calls to its handlers when applicable. As a result, either all accesses to the clock chip handlers are proxied and proper masking is already in place, or there is no proxy, which means the in-band kernel is still controlling the device, in which case there is no way we might conflict with out-of-band accesses.

Obviously, this fact does not preclude why we would still want to serialize CPUs when accessing shared data there, in which case hard locking should be in place to ensure this.

Make no assumption in virtualizing arch_local_irq_restore()

Do not make any assumption with respect to the current interrupt state when arch_local_irq_restore() is called, specifically don’t expect the inband stage to be stalled on entry. Some archs use constructs like follows, which breaks such assumption:

	local_save_flags(flags);
	local_irq_enable();
	...
	local_irq_restore(flags);

In that case, we do want the stall bit to be restored unconditionally from flags. A correct implementation would be:

static inline notrace void arch_local_irq_restore(unsigned long flags)
{
	inband_irq_restore(arch_irqs_disabled_flags(flags));
	barrier();
}

RCU and out-of-band context

The out-of-band context is semantically equivalent to the NMI context, therefore the current CPU cannot be in an extended quiescent state RCU-wise if running oob. CAUTION: the converse assertion is NOT true (i.e. a CPU running code on the in-band stage may be idle RCU-wise).

Common services which are safe in out-of-band context

Dovetail guarantees that the following services are safe to call from the out-of-band stage:

  • irq_work() may be called from the out-of-band stage to schedule a (synthetic) interrupt in the in-band stage.

  • __raise_softirq_irqoff() may be called from the out-of-band stage to schedule a softirq. For instance, the EVL network stack uses this to kick the NET_TX_SOFTIRQ event in the in-band stage.

  • printk() may be called from any context, including out-of-band interrupt handlers. Messages are queued, then passed to the output device(s) only when the in-band stage resumes though.

ARM

Context assumption with outer L2 cache

There is no reason for the outer cache to be invalidated/flushed/cleaned from an out-of-band context, all cache maintenance operations must happen from in-band code. Therefore, we neither need nor want to convert the spinlock serializing access to the cache maintenance operations for L2 to a hard lock.

This above assumption is unfortunately only partially right, because at some point in the future we may want to run DMA transfers from the out-of-band context, which could entail cache maintenance operations.

Conversion to hard lock may cause latency to skyrocket on some i.MX6 hardware, equipped with PL22x cache units, or PL31x with errata 588369 or 727915 for particular hardware revisions, as each background operation would be awaited for completion with hard irqs disabled, in order to work around some silicon bug.


Last modified: Tue, 29 Oct 2024 14:58:31 +0100