Device I/O

This section explains the changes required for adding out-of-band I/O capabilities to an existing network interface controller driver from the stock Linux kernel. It does not explain how to write a network interface controller driver, but assumes that you do know the basics about kernel development and the implementation of a NIC driver instead. The changes described below are not Ethernet-specific. We will focus on extending a NAPI-conformant driver, which is the case for most network drivers these days.

All code snippets in this section are extracted from the implementation of the Freescale FEC driver for Linux v6.6 with the Dovetail changes to support out-of-band traffic.

Out-of-band I/O support in a NIC driver

In order for an out-of-band network stack to send and receive packets directly from the out-of-band execution stage, we have to extend the driver code as follows:

  • Call netdev_set_oob_capable() for the network device to advertise out-of-band capabilities. Setting this flag basically means that the driver will provide the necessary handlers and support for out-of-band I/O operations.
static int
fec_probe(struct platform_device *pdev)
{
	struct fec_enet_private *fep;
	struct fec_platform_data *pdata;
	...
	if (IS_ENABLED(CONFIG_FEC_OOB)) {
		netdev_set_oob_capable(ndev);
		netdev_info(ndev, "FEC device is oob-capable\n");
	}
	...
}
  • Provide the required handlers for turning on/off the out-of-band mode. When a companion core wants a network device to accept out-of-band traffic, the driver receives a call to the .ndo_enable_oob() handler if registered in its struct net_device_ops descriptor, see netif_enable_oob_diversion(). Conversely, the .ndo_disable_oob() handler may be called to turn off out-of-band mode if registered, see netif_disable_oob_diversion().

    A driver implementing these handlers would take all the necessary steps to enable or disable out-of-band IRQ delivery for its interrupt sources. Switching the delivery mode is performed by calling irq_switch_oob() for the proper IRQ channels. These operations should be done for all interrupts coming from the network interface controller which participate to handling the I/O traffic.

/* From drivers/net/ethernet/freescale/fec_main.c */

#ifdef CONFIG_FEC_OOB

static int fec_enable_oob(struct net_device *ndev)
{
	struct fec_enet_private *fep = netdev_priv(ndev);
	int nr_irqs = fec_enet_get_irq_cnt(fep->pdev), n, ret = 0;

	napi_disable(&fep->napi);
	netif_tx_lock_bh(ndev);

	for (n = 0; n < nr_irqs; n++) {
		ret = irq_switch_oob(fep->irq[n], true);
		if (ret) {
			while (--n > 0)
				irq_switch_oob(fep->irq[n], false);
			break;
		}
	}

	netif_tx_unlock_bh(ndev);
	napi_enable(&fep->napi);

	return ret;
}

static void fec_disable_oob(struct net_device *ndev)
{
	struct fec_enet_private *fep = netdev_priv(ndev);
	int nr_irqs = fec_enet_get_irq_cnt(fep->pdev), n;

	napi_disable(&fep->napi);
	netif_tx_lock_bh(ndev);

	for (n = 0; n < nr_irqs; n++)
		irq_switch_oob(fep->irq[n], false);

	netif_tx_unlock_bh(ndev);
	napi_enable(&fep->napi);
}

#endif	/* !CONFIG_FEC_OOB */

[snip]

static const struct net_device_ops fec_netdev_ops = {
	.ndo_open		= fec_enet_open,
	.ndo_stop		= fec_enet_close,
	.ndo_start_xmit		= fec_enet_start_xmit,
	.ndo_select_queue       = fec_enet_select_queue,
	.ndo_set_rx_mode	= set_multicast_list,
	.ndo_validate_addr	= eth_validate_addr,
	.ndo_tx_timeout		= fec_timeout,
	.ndo_set_mac_address	= fec_set_mac_address,
	.ndo_eth_ioctl		= fec_enet_ioctl,
#ifdef CONFIG_NET_POLL_CONTROLLER
	.ndo_poll_controller	= fec_poll_controller,
#endif
#ifdef CONFIG_FEC_OOB
	.ndo_enable_oob		= fec_enable_oob,
	.ndo_disable_oob	= fec_disable_oob,
#endif
	.ndo_set_features	= fec_set_features,
};
  • The packet transmission handler to the hardware (aka hard transmit routine) is called by the regular/main network stack as well as the companion core for passing an outgoing packet to be handled by the driver (.ndo_start_transmit()). As a result, this handler may run in-band or out-of-band, depending on the caller: this is the fundamental difference introduced by Dovetail for an oob-capable driver. Either way, the driver would prepare for the packet to be picked by the DMA engine of the network controller. We need to protect this handler from concurrent accesses from the in-band stages on other CPUs when running on the out-of-band stage on the local CPU. For this, Dovetail expects the companion core to implement the netif_tx_lock_oob and netif_tx_unlock_oob hooks for serializing the inter-stage access to a transmit queue.
static netdev_tx_t
fec_enet_start_xmit(struct sk_buff *skb, struct net_device *ndev)
{
	struct fec_enet_private *fep = netdev_priv(ndev);
	int entries_free;
	unsigned short queue;
	struct fec_enet_priv_tx_q *txq;
	struct netdev_queue *nq;
	int ret = 0;

	queue = skb_get_queue_mapping(skb);
	txq = fep->tx_queue[queue];
	nq = netdev_get_tx_queue(ndev, queue);

	/*
	 * Lock out any sender running from the alternate execution
	 * stage from other CPUs (i.e. oob vs in-band). Clearly,
	 * in-band tasks should refrain from sending output through an
	 * oob-enabled device when aiming at the lowest possible
	 * latency for the oob players, but we still allow shared use
	 * for flexibility though, which comes in handy when a single
	 * NIC only is available to convey both kinds of traffic.
	 */
	netif_tx_lock_oob(nq);

	if (skb_is_gso(skb))
		ret = fec_enet_txq_submit_tso(txq, skb, ndev);
	else
		ret = fec_enet_txq_submit_skb(txq, skb, ndev);
	if (ret)
		return ret;

	if (running_inband()) {
		entries_free = fec_enet_get_free_txdesc_num(txq);
		if (entries_free <= txq->tx_stop_threshold)
			netif_tx_stop_queue(nq);
	}

	netif_tx_unlock_oob(nq);

	return NETDEV_TX_OK;
}
  • Once interrupts coming from the NIC are delivered from the out-of-band stage to the driver, and the hard transmit handler can be called from either the in-band or out-of-band stages, the RX and TX code paths in the driver may be traversed from either stages. We have to adapt them accordingly. The way to do this depends on the original implementation. However, the following rules apply to any driver:

    • regular [raw_]spinlocks in those code paths must be converted to hard spinlocks, so they can be acquired from either stages. As usual, a careful check is required to make sure that such conversion would not entail latency spikes for other real-time activities.

    • DMA streaming operations should be converted in order to rely on pre-mapped socket buffers, since we may not request DMA mappings when running out-of-band. For this purpose, the Dovetail interface to out-of-band networking extends the page pool API with a set of oob-oriented features, which includes pre-mapping. However, synchronization calls for DMA memory (dma_sync_*_for_{device, cpu}()) are usually safe in both execution stages (except for legacy systems which have to resort to software IOTLB, but using bounce buffers does not qualify for low-latency performance anyway).

      As an example, the FEC driver is NAPI-based, and uses a page pool to obtain the memory pages for backing the socket buffers on RX. We simply enable this pool for out-of-band operations (PP_FLAG_PAGE_OOB).

static int
fec_enet_create_page_pool(struct fec_enet_private *fep,
			  struct fec_enet_priv_rx_q *rxq, int size)
{
	struct page_pool_params pp_params = {
		.order = 0,
		.flags = PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV,
		.pool_size = size,
		.nid = dev_to_node(&fep->pdev->dev),
		.dev = &fep->pdev->dev,
		.dma_dir = DMA_FROM_DEVICE,
		.offset = FEC_ENET_XDP_HEADROOM,
		.max_len = FEC_ENET_RX_FRSIZE,
	};
	int err;

	if (fec_net_oob()) { /* Use oob-capable page pool. */
		pp_params.flags |= PP_FLAG_PAGE_OOB;
		/* An oob pool can't grow, so plan for extra space. */
		pp_params.pool_size *= 2;
	}

	rxq->page_pool = page_pool_create(&pp_params);
	if (IS_ERR(rxq->page_pool)) {
		err = PTR_ERR(rxq->page_pool);
		rxq->page_pool = NULL;
		return err;
	}
	...
}

Next, we retrieve the pre-mapped DMA address of the backing pages instead of mapping the buffers on the fly, only synchronizing the CPU caches instead of unmapping the buffers on completion.

static dma_addr_t get_dma_mapping(struct sk_buff *skb,
				struct device *dev, void *ptr,
				size_t size, enum dma_data_direction dir)
{
	dma_addr_t addr;

	if (!fec_net_oob() || !skb_has_oob_storage(skb))
		return dma_map_single(dev, ptr, size, dir);

	/*
	 * An oob-managed storage is already mapped by the page pool
	 * it belongs to. We only need to to let the device get at the
	 * pre-mapped DMA area for the specified I/O direction.
	 */
	addr = skb_oob_storage_addr(skb);
	dma_sync_single_for_device(dev, addr, size, dir);

	return addr;
}

static void release_dma_mapping(struct sk_buff *skb,
				struct device *dev, dma_addr_t addr, size_t size,
				enum dma_data_direction dir)
{
	if (!fec_net_oob() || !skb || !skb_has_oob_storage(skb)) {
		dma_unmap_single(dev, addr, size, dir);
	} else {
		/*
		 * An oob-managed storage should not be unmapped, this
		 * operation is handled when required by the page pool
		 * it belongs to. We only need to synchronize the CPU
		 * caches for the specified I/O direction.
		 */
		dma_sync_single_for_cpu(dev, addr, size, dir);
	}
}

static int fec_enet_txq_submit_skb(struct fec_enet_priv_tx_q *txq,
				   struct sk_buff *skb, struct net_device *ndev)
{
	...
	/* Push the data cache so the CPM does not get stale memory data. */
	addr = get_dma_mapping(skb, &fep->pdev->dev, bufaddr, buflen, DMA_TO_DEVICE);
	if (dma_mapping_error(&fep->pdev->dev, addr)) {
		dev_kfree_skb_any(skb);
		if (net_ratelimit())
			netdev_err(ndev, "Tx DMA memory map failed\n");
		return NETDEV_TX_OK;
	}

	if (nr_frags) {
		last_bdp = fec_enet_txq_submit_frag_skb(txq, skb, ndev);
		if (IS_ERR(last_bdp)) {
			release_dma_mapping(skb, &fep->pdev->dev, addr,
					buflen, DMA_TO_DEVICE);
			dev_kfree_skb_any(skb);
			return NETDEV_TX_OK;
		}
		...
	}
	...
}
	

Network device API

Dovetail provides the following kernel interface to companion cores for managing the devices involved in out-of-band networking.


bool netif_oob_diversion(const struct net_device *dev)

Tell whether a device is currently diverting input to a companion core.

  • dev

    The network device to query.


  • void netif_enable_oob_diversion(struct net_device *dev)

    Turn on input diversion on the given network device. If the .ndo_enable_oob() handler is registered in the struct net_device_ops descriptor of the associated NIC driver, it is called to enable out-of-band operations as well. Once enabled, input diversion means that all ingress packets coming from the device are first submitted to the companion core for selection via calls to the netif_deliver_oob() hook.

  • dev

    The network device for which all input packets should be submitted to the companion core.


  • void netif_disable_oob_diversion(struct net_device *dev)

    Turn off input diversion on the given network device. If the .ndo_disable_oob() handler is registered in the struct net_device_ops descriptor of the associated NIC driver, it is called to stop out-of-band operations as well.

  • dev

    The network device which should switch back to in-band operatiion mode, with all ingress packets it receives flowing directly to the regular network stack.


  • void netif_enable_oob_port(struct net_device *dev)

    Enable the device as an out-of-band network port. From that point, applications may refer to dev in device binding or I/O operations with out-of-band sockets.

  • dev

    The network device to enable as an out-of-band port.


  • void netif_disable_oob_port(struct net_device *dev)

    Stop using the device as an out-of-band network port.

  • dev

    The network device which is no more an out-of-band port.


  • bool netdev_is_oob_capable(const struct net_device *dev)

    Tell whether a device is able to handle traffic from the out-of-band stage, i.e. if netdev_set_oob_capable() was called for the device.

  • dev

    The network device to query.

  • A true return value only means that such device could handle out-of-band traffic directly from the out-of-band execution stage, it does not mean that such operating mode is currently enabled. The latter happens when netif_enable_oob_diversion() is called.


    The Dovetail interface relies in part on the companion core for supporting out-of-band network I/O by mean of the following weakly bound routines which the latter must implement.


    __weak bool netif_deliver_oob(struct sk_buff *skb)

    This routine receives the next ingress network packet to take or leave by the companion core, stored in a socket buffer. Only packets received from devices for which out-of-band diversion is enabled are sent to this handler.

  • skb

    The socket buffer received from the driver.

  • netif_deliver_oob() should return a boolean status telling the caller whether it has picked the packet for out-of-band handling (true), or the packet should be left to the in-band network stack for regular handling instead.

    This routine may be called from either the in-band or out-of-band execution stages, depending on whether the issuing driver is operating in out-of-band mode.


    __weak void netif_tx_lock_oob(struct netdev_queue *txq)

    This call should serialize callers from the converse Dovetail execution stage, e.g. in-band vs out-of-band. There is no requirement for serializing callers which belong to the same stage, since the calling network stack must already ensure non-concurrent execution in contexts which may access the transmit queue. Typically, the EVL network stack would use a stage exclusion lock for this purpose.

    Each call to netif_tx_lock_oob is paired with a converse call to netif_tx_unlock_oob. The Dovetail interface does not perform recursive locking.

  • txq

    The transmit queue to lock.

  • This routine may be called from any execution stage.


    __weak void netif_tx_lock_oob(struct netdev_queue *txq)

    This routine unlocks a transmit queue previously locked by a call to netif_tx_lock_oob.

  • txq

    The transmit queue to unlock.

  • This routine may be called from any execution stage, but always from the same stage from which the lock was acquired.


    __weak void process_inband_tx_backlog(struct softnet_data *sd)

    When running out-of-band, the companion core may have to postpone packet transmission to a network device which cannot directly handle traffic from that execution stage. It usually does this by accumulating the egress packets until the in-band network stack resumes in proper context to issue the pending output. Such a context is the execution of the network TX softirq (aka NET_TX_SOFTIRQ), which calls process_inband_tx_backlog() at the very beginning of its handler, giving the companion core the opportunity to hand over the pending output to the device from the in-band stage eventually, usually by calling dev_queue_xmit().

  • sd

    The softirq data descriptor.


  • __weak bool napi_schedule_oob(struct napi_struct *n)

    This hook should implement the out-of-band NAPI scheduling, analogously to its in-band counterpart in the context of the out-of-band network stack. All direct and indirect calls to __napi_schedule() and __napi_schedule_irqoff() from a NIC driver end up triggering the out-of-band NAPI scheduling instead of the in-band one when the caller is currently running on the out-of-band stage.

    What happens under the hood in order to schedule the execution of the NAPI handler from the out-of-band stage is not specified by Dovetail, this is merely decided by the implementation of the out-of-band network stack in the companion core. Normally, the core should plan for some out-of-band task to call the NAPI poll method, which must have been extended to support out-of-band callers.

  • n

    The NAPI instance to schedule for execution.


  • __weak bool napi_complete_oob(struct napi_struct *n)

    This hook is called by the in-band network stack when napi_complete_done() is called by the NIC driver from the out-of-band stage, when input diversion is enabled for the issuing device.

  • n

    The NAPI instance notifying about RX completion.


  • Last modified: Fri, 15 Nov 2024 16:28:31 +0100