Tuesday, April 22, 2014

OPW, Xen: some initial info

Turns out I've been accepted for OPW 2014!
I'm so thrilled to be working with the Xen Project to domain support for Linux.
I'll be trying to document my three-months journey in this blog as part of the program, but writing is really not my thing; also, what I will write is (pretty obviously) just what I think I have understood from my readings and investigations. So, please bear with me and do not hesitate to comment about any mistake you may find in these posts, if you want to.

I'll be working on adding some support for the multi-queue API, recently merged in the Linux kernel, to the block I/O paravirtualized Xen driver for Linux. In the following post, I'll try to provide some initial information about the such an idea and its motivations; the project was proposed by Konrad Rzeszutek Wilk, to whom I have been assigned as intern. A more in-depth and articulate description can be found in his project outline or in the documents linked from the outline itself, but I'm so happy that I'm looking forward even to writing about it, so here is my take.

Multi-queue support for the block I/O paravirtualized Xen drivers

The kernel component charged with the task of managing block I/O issued by applications is, in Linux, referred to as the block layer. Its task includes passing on block I/O requests to the device driver, which actually interacts with the block device. Until v3.12, the Linux kernel's block layer offered to block device drivers two interfaces, one of which is more commonly used and is called the request interface. When exposing this interface, the block layer internally stores I/O requests, queueing them in a buffering area; while in this area, requests can be re-ordered (e.g. to reduce penalties due to disk head movements), merged (to reduce per-request-processing overheads) or, more in general, subjected to service policies implemented in the block I/O scheduling sub-component of the block layer. Buffering has also the function to limit submission rate to prevent any hardware buffer from over-running. From a more practical point of view, when in request-based mode, the block layer keeps a simple request queue for each device; it adds new requests to the queue according to its internal policies; the request queue is drained by the device driver from its head.

Such an interface was designed when a high-end drive could handle hundreds of I/O operations per second ([1]); in a recent paper ([2]), Linux block layer maintainer Jens Axboe pointed out the three causes that make it a bottleneck when used with a high-performance SSD that can handle hundreds of thousands of IOPS.
  1. The request queue is a major source of contention. In fact, it is protected by a single spinlock, the queue_lock, which is grabbed each time an I/O request is issued, attempted to be merged, enqueued in the I/O scheduler's private data structures, dispatched to the driver and completed.
  2. Hardware interrupts and request completions reduce usability of caches. A high number of I/O operations per second cause a high number of interrupts, despite the presence of interrupt mitigation ([3]). If one CPU core is in charge of handling all hardware interrupts and forwarding them to other cores as software interrupts, as it happens on most modern storage devices ([2]), a single core may spend a considerable amount of time handling interrupts, switching context, and therefore polluting caches. Also, other cores must undergo an inter-processor interrupt to execute the I/O completion routine.
  3. Cross-CPU memory accesses exacerbate queue_lock contention. Remote memory accesses are needed every time the completion routine for an I/O request is executed on a CPU core which is different from the one on which the request was last handled. In fact, on completion it is necessary to grab the queue_lock, as the request must be removed from the request queue. As a consequence, the cache line with the last state of the lock (stored in the cache of the core where the lock was last acquired) must be explicitly invalidated.
Efforts applied in order to remove such a bottleneck were focused on the key points of reducing lock contention and remote memory accesses. Even before v3.13, the Linux kernel already provided a way to short-out the request queue (with the make request interface), but at the cost of deferring any pre-processing and buffering performed on I/O requests by the block layer.

The multi-queue API introduced in the Linux kernel attempts to address this issue by providing to device drivers a third interface to the block layer. The request queue is split in a pool of separate per-CPU or per-NUMA-socket queues. The first mode, in particular, allows to each CPU to queue requests in its private request queue in a lockless fashion, thus reducing the previously-critical lock contention issues; while in these software queues (or submission queues), a request can be subjected to I/O scheduling policies. The device driver, instead, accesses a set of hardware queues (or dispatch queues), whose number matches the actual number of queues supported by the device driver; those act as a buffer to adjust I/O submission rate between the software queues and the actual dispatch to the device driver.

The Xen hypervisor provides paravirtualized drivers for block I/O to avoid IDE emulation (such as, e.g. the one provided by QEMU), which is slow. The PV block I/O driver follows Xen's split device driver model, which includes a backend driver (kept by a driver domain, usually dom0, which hosts hardware-specific drivers) and a frontend driver (used by one of the unprivileged domains, which does not have to deal with hardware-specific or interface-specific details). The two halves of a PV driver share some memory pages, where they keep one or more consensual rings (might usually be only one ring or a pair of tx/rx rings); those rings are accessed by both partecipants through a producer/consumer protocol which uses an event channel for the needed notifications (e.g. presence of a new request or response in the ring). This is called "PV protocol" ([4]), and is used by all paravirtualized drivers in Xen.

The block PV protocol uses only one ring, which is shared between the driver domain and the guest domain. To avoid the allocation of a large amount of possibly-unused shared memory, initially the ring has the minimum indispensable size to queue requests and responses, but not the data itself. As soon as the guest domain starts issuing requests, the frontend driver requests a certain number of memory areas (grants) to hold the data. If the request is a write request, the memory areas are filled with the data to be written on behalf of the guest domain and necessary permissions are added for the driver domain. A grant reference is added to the request, and the request is queued in the ring. The driver domain, at this point, is notified that a new request from the guest domain is pending service; when the request is finally served, the driver domain parses the grant reference and uses it to map the related memory areas to its own address space. As soon as the request has been fulfulled, the grants are unmapped from the driver domain's address space, a response is written on the shared ring and the guest domain is notified about the completion; while handling such a notification, the guest domain removes any access permission on the grants for the driver domain.

With respect to the original protocol, the current implementation has had the benefit of many improvements, such as increasing the maximum number of in-flight I/O with indirect descriptors ([5]), reducing lock contention on the grant lock of a domain and reducing the number of memory [un]mapping operations needed to pass the I/O on to dom0 ([6]).
However, as currently implemented in Linux, the block PV driver still makes use of a single block thread for each ring to handle I/O transmission.

Porting the multi-queue API to the driver should allow it to allocate per-vCPU block threads, thus allowing to each vCPU lockless access to a private ring. The performance benefits should include, as in the Linux kernel ([2]), increasing throughput and reducing service latency of I/O requests.