Friday, June 27, 2014

OPW, Linux: The block I/O layer, part 2 - The request interface

Following the many directives and suggestions given by my OPW mentor, I have been implementing the tiniest and simplest block I/O driver, whose multi-queue implementation just fills I/O request structures with available information about the CPU core the process has been issuing I/O from; this is proving very useful to test the multi-queue-capable prototype of the block I/O I am developing. This task gave me the chance to study more in depth the different APIs provided by the block layer of the Linux kernel, and to try to continue the series of blog articles about it.
Here's my take on the second part; it's come out to be more high-level than I would have wanted, but still. Comments, remarks and criticism are welcome, as always.

The block I/O layer, part 2 - The request interface

The request interface bases itself on an essential data structure that allows to the block layer to pass requests to the device driver: the request_queue structure; such a structure has two main buffering functions: it allows for requests to be pre-processed in an attempt to reduce overheads (with merging and sorting) and limits submission rate to prevent hardware buffers from over-running. request_queue structures are allocated per-device, which means that each device has its private request_queue, which can be accessed only by the device-specific block I/O driver in use for the disk. Such a structure is essentially a container of block I/O requests: its more important field, in fact, is a liked list of requests (whose head is kept in the queue_head field) which have been made available to the driver. It also keeps a set of pointer to functions used to submit a bio or dispatch a request to be handled for submission to the device. When in request mode, the block layer also allows for I/O scheduling: therefore, a request_queue structure points to an elevator_queue structure, which in its turn keeps pointers to elevator hooks. Last, but not least, the request_queue structure is protected by a spinlock, the queue_lock.

Figure 1: The block layer's layout when in request mode

If willing to move forward and see what's inside each of the black boxes, a good starting point is a trace. But how do we trace a request cleanly without being bothered by the whole lot of driver-specific and low-level-driver-specific functions, that execute on an interrupt and don't allow us to get a straight-forward trace? We could exploit some ftrace tricks to get the cleanest possible trace (as I clumsily showed in a previous blog article) choosing whether to have a more messy trace or to lose interrupt-related bits, or we can use the null block device driver. I'd personally go for the latter. It comes with the Linux kernel as a module that, when loaded, creates a configurable number of fake block devices (see the nr_devices option of the module), exploiting a block layer API of your choice (given with the queue_mode option). Another nice feature of this driver is that, when in its default configuration, everything, including dispatches and completions, happens in the context of a request insertion, which is perfect, as we can simply trace by PID. I have collected a trace of the submission of a 4KB read request using the request interface, and I will make it available as soon as I have it uploaded.

The first interesting fact, however, is not shown by the trace, as it happens on the initialization of a new device. When initializing a new device, its driver, while being loaded, chooses which interface the block layer should be using for that specific device. The interface is chosen by specifying a make_request_fn for the device, which is usually defined with the blk_queue_make_request() helper function. The default make_request function, which is to be used when wanting the block layer to stay in request mode, is blk_queue_bio().
Now that we know which function is called to submit a bio to the block layer, we can easlily find it in the trace, and see that it is called by generic_make_request(), which in its turn is called by submit_bio(). The latter function is sort of an entry point to the block layer, and used by the upper levels of the block I/O stack to submit a block I/O unit to be processed and submitted to its device.

The most interesting function between the aboved-mentioned, however, is blk_queue_bio(), which implements the core routine of the Linux block layer's request mode. I had prepared a flowchart, but it would become too much of a long figure, so I'll simply try to outline the main steps it performs in the following paragraph. I'll skip bits here and there to try to avoid boring the reader to death; if you want to go through the full function, you can find it around line 1547 of the block/blk-core.c file of the Linux kernel source tree.

Generic block layer: the blk_queue_bio() function

The first step, performed by the function even before grabbing any lock, is that of attempting a merge of the newly-submitted block I/O unit with requests in the task's plug list. When in request mode, plugging is in fact performed per-process.

        if (blk_attempt_plug_merge(q, bio, &request_count))

The blk_attempt_plug_merge() function tries to merge the bio structure with the current issuer's plug list; it checks only basic merging parameters (gathered with the invocation of blk_try_merge()), avoiding any interaction with the elevator and therefore needing no locking to be performed.
Going on, the blk_queue_bio() function finally needs to grab the queue_lock, as further actions performed in an attempt to merge the block I/O unit will involve per-device structures.


       el_ret = elv_merge(q, &req, bio);
       if (el_ret == ELEVATOR_BACK_MERGE) {
               if (bio_attempt_back_merge(q, req, bio)) {
                       elv_bio_merged(q, req, bio);
                       if (!attempt_back_merge(q, req))
                               elv_merged_request(q, req, el_ret);
                       goto out_unlock;
       } else if (el_ret == ELEVATOR_FRONT_MERGE) {
               if (bio_attempt_front_merge(q, req, bio)) {
                       elv_bio_merged(q, req, bio);
                       if (!attempt_front_merge(q, req))
                               elv_merged_request(q, req, el_ret);
                       goto out_unlock;

The elv_merge() function handles all operations that concern attempting to merge a bio structure with a request that has been already queued in the elevator. To this purpose, the block layer keeps some private fields that has the aim to make this operation faster: a) a one-hit cache of the last request that has been successfully been involved in a merge, and b) a private list of requests that are currently waiting for dispatch in the elevator. (a) is the most useful, as, if merge succeeds with this one-hit cache, no running through lists is needed; (b) probably speeds up search, too, as, in the elevator, requests might be split throughout different service queues, which might imply multiple search levels if delegating the search to the elevator itself. The elv_merge() function involves invoking one of the elevator hooks (elevator_allow_merge_fn) to ask if a bio and a request can be merged; it returns a value that represents the kind of merge that a bio and a request can undergo. In case a merge succeeds, the device is immediately unlocked and the function returns.

        req = get_request(q, rw_flags, bio, GFP_NOIO);
        if (unlikely(!req)) {
                bio_endio(bio, -ENODEV);        /* @q is dead */
                goto out_unlock;

The block layer keeps a pool of already-allocated request structures for each device.
The number of allocated requests can be retrieved by reading the special sysfs file /sys/block/<device>/queue/nr_requests, while the number of in-flight requests for the device can be read from /sys/block/<device>/inflight. In case every merge attempt of the bio fails, blk_queue_bio() invokes the function get_request() attempts to get a free request from the pool. In case it fails, the helper __get_request() function, called by the latter, activates the request starvation logic, which makes every I/O request blocking (even write requests) for every application. If, instead, a free request structure is correctly retrieved, __get_request() invokes the elevator_set_req_fn elevator hook to initialize the request's elevator-private fields; when the former function has returned, blk_queue_bio() then proceeds to initialize it from the information included in the bio.

        init_request_from_bio(req, bio);

After a new request has been correctly retrieved and initialized from the information kept in the bio, it needs to be inserted in the issuing task's plug list or to be directly submitted to the elevator. The per-process-plugging logic is triggered.

        plug = current->plug;
        if (plug) {
                if (request_count >= BLK_MAX_REQUEST_COUNT)
                        blk_flush_plug_list(plug, false);
                list_add_tail(&req->queuelist, &plug->list);
        } else {

The newly-initialized request is inserted in the task's plug list only if the task is currently plugged and the length of said list does not exceed BLK_MAX_REQUEST_COUNT, or the task has been plugged for enough few time. Otherwise, if the task is plugged, it becomes unplugged and the plug list is flushed with the blk_flush_plug_list() function. If the task is unplugged, a driver run-of-queue is triggered and the new request is directly passed on to the elevator for further processing, being inserted with the elevator_add_req_fn hook.

The elevator and the device driver

When a run-of-queue is triggered, the driver peeks the request queue and, if it contains requests, it extracts one or more of them from the queue. A run of queue can happen as a consequence of request insertion or following an interrupt from the device's controller. During driver initialization, a request_fn function is set with the blk_init_queue() function, which also allocates a new request_queue structure. When a device wants to get new requests from a request_queue, its run-of-queue invokes the driver's request_fn hook. The latter, in its turn, typically invokes the blk_fetch_request() helper function of the block layer until it no more returns a valid pointer to a request. Inside the loop, the driver handles requests as deemed appropriate.

Let's see how, e.g., the null block driver handles initialization of a request_fn function. The source code for the complete null_add_dev() function is in the drivers/block/null_blk.c file of the Linux kerrnel source tree.

nullb->q = blk_init_queue_node(null_request_fn, &nullb->lock, home_node);

The blk_init_queue_node() function is similar to blk_init_queue(), but allows to specify the memory node the request_queue should be allocated from.

As of the null block driver's request_fn function, it is implemented with a simple loop as follows.

static void null_request_fn(struct request_queue *q) 
        struct request *rq;

        while ((rq = blk_fetch_request(q)) != NULL) {
                struct nullb_cmd *cmd = rq->special;



The blk_fetch_request() function invokes the blk_peek_request() function, which in its turn uses one of the elevator's callbacks, elevator_dispatch_fn, to trigger the insertion of new requests in the request_queue. The number of requests inserted for each dispatch depends on the implementation of the scheduling policy implemented in the elevator.

Friday, June 13, 2014

OPW, Linux: The block I/O layer, part 1 - Base concepts

So, during the last few weeks I have been studying and tracing the block I/O layer of the Linux kernel, with the aim of getting a clearer view of how it works. A fellow student from the University of Modena and Reggio Emilia thought I might write a blog post on that. My very patient OPW mentor approved the topic and suggested that I split the blog post in different parts, so that it can be read more comfortably and without getting bored all at once. Here is part one, which attempts to explain the basic concepts and ideas behind the mechanisms implemented in the block I/O layer: it's very high-level and concise. The following parts (which will go into much more detail about implementation details) will outline how the different APIs that the block layer exposes make use of those mechanisms, and hopefully why they are useful (or they aren't) to each of the APIs. Sounds like a pretty difficult goal (at least for a beginner as I am), so if you have any criticism, please, go ahead and leave a comment.

The block I/O layer, part 1 - Base concepts

The block I/O layer is the kernel subsystem in charge of managing input/output operations performed on block devices. The need for a specific kernel component for managing such operations is given by the additional complexity of block devices with respect to, for example, character devices. When we think about a character device (e.g. the keyboard we're writing with), we think about a stream, which has a cursor running along one only direction. In other words, the information kept in the stream can be read in only one order, and is usually read until a certain position, or until a certain special character is encountered (e.g. EOF, the end-of-file character). A block device is different: data can be read from it in any order and from any position, after one or more seeks. Extra care is needed while reading from a block device, not so much for handling the I/O itself, as for ensuring performance; in fact, block I/O devices are mainly exploited as storage devices, and it is important to exploit as much as possible their potential. As Robert Love stated in his renowned book, the development of a block layer was such an important goal that it became the main aim of the 2.5 development cycle of the Linux kernel.

Seeing it as a black box, we find that the block layer interacts with two other relevant subsystems of the Linux kernel: the mapping layer and the device driver of the block device where the actual I/O will take place. To undestand what the mapping layer does, and how it interacts with the block layer, we have to briefly outline how a block I/O request actually reaches it (see also Figure 1 for a schematic view).

Figure 1: The block I/O hierarchy of the Linux kernel

A process accesses a disk with a read() or write() operation of a certain number of bytes. The sys_read() and sys_write() system calls, invoked on behalf of the library, first activate the Virtual File System, that handles two different aspects. When opening, closing or linking a file, it locates the disk and the file system hosting the data, starting from a relative or absolute path. When reading from or writing to an already-open file, it verifies if the data is already mapped into memory (that is, resides in the kernel's page cache) and, if it isn't, it determines how the data can be read from disk. When the VFS finds out that the data is not already mapped into memory, it activates the the kernel susbystem in charge of physically locating the data. The mapping layer initially finds out the size of the file system block corresponding to the data, then it computes the size of the data in terms of number of disk blocks to be transferred. In this phase, the kernel also locates the block numbers that keep the data. As a file can be kept in blocks that are physically not contiguous, the file system keeps track of the mapping between logical blocks and physical blocks. The mapping layer must therefore invoke a mapping function that accesses the file descriptor and pieces together the logical-to-physical mapping, therefore getting, at the end, the position of the actual disk blocks (from the beginning of the disk or of a partition of the disk's file system) keeping the file.
As soon as the positions of the blocks to be mapped into memory are available, the kernel starts to build the I/O requests to forward to the disk device. The block I/O layer creates a list of block I/O structures (called bio), each representing an I/O operation to be submitted to disk. Such structures hide to the upper levels of the hierarchy the specific features of the block device, providing them with an an abstract view of the devices. In the simplest case of small I/O operations one single bio is created, while in more complex cases (large-sized I/Os, vector I/O) the block layer can initiate more I/O operations at a time. Subsequently, the block layer translates I/O operations to requests made for specific sectors, to be sent to the disk controller itself; each request normally contains one or more bios (Figure 2): this helps reducing per-strucure processing overheads. Each of these requests is represented by a request structure; their dispatch to the device driver can be controlled by service policies, implemented in the I/O scheduling sub-component of the block layer.

Figure 2: Interaction between bio and request data structures
(from R. Love's "Linux Kernel Development", 3rd Edition)

At last, the block layer dispatches the I/O requests to the device driver, which handles the actual data transfer. The device driver is typically provided with a list of available requests, or provides itself a callback to insert a request in some shared data structures. The device driver issues the interface-specific commands as soon as an interrupt is issued from the disk, or a request service is requested from the upper layers of the hierarchy. This interaction between block layer and device driver highly depends on the block layer API which is being used, and will be discussed in much more detail in the next posts of the blog series.

I/O handling techniques

The question arises quite naturally here: are requests simply forwarded by the block layer to the block device driver? Well, actually, no. If they were, the existence of the block I/O layer itself would be pointless. The block layer implements, instead, a number of I/O handling techniques, some of them very simple, some very complex, with a common aim: increasing performance. Some of the implemented techniques, such as I/O scheduling, are typical of a subset of the APIs currently offered by the block layer, and won't be explained here. There are however some that are common to all available interfaces, as they provide relevant benefit regardless of the used API.

Merge and coalesce I/O operations
The first operation the block I/O layer performs on a block I/O operation is attempting to merge it to already-inserted structures. This is achieved by searching in the block layer's internal data structures some adjacent requests; if at least one adjoining request is found, the newly-issued block I/O is unified to the existing one. When the existing operation is merged to the new one (because the new one was issued for sectors that preceded the existing one's), the merge is called front merge; else, if the new operation is merged to the existing one (because it was issued for sectors that followed the existing one's) the merge is called back merge. The most common of the two is the latter, as it most likely happens that I/O is performed in a sequential fashion. When, instead, a new operation "fills a gap" between two already-existing structures, the block layer performs a coalesce of the three operations.

Figure 3: Merge and coalesce operations

The block I/O layer includes a mechanism, called plugging, that adjusts the rate at which requests are dispatched to the device driver. If requests are queued to a block device which the block layer does not regard as under load or under stress (that is, with a low number of pending requests), the dispatch of newly-issued operations to the driver is delayed (and the device is said to be plugged). This allows to the block layer to perform a higher number of merges, if more I/O is issued closely (with regard to time and space). After a certain amount of time, or when the number of outstanding I/O for a device exceeds a fixed threshold, the device is unplugged, and the requests are actually dispatched to the device driver.

Initially, plugging was performed system-wide; the idea proved to be effective, but not as effective as it would be if plugging was performed per-device, which would help relieve lock contention on the increasing number of SMP machines. Starting from 2.6.39, plugging was moved to be per-device. Lock contention was still an issue, especially in the case of many processes issuing a high number of small requests to the same block device and constantly hitting on the block layer's per-device data structures each time a merge of a block I/O operation was attempted. Starting from 3.0, plugging was therefore moved to be per-process. This allows block I/O operations to be merged and coalesced in a lockless fashion in a high number of cases; the extent of the impact of this design choice on lock contention actually depends on the API being used.

Request sorting
Some types of block devices have a great benefit from the requests being issued as much as possible in a sequential fashion. As can be probably very easily deduced, the ones that benefit the most from it are devices where the access time to data depends on some mechanical movement: rotational hard drives. When accessing a sector, a rotational disk must move the disk plate where the data is kept (rotational latency) for the sector to be accessible to a head, which must position its head on a specific track (disk head seek). If requests are issued in a random manner, the number of seeks needed to retrieve the sectors increases, adding overhead with respect to sequentially-issued operations.
The block layer typically tries to re-order requests as much as possible, so that they are submitted to the device driver and to disk in a sequential fashion. Plugging helps also with this, allowing the block layer to internally re-order requests before they are submitted.

Figure 4: Internals of a hard disk
(from, because I can't draw for my life)
Request sorting is, however, not very useful on some kinds of disks where data transfer time depends (almost) only on the size of the requests, such as solid state drives (SSD). Some solid state disks actually benefit from having many small requests, as they internally serve them in parallel therefore achieving better performance. This technique is therefore left to I/O policies to be used only where it is strictly necessary, and it is even directly skipped in some cases where it does not matter or could introduce overhead which is unnecessary or impacts on performance.

Robert Love, "Linux Kernel Development"