Friday, July 11, 2014

OPW, Linux: The block I/O layer, part 3 - The make request interface

During the past weeks, I have been learning about profiling an operating system running in a virtual machine, since I have been needing to examine the driver I am working on to locate bottlenecks and work out lock contention issues. My OPW mentor has suggested that I get to grips with some popular profiling tools, such as perf and lockstat. During my bachelor thesis I already had the chance to become familiar with perf to some extent, but I am learning a lot more about collecting accurate data about the performance of a virtualized OS. For example, I read that Xen exploits Intel's Performance Monitoring Unit, which provides architectural support for collecting performance-related data. 

During the tests performed prior to profiling, I also had the chance to make use of the null_blk block device driver to compare the performance of the CFQ and NOOP I/O schedulers with a random workload composed of greedy random readers and writers, having no completion latency. Such a workload emulates Intel's IOmeter on a too-fast-to-be-real device. The throughput achieved by the CFQ I/O scheduler is half of the the one achieved by NOOP, or even lower, depending on the number of processes issuing I/O.

The NOOP scheduler, however, still does merges and sorts requests; none of this seems really necessary with such a workload where I/O operations are issued in a random fashion (so there seem to be not many merges in any case) and there is no seek penalty that would justify sorting. So, there's already something in the Linux kernel's block layer that should perform slightly better than the request API with the NOOP scheduler: the make request interface.


The block I/O layer, part 3 - The make request interface

The make request interface (or bio-based interface) essentially shorts out all processing of block I/O units following the creation of a bio structure. It therefore allows the kernel to directly submit a bio to the storage device's driver. Such an interface is useful to any block device driver needing to perform pre-processing of requests before submitting them to the actual underlying device (such, e.g., stacked drivers implementing RAID). Even if its purpose was not initially that, the bio-based API is also useful to any block device driver that sees the block layer's processing of I/O requests as an overhead; think, for example, to drivers of devices or controller that feature a highly complex internal request processing logic or don't need requests to be processed. The drawbacks for such an interface are evident: a driver making use of it would lose any pre-processing normally performed by the block layer.

Figure 1: Block layer layout when using the make request interface

Let's see how a driver uses such an interface, again from the code of the very simple null_blk driver. Even when in bio-based mode, the null_blk driver still needs to allocate a request_queue structure. The key, however, is defining, after that, an alternate make request function with respect to the default one. The null_blk driver does this in its null_add_dev() function, invoked for each simulated device that it requires to create, on module initialization.

nullb->q = blk_alloc_queue_node(GFP_KERNEL, home_node);
blk_queue_make_request(nullb->q, null_queue_bio);

Let's turn our attention to the bulk of the null_queue_bio() function itself. It is very simple and does not even need to allocate new request structures; however it needs to get a command structure to handle completions afterwards. It just handles the block operation's command with no additional operations.

static void null_queue_bio(struct request_queue *q, struct bio *bio)
{
        struct nullb *nullb = q->queuedata;
        struct nullb_queue *nq = nullb_to_queue(nullb);
        struct nullb_cmd *cmd;

        cmd = alloc_cmd(nq, 1); 
        cmd->bio = bio;

        null_handle_cmd(cmd);
}

In this very simple case, completions are handled by just ending the I/O command with no error notification, as if it had been executed by a device controller. We can see how the null_blk driver does in its end_cmd() function, which is invoked directly in the context of the previously-seen null_handle_cmd() function: it invokes the block layer's bio_endio() function by passing to it the completed bio and the error code as its second parameter.

case NULL_Q_BIO:
        bio_endio(cmd->bio, 0);
        break;


K. Wilk, "Xen Profiling: oprofile and perf" - http://wiki.xen.org/wiki/Xen_Profiling:_oprofile_and_perf
J. Corbet, "The multiqueue block layer" - http://lwn.net/Articles/552904/

Friday, June 27, 2014

OPW, Linux: The block I/O layer, part 2 - The request interface

Following the many directives and suggestions given by my OPW mentor, I have been implementing the tiniest and simplest block I/O driver, whose multi-queue implementation just fills I/O request structures with available information about the CPU core the process has been issuing I/O from; this is proving very useful to test the multi-queue-capable prototype of the block I/O I am developing. This task gave me the chance to study more in depth the different APIs provided by the block layer of the Linux kernel, and to try to continue the series of blog articles about it.
Here's my take on the second part; it's come out to be more high-level than I would have wanted, but still. Comments, remarks and criticism are welcome, as always.


The block I/O layer, part 2 - The request interface


The request interface bases itself on an essential data structure that allows to the block layer to pass requests to the device driver: the request_queue structure; such a structure has two main buffering functions: it allows for requests to be pre-processed in an attempt to reduce overheads (with merging and sorting) and limits submission rate to prevent hardware buffers from over-running. request_queue structures are allocated per-device, which means that each device has its private request_queue, which can be accessed only by the device-specific block I/O driver in use for the disk. Such a structure is essentially a container of block I/O requests: its more important field, in fact, is a liked list of requests (whose head is kept in the queue_head field) which have been made available to the driver. It also keeps a set of pointer to functions used to submit a bio or dispatch a request to be handled for submission to the device. When in request mode, the block layer also allows for I/O scheduling: therefore, a request_queue structure points to an elevator_queue structure, which in its turn keeps pointers to elevator hooks. Last, but not least, the request_queue structure is protected by a spinlock, the queue_lock.

Figure 1: The block layer's layout when in request mode

If willing to move forward and see what's inside each of the black boxes, a good starting point is a trace. But how do we trace a request cleanly without being bothered by the whole lot of driver-specific and low-level-driver-specific functions, that execute on an interrupt and don't allow us to get a straight-forward trace? We could exploit some ftrace tricks to get the cleanest possible trace (as I clumsily showed in a previous blog article) choosing whether to have a more messy trace or to lose interrupt-related bits, or we can use the null block device driver. I'd personally go for the latter. It comes with the Linux kernel as a module that, when loaded, creates a configurable number of fake block devices (see the nr_devices option of the module), exploiting a block layer API of your choice (given with the queue_mode option). Another nice feature of this driver is that, when in its default configuration, everything, including dispatches and completions, happens in the context of a request insertion, which is perfect, as we can simply trace by PID. I have collected a trace of the submission of a 4KB read request using the request interface, and I will make it available as soon as I have it uploaded.

The first interesting fact, however, is not shown by the trace, as it happens on the initialization of a new device. When initializing a new device, its driver, while being loaded, chooses which interface the block layer should be using for that specific device. The interface is chosen by specifying a make_request_fn for the device, which is usually defined with the blk_queue_make_request() helper function. The default make_request function, which is to be used when wanting the block layer to stay in request mode, is blk_queue_bio().
Now that we know which function is called to submit a bio to the block layer, we can easlily find it in the trace, and see that it is called by generic_make_request(), which in its turn is called by submit_bio(). The latter function is sort of an entry point to the block layer, and used by the upper levels of the block I/O stack to submit a block I/O unit to be processed and submitted to its device.

The most interesting function between the aboved-mentioned, however, is blk_queue_bio(), which implements the core routine of the Linux block layer's request mode. I had prepared a flowchart, but it would become too much of a long figure, so I'll simply try to outline the main steps it performs in the following paragraph. I'll skip bits here and there to try to avoid boring the reader to death; if you want to go through the full function, you can find it around line 1547 of the block/blk-core.c file of the Linux kernel source tree.

Generic block layer: the blk_queue_bio() function

The first step, performed by the function even before grabbing any lock, is that of attempting a merge of the newly-submitted block I/O unit with requests in the task's plug list. When in request mode, plugging is in fact performed per-process.

        if (blk_attempt_plug_merge(q, bio, &request_count))
                return;

The blk_attempt_plug_merge() function tries to merge the bio structure with the current issuer's plug list; it checks only basic merging parameters (gathered with the invocation of blk_try_merge()), avoiding any interaction with the elevator and therefore needing no locking to be performed.
Going on, the blk_queue_bio() function finally needs to grab the queue_lock, as further actions performed in an attempt to merge the block I/O unit will involve per-device structures.

       spin_lock_irq(q->queue_lock);

       el_ret = elv_merge(q, &req, bio);
       if (el_ret == ELEVATOR_BACK_MERGE) {
               if (bio_attempt_back_merge(q, req, bio)) {
                       elv_bio_merged(q, req, bio);
                       if (!attempt_back_merge(q, req))
                               elv_merged_request(q, req, el_ret);
                       goto out_unlock;
               }
       } else if (el_ret == ELEVATOR_FRONT_MERGE) {
               if (bio_attempt_front_merge(q, req, bio)) {
                       elv_bio_merged(q, req, bio);
                       if (!attempt_front_merge(q, req))
                               elv_merged_request(q, req, el_ret);
                       goto out_unlock;
               }
       }

The elv_merge() function handles all operations that concern attempting to merge a bio structure with a request that has been already queued in the elevator. To this purpose, the block layer keeps some private fields that has the aim to make this operation faster: a) a one-hit cache of the last request that has been successfully been involved in a merge, and b) a private list of requests that are currently waiting for dispatch in the elevator. (a) is the most useful, as, if merge succeeds with this one-hit cache, no running through lists is needed; (b) probably speeds up search, too, as, in the elevator, requests might be split throughout different service queues, which might imply multiple search levels if delegating the search to the elevator itself. The elv_merge() function involves invoking one of the elevator hooks (elevator_allow_merge_fn) to ask if a bio and a request can be merged; it returns a value that represents the kind of merge that a bio and a request can undergo. In case a merge succeeds, the device is immediately unlocked and the function returns.

        req = get_request(q, rw_flags, bio, GFP_NOIO);
        if (unlikely(!req)) {
                bio_endio(bio, -ENODEV);        /* @q is dead */
                goto out_unlock;
        }

The block layer keeps a pool of already-allocated request structures for each device.
The number of allocated requests can be retrieved by reading the special sysfs file /sys/block/<device>/queue/nr_requests, while the number of in-flight requests for the device can be read from /sys/block/<device>/inflight. In case every merge attempt of the bio fails, blk_queue_bio() invokes the function get_request() attempts to get a free request from the pool. In case it fails, the helper __get_request() function, called by the latter, activates the request starvation logic, which makes every I/O request blocking (even write requests) for every application. If, instead, a free request structure is correctly retrieved, __get_request() invokes the elevator_set_req_fn elevator hook to initialize the request's elevator-private fields; when the former function has returned, blk_queue_bio() then proceeds to initialize it from the information included in the bio.

        init_request_from_bio(req, bio);

After a new request has been correctly retrieved and initialized from the information kept in the bio, it needs to be inserted in the issuing task's plug list or to be directly submitted to the elevator. The per-process-plugging logic is triggered.

        plug = current->plug;
        if (plug) {
                if (request_count >= BLK_MAX_REQUEST_COUNT)
                        blk_flush_plug_list(plug, false);
                list_add_tail(&req->queuelist, &plug->list);
        } else {
                spin_lock_irq(q->queue_lock);
                __blk_run_queue(q);
                spin_unlock_irq(q->queue_lock);
        }

The newly-initialized request is inserted in the task's plug list only if the task is currently plugged and the length of said list does not exceed BLK_MAX_REQUEST_COUNT, or the task has been plugged for enough few time. Otherwise, if the task is plugged, it becomes unplugged and the plug list is flushed with the blk_flush_plug_list() function. If the task is unplugged, a driver run-of-queue is triggered and the new request is directly passed on to the elevator for further processing, being inserted with the elevator_add_req_fn hook.

The elevator and the device driver

When a run-of-queue is triggered, the driver peeks the request queue and, if it contains requests, it extracts one or more of them from the queue. A run of queue can happen as a consequence of request insertion or following an interrupt from the device's controller. During driver initialization, a request_fn function is set with the blk_init_queue() function, which also allocates a new request_queue structure. When a device wants to get new requests from a request_queue, its run-of-queue invokes the driver's request_fn hook. The latter, in its turn, typically invokes the blk_fetch_request() helper function of the block layer until it no more returns a valid pointer to a request. Inside the loop, the driver handles requests as deemed appropriate.

Let's see how, e.g., the null block driver handles initialization of a request_fn function. The source code for the complete null_add_dev() function is in the drivers/block/null_blk.c file of the Linux kerrnel source tree.

nullb->q = blk_init_queue_node(null_request_fn, &nullb->lock, home_node);

The blk_init_queue_node() function is similar to blk_init_queue(), but allows to specify the memory node the request_queue should be allocated from.

As of the null block driver's request_fn function, it is implemented with a simple loop as follows.

static void null_request_fn(struct request_queue *q) 
{
        struct request *rq;

        while ((rq = blk_fetch_request(q)) != NULL) {
                struct nullb_cmd *cmd = rq->special;

                spin_unlock_irq(q->queue_lock);
                null_handle_cmd(cmd);
                spin_lock_irq(q->queue_lock);
        }

}

The blk_fetch_request() function invokes the blk_peek_request() function, which in its turn uses one of the elevator's callbacks, elevator_dispatch_fn, to trigger the insertion of new requests in the request_queue. The number of requests inserted for each dispatch depends on the implementation of the scheduling policy implemented in the elevator.

Friday, June 13, 2014

OPW, Linux: The block I/O layer, part 1 - Base concepts

So, during the last few weeks I have been studying and tracing the block I/O layer of the Linux kernel, with the aim of getting a clearer view of how it works. A fellow student from the University of Modena and Reggio Emilia thought I might write a blog post on that. My very patient OPW mentor approved the topic and suggested that I split the blog post in different parts, so that it can be read more comfortably and without getting bored all at once. Here is part one, which attempts to explain the basic concepts and ideas behind the mechanisms implemented in the block I/O layer: it's very high-level and concise. The following parts (which will go into much more detail about implementation details) will outline how the different APIs that the block layer exposes make use of those mechanisms, and hopefully why they are useful (or they aren't) to each of the APIs. Sounds like a pretty difficult goal (at least for a beginner as I am), so if you have any criticism, please, go ahead and leave a comment.


The block I/O layer, part 1 - Base concepts

The block I/O layer is the kernel subsystem in charge of managing input/output operations performed on block devices. The need for a specific kernel component for managing such operations is given by the additional complexity of block devices with respect to, for example, character devices. When we think about a character device (e.g. the keyboard we're writing with), we think about a stream, which has a cursor running along one only direction. In other words, the information kept in the stream can be read in only one order, and is usually read until a certain position, or until a certain special character is encountered (e.g. EOF, the end-of-file character). A block device is different: data can be read from it in any order and from any position, after one or more seeks. Extra care is needed while reading from a block device, not so much for handling the I/O itself, as for ensuring performance; in fact, block I/O devices are mainly exploited as storage devices, and it is important to exploit as much as possible their potential. As Robert Love stated in his renowned book, the development of a block layer was such an important goal that it became the main aim of the 2.5 development cycle of the Linux kernel.

Seeing it as a black box, we find that the block layer interacts with two other relevant subsystems of the Linux kernel: the mapping layer and the device driver of the block device where the actual I/O will take place. To undestand what the mapping layer does, and how it interacts with the block layer, we have to briefly outline how a block I/O request actually reaches it (see also Figure 1 for a schematic view).

Figure 1: The block I/O hierarchy of the Linux kernel

A process accesses a disk with a read() or write() operation of a certain number of bytes. The sys_read() and sys_write() system calls, invoked on behalf of the library, first activate the Virtual File System, that handles two different aspects. When opening, closing or linking a file, it locates the disk and the file system hosting the data, starting from a relative or absolute path. When reading from or writing to an already-open file, it verifies if the data is already mapped into memory (that is, resides in the kernel's page cache) and, if it isn't, it determines how the data can be read from disk. When the VFS finds out that the data is not already mapped into memory, it activates the the kernel susbystem in charge of physically locating the data. The mapping layer initially finds out the size of the file system block corresponding to the data, then it computes the size of the data in terms of number of disk blocks to be transferred. In this phase, the kernel also locates the block numbers that keep the data. As a file can be kept in blocks that are physically not contiguous, the file system keeps track of the mapping between logical blocks and physical blocks. The mapping layer must therefore invoke a mapping function that accesses the file descriptor and pieces together the logical-to-physical mapping, therefore getting, at the end, the position of the actual disk blocks (from the beginning of the disk or of a partition of the disk's file system) keeping the file.
As soon as the positions of the blocks to be mapped into memory are available, the kernel starts to build the I/O requests to forward to the disk device. The block I/O layer creates a list of block I/O structures (called bio), each representing an I/O operation to be submitted to disk. Such structures hide to the upper levels of the hierarchy the specific features of the block device, providing them with an an abstract view of the devices. In the simplest case of small I/O operations one single bio is created, while in more complex cases (large-sized I/Os, vector I/O) the block layer can initiate more I/O operations at a time. Subsequently, the block layer translates I/O operations to requests made for specific sectors, to be sent to the disk controller itself; each request normally contains one or more bios (Figure 2): this helps reducing per-strucure processing overheads. Each of these requests is represented by a request structure; their dispatch to the device driver can be controlled by service policies, implemented in the I/O scheduling sub-component of the block layer.

Figure 2: Interaction between bio and request data structures
(from R. Love's "Linux Kernel Development", 3rd Edition)

At last, the block layer dispatches the I/O requests to the device driver, which handles the actual data transfer. The device driver is typically provided with a list of available requests, or provides itself a callback to insert a request in some shared data structures. The device driver issues the interface-specific commands as soon as an interrupt is issued from the disk, or a request service is requested from the upper layers of the hierarchy. This interaction between block layer and device driver highly depends on the block layer API which is being used, and will be discussed in much more detail in the next posts of the blog series.


I/O handling techniques

The question arises quite naturally here: are requests simply forwarded by the block layer to the block device driver? Well, actually, no. If they were, the existence of the block I/O layer itself would be pointless. The block layer implements, instead, a number of I/O handling techniques, some of them very simple, some very complex, with a common aim: increasing performance. Some of the implemented techniques, such as I/O scheduling, are typical of a subset of the APIs currently offered by the block layer, and won't be explained here. There are however some that are common to all available interfaces, as they provide relevant benefit regardless of the used API.

Merge and coalesce I/O operations
The first operation the block I/O layer performs on a block I/O operation is attempting to merge it to already-inserted structures. This is achieved by searching in the block layer's internal data structures some adjacent requests; if at least one adjoining request is found, the newly-issued block I/O is unified to the existing one. When the existing operation is merged to the new one (because the new one was issued for sectors that preceded the existing one's), the merge is called front merge; else, if the new operation is merged to the existing one (because it was issued for sectors that followed the existing one's) the merge is called back merge. The most common of the two is the latter, as it most likely happens that I/O is performed in a sequential fashion. When, instead, a new operation "fills a gap" between two already-existing structures, the block layer performs a coalesce of the three operations.

Figure 3: Merge and coalesce operations

Plugging
The block I/O layer includes a mechanism, called plugging, that adjusts the rate at which requests are dispatched to the device driver. If requests are queued to a block device which the block layer does not regard as under load or under stress (that is, with a low number of pending requests), the dispatch of newly-issued operations to the driver is delayed (and the device is said to be plugged). This allows to the block layer to perform a higher number of merges, if more I/O is issued closely (with regard to time and space). After a certain amount of time, or when the number of outstanding I/O for a device exceeds a fixed threshold, the device is unplugged, and the requests are actually dispatched to the device driver.

Initially, plugging was performed system-wide; the idea proved to be effective, but not as effective as it would be if plugging was performed per-device, which would help relieve lock contention on the increasing number of SMP machines. Starting from 2.6.39, plugging was moved to be per-device. Lock contention was still an issue, especially in the case of many processes issuing a high number of small requests to the same block device and constantly hitting on the block layer's per-device data structures each time a merge of a block I/O operation was attempted. Starting from 3.0, plugging was therefore moved to be per-process. This allows block I/O operations to be merged and coalesced in a lockless fashion in a high number of cases; the extent of the impact of this design choice on lock contention actually depends on the API being used.

Request sorting
Some types of block devices have a great benefit from the requests being issued as much as possible in a sequential fashion. As can be probably very easily deduced, the ones that benefit the most from it are devices where the access time to data depends on some mechanical movement: rotational hard drives. When accessing a sector, a rotational disk must move the disk plate where the data is kept (rotational latency) for the sector to be accessible to a head, which must position its head on a specific track (disk head seek). If requests are issued in a random manner, the number of seeks needed to retrieve the sectors increases, adding overhead with respect to sequentially-issued operations.
The block layer typically tries to re-order requests as much as possible, so that they are submitted to the device driver and to disk in a sequential fashion. Plugging helps also with this, allowing the block layer to internally re-order requests before they are submitted.

Figure 4: Internals of a hard disk
(from learn44.com, because I can't draw for my life)
Request sorting is, however, not very useful on some kinds of disks where data transfer time depends (almost) only on the size of the requests, such as solid state drives (SSD). Some solid state disks actually benefit from having many small requests, as they internally serve them in parallel therefore achieving better performance. This technique is therefore left to I/O policies to be used only where it is strictly necessary, and it is even directly skipped in some cases where it does not matter or could introduce overhead which is unnecessary or impacts on performance.


Sources
Robert Love, "Linux Kernel Development"

Thursday, May 29, 2014

OPW, Linux: Learning about Linux with ftrace

So, this week I've been studying and documenting code and asking questions to my OPW mentor (who, by the way, suggested the topic for this kick-off blog post, thank you so much!), and starting to code (and debug). I've been mainly making use of the ftrace environment to get traces from a running kernel and compare them to what I understood from a static study; also, as I'm working on a block I/O driver, I've been also experimenting with the blktrace tool. The following paragraphs will try to outline how I have been using them.
If you have any feedback and the time to write a comment about it, please do; I'm just a beginner here, and can make use of any suggestion.

Learning about Linux with ftrace

ftrace, whose name is heritage of its initial purpose of kernel-space function tracer, is much more than that: it is available in Linux (and has been from v2.6.27 on) as a framework that features several tracers, each sporting a configuration interface exposed to userspace applications. It indeed includes a function tracer, with the possibility to probe for both the entry points only and the pair of entry/exit points of kernel functions; these are called respectively function and function_graph. Other tracers provided by the framework, instead, allow to trace especially areas that disable interrupts: the irqsoff tracer saves a snapshot of the longest measured period of time where interrupts where disabled; similarly, preemptoff saves a trace of the longest period that had preemption disabled; the preemptirqsoff tracer joins the operations performed by the two. Last, the wakeup and wakeup_rt tracers are involved in recording the maximum service latency of high-priority tasks, the latter focusing on real-time tasks.
ftrace also allows for dynamic tracing, having the system run with no measurable overhead when tracing is disabled; this convenient feature is implemented with code text rewrite: when starting up the kernel, or stopping a running trace, the code of the running kernel is dynamically modified by replacing all calls to function profilers with a no-operation instruction, thereby reducing the overhead to zero. Having the chance to dynamically enable and disable tracing points throughout kernel code also gives the possibility to restrict traces to relevant functions, reducing the overhead unavoidably produced by enabled tracing.

Being a versatile and efficient kernel-space tracing environment, ftrace proves to be an unindispensable tool to the student or developer willing to become familiar with a kernel component. If this is the case, the most useful tracers between those listed above are probably the function and function_graph tracers, as they give a concise representation of the code's execution flow.
The interface to the ftrace framework resides in the kernel's debugfs. Assuming that the debugfs is mounted on the directory /sys/kernel/debug, the first file we'd be willing to use is probably the one that allows us to select a tracer. Let's say we want to collect a function graph. First, we can list available tracers to see if the one we want is actually compiled in the kernel.

# cat /sys/kernel/debug/tracing/available_tracers 
blk function_graph wakeup_dl wakeup_rt wakeup function nop

It seems like we have the function_graph tracer built in our kernel (want to know what that mysterious blk tracer is? Read the section "Tracing the block layer" below). We can select it immediately by echoing its name to the current_tracer file. But, let us see, before that, if any tracer has already been selected.

# cat /sys/kernel/debug/tracing/current_tracer 
nop

The nop tracer is a placeholder to indicate that no tracer is selected. Let's select function_graph instead.

# echo function_graph > /sys/kernel/debug/tracing/current_tracer 
# cat /sys/kernel/debug/tracing/current_tracer
function_graph

At this point, we can finally start tracing. But first, let's choose an interesting test case to trace. Let's say we want to follow the path of a 4KB read I/O request performed on the sda drive from syscall to dispatch. Writing it as a couple of bash command lines:

# echo 3 > /proc/sys/vm/drop_caches
# dd if=/dev/sda of=/dev/null bs=1K count=4

The first command drops all data kept in the kernel's page cache. Dropping the cache might not be necessary, but is strongly advised in this case, as the blocks we are trying to read from disk could by coincidence be already mapped into memory.
Before going on to the real experiment, we can glance at the file that will keep the trance once it's been collected. What we want to read is the /sys/kernel/debug/tracing/trace special file:

# cat /sys/kernel/debug/tracing/trace
# tracer: function_graph
#
# CPU  DURATION                  FUNCTION CALLS
# |     |   |                     |   |   |   |

As of now, the file is empty, but the header already gives us an idea of what the trace will include. In it, for each row, we'll see the id of the CPU where the function was executed, the duration of its execution and, as expected, its name.
Now we can add tracing around our experiment and finally execute it. We will be clearing the contents of the trace file before starting to record, just to be sure we are collecting only the relevant output.

# echo nop > /sys/kernel/debug/tracing/current_tracer
# echo function_graph > /sys/kernel/debug/tracing/current_tracer
# echo 3 > /proc/sys/vm/drop_caches
# echo 1 > /sys/kernel/debug/tracing/tracing_on
# dd if=/dev/sda of=/dev/null bs=1K count=4
# echo 0 > /sys/kernel/debug/tracing/tracing_on

We can glance at our experiment's trace by viewing the contents of the /sys/kernel/debug/tracing/trace file. Be warned, it is most likely the length of the trace's content has very much increased.

Improving trace accuracy

If we start eagerly studying the trace we collected, we will probably encounter a number of issues that affect the readability of the output we collected, the most relevant being that the trace probably includes also function calls that do not concern the service of that 4KB read request we wanted to see. One of the reasons for this issue is that some time unavoidably elapses from the bash executing that "echo 1" command and actually starting the dd application; also, the trace will contain all preliminary operations performed on behalf of the dd application, such as, e.g. loading its executable image into memory, opening the /dev/sda file and the /dev/null file, and so on.

The first thing we'd want to try is restricting the trace to read-related functions. By glancing at documents and manpages, we see that everything starts with the SyS_read() system call. If we could restrict tracing to that only function and its children, the output would be clearer. Turns out we actually can:

# echo SyS_read > /sys/kernel/debug/tracing/set_graph_function
# cat /sys/kernel/debug/tracing/set_graph_function
SyS_read

If we repeat the experiment, we see that now only SyS_read and the functions it calls are in the trace. Although useful, however, this method does not take into account that not all the handling of the I/O happens in the context of the SyS_read function.
Let's revert back to tracing all function by simply clearing the file:

# echo > /sys/kernel/debug/tracing/set_graph_function

What we really want to do to improve the accuracy of our trace and enhance its readability is exploiting one more feature of the ftrace environment: setting trace markers (that is, trace lines that will help our user-space application to get in sync with the kernel, e.g. telling it that we're about to start issuing I/O so that it can report it on the trace or when we have finished). Having a marker in the trace makes sense only if the marker is set immediately before the event we want to mark. This can be done by setting up a helper program, preferrably written in a compiled language, which will allow us to get rid of a big amount of spurious events. The official documentation page for ftrace provides an example for such a helper program. Starting from that outline, I have made available on my gist a draft of possible implementation for our test. Even if almost trivial, it's pretty long, and would certainly be boring and uncomfortable to read on a blog, so I'm linking it here.
If we are sure we want to see only operations performed on behalf of a single process, we could also trace by PID, by specifying in the set_ftrace_pid file the PID of the process we want to trace. However, this still would prevent us from seeing events performed in interrupt context following actions from that process (e.g. I/O request completions). The example I linked above includes tracing by PID, and will therefore discard some events that might be relevant to the student.

Tracing the block layer

While studying the block layer or another block I/O-related kernel component, we could be interested only in tracing actions related to handling of requests. ftrace comes in handy also for collecting traces basing on events; let's see what events it supports, and if we can do something with them.

# cat /sys/kernel/debug/tracing/available_events
<it's a huge list, so I'm snipping>
block:block_rq_remap
block:block_bio_remap
block:block_split
block:block_unplug
block:block_plug
block:block_sleeprq
block:block_getrq
block:block_bio_queue
block:block_bio_frontmerge
block:block_bio_backmerge
block:block_bio_complete
block:block_bio_bounce
block:block_rq_issue
block:block_rq_insert
block:block_rq_complete
block:block_rq_requeue
block:block_rq_abort
block:block_dirty_buffer
block:block_touch_buffer
<...>

ftrace is not the only option for block layer-related study and debugging. Actually, the Linux kernel block layer maintainer, Jens Axboe, has introduced a block layer I/O tracing tool, blktrace, that provides detailed traces about request-queue operations from kernel to user space. blktrace efficiently records traces to one or more buffers in a binary format, which is then parsed back into a human-readable format by the blkparse utility. blktrace also adds to common events the log messages produced by block layer components, therefore highly facilitating debugging on block I/O-related subsystems. blktrace is completely integrated with the debugfs interface provided by ftrace, and supports a fair amount of the options provided by the ftrace environment (such as, e.g. filtering traces and dynamic tracepoints).
To enable the block tracer provided by the blktrace facility, it is sufficient to select the blk tracer as current one. It is also necessary to enable traces per-block-device, by writing to the file /sys/block/<device>/trace/enable as previously seen. Note that, when the blktrace framework is compiled in the kernel, writing to the per-block-device enable file will start/stop a trace even if the tracer is not blk.

Extending kernel oopses with function traces

Documenting and studying code is not the only field where a function tracer can be of great help: we can make good use of a complete trace even when debugging a kernel component. The main issue is that we don't know when an error will occur, how critical it will be, and ultimately might not know when, or even be able at all to, collect a trace of the failure.
ftrace can be configured to dump to terminal (or better, to serial console) a trace for us in case of an oops. This behavior, which is disabled by default can be dynamically changed with the switch /proc/sys/kernel/ftrace_dump_on_oops. It can also be configured at boot time by adding the option ftrace_dump_on_oops to the kernel's command line; note that we can also choose which tracer we want to record a trace with also when booting with traces enabled, by placing the ftrace=<tracer_name> option before the former command line parameter. In his article about debugging the kernel with ftrace, Steven Rostedt suggested shrinking the per-CPU buffer kept by ftrace especially in this occasion, as the buffer is to be dumped to the console as soon as an oops is produced. Also, the same article points out a nice feature of ftrace, that can be induced to print out the ftrace buffer with the SysRq-z combination, therefore allowing the developer to collect a trace of the running kernel even if it is stuck with interrupts disabled.

Sources
ftrace on eLinux: http://elinux.org/Ftrace

Tuesday, April 22, 2014

OPW, Xen: some initial info

Turns out I've been accepted for OPW 2014!
I'm so thrilled to be working with the Xen Project to domain support for Linux.
I'll be trying to document my three-months journey in this blog as part of the program, but writing is really not my thing; also, what I will write is (pretty obviously) just what I think I have understood from my readings and investigations. So, please bear with me and do not hesitate to comment about any mistake you may find in these posts, if you want to.

I'll be working on adding some support for the multi-queue API, recently merged in the Linux kernel, to the block I/O paravirtualized Xen driver for Linux. In the following post, I'll try to provide some initial information about the such an idea and its motivations; the project was proposed by Konrad Rzeszutek Wilk, to whom I have been assigned as intern. A more in-depth and articulate description can be found in his project outline or in the documents linked from the outline itself, but I'm so happy that I'm looking forward even to writing about it, so here is my take.

Multi-queue support for the block I/O paravirtualized Xen drivers

The kernel component charged with the task of managing block I/O issued by applications is, in Linux, referred to as the block layer. Its task includes passing on block I/O requests to the device driver, which actually interacts with the block device. Until v3.12, the Linux kernel's block layer offered to block device drivers two interfaces, one of which is more commonly used and is called the request interface. When exposing this interface, the block layer internally stores I/O requests, queueing them in a buffering area; while in this area, requests can be re-ordered (e.g. to reduce penalties due to disk head movements), merged (to reduce per-request-processing overheads) or, more in general, subjected to service policies implemented in the block I/O scheduling sub-component of the block layer. Buffering has also the function to limit submission rate to prevent any hardware buffer from over-running. From a more practical point of view, when in request-based mode, the block layer keeps a simple request queue for each device; it adds new requests to the queue according to its internal policies; the request queue is drained by the device driver from its head.

Such an interface was designed when a high-end drive could handle hundreds of I/O operations per second ([1]); in a recent paper ([2]), Linux block layer maintainer Jens Axboe pointed out the three causes that make it a bottleneck when used with a high-performance SSD that can handle hundreds of thousands of IOPS.
  1. The request queue is a major source of contention. In fact, it is protected by a single spinlock, the queue_lock, which is grabbed each time an I/O request is issued, attempted to be merged, enqueued in the I/O scheduler's private data structures, dispatched to the driver and completed.
  2. Hardware interrupts and request completions reduce usability of caches. A high number of I/O operations per second cause a high number of interrupts, despite the presence of interrupt mitigation ([3]). If one CPU core is in charge of handling all hardware interrupts and forwarding them to other cores as software interrupts, as it happens on most modern storage devices ([2]), a single core may spend a considerable amount of time handling interrupts, switching context, and therefore polluting caches. Also, other cores must undergo an inter-processor interrupt to execute the I/O completion routine.
  3. Cross-CPU memory accesses exacerbate queue_lock contention. Remote memory accesses are needed every time the completion routine for an I/O request is executed on a CPU core which is different from the one on which the request was last handled. In fact, on completion it is necessary to grab the queue_lock, as the request must be removed from the request queue. As a consequence, the cache line with the last state of the lock (stored in the cache of the core where the lock was last acquired) must be explicitly invalidated.
Efforts applied in order to remove such a bottleneck were focused on the key points of reducing lock contention and remote memory accesses. Even before v3.13, the Linux kernel already provided a way to short-out the request queue (with the make request interface), but at the cost of deferring any pre-processing and buffering performed on I/O requests by the block layer.

The multi-queue API introduced in the Linux kernel attempts to address this issue by providing to device drivers a third interface to the block layer. The request queue is split in a pool of separate per-CPU or per-NUMA-socket queues. The first mode, in particular, allows to each CPU to queue requests in its private request queue in a lockless fashion, thus reducing the previously-critical lock contention issues; while in these software queues (or submission queues), a request can be subjected to I/O scheduling policies. The device driver, instead, accesses a set of hardware queues (or dispatch queues), whose number matches the actual number of queues supported by the device driver; those act as a buffer to adjust I/O submission rate between the software queues and the actual dispatch to the device driver.

The Xen hypervisor provides paravirtualized drivers for block I/O to avoid IDE emulation (such as, e.g. the one provided by QEMU), which is slow. The PV block I/O driver follows Xen's split device driver model, which includes a backend driver (kept by a driver domain, usually dom0, which hosts hardware-specific drivers) and a frontend driver (used by one of the unprivileged domains, which does not have to deal with hardware-specific or interface-specific details). The two halves of a PV driver share some memory pages, where they keep one or more consensual rings (might usually be only one ring or a pair of tx/rx rings); those rings are accessed by both partecipants through a producer/consumer protocol which uses an event channel for the needed notifications (e.g. presence of a new request or response in the ring). This is called "PV protocol" ([4]), and is used by all paravirtualized drivers in Xen.

The block PV protocol uses only one ring, which is shared between the driver domain and the guest domain. To avoid the allocation of a large amount of possibly-unused shared memory, initially the ring has the minimum indispensable size to queue requests and responses, but not the data itself. As soon as the guest domain starts issuing requests, the frontend driver requests a certain number of memory areas (grants) to hold the data. If the request is a write request, the memory areas are filled with the data to be written on behalf of the guest domain and necessary permissions are added for the driver domain. A grant reference is added to the request, and the request is queued in the ring. The driver domain, at this point, is notified that a new request from the guest domain is pending service; when the request is finally served, the driver domain parses the grant reference and uses it to map the related memory areas to its own address space. As soon as the request has been fulfulled, the grants are unmapped from the driver domain's address space, a response is written on the shared ring and the guest domain is notified about the completion; while handling such a notification, the guest domain removes any access permission on the grants for the driver domain.

With respect to the original protocol, the current implementation has had the benefit of many improvements, such as increasing the maximum number of in-flight I/O with indirect descriptors ([5]), reducing lock contention on the grant lock of a domain and reducing the number of memory [un]mapping operations needed to pass the I/O on to dom0 ([6]).
However, as currently implemented in Linux, the block PV driver still makes use of a single block thread for each ring to handle I/O transmission.

Porting the multi-queue API to the driver should allow it to allocate per-vCPU block threads, thus allowing to each vCPU lockless access to a private ring. The performance benefits should include, as in the Linux kernel ([2]), increasing throughput and reducing service latency of I/O requests.