Showing posts with label driver. Show all posts
Showing posts with label driver. Show all posts

Friday, August 8, 2014

OPW, Linux: Profiling a Xen Guest

Here is my take at describing how have I been using some of the most popular tools when it comes to profiling. My OPW mentor suggested me to make use of these tools while looking into locking issues I had to deal with in the context of my summer project. I'll also try to include some hints that my mentor provided and should allow to gather more accurate data.
If you have any criticism or suggestion and the time to write about it, please do.


OPW, Linux: Profiling a Xen Guest

The first tool I had fun with is perf, a tool which abstracts away hardware differences of the CPUs supported by Linux and allows to use performance counters for profiling purposes. Performance counters are CPU registers used to count events that are relevant to profiling: instructions executed, cache events, branch prediction performance; they are instrumental to profiling applications as they allow to trace their execution and identify hot spots in the executed code.

The perf tool works on top of those hardware tools and uses them as building blocks to keep also per-task, per-CPU and per-workload counters, also enabling for source code instrumentation, letting the developer gather accurate annotated traces. Tracing is performed with the kprobes (with ftrace) and uprobes frameworks, which both allow for dynamic tracing, therefore highly reducing the overhead unavoidably added by tracepoints. If you are interested in dynamic tracing you can either read the manual or a previous blog post on ftrace.

From a more practical point of view, the fact that perf relies on ftrace also means that it must be compiled on top of your currently-executing kernel; in fact, it lives in a directory of the kernel source code tree, and can be build by exploiting the kernel make subsystem by issuing:

cd tools/perf
make

If you want to install the tool, just enter a make install after that. Otherwise, you'll need to use the perf executable which has been created in the tools/perf directory of your development kernel tree.

Then, we're done! Let's find hot spots in the code I'm writing. Let's record a run of the IOmeter emulator built on top of the fio flexible I/O tester and performed from a paravirtualized Xen guest on a physical block device emulated with the null_blk driver. The test script basically starts fio with the IOmeter jobfile. Note that the jobfile specifies libaio as the I/O engine to use for the test; just keep that in mind.

$ cat iometer_test.sh
#!/bin/bash
fio iometer-file-access-server

To make things simple, let's trace all available perf events.

perf record ./iometer_test.sh

Once the command has been executed, we can ask perf to list the recorded tracepoints, ordering them by placing first the ones with the higher overhead. Since the output is very long, I'm making it available in my Dropbox. I'll just paste here the first lines.

# To display the perf.data header info, please use --header/--header-only options.
#
# Samples: 271K of event 'cpu-clock'
# Event count (approx.): 67840250000
#
# Overhead    Command       Shared Object                               Symbol
# ........  .........  ..................  ...................................
#
    24.04%        fio  [kernel.kallsyms]   [k] xen_hypercall_xen_version      
     4.82%        fio  libaio.so.1.0.1     [.] io_submit                      
     3.98%        fio  libaio.so.1.0.1     [.] 0x0000000000000614             
     3.68%        fio  fio                 [.] get_io_u                       
     3.40%        fio  [kernel.kallsyms]   [k] __blockdev_direct_IO           
     2.78%        fio  fio                 [.] axmap_isset                    
     1.93%        fio  [kernel.kallsyms]   [k] lookup_ioctx                   
     1.66%        fio  [kernel.kallsyms]   [k] kmem_cache_alloc               
     1.64%        fio  [kernel.kallsyms]   [k] do_io_submit                   
     1.18%        fio  fio                 [.] _start                         
     1.16%        fio  [kernel.kallsyms]   [k] __blk_mq_run_hw_queue          
     1.08%        fio  [kernel.kallsyms]   [k] blk_throtl_bio                 
     0.91%        fio  [kernel.kallsyms]   [k] blk_mq_make_request            
     0.90%        fio  fio                 [.] fio_gettime                    
     0.90%        fio  fio                 [.] td_io_queue                    
     0.76%        fio  [kernel.kallsyms]   [k] generic_make_request_checks    
     0.72%        fio  [kernel.kallsyms]   [k] pvclock_clocksource_read       
     0.72%        fio  [kernel.kallsyms]   [k] __rcu_read_lock                
     0.71%        fio  [kernel.kallsyms]   [k] aio_read_events                
     0.69%        fio  fio                 [.] add_slat_sample                
     0.67%        fio  fio                 [.] io_u_queued_complete           
     0.66%        fio  [kernel.kallsyms]   [k] generic_file_aio_read          
     0.63%        fio  [kernel.kallsyms]   [k] __fget                         
     0.60%        fio  [kernel.kallsyms]   [k] bio_alloc_bioset               
     0.60%        fio  fio                 [.] add_clat_sample                
     0.59%        fio  [kernel.kallsyms]   [k] blkdev_aio_read                
     0.56%        fio  [kernel.kallsyms]   [k] percpu_ida_alloc               
     0.55%        fio  [kernel.kallsyms]   [k] copy_user_generic_string       
     0.54%        fio  [kernel.kallsyms]   [k] sys_io_getevents               
     0.54%        fio  [kernel.kallsyms]   [k] blkif_queue_rq

First of all, let's look at the fourth column, the one with one character enclosed in square brackets. That columns tells us whether the listed symbol is a userspace symbol (we have a dot in square brackets) or a kernelspace symbol (we see a k character in square brackets).
When it comes to userspace symbols, we can disambiguate whether we're looking at library functions or at user application symbols by glancing at the third column, which shows either the name of the library or the application. The second column, which might mislead us, shows instead the context of execution of the symbol. In this case, most of the execution happened in the context of the fio process, which is the one performing I/O during our tests.
The last column, the one on the right, shows us the name of the symbol, if perf managed to retrieve it. The first column, the one on the left, shows instead the percentage of time in which the list symbol was executing, given that 100% is the amount of time that covered the whole execution of a test.

So, perf did not record just events in the kernel, but also in the library used by fio to perform I/O, which is the libaio one we've seen before. Interesting fact, some of the library calls have a relatively high overhead: for example, the io_submit() function of libaio has been executing for almost the 5% of the test. This doesn't surprise us very much, as we kind of expected to find some hot libaio symbols by just reading Jens Axboe's paper which shows and explains profiling performed on such a test.

The most relevant hotspot, however, is in the kernel, and seems to be the xen_version hypercall, which has been executing for almost a fourth of the test. This seemed to me some kind of mistake until Konrad Wilk told me that the hypercall was probably being executed while grabbing a lock; in fact, the xen_version hypercall is used as a hypervisor callback: it is called to see if an event is pending, and if there is an interrupt is executed on return. That same hypercall, for example, is called whenever the functions spin_lock_irqsave() and spin_unlock_irqrestore() are called. This seems to identify the issue as a locking issue, so...

Debugging contention issues
... to get deeper into the investigation of such an issue, I chose to have a look at lockstat, the lock debugging framework provided by the Linux kernel. Essentially, lockstat provides statistics about the usage of locks, therefore being of great help to idenfity lock contention issues that can heavily impact performance.

First step is enabling it in the kernel's configuration. This can be done by setting CONFIG_LOCK_STAT=y or by browsing the menuconfig to the "Kernel hacking > Lock debugging (spinlocks, mutexes, etc...) > Lock usage statistics" option. As soon as the kernel has been re-compiled and re-installed with the framework enabled, we can enable lock statistics collection with:

$ echo 1 | sudo tee /proc/sys/kernel/lock_stat

From this moment, lock statistics are available in the special procfs file /proc/lock_stat, which we can glance at by issuing:

$ less /proc/lock_stat

An example of the output of that command collected on the frontend guest during the execution of the fio IOmeter test can be found, again, on my Dropbox. It's very very long and would not be much comfortable to read in here, so I'm just commenting it in this blog post.
First, we can see that the output is split in different sections, bordered with dots. Each of the sections contains information about a lock, whose name is the first thing listed there. For example, the first section is about the (&rinfo->io_lock)->rlock. Basically, it's the lock protecting access to the shared data structure used to exchange block I/O requests and responses between the guest domain and the driver domain following Xen's PV protocol. After that, we find a whole set of numbers, whose meaning is explained in the header above. We might be interested in extrapolating the number of contentions (the contentions field), the average time spent in waiting for the lock to be released (waittime-avg), the number of times that the lock has been acquired (acquisitions) and the average time the lock has been held (holdtime-avg). For example, we see that the io_lock has suffered 47500 contentions, has been acquired 1348488 times, and a thread willing to acquire it has been waiting for 25.4 microseconds.
After that batch of information, we find a listing of call traces; this indicated which threads of execution brought to acquiring that lock. Always with reference to our io_lock, we see that the two contenting traces concern the service of an interrupt, probably for a request completion, when the driver queues a response in the shared data structure (blkif_interrupt() is the key here) and the request insertion path, where the driver is queueing a request in the shared data structure (blk_queue_bio(), which is called back as soon as a completion kicks the driver for pending request insertions).

So, this is it. Seems like we've found a possible suspect. Let's make sure there aren't other relevant contention points by just glancing at the stats by ordering them by number of contentions. We can do that very easily if we know that the output is already ordered by number of contentions. Let's just then grep the headers of the sections.

$ grep : /proc/lock_stat

We can see from the output produced by this command that the second-most contented lock (rq->lock) has suffered "just" 1579 contentions during the test, which is an order of magnitude lower than the io_lock.

Tips for more accurate statistics
Just as a side note, I want to list here some tips that my mentor, Konrad Wilk, has given me on collecting more accurate data from profiling tools while dealing with Xen guests, in the hope that they will be useful to others as they have been to me.
Boot the PV guest with vPMU enabled. vPMU is a hardware support to performance monitoring, based on the performance monitoring unit, specifically prepared to help collecting performance data of virtualized operating systems. Exploiting vPMU allows to offload a huge overhead from the hypervisor, therefore allowing for more accurate data. Using it is fairly straight-forward, and involves having the vpmu=1 option in the guest's configuration file.
Use a driver domain. This removes having the dom0 as a bottleneck when using more than one guest, as the backend is located in the driver domain. This also allows to keep roles separated, and to have a specific-purpose domain just for the handling of block I/O.
Use a HVM domain as driver domain. This removes the overhead of PV MMU-related hypercalls and should allow for cleaner data.


Sources
Brendan Gregg, "Perf examples"

Wednesday, July 23, 2014

OPW, Linux: The block I/O layer, part 4 - The multi-queue interface

So, a couple of weeks ago I performed some first tests on the prototype driver using a simulated device created with the null_blk device driver. Such tests, and the following profiling, highlighted some locking issues due to contention on an internal lock kept by the frontend driver; more in detail, such a lock is instrumental to handling a ringbuffer which is used to exchange requests and responses between the frontend and the backend parts of the block I/O driver: the lock protects the ring against the insertion of new requests and the extraction of responses. My very patient OPW mentor has therefore suggested splitting the ring into two separate halves, one used for requests and another used for responses. During the last week and a half I have been working on that; while struggling with making the interface consistent with the needs of older versions of the driver, I have been writing this last blog article about the multi-queue block layer API, which concludes the series; next week I'll bore you to death with something more related to performance. Also, please note that this post is the reword of some documentation I produced during the first weeks of internship, and it has had the benefit of my mentor's reading, so it's probably much more accurate than usual.


The block I/O layer, part 4 - The multi-queue interface

The request interface was designed for devices that could handle hundreds of I/O operations per second; in a recent paper, block layer maintainer Jens Axboe noted that it suffers from design issues when used for devices that can handle hundreds of thousands of IOPS (see my first blog post or, much better, Jens Axboe's paper). One of the capital issues, lock contention, has very relevant effects event if only having multiple cores concurrently inserting block I/O requests and continuously spinning on the single queue_lock; such a situation is exacerbated when the lock is conteded also by a high-end storage device issuing interrupts and whose driver is still spinning on that same lock. The multi-queue API (also referred to as blk-mq) addresses such an issue by exploiting the ability of block I/O controllers to handle multiple requests in parallel, and thus dramatically reducing lock contention; in fact, in its most common configuration, it allows for block I/O requests to be inserted without the need to lock the whole request_queue. Let's see how it manages to do it.

The blk-mq API implements a two-levels block layer design which makes use of two separate sets of request queues: software staging queues, allocated per-CPU, and hardware dispatch queues, whose number typically match the number of actual hardware queues supported by the block device. The number of software staging queues can be higher than the number of hardware dispatch queues: in this case, two or more software queues will be part of the same hardware context, and a dispatch performed with that hardware context will pull in requests from all the associated software queues. The number of software staging queues can be, instead, less than the number of hardware queues: in this case, sequential mapping is performed. In the third and most simple case where the number of software queues equals the number of hardware queues, a direct 1:1 mapping is performed.

Figure 1: Outline of the multi-queue block layer.

Main data structures
The first relevant data structure used by the multi-queue block layer API is the blk_mq_reg structure, containing all informations of importance during the registration of a new block device to the block layer. This data structure contains the pointer to a blk_mq_ops data structure, used to keep track of the specific routines to be used by the multi-queue block layer to interact with the device's driver. A blk_mq_reg structure also keeps the number of hardware queues to be initialized, the dept of such queues and other information useful during the initialization of data structures related to the particular driver in the block layer. Another data structure of importance is the blk_mq_hw_ctx structure, which represents the hardware context to which a request_queue is associated. Its corresponding structure for the software staging queue is the blk_mq_ctx structure, which is allocated per-CPU. The function performing the mapping between these contexts is specified in the map_queue field of the driver's blk_mq_ops data structure, while the mapping built by this function is kept as the mq_map of the request_queue data structure associated to the block device.
Don't worry: a drawing, such as Figure 2, makes it clearer. Kind of.

Figure 2: Data structures used in the multi-queue block layer.

Queue initialization
When a new device driver using the multi-queue API is loaded, it creates and initializes a new blk_mq_ops structure and sets to its address the related pointer of a new blk_mq_reg. More in detail, the required operations are queue_fn, which must be set to a function in charge of handling the command (e.g. by passing it to the low-level driver), and map_queue, which performs the mapping between hardware and software contexts. Other operations are not strictly required, but can be specified in order to perform specific operations on allocation of contexts or on completion of an I/O request. As of necessary data, the driver must initialize the number of submission queues it supports, along with their size; other data are required, e.g., to determine the size of the command supported by the driver and specific flags that must be exposed to the block layer.

When a new device is initialized, its driver prepares a new data structure whose type may vary according to the device driver handling the device; such a driver-specific data structure, however, is very likely to contain a pointer to the device's gendisk struct and to the request_queue related to the device. As soon as the driver has these data structures ready, it invokes the blk_mq_init_queue() function, which initializes the hardware and software contexts and performs the mapping between them. The initialization routine also sets an alternate make_request function, subsituting to the conventional request submission path (which would include blk_make_request()) the multi-queue submission path (which includes, instead, the function blk_mq_make_request()); as usual, the alternate make_request function is set with the blk_queue_make_request() helper.

Request submission
Device initialization substituted the conventional block I/O submission function with the multi-queue-ready request-submission function, blk_mq_make_request(), letting the multi-queue structures be used transparently from the perspective of the upper layers. The make_request function used by the multi-queue block layer includes the possibility to benefit from per-process plugging, but only for drivers supporting a single hardware queue or for async requests. In case the request is sync and the driver actively uses the multi-queue interface, no plugging is performed. The make_request function also performs request merging, searching for a candidate first inside the task's plug list, if plugging is allowed, and finally in the software queue mapped to the current CPU; the submission path does not involve any I/O scheduling-related callback. Finally, make_request sends immediately to the corresponding hardware queue any sync request, while it delays this transition in case of async or flush requests, to allow for subsequent merging and more efficient dispatching.

Request dispatch
In case that an I/O request is synchronous (and therefore no plugging is allowed for it from the multi-queue block layer) its dispatch to the device driver is performed in the context of the same request; if the request is instead async or flush, and task plugging is present, its dispatch can be performed: a) in the context of the submission of another I/O request to a software queue associated to the same hardware queue; b) when the delayed work scheduled during request submission is executed.
The main run-of-queue function of the multi-queue block layer is the blk_mq_run_hw_queue(), which basically relies on another driver-specific routine, pointed by the queue_rq field of its blk_mq_ops structure. This function delays any run of queue for an async request, while it dispatches a sync request immediately to the driver. The inner function __blk_mq_run_hw_queue(), called by blk_mq_run_queue() in case the request is sync, first joins any software queue associated to the currently-in-service hardware queue; then it joins the resulting list with any entry already on the dispatch list. After collecting all to-be-served entries, the function processes them, starting each request and passing it on to the driver, with its queue_rq function. The function finally handles possible errors, by requeue or deletion of the associated requests.

Figure 3: Functions performing request transition between different data structures.

Sources
Bjørling, Matias, et al. "Linux block IO: Introducing multi-queue SSD access on multi-core systems." Proceedings of the 6th International Systems and Storage Conference. ACM, 2013. - http://kernel.dk/systor13-final18.pdf
Johnathan Corbet, "The multi-queue block layer" - http://lwn.net/Articles/552904/
The Linux kernel, blk-mq: new multi-queue block IO queueing mechanism

Friday, July 11, 2014

OPW, Linux: The block I/O layer, part 3 - The make request interface

During the past weeks, I have been learning about profiling an operating system running in a virtual machine, since I have been needing to examine the driver I am working on to locate bottlenecks and work out lock contention issues. My OPW mentor has suggested that I get to grips with some popular profiling tools, such as perf and lockstat. During my bachelor thesis I already had the chance to become familiar with perf to some extent, but I am learning a lot more about collecting accurate data about the performance of a virtualized OS. For example, I read that Xen exploits Intel's Performance Monitoring Unit, which provides architectural support for collecting performance-related data. 

During the tests performed prior to profiling, I also had the chance to make use of the null_blk block device driver to compare the performance of the CFQ and NOOP I/O schedulers with a random workload composed of greedy random readers and writers, having no completion latency. Such a workload emulates Intel's IOmeter on a too-fast-to-be-real device. The throughput achieved by the CFQ I/O scheduler is half of the the one achieved by NOOP, or even lower, depending on the number of processes issuing I/O.

The NOOP scheduler, however, still does merges and sorts requests; none of this seems really necessary with such a workload where I/O operations are issued in a random fashion (so there seem to be not many merges in any case) and there is no seek penalty that would justify sorting. So, there's already something in the Linux kernel's block layer that should perform slightly better than the request API with the NOOP scheduler: the make request interface.


The block I/O layer, part 3 - The make request interface

The make request interface (or bio-based interface) essentially shorts out all processing of block I/O units following the creation of a bio structure. It therefore allows the kernel to directly submit a bio to the storage device's driver. Such an interface is useful to any block device driver needing to perform pre-processing of requests before submitting them to the actual underlying device (such, e.g., stacked drivers implementing RAID). Even if its purpose was not initially that, the bio-based API is also useful to any block device driver that sees the block layer's processing of I/O requests as an overhead; think, for example, to drivers of devices or controller that feature a highly complex internal request processing logic or don't need requests to be processed. The drawbacks for such an interface are evident: a driver making use of it would lose any pre-processing normally performed by the block layer.

Figure 1: Block layer layout when using the make request interface

Let's see how a driver uses such an interface, again from the code of the very simple null_blk driver. Even when in bio-based mode, the null_blk driver still needs to allocate a request_queue structure. The key, however, is defining, after that, an alternate make request function with respect to the default one. The null_blk driver does this in its null_add_dev() function, invoked for each simulated device that it requires to create, on module initialization.

nullb->q = blk_alloc_queue_node(GFP_KERNEL, home_node);
blk_queue_make_request(nullb->q, null_queue_bio);

Let's turn our attention to the bulk of the null_queue_bio() function itself. It is very simple and does not even need to allocate new request structures; however it needs to get a command structure to handle completions afterwards. It just handles the block operation's command with no additional operations.

static void null_queue_bio(struct request_queue *q, struct bio *bio)
{
        struct nullb *nullb = q->queuedata;
        struct nullb_queue *nq = nullb_to_queue(nullb);
        struct nullb_cmd *cmd;

        cmd = alloc_cmd(nq, 1); 
        cmd->bio = bio;

        null_handle_cmd(cmd);
}

In this very simple case, completions are handled by just ending the I/O command with no error notification, as if it had been executed by a device controller. We can see how the null_blk driver does in its end_cmd() function, which is invoked directly in the context of the previously-seen null_handle_cmd() function: it invokes the block layer's bio_endio() function by passing to it the completed bio and the error code as its second parameter.

case NULL_Q_BIO:
        bio_endio(cmd->bio, 0);
        break;


K. Wilk, "Xen Profiling: oprofile and perf" - http://wiki.xen.org/wiki/Xen_Profiling:_oprofile_and_perf
J. Corbet, "The multiqueue block layer" - http://lwn.net/Articles/552904/

Friday, June 27, 2014

OPW, Linux: The block I/O layer, part 2 - The request interface

Following the many directives and suggestions given by my OPW mentor, I have been implementing the tiniest and simplest block I/O driver, whose multi-queue implementation just fills I/O request structures with available information about the CPU core the process has been issuing I/O from; this is proving very useful to test the multi-queue-capable prototype of the block I/O I am developing. This task gave me the chance to study more in depth the different APIs provided by the block layer of the Linux kernel, and to try to continue the series of blog articles about it.
Here's my take on the second part; it's come out to be more high-level than I would have wanted, but still. Comments, remarks and criticism are welcome, as always.


The block I/O layer, part 2 - The request interface


The request interface bases itself on an essential data structure that allows to the block layer to pass requests to the device driver: the request_queue structure; such a structure has two main buffering functions: it allows for requests to be pre-processed in an attempt to reduce overheads (with merging and sorting) and limits submission rate to prevent hardware buffers from over-running. request_queue structures are allocated per-device, which means that each device has its private request_queue, which can be accessed only by the device-specific block I/O driver in use for the disk. Such a structure is essentially a container of block I/O requests: its more important field, in fact, is a liked list of requests (whose head is kept in the queue_head field) which have been made available to the driver. It also keeps a set of pointer to functions used to submit a bio or dispatch a request to be handled for submission to the device. When in request mode, the block layer also allows for I/O scheduling: therefore, a request_queue structure points to an elevator_queue structure, which in its turn keeps pointers to elevator hooks. Last, but not least, the request_queue structure is protected by a spinlock, the queue_lock.

Figure 1: The block layer's layout when in request mode

If willing to move forward and see what's inside each of the black boxes, a good starting point is a trace. But how do we trace a request cleanly without being bothered by the whole lot of driver-specific and low-level-driver-specific functions, that execute on an interrupt and don't allow us to get a straight-forward trace? We could exploit some ftrace tricks to get the cleanest possible trace (as I clumsily showed in a previous blog article) choosing whether to have a more messy trace or to lose interrupt-related bits, or we can use the null block device driver. I'd personally go for the latter. It comes with the Linux kernel as a module that, when loaded, creates a configurable number of fake block devices (see the nr_devices option of the module), exploiting a block layer API of your choice (given with the queue_mode option). Another nice feature of this driver is that, when in its default configuration, everything, including dispatches and completions, happens in the context of a request insertion, which is perfect, as we can simply trace by PID. I have collected a trace of the submission of a 4KB read request using the request interface, and I will make it available as soon as I have it uploaded.

The first interesting fact, however, is not shown by the trace, as it happens on the initialization of a new device. When initializing a new device, its driver, while being loaded, chooses which interface the block layer should be using for that specific device. The interface is chosen by specifying a make_request_fn for the device, which is usually defined with the blk_queue_make_request() helper function. The default make_request function, which is to be used when wanting the block layer to stay in request mode, is blk_queue_bio().
Now that we know which function is called to submit a bio to the block layer, we can easlily find it in the trace, and see that it is called by generic_make_request(), which in its turn is called by submit_bio(). The latter function is sort of an entry point to the block layer, and used by the upper levels of the block I/O stack to submit a block I/O unit to be processed and submitted to its device.

The most interesting function between the aboved-mentioned, however, is blk_queue_bio(), which implements the core routine of the Linux block layer's request mode. I had prepared a flowchart, but it would become too much of a long figure, so I'll simply try to outline the main steps it performs in the following paragraph. I'll skip bits here and there to try to avoid boring the reader to death; if you want to go through the full function, you can find it around line 1547 of the block/blk-core.c file of the Linux kernel source tree.

Generic block layer: the blk_queue_bio() function

The first step, performed by the function even before grabbing any lock, is that of attempting a merge of the newly-submitted block I/O unit with requests in the task's plug list. When in request mode, plugging is in fact performed per-process.

        if (blk_attempt_plug_merge(q, bio, &request_count))
                return;

The blk_attempt_plug_merge() function tries to merge the bio structure with the current issuer's plug list; it checks only basic merging parameters (gathered with the invocation of blk_try_merge()), avoiding any interaction with the elevator and therefore needing no locking to be performed.
Going on, the blk_queue_bio() function finally needs to grab the queue_lock, as further actions performed in an attempt to merge the block I/O unit will involve per-device structures.

       spin_lock_irq(q->queue_lock);

       el_ret = elv_merge(q, &req, bio);
       if (el_ret == ELEVATOR_BACK_MERGE) {
               if (bio_attempt_back_merge(q, req, bio)) {
                       elv_bio_merged(q, req, bio);
                       if (!attempt_back_merge(q, req))
                               elv_merged_request(q, req, el_ret);
                       goto out_unlock;
               }
       } else if (el_ret == ELEVATOR_FRONT_MERGE) {
               if (bio_attempt_front_merge(q, req, bio)) {
                       elv_bio_merged(q, req, bio);
                       if (!attempt_front_merge(q, req))
                               elv_merged_request(q, req, el_ret);
                       goto out_unlock;
               }
       }

The elv_merge() function handles all operations that concern attempting to merge a bio structure with a request that has been already queued in the elevator. To this purpose, the block layer keeps some private fields that has the aim to make this operation faster: a) a one-hit cache of the last request that has been successfully been involved in a merge, and b) a private list of requests that are currently waiting for dispatch in the elevator. (a) is the most useful, as, if merge succeeds with this one-hit cache, no running through lists is needed; (b) probably speeds up search, too, as, in the elevator, requests might be split throughout different service queues, which might imply multiple search levels if delegating the search to the elevator itself. The elv_merge() function involves invoking one of the elevator hooks (elevator_allow_merge_fn) to ask if a bio and a request can be merged; it returns a value that represents the kind of merge that a bio and a request can undergo. In case a merge succeeds, the device is immediately unlocked and the function returns.

        req = get_request(q, rw_flags, bio, GFP_NOIO);
        if (unlikely(!req)) {
                bio_endio(bio, -ENODEV);        /* @q is dead */
                goto out_unlock;
        }

The block layer keeps a pool of already-allocated request structures for each device.
The number of allocated requests can be retrieved by reading the special sysfs file /sys/block/<device>/queue/nr_requests, while the number of in-flight requests for the device can be read from /sys/block/<device>/inflight. In case every merge attempt of the bio fails, blk_queue_bio() invokes the function get_request() attempts to get a free request from the pool. In case it fails, the helper __get_request() function, called by the latter, activates the request starvation logic, which makes every I/O request blocking (even write requests) for every application. If, instead, a free request structure is correctly retrieved, __get_request() invokes the elevator_set_req_fn elevator hook to initialize the request's elevator-private fields; when the former function has returned, blk_queue_bio() then proceeds to initialize it from the information included in the bio.

        init_request_from_bio(req, bio);

After a new request has been correctly retrieved and initialized from the information kept in the bio, it needs to be inserted in the issuing task's plug list or to be directly submitted to the elevator. The per-process-plugging logic is triggered.

        plug = current->plug;
        if (plug) {
                if (request_count >= BLK_MAX_REQUEST_COUNT)
                        blk_flush_plug_list(plug, false);
                list_add_tail(&req->queuelist, &plug->list);
        } else {
                spin_lock_irq(q->queue_lock);
                __blk_run_queue(q);
                spin_unlock_irq(q->queue_lock);
        }

The newly-initialized request is inserted in the task's plug list only if the task is currently plugged and the length of said list does not exceed BLK_MAX_REQUEST_COUNT, or the task has been plugged for enough few time. Otherwise, if the task is plugged, it becomes unplugged and the plug list is flushed with the blk_flush_plug_list() function. If the task is unplugged, a driver run-of-queue is triggered and the new request is directly passed on to the elevator for further processing, being inserted with the elevator_add_req_fn hook.

The elevator and the device driver

When a run-of-queue is triggered, the driver peeks the request queue and, if it contains requests, it extracts one or more of them from the queue. A run of queue can happen as a consequence of request insertion or following an interrupt from the device's controller. During driver initialization, a request_fn function is set with the blk_init_queue() function, which also allocates a new request_queue structure. When a device wants to get new requests from a request_queue, its run-of-queue invokes the driver's request_fn hook. The latter, in its turn, typically invokes the blk_fetch_request() helper function of the block layer until it no more returns a valid pointer to a request. Inside the loop, the driver handles requests as deemed appropriate.

Let's see how, e.g., the null block driver handles initialization of a request_fn function. The source code for the complete null_add_dev() function is in the drivers/block/null_blk.c file of the Linux kerrnel source tree.

nullb->q = blk_init_queue_node(null_request_fn, &nullb->lock, home_node);

The blk_init_queue_node() function is similar to blk_init_queue(), but allows to specify the memory node the request_queue should be allocated from.

As of the null block driver's request_fn function, it is implemented with a simple loop as follows.

static void null_request_fn(struct request_queue *q) 
{
        struct request *rq;

        while ((rq = blk_fetch_request(q)) != NULL) {
                struct nullb_cmd *cmd = rq->special;

                spin_unlock_irq(q->queue_lock);
                null_handle_cmd(cmd);
                spin_lock_irq(q->queue_lock);
        }

}

The blk_fetch_request() function invokes the blk_peek_request() function, which in its turn uses one of the elevator's callbacks, elevator_dispatch_fn, to trigger the insertion of new requests in the request_queue. The number of requests inserted for each dispatch depends on the implementation of the scheduling policy implemented in the elevator.

Thursday, May 29, 2014

OPW, Linux: Learning about Linux with ftrace

So, this week I've been studying and documenting code and asking questions to my OPW mentor (who, by the way, suggested the topic for this kick-off blog post, thank you so much!), and starting to code (and debug). I've been mainly making use of the ftrace environment to get traces from a running kernel and compare them to what I understood from a static study; also, as I'm working on a block I/O driver, I've been also experimenting with the blktrace tool. The following paragraphs will try to outline how I have been using them.
If you have any feedback and the time to write a comment about it, please do; I'm just a beginner here, and can make use of any suggestion.

Learning about Linux with ftrace

ftrace, whose name is heritage of its initial purpose of kernel-space function tracer, is much more than that: it is available in Linux (and has been from v2.6.27 on) as a framework that features several tracers, each sporting a configuration interface exposed to userspace applications. It indeed includes a function tracer, with the possibility to probe for both the entry points only and the pair of entry/exit points of kernel functions; these are called respectively function and function_graph. Other tracers provided by the framework, instead, allow to trace especially areas that disable interrupts: the irqsoff tracer saves a snapshot of the longest measured period of time where interrupts where disabled; similarly, preemptoff saves a trace of the longest period that had preemption disabled; the preemptirqsoff tracer joins the operations performed by the two. Last, the wakeup and wakeup_rt tracers are involved in recording the maximum service latency of high-priority tasks, the latter focusing on real-time tasks.
ftrace also allows for dynamic tracing, having the system run with no measurable overhead when tracing is disabled; this convenient feature is implemented with code text rewrite: when starting up the kernel, or stopping a running trace, the code of the running kernel is dynamically modified by replacing all calls to function profilers with a no-operation instruction, thereby reducing the overhead to zero. Having the chance to dynamically enable and disable tracing points throughout kernel code also gives the possibility to restrict traces to relevant functions, reducing the overhead unavoidably produced by enabled tracing.

Being a versatile and efficient kernel-space tracing environment, ftrace proves to be an unindispensable tool to the student or developer willing to become familiar with a kernel component. If this is the case, the most useful tracers between those listed above are probably the function and function_graph tracers, as they give a concise representation of the code's execution flow.
The interface to the ftrace framework resides in the kernel's debugfs. Assuming that the debugfs is mounted on the directory /sys/kernel/debug, the first file we'd be willing to use is probably the one that allows us to select a tracer. Let's say we want to collect a function graph. First, we can list available tracers to see if the one we want is actually compiled in the kernel.

# cat /sys/kernel/debug/tracing/available_tracers 
blk function_graph wakeup_dl wakeup_rt wakeup function nop

It seems like we have the function_graph tracer built in our kernel (want to know what that mysterious blk tracer is? Read the section "Tracing the block layer" below). We can select it immediately by echoing its name to the current_tracer file. But, let us see, before that, if any tracer has already been selected.

# cat /sys/kernel/debug/tracing/current_tracer 
nop

The nop tracer is a placeholder to indicate that no tracer is selected. Let's select function_graph instead.

# echo function_graph > /sys/kernel/debug/tracing/current_tracer 
# cat /sys/kernel/debug/tracing/current_tracer
function_graph

At this point, we can finally start tracing. But first, let's choose an interesting test case to trace. Let's say we want to follow the path of a 4KB read I/O request performed on the sda drive from syscall to dispatch. Writing it as a couple of bash command lines:

# echo 3 > /proc/sys/vm/drop_caches
# dd if=/dev/sda of=/dev/null bs=1K count=4

The first command drops all data kept in the kernel's page cache. Dropping the cache might not be necessary, but is strongly advised in this case, as the blocks we are trying to read from disk could by coincidence be already mapped into memory.
Before going on to the real experiment, we can glance at the file that will keep the trance once it's been collected. What we want to read is the /sys/kernel/debug/tracing/trace special file:

# cat /sys/kernel/debug/tracing/trace
# tracer: function_graph
#
# CPU  DURATION                  FUNCTION CALLS
# |     |   |                     |   |   |   |

As of now, the file is empty, but the header already gives us an idea of what the trace will include. In it, for each row, we'll see the id of the CPU where the function was executed, the duration of its execution and, as expected, its name.
Now we can add tracing around our experiment and finally execute it. We will be clearing the contents of the trace file before starting to record, just to be sure we are collecting only the relevant output.

# echo nop > /sys/kernel/debug/tracing/current_tracer
# echo function_graph > /sys/kernel/debug/tracing/current_tracer
# echo 3 > /proc/sys/vm/drop_caches
# echo 1 > /sys/kernel/debug/tracing/tracing_on
# dd if=/dev/sda of=/dev/null bs=1K count=4
# echo 0 > /sys/kernel/debug/tracing/tracing_on

We can glance at our experiment's trace by viewing the contents of the /sys/kernel/debug/tracing/trace file. Be warned, it is most likely the length of the trace's content has very much increased.

Improving trace accuracy

If we start eagerly studying the trace we collected, we will probably encounter a number of issues that affect the readability of the output we collected, the most relevant being that the trace probably includes also function calls that do not concern the service of that 4KB read request we wanted to see. One of the reasons for this issue is that some time unavoidably elapses from the bash executing that "echo 1" command and actually starting the dd application; also, the trace will contain all preliminary operations performed on behalf of the dd application, such as, e.g. loading its executable image into memory, opening the /dev/sda file and the /dev/null file, and so on.

The first thing we'd want to try is restricting the trace to read-related functions. By glancing at documents and manpages, we see that everything starts with the SyS_read() system call. If we could restrict tracing to that only function and its children, the output would be clearer. Turns out we actually can:

# echo SyS_read > /sys/kernel/debug/tracing/set_graph_function
# cat /sys/kernel/debug/tracing/set_graph_function
SyS_read

If we repeat the experiment, we see that now only SyS_read and the functions it calls are in the trace. Although useful, however, this method does not take into account that not all the handling of the I/O happens in the context of the SyS_read function.
Let's revert back to tracing all function by simply clearing the file:

# echo > /sys/kernel/debug/tracing/set_graph_function

What we really want to do to improve the accuracy of our trace and enhance its readability is exploiting one more feature of the ftrace environment: setting trace markers (that is, trace lines that will help our user-space application to get in sync with the kernel, e.g. telling it that we're about to start issuing I/O so that it can report it on the trace or when we have finished). Having a marker in the trace makes sense only if the marker is set immediately before the event we want to mark. This can be done by setting up a helper program, preferrably written in a compiled language, which will allow us to get rid of a big amount of spurious events. The official documentation page for ftrace provides an example for such a helper program. Starting from that outline, I have made available on my gist a draft of possible implementation for our test. Even if almost trivial, it's pretty long, and would certainly be boring and uncomfortable to read on a blog, so I'm linking it here.
If we are sure we want to see only operations performed on behalf of a single process, we could also trace by PID, by specifying in the set_ftrace_pid file the PID of the process we want to trace. However, this still would prevent us from seeing events performed in interrupt context following actions from that process (e.g. I/O request completions). The example I linked above includes tracing by PID, and will therefore discard some events that might be relevant to the student.

Tracing the block layer

While studying the block layer or another block I/O-related kernel component, we could be interested only in tracing actions related to handling of requests. ftrace comes in handy also for collecting traces basing on events; let's see what events it supports, and if we can do something with them.

# cat /sys/kernel/debug/tracing/available_events
<it's a huge list, so I'm snipping>
block:block_rq_remap
block:block_bio_remap
block:block_split
block:block_unplug
block:block_plug
block:block_sleeprq
block:block_getrq
block:block_bio_queue
block:block_bio_frontmerge
block:block_bio_backmerge
block:block_bio_complete
block:block_bio_bounce
block:block_rq_issue
block:block_rq_insert
block:block_rq_complete
block:block_rq_requeue
block:block_rq_abort
block:block_dirty_buffer
block:block_touch_buffer
<...>

ftrace is not the only option for block layer-related study and debugging. Actually, the Linux kernel block layer maintainer, Jens Axboe, has introduced a block layer I/O tracing tool, blktrace, that provides detailed traces about request-queue operations from kernel to user space. blktrace efficiently records traces to one or more buffers in a binary format, which is then parsed back into a human-readable format by the blkparse utility. blktrace also adds to common events the log messages produced by block layer components, therefore highly facilitating debugging on block I/O-related subsystems. blktrace is completely integrated with the debugfs interface provided by ftrace, and supports a fair amount of the options provided by the ftrace environment (such as, e.g. filtering traces and dynamic tracepoints).
To enable the block tracer provided by the blktrace facility, it is sufficient to select the blk tracer as current one. It is also necessary to enable traces per-block-device, by writing to the file /sys/block/<device>/trace/enable as previously seen. Note that, when the blktrace framework is compiled in the kernel, writing to the per-block-device enable file will start/stop a trace even if the tracer is not blk.

Extending kernel oopses with function traces

Documenting and studying code is not the only field where a function tracer can be of great help: we can make good use of a complete trace even when debugging a kernel component. The main issue is that we don't know when an error will occur, how critical it will be, and ultimately might not know when, or even be able at all to, collect a trace of the failure.
ftrace can be configured to dump to terminal (or better, to serial console) a trace for us in case of an oops. This behavior, which is disabled by default can be dynamically changed with the switch /proc/sys/kernel/ftrace_dump_on_oops. It can also be configured at boot time by adding the option ftrace_dump_on_oops to the kernel's command line; note that we can also choose which tracer we want to record a trace with also when booting with traces enabled, by placing the ftrace=<tracer_name> option before the former command line parameter. In his article about debugging the kernel with ftrace, Steven Rostedt suggested shrinking the per-CPU buffer kept by ftrace especially in this occasion, as the buffer is to be dumped to the console as soon as an oops is produced. Also, the same article points out a nice feature of ftrace, that can be induced to print out the ftrace buffer with the SysRq-z combination, therefore allowing the developer to collect a trace of the running kernel even if it is stuck with interrupts disabled.

Sources
ftrace on eLinux: http://elinux.org/Ftrace

Tuesday, April 22, 2014

OPW, Xen: some initial info

Turns out I've been accepted for OPW 2014!
I'm so thrilled to be working with the Xen Project to domain support for Linux.
I'll be trying to document my three-months journey in this blog as part of the program, but writing is really not my thing; also, what I will write is (pretty obviously) just what I think I have understood from my readings and investigations. So, please bear with me and do not hesitate to comment about any mistake you may find in these posts, if you want to.

I'll be working on adding some support for the multi-queue API, recently merged in the Linux kernel, to the block I/O paravirtualized Xen driver for Linux. In the following post, I'll try to provide some initial information about the such an idea and its motivations; the project was proposed by Konrad Rzeszutek Wilk, to whom I have been assigned as intern. A more in-depth and articulate description can be found in his project outline or in the documents linked from the outline itself, but I'm so happy that I'm looking forward even to writing about it, so here is my take.

Multi-queue support for the block I/O paravirtualized Xen drivers

The kernel component charged with the task of managing block I/O issued by applications is, in Linux, referred to as the block layer. Its task includes passing on block I/O requests to the device driver, which actually interacts with the block device. Until v3.12, the Linux kernel's block layer offered to block device drivers two interfaces, one of which is more commonly used and is called the request interface. When exposing this interface, the block layer internally stores I/O requests, queueing them in a buffering area; while in this area, requests can be re-ordered (e.g. to reduce penalties due to disk head movements), merged (to reduce per-request-processing overheads) or, more in general, subjected to service policies implemented in the block I/O scheduling sub-component of the block layer. Buffering has also the function to limit submission rate to prevent any hardware buffer from over-running. From a more practical point of view, when in request-based mode, the block layer keeps a simple request queue for each device; it adds new requests to the queue according to its internal policies; the request queue is drained by the device driver from its head.

Such an interface was designed when a high-end drive could handle hundreds of I/O operations per second ([1]); in a recent paper ([2]), Linux block layer maintainer Jens Axboe pointed out the three causes that make it a bottleneck when used with a high-performance SSD that can handle hundreds of thousands of IOPS.
  1. The request queue is a major source of contention. In fact, it is protected by a single spinlock, the queue_lock, which is grabbed each time an I/O request is issued, attempted to be merged, enqueued in the I/O scheduler's private data structures, dispatched to the driver and completed.
  2. Hardware interrupts and request completions reduce usability of caches. A high number of I/O operations per second cause a high number of interrupts, despite the presence of interrupt mitigation ([3]). If one CPU core is in charge of handling all hardware interrupts and forwarding them to other cores as software interrupts, as it happens on most modern storage devices ([2]), a single core may spend a considerable amount of time handling interrupts, switching context, and therefore polluting caches. Also, other cores must undergo an inter-processor interrupt to execute the I/O completion routine.
  3. Cross-CPU memory accesses exacerbate queue_lock contention. Remote memory accesses are needed every time the completion routine for an I/O request is executed on a CPU core which is different from the one on which the request was last handled. In fact, on completion it is necessary to grab the queue_lock, as the request must be removed from the request queue. As a consequence, the cache line with the last state of the lock (stored in the cache of the core where the lock was last acquired) must be explicitly invalidated.
Efforts applied in order to remove such a bottleneck were focused on the key points of reducing lock contention and remote memory accesses. Even before v3.13, the Linux kernel already provided a way to short-out the request queue (with the make request interface), but at the cost of deferring any pre-processing and buffering performed on I/O requests by the block layer.

The multi-queue API introduced in the Linux kernel attempts to address this issue by providing to device drivers a third interface to the block layer. The request queue is split in a pool of separate per-CPU or per-NUMA-socket queues. The first mode, in particular, allows to each CPU to queue requests in its private request queue in a lockless fashion, thus reducing the previously-critical lock contention issues; while in these software queues (or submission queues), a request can be subjected to I/O scheduling policies. The device driver, instead, accesses a set of hardware queues (or dispatch queues), whose number matches the actual number of queues supported by the device driver; those act as a buffer to adjust I/O submission rate between the software queues and the actual dispatch to the device driver.

The Xen hypervisor provides paravirtualized drivers for block I/O to avoid IDE emulation (such as, e.g. the one provided by QEMU), which is slow. The PV block I/O driver follows Xen's split device driver model, which includes a backend driver (kept by a driver domain, usually dom0, which hosts hardware-specific drivers) and a frontend driver (used by one of the unprivileged domains, which does not have to deal with hardware-specific or interface-specific details). The two halves of a PV driver share some memory pages, where they keep one or more consensual rings (might usually be only one ring or a pair of tx/rx rings); those rings are accessed by both partecipants through a producer/consumer protocol which uses an event channel for the needed notifications (e.g. presence of a new request or response in the ring). This is called "PV protocol" ([4]), and is used by all paravirtualized drivers in Xen.

The block PV protocol uses only one ring, which is shared between the driver domain and the guest domain. To avoid the allocation of a large amount of possibly-unused shared memory, initially the ring has the minimum indispensable size to queue requests and responses, but not the data itself. As soon as the guest domain starts issuing requests, the frontend driver requests a certain number of memory areas (grants) to hold the data. If the request is a write request, the memory areas are filled with the data to be written on behalf of the guest domain and necessary permissions are added for the driver domain. A grant reference is added to the request, and the request is queued in the ring. The driver domain, at this point, is notified that a new request from the guest domain is pending service; when the request is finally served, the driver domain parses the grant reference and uses it to map the related memory areas to its own address space. As soon as the request has been fulfulled, the grants are unmapped from the driver domain's address space, a response is written on the shared ring and the guest domain is notified about the completion; while handling such a notification, the guest domain removes any access permission on the grants for the driver domain.

With respect to the original protocol, the current implementation has had the benefit of many improvements, such as increasing the maximum number of in-flight I/O with indirect descriptors ([5]), reducing lock contention on the grant lock of a domain and reducing the number of memory [un]mapping operations needed to pass the I/O on to dom0 ([6]).
However, as currently implemented in Linux, the block PV driver still makes use of a single block thread for each ring to handle I/O transmission.

Porting the multi-queue API to the driver should allow it to allocate per-vCPU block threads, thus allowing to each vCPU lockless access to a private ring. The performance benefits should include, as in the Linux kernel ([2]), increasing throughput and reducing service latency of I/O requests.