Friday, July 11, 2014

OPW, Linux: The block I/O layer, part 3 - The make request interface

During the past weeks, I have been learning about profiling an operating system running in a virtual machine, since I have been needing to examine the driver I am working on to locate bottlenecks and work out lock contention issues. My OPW mentor has suggested that I get to grips with some popular profiling tools, such as perf and lockstat. During my bachelor thesis I already had the chance to become familiar with perf to some extent, but I am learning a lot more about collecting accurate data about the performance of a virtualized OS. For example, I read that Xen exploits Intel's Performance Monitoring Unit, which provides architectural support for collecting performance-related data. 

During the tests performed prior to profiling, I also had the chance to make use of the null_blk block device driver to compare the performance of the CFQ and NOOP I/O schedulers with a random workload composed of greedy random readers and writers, having no completion latency. Such a workload emulates Intel's IOmeter on a too-fast-to-be-real device. The throughput achieved by the CFQ I/O scheduler is half of the the one achieved by NOOP, or even lower, depending on the number of processes issuing I/O.

The NOOP scheduler, however, still does merges and sorts requests; none of this seems really necessary with such a workload where I/O operations are issued in a random fashion (so there seem to be not many merges in any case) and there is no seek penalty that would justify sorting. So, there's already something in the Linux kernel's block layer that should perform slightly better than the request API with the NOOP scheduler: the make request interface.

The block I/O layer, part 3 - The make request interface

The make request interface (or bio-based interface) essentially shorts out all processing of block I/O units following the creation of a bio structure. It therefore allows the kernel to directly submit a bio to the storage device's driver. Such an interface is useful to any block device driver needing to perform pre-processing of requests before submitting them to the actual underlying device (such, e.g., stacked drivers implementing RAID). Even if its purpose was not initially that, the bio-based API is also useful to any block device driver that sees the block layer's processing of I/O requests as an overhead; think, for example, to drivers of devices or controller that feature a highly complex internal request processing logic or don't need requests to be processed. The drawbacks for such an interface are evident: a driver making use of it would lose any pre-processing normally performed by the block layer.

Figure 1: Block layer layout when using the make request interface

Let's see how a driver uses such an interface, again from the code of the very simple null_blk driver. Even when in bio-based mode, the null_blk driver still needs to allocate a request_queue structure. The key, however, is defining, after that, an alternate make request function with respect to the default one. The null_blk driver does this in its null_add_dev() function, invoked for each simulated device that it requires to create, on module initialization.

nullb->q = blk_alloc_queue_node(GFP_KERNEL, home_node);
blk_queue_make_request(nullb->q, null_queue_bio);

Let's turn our attention to the bulk of the null_queue_bio() function itself. It is very simple and does not even need to allocate new request structures; however it needs to get a command structure to handle completions afterwards. It just handles the block operation's command with no additional operations.

static void null_queue_bio(struct request_queue *q, struct bio *bio)
        struct nullb *nullb = q->queuedata;
        struct nullb_queue *nq = nullb_to_queue(nullb);
        struct nullb_cmd *cmd;

        cmd = alloc_cmd(nq, 1); 
        cmd->bio = bio;


In this very simple case, completions are handled by just ending the I/O command with no error notification, as if it had been executed by a device controller. We can see how the null_blk driver does in its end_cmd() function, which is invoked directly in the context of the previously-seen null_handle_cmd() function: it invokes the block layer's bio_endio() function by passing to it the completed bio and the error code as its second parameter.

case NULL_Q_BIO:
        bio_endio(cmd->bio, 0);

K. Wilk, "Xen Profiling: oprofile and perf" -
J. Corbet, "The multiqueue block layer" -

No comments:

Post a Comment