Following the many directives and suggestions given by my OPW mentor, I have been implementing the tiniest and simplest block I/O driver, whose multi-queue implementation just fills I/O request structures with available information about the CPU core the process has been issuing I/O from; this is proving very useful to test the multi-queue-capable prototype of the block I/O I am developing. This task gave me the chance to study more in depth the different APIs provided by the block layer of the Linux kernel, and to try to continue the series of blog articles about it.
Here's my take on the second part; it's come out to be more high-level than I would have wanted, but still. Comments, remarks and criticism are welcome, as always.
Here's my take on the second part; it's come out to be more high-level than I would have wanted, but still. Comments, remarks and criticism are welcome, as always.
The block I/O layer, part 2 - The request interface
The request interface bases itself on an essential data structure that allows to the block layer to pass requests to the device driver: the request_queue structure; such a structure has two main buffering functions: it allows for requests to be pre-processed in an attempt to reduce overheads (with merging and sorting) and limits submission rate to prevent hardware buffers from over-running. request_queue structures are allocated per-device, which means that each device has its private request_queue, which can be accessed only by the device-specific block I/O driver in use for the disk. Such a structure is essentially a container of block I/O requests: its more important field, in fact, is a liked list of requests (whose head is kept in the queue_head field) which have been made available to the driver. It also keeps a set of pointer to functions used to submit a bio or dispatch a request to be handled for submission to the device. When in request mode, the block layer also allows for I/O scheduling: therefore, a request_queue structure points to an elevator_queue structure, which in its turn keeps pointers to elevator hooks. Last, but not least, the request_queue structure is protected by a spinlock, the queue_lock.
If willing to move forward and see what's inside each of the black boxes, a good starting point is a trace. But how do we trace a request cleanly without being bothered by the whole lot of driver-specific and low-level-driver-specific functions, that execute on an interrupt and don't allow us to get a straight-forward trace? We could exploit some ftrace tricks to get the cleanest possible trace (as I clumsily showed in a previous blog article) choosing whether to have a more messy trace or to lose interrupt-related bits, or we can use the null block device driver. I'd personally go for the latter. It comes with the Linux kernel as a module that, when loaded, creates a configurable number of fake block devices (see the nr_devices option of the module), exploiting a block layer API of your choice (given with the queue_mode option). Another nice feature of this driver is that, when in its default configuration, everything, including dispatches and completions, happens in the context of a request insertion, which is perfect, as we can simply trace by PID. I have collected a trace of the submission of a 4KB read request using the request interface, and I will make it available as soon as I have it uploaded.
The first interesting fact, however, is not shown by the trace, as it happens on the initialization of a new device. When initializing a new device, its driver, while being loaded, chooses which interface the block layer should be using for that specific device. The interface is chosen by specifying a make_request_fn for the device, which is usually defined with the blk_queue_make_request() helper function. The default make_request function, which is to be used when wanting the block layer to stay in request mode, is blk_queue_bio().
Now that we know which function is called to submit a bio to the block layer, we can easlily find it in the trace, and see that it is called by generic_make_request(), which in its turn is called by submit_bio(). The latter function is sort of an entry point to the block layer, and used by the upper levels of the block I/O stack to submit a block I/O unit to be processed and submitted to its device.
The most interesting function between the aboved-mentioned, however, is blk_queue_bio(), which implements the core routine of the Linux block layer's request mode. I had prepared a flowchart, but it would become too much of a long figure, so I'll simply try to outline the main steps it performs in the following paragraph. I'll skip bits here and there to try to avoid boring the reader to death; if you want to go through the full function, you can find it around line 1547 of the block/blk-core.c file of the Linux kernel source tree.
Generic block layer: the blk_queue_bio() function
The first step, performed by the function even before grabbing any lock, is that of attempting a merge of the newly-submitted block I/O unit with requests in the task's plug list. When in request mode, plugging is in fact performed per-process.
if (blk_attempt_plug_merge(q, bio, &request_count))
return;
The blk_attempt_plug_merge() function tries to merge the bio structure with the current issuer's plug list; it checks only basic merging parameters (gathered with the invocation of blk_try_merge()), avoiding any interaction with the elevator and therefore needing no locking to be performed.
Going on, the blk_queue_bio() function finally needs to grab the queue_lock, as further actions performed in an attempt to merge the block I/O unit will involve per-device structures.
spin_lock_irq(q->queue_lock);
el_ret = elv_merge(q, &req, bio);
if (el_ret == ELEVATOR_BACK_MERGE) {
if (bio_attempt_back_merge(q, req, bio)) {
elv_bio_merged(q, req, bio);
if (!attempt_back_merge(q, req))
elv_merged_request(q, req, el_ret);
goto out_unlock;
}
} else if (el_ret == ELEVATOR_FRONT_MERGE) {
if (bio_attempt_front_merge(q, req, bio)) {
elv_bio_merged(q, req, bio);
if (!attempt_front_merge(q, req))
elv_merged_request(q, req, el_ret);
goto out_unlock;
}
}
The elv_merge() function handles all operations that concern attempting to merge a bio structure with a request that has been already queued in the elevator. To this purpose, the block layer keeps some private fields that has the aim to make this operation faster: a) a one-hit cache of the last request that has been successfully been involved in a merge, and b) a private list of requests that are currently waiting for dispatch in the elevator. (a) is the most useful, as, if merge succeeds with this one-hit cache, no running through lists is needed; (b) probably speeds up search, too, as, in the elevator, requests might be split throughout different service queues, which might imply multiple search levels if delegating the search to the elevator itself. The elv_merge() function involves invoking one of the elevator hooks (elevator_allow_merge_fn) to ask if a bio and a request can be merged; it returns a value that represents the kind of merge that a bio and a request can undergo. In case a merge succeeds, the device is immediately unlocked and the function returns.
req = get_request(q, rw_flags, bio, GFP_NOIO);
if (unlikely(!req)) {
bio_endio(bio, -ENODEV); /* @q is dead */
goto out_unlock;
}
Figure 1: The block layer's layout when in request mode |
If willing to move forward and see what's inside each of the black boxes, a good starting point is a trace. But how do we trace a request cleanly without being bothered by the whole lot of driver-specific and low-level-driver-specific functions, that execute on an interrupt and don't allow us to get a straight-forward trace? We could exploit some ftrace tricks to get the cleanest possible trace (as I clumsily showed in a previous blog article) choosing whether to have a more messy trace or to lose interrupt-related bits, or we can use the null block device driver. I'd personally go for the latter. It comes with the Linux kernel as a module that, when loaded, creates a configurable number of fake block devices (see the nr_devices option of the module), exploiting a block layer API of your choice (given with the queue_mode option). Another nice feature of this driver is that, when in its default configuration, everything, including dispatches and completions, happens in the context of a request insertion, which is perfect, as we can simply trace by PID. I have collected a trace of the submission of a 4KB read request using the request interface, and I will make it available as soon as I have it uploaded.
The first interesting fact, however, is not shown by the trace, as it happens on the initialization of a new device. When initializing a new device, its driver, while being loaded, chooses which interface the block layer should be using for that specific device. The interface is chosen by specifying a make_request_fn for the device, which is usually defined with the blk_queue_make_request() helper function. The default make_request function, which is to be used when wanting the block layer to stay in request mode, is blk_queue_bio().
Now that we know which function is called to submit a bio to the block layer, we can easlily find it in the trace, and see that it is called by generic_make_request(), which in its turn is called by submit_bio(). The latter function is sort of an entry point to the block layer, and used by the upper levels of the block I/O stack to submit a block I/O unit to be processed and submitted to its device.
The most interesting function between the aboved-mentioned, however, is blk_queue_bio(), which implements the core routine of the Linux block layer's request mode. I had prepared a flowchart, but it would become too much of a long figure, so I'll simply try to outline the main steps it performs in the following paragraph. I'll skip bits here and there to try to avoid boring the reader to death; if you want to go through the full function, you can find it around line 1547 of the block/blk-core.c file of the Linux kernel source tree.
Generic block layer: the blk_queue_bio() function
The first step, performed by the function even before grabbing any lock, is that of attempting a merge of the newly-submitted block I/O unit with requests in the task's plug list. When in request mode, plugging is in fact performed per-process.
if (blk_attempt_plug_merge(q, bio, &request_count))
return;
The blk_attempt_plug_merge() function tries to merge the bio structure with the current issuer's plug list; it checks only basic merging parameters (gathered with the invocation of blk_try_merge()), avoiding any interaction with the elevator and therefore needing no locking to be performed.
Going on, the blk_queue_bio() function finally needs to grab the queue_lock, as further actions performed in an attempt to merge the block I/O unit will involve per-device structures.
spin_lock_irq(q->queue_lock);
el_ret = elv_merge(q, &req, bio);
if (el_ret == ELEVATOR_BACK_MERGE) {
if (bio_attempt_back_merge(q, req, bio)) {
elv_bio_merged(q, req, bio);
if (!attempt_back_merge(q, req))
elv_merged_request(q, req, el_ret);
goto out_unlock;
}
} else if (el_ret == ELEVATOR_FRONT_MERGE) {
if (bio_attempt_front_merge(q, req, bio)) {
elv_bio_merged(q, req, bio);
if (!attempt_front_merge(q, req))
elv_merged_request(q, req, el_ret);
goto out_unlock;
}
}
The elv_merge() function handles all operations that concern attempting to merge a bio structure with a request that has been already queued in the elevator. To this purpose, the block layer keeps some private fields that has the aim to make this operation faster: a) a one-hit cache of the last request that has been successfully been involved in a merge, and b) a private list of requests that are currently waiting for dispatch in the elevator. (a) is the most useful, as, if merge succeeds with this one-hit cache, no running through lists is needed; (b) probably speeds up search, too, as, in the elevator, requests might be split throughout different service queues, which might imply multiple search levels if delegating the search to the elevator itself. The elv_merge() function involves invoking one of the elevator hooks (elevator_allow_merge_fn) to ask if a bio and a request can be merged; it returns a value that represents the kind of merge that a bio and a request can undergo. In case a merge succeeds, the device is immediately unlocked and the function returns.
req = get_request(q, rw_flags, bio, GFP_NOIO);
if (unlikely(!req)) {
bio_endio(bio, -ENODEV); /* @q is dead */
goto out_unlock;
}
The block layer keeps a pool of already-allocated request structures for each device.
The number of allocated requests can be retrieved by reading the special sysfs file /sys/block/<device>/queue/nr_requests, while the number of in-flight requests for the device can be read from /sys/block/<device>/inflight. In case every merge attempt of the bio fails, blk_queue_bio() invokes the function get_request() attempts to get a free request from the pool. In case it fails, the helper __get_request() function, called by the latter, activates the request starvation logic, which makes every I/O request blocking (even write requests) for every application. If, instead, a free request structure is correctly retrieved, __get_request() invokes the elevator_set_req_fn elevator hook to initialize the request's elevator-private fields; when the former function has returned, blk_queue_bio() then proceeds to initialize it from the information included in the bio.
init_request_from_bio(req, bio);
After a new request has been correctly retrieved and initialized from the information kept in the bio, it needs to be inserted in the issuing task's plug list or to be directly submitted to the elevator. The per-process-plugging logic is triggered.
plug = current->plug;
if (plug) {
if (request_count >= BLK_MAX_REQUEST_COUNT)
blk_flush_plug_list(plug, false);
list_add_tail(&req->queuelist, &plug->list);
} else {
spin_lock_irq(q->queue_lock);
__blk_run_queue(q);
spin_unlock_irq(q->queue_lock);
}
The newly-initialized request is inserted in the task's plug list only if the task is currently plugged and the length of said list does not exceed BLK_MAX_REQUEST_COUNT, or the task has been plugged for enough few time. Otherwise, if the task is plugged, it becomes unplugged and the plug list is flushed with the blk_flush_plug_list() function. If the task is unplugged, a driver run-of-queue is triggered and the new request is directly passed on to the elevator for further processing, being inserted with the elevator_add_req_fn hook.
The elevator and the device driver
When a run-of-queue is triggered, the driver peeks the request queue and, if it contains requests, it extracts one or more of them from the queue. A run of queue can happen as a consequence of request insertion or following an interrupt from the device's controller. During driver initialization, a request_fn function is set with the blk_init_queue() function, which also allocates a new request_queue structure. When a device wants to get new requests from a request_queue, its run-of-queue invokes the driver's request_fn hook. The latter, in its turn, typically invokes the blk_fetch_request() helper function of the block layer until it no more returns a valid pointer to a request. Inside the loop, the driver handles requests as deemed appropriate.
Let's see how, e.g., the null block driver handles initialization of a request_fn function. The source code for the complete null_add_dev() function is in the drivers/block/null_blk.c file of the Linux kerrnel source tree.
nullb->q = blk_init_queue_node(null_request_fn, &nullb->lock, home_node);
The number of allocated requests can be retrieved by reading the special sysfs file /sys/block/<device>/queue/nr_requests, while the number of in-flight requests for the device can be read from /sys/block/<device>/inflight. In case every merge attempt of the bio fails, blk_queue_bio() invokes the function get_request() attempts to get a free request from the pool. In case it fails, the helper __get_request() function, called by the latter, activates the request starvation logic, which makes every I/O request blocking (even write requests) for every application. If, instead, a free request structure is correctly retrieved, __get_request() invokes the elevator_set_req_fn elevator hook to initialize the request's elevator-private fields; when the former function has returned, blk_queue_bio() then proceeds to initialize it from the information included in the bio.
init_request_from_bio(req, bio);
After a new request has been correctly retrieved and initialized from the information kept in the bio, it needs to be inserted in the issuing task's plug list or to be directly submitted to the elevator. The per-process-plugging logic is triggered.
plug = current->plug;
if (plug) {
if (request_count >= BLK_MAX_REQUEST_COUNT)
blk_flush_plug_list(plug, false);
list_add_tail(&req->queuelist, &plug->list);
} else {
spin_lock_irq(q->queue_lock);
__blk_run_queue(q);
spin_unlock_irq(q->queue_lock);
}
The newly-initialized request is inserted in the task's plug list only if the task is currently plugged and the length of said list does not exceed BLK_MAX_REQUEST_COUNT, or the task has been plugged for enough few time. Otherwise, if the task is plugged, it becomes unplugged and the plug list is flushed with the blk_flush_plug_list() function. If the task is unplugged, a driver run-of-queue is triggered and the new request is directly passed on to the elevator for further processing, being inserted with the elevator_add_req_fn hook.
The elevator and the device driver
When a run-of-queue is triggered, the driver peeks the request queue and, if it contains requests, it extracts one or more of them from the queue. A run of queue can happen as a consequence of request insertion or following an interrupt from the device's controller. During driver initialization, a request_fn function is set with the blk_init_queue() function, which also allocates a new request_queue structure. When a device wants to get new requests from a request_queue, its run-of-queue invokes the driver's request_fn hook. The latter, in its turn, typically invokes the blk_fetch_request() helper function of the block layer until it no more returns a valid pointer to a request. Inside the loop, the driver handles requests as deemed appropriate.
Let's see how, e.g., the null block driver handles initialization of a request_fn function. The source code for the complete null_add_dev() function is in the drivers/block/null_blk.c file of the Linux kerrnel source tree.
nullb->q = blk_init_queue_node(null_request_fn, &nullb->lock, home_node);
The blk_init_queue_node() function is similar to blk_init_queue(), but allows to specify the memory node the request_queue should be allocated from.
As of the null block driver's request_fn function, it is implemented with a simple loop as follows.
static void null_request_fn(struct request_queue *q)
{
struct request *rq;
while ((rq = blk_fetch_request(q)) != NULL) {
struct nullb_cmd *cmd = rq->special;
spin_unlock_irq(q->queue_lock);
null_handle_cmd(cmd);
spin_lock_irq(q->queue_lock);
}
}
The blk_fetch_request() function invokes the blk_peek_request() function, which in its turn uses one of the elevator's callbacks, elevator_dispatch_fn, to trigger the insertion of new requests in the request_queue. The number of requests inserted for each dispatch depends on the implementation of the scheduling policy implemented in the elevator.