Thursday, May 29, 2014

OPW, Linux: Learning about Linux with ftrace

So, this week I've been studying and documenting code and asking questions to my OPW mentor (who, by the way, suggested the topic for this kick-off blog post, thank you so much!), and starting to code (and debug). I've been mainly making use of the ftrace environment to get traces from a running kernel and compare them to what I understood from a static study; also, as I'm working on a block I/O driver, I've been also experimenting with the blktrace tool. The following paragraphs will try to outline how I have been using them.
If you have any feedback and the time to write a comment about it, please do; I'm just a beginner here, and can make use of any suggestion.

Learning about Linux with ftrace

ftrace, whose name is heritage of its initial purpose of kernel-space function tracer, is much more than that: it is available in Linux (and has been from v2.6.27 on) as a framework that features several tracers, each sporting a configuration interface exposed to userspace applications. It indeed includes a function tracer, with the possibility to probe for both the entry points only and the pair of entry/exit points of kernel functions; these are called respectively function and function_graph. Other tracers provided by the framework, instead, allow to trace especially areas that disable interrupts: the irqsoff tracer saves a snapshot of the longest measured period of time where interrupts where disabled; similarly, preemptoff saves a trace of the longest period that had preemption disabled; the preemptirqsoff tracer joins the operations performed by the two. Last, the wakeup and wakeup_rt tracers are involved in recording the maximum service latency of high-priority tasks, the latter focusing on real-time tasks.
ftrace also allows for dynamic tracing, having the system run with no measurable overhead when tracing is disabled; this convenient feature is implemented with code text rewrite: when starting up the kernel, or stopping a running trace, the code of the running kernel is dynamically modified by replacing all calls to function profilers with a no-operation instruction, thereby reducing the overhead to zero. Having the chance to dynamically enable and disable tracing points throughout kernel code also gives the possibility to restrict traces to relevant functions, reducing the overhead unavoidably produced by enabled tracing.

Being a versatile and efficient kernel-space tracing environment, ftrace proves to be an unindispensable tool to the student or developer willing to become familiar with a kernel component. If this is the case, the most useful tracers between those listed above are probably the function and function_graph tracers, as they give a concise representation of the code's execution flow.
The interface to the ftrace framework resides in the kernel's debugfs. Assuming that the debugfs is mounted on the directory /sys/kernel/debug, the first file we'd be willing to use is probably the one that allows us to select a tracer. Let's say we want to collect a function graph. First, we can list available tracers to see if the one we want is actually compiled in the kernel.

# cat /sys/kernel/debug/tracing/available_tracers 
blk function_graph wakeup_dl wakeup_rt wakeup function nop

It seems like we have the function_graph tracer built in our kernel (want to know what that mysterious blk tracer is? Read the section "Tracing the block layer" below). We can select it immediately by echoing its name to the current_tracer file. But, let us see, before that, if any tracer has already been selected.

# cat /sys/kernel/debug/tracing/current_tracer 
nop

The nop tracer is a placeholder to indicate that no tracer is selected. Let's select function_graph instead.

# echo function_graph > /sys/kernel/debug/tracing/current_tracer 
# cat /sys/kernel/debug/tracing/current_tracer
function_graph

At this point, we can finally start tracing. But first, let's choose an interesting test case to trace. Let's say we want to follow the path of a 4KB read I/O request performed on the sda drive from syscall to dispatch. Writing it as a couple of bash command lines:

# echo 3 > /proc/sys/vm/drop_caches
# dd if=/dev/sda of=/dev/null bs=1K count=4

The first command drops all data kept in the kernel's page cache. Dropping the cache might not be necessary, but is strongly advised in this case, as the blocks we are trying to read from disk could by coincidence be already mapped into memory.
Before going on to the real experiment, we can glance at the file that will keep the trance once it's been collected. What we want to read is the /sys/kernel/debug/tracing/trace special file:

# cat /sys/kernel/debug/tracing/trace
# tracer: function_graph
#
# CPU  DURATION                  FUNCTION CALLS
# |     |   |                     |   |   |   |

As of now, the file is empty, but the header already gives us an idea of what the trace will include. In it, for each row, we'll see the id of the CPU where the function was executed, the duration of its execution and, as expected, its name.
Now we can add tracing around our experiment and finally execute it. We will be clearing the contents of the trace file before starting to record, just to be sure we are collecting only the relevant output.

# echo nop > /sys/kernel/debug/tracing/current_tracer
# echo function_graph > /sys/kernel/debug/tracing/current_tracer
# echo 3 > /proc/sys/vm/drop_caches
# echo 1 > /sys/kernel/debug/tracing/tracing_on
# dd if=/dev/sda of=/dev/null bs=1K count=4
# echo 0 > /sys/kernel/debug/tracing/tracing_on

We can glance at our experiment's trace by viewing the contents of the /sys/kernel/debug/tracing/trace file. Be warned, it is most likely the length of the trace's content has very much increased.

Improving trace accuracy

If we start eagerly studying the trace we collected, we will probably encounter a number of issues that affect the readability of the output we collected, the most relevant being that the trace probably includes also function calls that do not concern the service of that 4KB read request we wanted to see. One of the reasons for this issue is that some time unavoidably elapses from the bash executing that "echo 1" command and actually starting the dd application; also, the trace will contain all preliminary operations performed on behalf of the dd application, such as, e.g. loading its executable image into memory, opening the /dev/sda file and the /dev/null file, and so on.

The first thing we'd want to try is restricting the trace to read-related functions. By glancing at documents and manpages, we see that everything starts with the SyS_read() system call. If we could restrict tracing to that only function and its children, the output would be clearer. Turns out we actually can:

# echo SyS_read > /sys/kernel/debug/tracing/set_graph_function
# cat /sys/kernel/debug/tracing/set_graph_function
SyS_read

If we repeat the experiment, we see that now only SyS_read and the functions it calls are in the trace. Although useful, however, this method does not take into account that not all the handling of the I/O happens in the context of the SyS_read function.
Let's revert back to tracing all function by simply clearing the file:

# echo > /sys/kernel/debug/tracing/set_graph_function

What we really want to do to improve the accuracy of our trace and enhance its readability is exploiting one more feature of the ftrace environment: setting trace markers (that is, trace lines that will help our user-space application to get in sync with the kernel, e.g. telling it that we're about to start issuing I/O so that it can report it on the trace or when we have finished). Having a marker in the trace makes sense only if the marker is set immediately before the event we want to mark. This can be done by setting up a helper program, preferrably written in a compiled language, which will allow us to get rid of a big amount of spurious events. The official documentation page for ftrace provides an example for such a helper program. Starting from that outline, I have made available on my gist a draft of possible implementation for our test. Even if almost trivial, it's pretty long, and would certainly be boring and uncomfortable to read on a blog, so I'm linking it here.
If we are sure we want to see only operations performed on behalf of a single process, we could also trace by PID, by specifying in the set_ftrace_pid file the PID of the process we want to trace. However, this still would prevent us from seeing events performed in interrupt context following actions from that process (e.g. I/O request completions). The example I linked above includes tracing by PID, and will therefore discard some events that might be relevant to the student.

Tracing the block layer

While studying the block layer or another block I/O-related kernel component, we could be interested only in tracing actions related to handling of requests. ftrace comes in handy also for collecting traces basing on events; let's see what events it supports, and if we can do something with them.

# cat /sys/kernel/debug/tracing/available_events
<it's a huge list, so I'm snipping>
block:block_rq_remap
block:block_bio_remap
block:block_split
block:block_unplug
block:block_plug
block:block_sleeprq
block:block_getrq
block:block_bio_queue
block:block_bio_frontmerge
block:block_bio_backmerge
block:block_bio_complete
block:block_bio_bounce
block:block_rq_issue
block:block_rq_insert
block:block_rq_complete
block:block_rq_requeue
block:block_rq_abort
block:block_dirty_buffer
block:block_touch_buffer
<...>

ftrace is not the only option for block layer-related study and debugging. Actually, the Linux kernel block layer maintainer, Jens Axboe, has introduced a block layer I/O tracing tool, blktrace, that provides detailed traces about request-queue operations from kernel to user space. blktrace efficiently records traces to one or more buffers in a binary format, which is then parsed back into a human-readable format by the blkparse utility. blktrace also adds to common events the log messages produced by block layer components, therefore highly facilitating debugging on block I/O-related subsystems. blktrace is completely integrated with the debugfs interface provided by ftrace, and supports a fair amount of the options provided by the ftrace environment (such as, e.g. filtering traces and dynamic tracepoints).
To enable the block tracer provided by the blktrace facility, it is sufficient to select the blk tracer as current one. It is also necessary to enable traces per-block-device, by writing to the file /sys/block/<device>/trace/enable as previously seen. Note that, when the blktrace framework is compiled in the kernel, writing to the per-block-device enable file will start/stop a trace even if the tracer is not blk.

Extending kernel oopses with function traces

Documenting and studying code is not the only field where a function tracer can be of great help: we can make good use of a complete trace even when debugging a kernel component. The main issue is that we don't know when an error will occur, how critical it will be, and ultimately might not know when, or even be able at all to, collect a trace of the failure.
ftrace can be configured to dump to terminal (or better, to serial console) a trace for us in case of an oops. This behavior, which is disabled by default can be dynamically changed with the switch /proc/sys/kernel/ftrace_dump_on_oops. It can also be configured at boot time by adding the option ftrace_dump_on_oops to the kernel's command line; note that we can also choose which tracer we want to record a trace with also when booting with traces enabled, by placing the ftrace=<tracer_name> option before the former command line parameter. In his article about debugging the kernel with ftrace, Steven Rostedt suggested shrinking the per-CPU buffer kept by ftrace especially in this occasion, as the buffer is to be dumped to the console as soon as an oops is produced. Also, the same article points out a nice feature of ftrace, that can be induced to print out the ftrace buffer with the SysRq-z combination, therefore allowing the developer to collect a trace of the running kernel even if it is stuck with interrupts disabled.

Sources
ftrace on eLinux: http://elinux.org/Ftrace

Tuesday, April 22, 2014

OPW, Xen: some initial info

Turns out I've been accepted for OPW 2014!
I'm so thrilled to be working with the Xen Project to domain support for Linux.
I'll be trying to document my three-months journey in this blog as part of the program, but writing is really not my thing; also, what I will write is (pretty obviously) just what I think I have understood from my readings and investigations. So, please bear with me and do not hesitate to comment about any mistake you may find in these posts, if you want to.

I'll be working on adding some support for the multi-queue API, recently merged in the Linux kernel, to the block I/O paravirtualized Xen driver for Linux. In the following post, I'll try to provide some initial information about the such an idea and its motivations; the project was proposed by Konrad Rzeszutek Wilk, to whom I have been assigned as intern. A more in-depth and articulate description can be found in his project outline or in the documents linked from the outline itself, but I'm so happy that I'm looking forward even to writing about it, so here is my take.

Multi-queue support for the block I/O paravirtualized Xen drivers

The kernel component charged with the task of managing block I/O issued by applications is, in Linux, referred to as the block layer. Its task includes passing on block I/O requests to the device driver, which actually interacts with the block device. Until v3.12, the Linux kernel's block layer offered to block device drivers two interfaces, one of which is more commonly used and is called the request interface. When exposing this interface, the block layer internally stores I/O requests, queueing them in a buffering area; while in this area, requests can be re-ordered (e.g. to reduce penalties due to disk head movements), merged (to reduce per-request-processing overheads) or, more in general, subjected to service policies implemented in the block I/O scheduling sub-component of the block layer. Buffering has also the function to limit submission rate to prevent any hardware buffer from over-running. From a more practical point of view, when in request-based mode, the block layer keeps a simple request queue for each device; it adds new requests to the queue according to its internal policies; the request queue is drained by the device driver from its head.

Such an interface was designed when a high-end drive could handle hundreds of I/O operations per second ([1]); in a recent paper ([2]), Linux block layer maintainer Jens Axboe pointed out the three causes that make it a bottleneck when used with a high-performance SSD that can handle hundreds of thousands of IOPS.
  1. The request queue is a major source of contention. In fact, it is protected by a single spinlock, the queue_lock, which is grabbed each time an I/O request is issued, attempted to be merged, enqueued in the I/O scheduler's private data structures, dispatched to the driver and completed.
  2. Hardware interrupts and request completions reduce usability of caches. A high number of I/O operations per second cause a high number of interrupts, despite the presence of interrupt mitigation ([3]). If one CPU core is in charge of handling all hardware interrupts and forwarding them to other cores as software interrupts, as it happens on most modern storage devices ([2]), a single core may spend a considerable amount of time handling interrupts, switching context, and therefore polluting caches. Also, other cores must undergo an inter-processor interrupt to execute the I/O completion routine.
  3. Cross-CPU memory accesses exacerbate queue_lock contention. Remote memory accesses are needed every time the completion routine for an I/O request is executed on a CPU core which is different from the one on which the request was last handled. In fact, on completion it is necessary to grab the queue_lock, as the request must be removed from the request queue. As a consequence, the cache line with the last state of the lock (stored in the cache of the core where the lock was last acquired) must be explicitly invalidated.
Efforts applied in order to remove such a bottleneck were focused on the key points of reducing lock contention and remote memory accesses. Even before v3.13, the Linux kernel already provided a way to short-out the request queue (with the make request interface), but at the cost of deferring any pre-processing and buffering performed on I/O requests by the block layer.

The multi-queue API introduced in the Linux kernel attempts to address this issue by providing to device drivers a third interface to the block layer. The request queue is split in a pool of separate per-CPU or per-NUMA-socket queues. The first mode, in particular, allows to each CPU to queue requests in its private request queue in a lockless fashion, thus reducing the previously-critical lock contention issues; while in these software queues (or submission queues), a request can be subjected to I/O scheduling policies. The device driver, instead, accesses a set of hardware queues (or dispatch queues), whose number matches the actual number of queues supported by the device driver; those act as a buffer to adjust I/O submission rate between the software queues and the actual dispatch to the device driver.

The Xen hypervisor provides paravirtualized drivers for block I/O to avoid IDE emulation (such as, e.g. the one provided by QEMU), which is slow. The PV block I/O driver follows Xen's split device driver model, which includes a backend driver (kept by a driver domain, usually dom0, which hosts hardware-specific drivers) and a frontend driver (used by one of the unprivileged domains, which does not have to deal with hardware-specific or interface-specific details). The two halves of a PV driver share some memory pages, where they keep one or more consensual rings (might usually be only one ring or a pair of tx/rx rings); those rings are accessed by both partecipants through a producer/consumer protocol which uses an event channel for the needed notifications (e.g. presence of a new request or response in the ring). This is called "PV protocol" ([4]), and is used by all paravirtualized drivers in Xen.

The block PV protocol uses only one ring, which is shared between the driver domain and the guest domain. To avoid the allocation of a large amount of possibly-unused shared memory, initially the ring has the minimum indispensable size to queue requests and responses, but not the data itself. As soon as the guest domain starts issuing requests, the frontend driver requests a certain number of memory areas (grants) to hold the data. If the request is a write request, the memory areas are filled with the data to be written on behalf of the guest domain and necessary permissions are added for the driver domain. A grant reference is added to the request, and the request is queued in the ring. The driver domain, at this point, is notified that a new request from the guest domain is pending service; when the request is finally served, the driver domain parses the grant reference and uses it to map the related memory areas to its own address space. As soon as the request has been fulfulled, the grants are unmapped from the driver domain's address space, a response is written on the shared ring and the guest domain is notified about the completion; while handling such a notification, the guest domain removes any access permission on the grants for the driver domain.

With respect to the original protocol, the current implementation has had the benefit of many improvements, such as increasing the maximum number of in-flight I/O with indirect descriptors ([5]), reducing lock contention on the grant lock of a domain and reducing the number of memory [un]mapping operations needed to pass the I/O on to dom0 ([6]).
However, as currently implemented in Linux, the block PV driver still makes use of a single block thread for each ring to handle I/O transmission.

Porting the multi-queue API to the driver should allow it to allocate per-vCPU block threads, thus allowing to each vCPU lockless access to a private ring. The performance benefits should include, as in the Linux kernel ([2]), increasing throughput and reducing service latency of I/O requests.