Thursday, May 29, 2014

OPW, Linux: Learning about Linux with ftrace

So, this week I've been studying and documenting code and asking questions to my OPW mentor (who, by the way, suggested the topic for this kick-off blog post, thank you so much!), and starting to code (and debug). I've been mainly making use of the ftrace environment to get traces from a running kernel and compare them to what I understood from a static study; also, as I'm working on a block I/O driver, I've been also experimenting with the blktrace tool. The following paragraphs will try to outline how I have been using them.
If you have any feedback and the time to write a comment about it, please do; I'm just a beginner here, and can make use of any suggestion.

Learning about Linux with ftrace

ftrace, whose name is heritage of its initial purpose of kernel-space function tracer, is much more than that: it is available in Linux (and has been from v2.6.27 on) as a framework that features several tracers, each sporting a configuration interface exposed to userspace applications. It indeed includes a function tracer, with the possibility to probe for both the entry points only and the pair of entry/exit points of kernel functions; these are called respectively function and function_graph. Other tracers provided by the framework, instead, allow to trace especially areas that disable interrupts: the irqsoff tracer saves a snapshot of the longest measured period of time where interrupts where disabled; similarly, preemptoff saves a trace of the longest period that had preemption disabled; the preemptirqsoff tracer joins the operations performed by the two. Last, the wakeup and wakeup_rt tracers are involved in recording the maximum service latency of high-priority tasks, the latter focusing on real-time tasks.
ftrace also allows for dynamic tracing, having the system run with no measurable overhead when tracing is disabled; this convenient feature is implemented with code text rewrite: when starting up the kernel, or stopping a running trace, the code of the running kernel is dynamically modified by replacing all calls to function profilers with a no-operation instruction, thereby reducing the overhead to zero. Having the chance to dynamically enable and disable tracing points throughout kernel code also gives the possibility to restrict traces to relevant functions, reducing the overhead unavoidably produced by enabled tracing.

Being a versatile and efficient kernel-space tracing environment, ftrace proves to be an unindispensable tool to the student or developer willing to become familiar with a kernel component. If this is the case, the most useful tracers between those listed above are probably the function and function_graph tracers, as they give a concise representation of the code's execution flow.
The interface to the ftrace framework resides in the kernel's debugfs. Assuming that the debugfs is mounted on the directory /sys/kernel/debug, the first file we'd be willing to use is probably the one that allows us to select a tracer. Let's say we want to collect a function graph. First, we can list available tracers to see if the one we want is actually compiled in the kernel.

# cat /sys/kernel/debug/tracing/available_tracers 
blk function_graph wakeup_dl wakeup_rt wakeup function nop

It seems like we have the function_graph tracer built in our kernel (want to know what that mysterious blk tracer is? Read the section "Tracing the block layer" below). We can select it immediately by echoing its name to the current_tracer file. But, let us see, before that, if any tracer has already been selected.

# cat /sys/kernel/debug/tracing/current_tracer 

The nop tracer is a placeholder to indicate that no tracer is selected. Let's select function_graph instead.

# echo function_graph > /sys/kernel/debug/tracing/current_tracer 
# cat /sys/kernel/debug/tracing/current_tracer

At this point, we can finally start tracing. But first, let's choose an interesting test case to trace. Let's say we want to follow the path of a 4KB read I/O request performed on the sda drive from syscall to dispatch. Writing it as a couple of bash command lines:

# echo 3 > /proc/sys/vm/drop_caches
# dd if=/dev/sda of=/dev/null bs=1K count=4

The first command drops all data kept in the kernel's page cache. Dropping the cache might not be necessary, but is strongly advised in this case, as the blocks we are trying to read from disk could by coincidence be already mapped into memory.
Before going on to the real experiment, we can glance at the file that will keep the trance once it's been collected. What we want to read is the /sys/kernel/debug/tracing/trace special file:

# cat /sys/kernel/debug/tracing/trace
# tracer: function_graph
# |     |   |                     |   |   |   |

As of now, the file is empty, but the header already gives us an idea of what the trace will include. In it, for each row, we'll see the id of the CPU where the function was executed, the duration of its execution and, as expected, its name.
Now we can add tracing around our experiment and finally execute it. We will be clearing the contents of the trace file before starting to record, just to be sure we are collecting only the relevant output.

# echo nop > /sys/kernel/debug/tracing/current_tracer
# echo function_graph > /sys/kernel/debug/tracing/current_tracer
# echo 3 > /proc/sys/vm/drop_caches
# echo 1 > /sys/kernel/debug/tracing/tracing_on
# dd if=/dev/sda of=/dev/null bs=1K count=4
# echo 0 > /sys/kernel/debug/tracing/tracing_on

We can glance at our experiment's trace by viewing the contents of the /sys/kernel/debug/tracing/trace file. Be warned, it is most likely the length of the trace's content has very much increased.

Improving trace accuracy

If we start eagerly studying the trace we collected, we will probably encounter a number of issues that affect the readability of the output we collected, the most relevant being that the trace probably includes also function calls that do not concern the service of that 4KB read request we wanted to see. One of the reasons for this issue is that some time unavoidably elapses from the bash executing that "echo 1" command and actually starting the dd application; also, the trace will contain all preliminary operations performed on behalf of the dd application, such as, e.g. loading its executable image into memory, opening the /dev/sda file and the /dev/null file, and so on.

The first thing we'd want to try is restricting the trace to read-related functions. By glancing at documents and manpages, we see that everything starts with the SyS_read() system call. If we could restrict tracing to that only function and its children, the output would be clearer. Turns out we actually can:

# echo SyS_read > /sys/kernel/debug/tracing/set_graph_function
# cat /sys/kernel/debug/tracing/set_graph_function

If we repeat the experiment, we see that now only SyS_read and the functions it calls are in the trace. Although useful, however, this method does not take into account that not all the handling of the I/O happens in the context of the SyS_read function.
Let's revert back to tracing all function by simply clearing the file:

# echo > /sys/kernel/debug/tracing/set_graph_function

What we really want to do to improve the accuracy of our trace and enhance its readability is exploiting one more feature of the ftrace environment: setting trace markers (that is, trace lines that will help our user-space application to get in sync with the kernel, e.g. telling it that we're about to start issuing I/O so that it can report it on the trace or when we have finished). Having a marker in the trace makes sense only if the marker is set immediately before the event we want to mark. This can be done by setting up a helper program, preferrably written in a compiled language, which will allow us to get rid of a big amount of spurious events. The official documentation page for ftrace provides an example for such a helper program. Starting from that outline, I have made available on my gist a draft of possible implementation for our test. Even if almost trivial, it's pretty long, and would certainly be boring and uncomfortable to read on a blog, so I'm linking it here.
If we are sure we want to see only operations performed on behalf of a single process, we could also trace by PID, by specifying in the set_ftrace_pid file the PID of the process we want to trace. However, this still would prevent us from seeing events performed in interrupt context following actions from that process (e.g. I/O request completions). The example I linked above includes tracing by PID, and will therefore discard some events that might be relevant to the student.

Tracing the block layer

While studying the block layer or another block I/O-related kernel component, we could be interested only in tracing actions related to handling of requests. ftrace comes in handy also for collecting traces basing on events; let's see what events it supports, and if we can do something with them.

# cat /sys/kernel/debug/tracing/available_events
<it's a huge list, so I'm snipping>

ftrace is not the only option for block layer-related study and debugging. Actually, the Linux kernel block layer maintainer, Jens Axboe, has introduced a block layer I/O tracing tool, blktrace, that provides detailed traces about request-queue operations from kernel to user space. blktrace efficiently records traces to one or more buffers in a binary format, which is then parsed back into a human-readable format by the blkparse utility. blktrace also adds to common events the log messages produced by block layer components, therefore highly facilitating debugging on block I/O-related subsystems. blktrace is completely integrated with the debugfs interface provided by ftrace, and supports a fair amount of the options provided by the ftrace environment (such as, e.g. filtering traces and dynamic tracepoints).
To enable the block tracer provided by the blktrace facility, it is sufficient to select the blk tracer as current one. It is also necessary to enable traces per-block-device, by writing to the file /sys/block/<device>/trace/enable as previously seen. Note that, when the blktrace framework is compiled in the kernel, writing to the per-block-device enable file will start/stop a trace even if the tracer is not blk.

Extending kernel oopses with function traces

Documenting and studying code is not the only field where a function tracer can be of great help: we can make good use of a complete trace even when debugging a kernel component. The main issue is that we don't know when an error will occur, how critical it will be, and ultimately might not know when, or even be able at all to, collect a trace of the failure.
ftrace can be configured to dump to terminal (or better, to serial console) a trace for us in case of an oops. This behavior, which is disabled by default can be dynamically changed with the switch /proc/sys/kernel/ftrace_dump_on_oops. It can also be configured at boot time by adding the option ftrace_dump_on_oops to the kernel's command line; note that we can also choose which tracer we want to record a trace with also when booting with traces enabled, by placing the ftrace=<tracer_name> option before the former command line parameter. In his article about debugging the kernel with ftrace, Steven Rostedt suggested shrinking the per-CPU buffer kept by ftrace especially in this occasion, as the buffer is to be dumped to the console as soon as an oops is produced. Also, the same article points out a nice feature of ftrace, that can be induced to print out the ftrace buffer with the SysRq-z combination, therefore allowing the developer to collect a trace of the running kernel even if it is stuck with interrupts disabled.

ftrace on eLinux: