cat /var/log/ava: ftrace

Showing posts with label ftrace. Show all posts

Sunday, November 23, 2014

Linux: quick debugging example feat. blktrace

I've been silent for more than two months now, and while I have been working on a post about deadlock debugging, and on my OPW patches (v2 thread here), and finishing up my last exams, and speaking at CloudOpen Europe (see abstract, slides - and also pictures, courtesy of Lars Kurth and Sven Jaborek), and finally completing my Master's Degree (see me and my friends celebrating here), and thinking about what I want to do with my life - I also decided to try to upstream a very small bugfix patch I prepared for the Linux kernel's blktrace core.

Before getting to the patch itself, I think it could be interesting - and not too long either - to explain the debugging process behind it. So, let's bore you with some kernel oops dissection!

Quick debugging example feat. blktrace

A few months ago I was using the BFQ I/O scheduler's benchmark suite and I happened to stumble on the following kernel oops:

[ 117.383477] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 [ 117.391959] IP: [<ffffffff8112686b>] sysfs_blk_trace_attr_store+0x2ab/0x380 [ 117.399486] PGD 1aa7f4067 PUD 1aa7b8067 PMD 0 [ 117.399488] Oops: 0002 [#1] PREEMPT SMP [ 117.399529] Modules linked in: bnep coretemp intel_powerclamp kvm_intel arc4 iwldvm kvm i915 mac80211 uvcvideo btusb crct10dif_pclmul bluetooth videobuf2_vmalloc videobuf2_memops crc32_pclmul crc32c_intel ghash_clmulni_intel videobuf2_core iwlwifi aesni_intel v4l2_common videodev snd_hda_codec_hdmi qcserial usb_wwan aes_x86_64 lrw snd_hda_codec_conexant ppdev usbserial iTCO_wdt gf128mul snd_hda_codec_generic mxm_wmi media glue_helper ablk_helper iTCO_vendor_support joydev cfg80211 snd_hda_intel mousedev cryptd snd_hda_controller snd_hda_codec drm_kms_helper psmouse microcode evdev e1000e pcspkr mac_hid drm snd_hwdep serio_raw intel_ips parport_pc thinkpad_acpi i2c_i801 snd_pcm tpm_tis mei_me nvram parport tpm thermal snd_timer rfkill shpchp mei snd intel_agp hwmon battery ptp lpc_ich wmi ac soundcore i2c_algo_bit i2c_core intel_gtt pps_core video button acpi_cpufreq sch_fq_codel processor hid_generic usbhid hid ext4 crc16 mbcache jbd2 ehci_pci ehci_hcd sd_mod ahci sdhci_pci libahci atkbd usbcore libps2 sdhci libata led_class firewire_ohci mmc_core firewire_core scsi_mod crc_itu_t usb_common i8042 serio [ 117.399549] CPU: 1 PID: 520 Comm: tee Not tainted 3.18.0-rc5-rs232-test-kdump #190 [ 117.399550] Hardware name: LENOVO 2522WX8/2522WX8, BIOS 6IET85WW (1.45 ) 02/14/2013 [ 117.399551] task: ffff8800b6e65080 ti: ffff8801ab0f4000 task.ti: ffff8801ab0f4000 [ 117.399555] RIP: 0010:[<ffffffff8112686b>] [<ffffffff8112686b>] sysfs_blk_trace_attr_store+0x2ab/0x380 [ 117.399556] RSP: 0018:ffff8801ab0f7df8 EFLAGS: 00010046 [ 117.399557] RAX: 0000000000000000 RBX: ffff8800b606c018 RCX: 0000000000000000 [ 117.399557] RDX: ffff8801ab0f7d98 RSI: ffff8800b6e65080 RDI: ffffffff8180a1c0 [ 117.399558] RBP: ffff8801ab0f7e68 R08: ffff8801ab0f4000 R09: ffff8801b5001700 [ 117.399559] R10: ffff8801bbc974e0 R11: ffffea0006c0bf40 R12: ffff8800b606c000 [ 117.399560] R13: 0000000000000002 R14: ffff8800b52c7500 R15: ffff880037928000 [ 117.399561] FS: 00007fd010cde700(0000) GS:ffff8801bbc80000(0000) knlGS:0000000000000000 [ 117.399562] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 117.399563] CR2: 0000000000000008 CR3: 00000001b073b000 CR4: 00000000000007e0 [ 117.399564] Stack: [ 117.399566] 0000000000000001 0000000000000296 ffff8801ab0f7e48 ffff8801b05e6470 [ 117.399568] ffffffff81095a30 0000000000000000 ffff8801b2b92c70 000000003d312879 [ 117.399569] 00007fff7a7e1f82 0000000000000002 ffff8801b0313790 ffff8800b52c7200 [ 117.399570] Call Trace: [ 117.399576] [<ffffffff81095a30>] ? wake_up_process+0x50/0x50 [ 117.399582] [<ffffffff813accc8>] dev_attr_store+0x18/0x30 [ 117.399588] [<ffffffff8123ee8a>] sysfs_kf_write+0x3a/0x50 [ 117.399590] [<ffffffff8123e3ce>] kernfs_fop_write+0xee/0x180 [ 117.399594] [<ffffffff811c63a7>] vfs_write+0xb7/0x200 [ 117.399596] [<ffffffff811c62b4>] ? vfs_read+0x144/0x180 [ 117.399598] [<ffffffff811c6ef9>] SyS_write+0x59/0xd0 [ 117.399603] [<ffffffff8154bf69>] system_call_fastpath+0x12/0x17 [ 117.399620] Code: 00 00 00 f0 ff 0d ce 10 9d 00 0f 84 c2 00 00 00 48 c7 c7 c0 a1 80 81 e8 64 4c 42 00 49 8b 4e 58 49 8b 46 60 48 c7 c7 c0 a1 80 81 <48> 89 41 08 48 89 08 48 b8 00 01 10 00 00 00 ad de 49 89 46 58 [ 117.399622] RIP [<ffffffff8112686b>] sysfs_blk_trace_attr_store+0x2ab/0x380 [ 117.399623] RSP <ffff8801ab0f7df8> [ 117.399624] CR2: 0000000000000008 [ 117.399626] ---[ end trace b0bf4bf2fac3d64a ]--- [ 117.399628] note: tee[520] exited with preempt_count 1

After a few steps of narrowing the reproducer down, I could establish that it was the sequence of the following two simple bash commands:

# echo 1 > /sys/block/sda/trace/enable
# echo 0 > /sys/block/sda/trace/enable

To be even more minimal in explaining it, the action triggering the kernel oops was stopping a block trace. After glaring in disappointment to the pesky oops for some time, I finally decided to try to investigate it. After considering bisecting it for a while, I decided to just call out to gdb for help. I fortunately was running a custom kernel with debugging symbols compiled in it, so I didn't have to run through the process I described in a previous post.

$ gdb vmlinux GNU gdb (GDB) 7.8.1 Copyright (C) 2014 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-unknown-linux-gnu". Type "show configuration" for configuration details. For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>. Find the GDB manual and other documentation resources online at: <http://www.gnu.org/software/gdb/documentation/>. For help, type "help". Type "apropos word" to search for commands related to "word"... Reading symbols from vmlinux...done. (gdb) l *(sysfs_blk_trace_attr_store+0x2ab) 0xffffffff8112686b is in sysfs_blk_trace_attr_store (include/linux/list.h:89). 84 * This is only for internal list manipulation where we know 85 * the prev/next entries already! 86 */ 87 static inline void __list_del(struct list_head * prev, struct list_head * next) 88 { 89 next->prev = prev; 90 prev->next = next; 91 } 92 93 /** (gdb)

Well, not much help, gdb. That's pretty generic. The crashing point is a null pointer dereference happening into a list handling function inlined into the blk_trace_attr_store() blktrace function. Let's have a look to the blktrace function to see if it handles lists somewhere. Now, we probably can find it inside the blktrace core, which resides in kernel/trace/blktrace.c, but let's pretend we don't even know where the function is and let's see how we can find it out.

Solution 1 - the simplest: grep it and hope for the best. We're lazy and we don't want to wait for the output to eventually appear. Solution 2 - use cscope, an efficient utility to explore C source files. With

$ cscope -R

we simply ask cscope to index all C files (and related headers) in the current directory and all its subdirectories. Finally, we are prompted with a text-based interface that includes a bunch of functionalities and, most importantly, allows us to search for all definitions of a symbol. Just what we need.

Finally, we find the source code of the blk_trace_attr_store() blktrace function, but it seems not to help either. It's very long and does not handle lists by itself. Fortunately, by examining the code, we can see it is quite simple to follow. More in detail, we know that the issue happens when we stop a block trace, so practically when we set to 0 the /sys/block/sdX/trace/enable tunable for a block device. Now, let's have a look to this piece of code from the blk_trace_attr_store() function:

if (attr == &dev_attr_enable) { if (value) ret = blk_trace_setup_queue(q, bdev); else ret = blk_trace_remove_queue(q); goto out_unlock_bdev; }

So, just to explain what it means: if we're triggering an action with the tunable representing the trace enabler, then: if the value we're setting is true (1), we call blk_trace_setup_queue(), else we call blk_trace_remove_queue(). Then we unlock the device.

Seems like the next function we're going to look at is blk_trace_remove_queue()! And it also seems it's shorter, woot! And it handles a list, woot woot!

static int blk_trace_remove_queue(struct request_queue *q)

{

struct blk_trace *bt;

bt = xchg(&q->blk_trace, NULL);

if (bt == NULL)

return -EINVAL;

if (atomic_dec_and_test(&blk_probes_ref))

blk_unregister_tracepoints();

spin_lock_irq(&running_trace_lock);

list_del(&bt->running_list);

spin_unlock_irq(&running_trace_lock);

blk_trace_free(bt);

return 0;

}

The point where the function handles lists concerns the removal of a block trace from a list keeping pointers to all running traces. Makes sense. Let's see where the insertion happens.

It seems that the only invocation of list_add(&bt->running_list, &running_trace_list) is in the blk_trace_startstop() function. It's pretty weird, because the only invocation of the latter we can find in blktrace.c happens as a result of some action issued with the ioctl-based interface exposed by blktrace. This probably means that, when we use the sysfs tunable and we set it to 1, the trace is not inserted at all in the running list.

By looking at the code we can also see that, when a trace is added to the running trace list, the state of the trace is changed to Blktrace_running. Also, when the list is removed from the list in blk_trace_startstop(), it is removed only if its state is Blktrace_running.

As a proof of concept, we can try to mimic what happens in blk_trace_startstop() by replicating it in blk_trace_remove_queue() (as in the first version I proposed on lkml):

kernel/trace/blktrace.c | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)
diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
index c1bd4ad..f58b617 100644
--- a/kernel/trace/blktrace.c
+++ b/kernel/trace/blktrace.c
@@ -1493,9 +1493,11 @@ static int blk_trace_remove_queue(struct request_queue *q)
  if (atomic_dec_and_test(&blk_probes_ref))
   blk_unregister_tracepoints();
 
- spin_lock_irq(&running_trace_lock);
- list_del(&bt->running_list);
- spin_unlock_irq(&running_trace_lock);
+ if (bt->trace_state == Blktrace_running) {
+         spin_lock_irq(&running_trace_lock);
+         list_del(&bt->running_list);
+         spin_unlock_irq(&running_trace_lock);
+ }
  blk_trace_free(bt);
  return 0;
 }

Actually, even if the patch reached its goal, this is an inefficient solution, as it turns out that the running trace list is reserved to the ioctl-based interface and as such it should not be ever touched by the sysfs-based interface. As Namhyung Kim noted that, I proposed a second version of the patch, which is even smaller.

kernel/trace/blktrace.c | 3 ---
1 file changed, 3 deletions(-)
diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
index c1bd4ad..bd05fd2 100644
--- a/kernel/trace/blktrace.c
+++ b/kernel/trace/blktrace.c
@@ -1493,9 +1493,6 @@ static int blk_trace_remove_queue(struct request_queue *q)
  if (atomic_dec_and_test(&blk_probes_ref))
   blk_unregister_tracepoints();
 
- spin_lock_irq(&running_trace_lock);
- list_del(&bt->running_list);
- spin_unlock_irq(&running_trace_lock);
  blk_trace_free(bt);
  return 0;
 }

Friday, August 8, 2014

OPW, Linux: Profiling a Xen Guest

Here is my take at describing how have I been using some of the most popular tools when it comes to profiling. My OPW mentor suggested me to make use of these tools while looking into locking issues I had to deal with in the context of my summer project. I'll also try to include some hints that my mentor provided and should allow to gather more accurate data.
If you have any criticism or suggestion and the time to write about it, please do.

OPW, Linux: Profiling a Xen Guest

The first tool I had fun with is perf, a tool which abstracts away hardware differences of the CPUs supported by Linux and allows to use performance counters for profiling purposes. Performance counters are CPU registers used to count events that are relevant to profiling: instructions executed, cache events, branch prediction performance; they are instrumental to profiling applications as they allow to trace their execution and identify hot spots in the executed code.

The perf tool works on top of those hardware tools and uses them as building blocks to keep also per-task, per-CPU and per-workload counters, also enabling for source code instrumentation, letting the developer gather accurate annotated traces. Tracing is performed with the kprobes (with ftrace) and uprobes frameworks, which both allow for dynamic tracing, therefore highly reducing the overhead unavoidably added by tracepoints. If you are interested in dynamic tracing you can either read the manual or a previous blog post on ftrace.

From a more practical point of view, the fact that perf relies on ftrace also means that it must be compiled on top of your currently-executing kernel; in fact, it lives in a directory of the kernel source code tree, and can be build by exploiting the kernel make subsystem by issuing:

cd tools/perf

make

If you want to install the tool, just enter a make install after that. Otherwise, you'll need to use the perf executable which has been created in the tools/perf directory of your development kernel tree.

Then, we're done! Let's find hot spots in the code I'm writing. Let's record a run of the IOmeter emulator built on top of the fio flexible I/O tester and performed from a paravirtualized Xen guest on a physical block device emulated with the null_blk driver. The test script basically starts fio with the IOmeter jobfile. Note that the jobfile specifies libaio as the I/O engine to use for the test; just keep that in mind.

$ cat iometer_test.sh

#!/bin/bash

fio iometer-file-access-server

To make things simple, let's trace all available perf events.

perf record ./iometer_test.sh

Once the command has been executed, we can ask perf to list the recorded tracepoints, ordering them by placing first the ones with the higher overhead. Since the output is very long, I'm making it available in my Dropbox. I'll just paste here the first lines.

# To display the perf.data header info, please use --header/--header-only options.
#
# Samples: 271K of event 'cpu-clock'
# Event count (approx.): 67840250000
#
# Overhead    Command       Shared Object                               Symbol
# ........  .........  ..................  ...................................
#
    24.04%        fio  [kernel.kallsyms]   [k] xen_hypercall_xen_version      
     4.82%        fio  libaio.so.1.0.1     [.] io_submit                      
     3.98%        fio  libaio.so.1.0.1     [.] 0x0000000000000614             
     3.68%        fio  fio                 [.] get_io_u                       
     3.40%        fio  [kernel.kallsyms]   [k] __blockdev_direct_IO           
     2.78%        fio  fio                 [.] axmap_isset                    
     1.93%        fio  [kernel.kallsyms]   [k] lookup_ioctx                   
     1.66%        fio  [kernel.kallsyms]   [k] kmem_cache_alloc               
     1.64%        fio  [kernel.kallsyms]   [k] do_io_submit                   
     1.18%        fio  fio                 [.] _start                         
     1.16%        fio  [kernel.kallsyms]   [k] __blk_mq_run_hw_queue          
     1.08%        fio  [kernel.kallsyms]   [k] blk_throtl_bio                 
     0.91%        fio  [kernel.kallsyms]   [k] blk_mq_make_request            
     0.90%        fio  fio                 [.] fio_gettime                    
     0.90%        fio  fio                 [.] td_io_queue                    
     0.76%        fio  [kernel.kallsyms]   [k] generic_make_request_checks    
     0.72%        fio  [kernel.kallsyms]   [k] pvclock_clocksource_read       
     0.72%        fio  [kernel.kallsyms]   [k] __rcu_read_lock                
     0.71%        fio  [kernel.kallsyms]   [k] aio_read_events                
     0.69%        fio  fio                 [.] add_slat_sample                
     0.67%        fio  fio                 [.] io_u_queued_complete           
     0.66%        fio  [kernel.kallsyms]   [k] generic_file_aio_read          
     0.63%        fio  [kernel.kallsyms]   [k] __fget                         
     0.60%        fio  [kernel.kallsyms]   [k] bio_alloc_bioset               
     0.60%        fio  fio                 [.] add_clat_sample                
     0.59%        fio  [kernel.kallsyms]   [k] blkdev_aio_read                
     0.56%        fio  [kernel.kallsyms]   [k] percpu_ida_alloc               
     0.55%        fio  [kernel.kallsyms]   [k] copy_user_generic_string       
     0.54%        fio  [kernel.kallsyms]   [k] sys_io_getevents               
     0.54%        fio  [kernel.kallsyms]   [k] blkif_queue_rq

First of all, let's look at the fourth column, the one with one character enclosed in square brackets. That columns tells us whether the listed symbol is a userspace symbol (we have a dot in square brackets) or a kernelspace symbol (we see a k character in square brackets).

When it comes to userspace symbols, we can disambiguate whether we're looking at library functions or at user application symbols by glancing at the third column, which shows either the name of the library or the application. The second column, which might mislead us, shows instead the context of execution of the symbol. In this case, most of the execution happened in the context of the fio process, which is the one performing I/O during our tests.

The last column, the one on the right, shows us the name of the symbol, if perf managed to retrieve it. The first column, the one on the left, shows instead the percentage of time in which the list symbol was executing, given that 100% is the amount of time that covered the whole execution of a test.

So, perf did not record just events in the kernel, but also in the library used by fio to perform I/O, which is the libaio one we've seen before. Interesting fact, some of the library calls have a relatively high overhead: for example, the io_submit() function of libaio has been executing for almost the 5% of the test. This doesn't surprise us very much, as we kind of expected to find some hot libaio symbols by just reading Jens Axboe's paper which shows and explains profiling performed on such a test.

The most relevant hotspot, however, is in the kernel, and seems to be the xen_version hypercall, which has been executing for almost a fourth of the test. This seemed to me some kind of mistake until Konrad Wilk told me that the hypercall was probably being executed while grabbing a lock; in fact, the xen_version hypercall is used as a hypervisor callback: it is called to see if an event is pending, and if there is an interrupt is executed on return. That same hypercall, for example, is called whenever the functions spin_lock_irqsave() and spin_unlock_irqrestore() are called. This seems to identify the issue as a locking issue, so...

Debugging contention issues

... to get deeper into the investigation of such an issue, I chose to have a look at lockstat, the lock debugging framework provided by the Linux kernel. Essentially, lockstat provides statistics about the usage of locks, therefore being of great help to idenfity lock contention issues that can heavily impact performance.

First step is enabling it in the kernel's configuration. This can be done by setting CONFIG_LOCK_STAT=y or by browsing the menuconfig to the "Kernel hacking > Lock debugging (spinlocks, mutexes, etc...) > Lock usage statistics" option. As soon as the kernel has been re-compiled and re-installed with the framework enabled, we can enable lock statistics collection with:

$ echo 1 | sudo tee /proc/sys/kernel/lock_stat

From this moment, lock statistics are available in the special procfs file /proc/lock_stat, which we can glance at by issuing:

$ less /proc/lock_stat

An example of the output of that command collected on the frontend guest during the execution of the fio IOmeter test can be found, again, on my Dropbox. It's very very long and would not be much comfortable to read in here, so I'm just commenting it in this blog post.
First, we can see that the output is split in different sections, bordered with dots. Each of the sections contains information about a lock, whose name is the first thing listed there. For example, the first section is about the (&rinfo->io_lock)->rlock. Basically, it's the lock protecting access to the shared data structure used to exchange block I/O requests and responses between the guest domain and the driver domain following Xen's PV protocol. After that, we find a whole set of numbers, whose meaning is explained in the header above. We might be interested in extrapolating the number of contentions (the contentions field), the average time spent in waiting for the lock to be released (waittime-avg), the number of times that the lock has been acquired (acquisitions) and the average time the lock has been held (holdtime-avg). For example, we see that the io_lock has suffered 47500 contentions, has been acquired 1348488 times, and a thread willing to acquire it has been waiting for 25.4 microseconds.
After that batch of information, we find a listing of call traces; this indicated which threads of execution brought to acquiring that lock. Always with reference to our io_lock, we see that the two contenting traces concern the service of an interrupt, probably for a request completion, when the driver queues a response in the shared data structure (blkif_interrupt() is the key here) and the request insertion path, where the driver is queueing a request in the shared data structure (blk_queue_bio(), which is called back as soon as a completion kicks the driver for pending request insertions).

So, this is it. Seems like we've found a possible suspect. Let's make sure there aren't other relevant contention points by just glancing at the stats by ordering them by number of contentions. We can do that very easily if we know that the output is already ordered by number of contentions. Let's just then grep the headers of the sections.

$ grep : /proc/lock_stat

We can see from the output produced by this command that the second-most contented lock (rq->lock) has suffered "just" 1579 contentions during the test, which is an order of magnitude lower than the io_lock.

Tips for more accurate statistics
Just as a side note, I want to list here some tips that my mentor, Konrad Wilk, has given me on collecting more accurate data from profiling tools while dealing with Xen guests, in the hope that they will be useful to others as they have been to me.
Boot the PV guest with vPMU enabled. vPMU is a hardware support to performance monitoring, based on the performance monitoring unit, specifically prepared to help collecting performance data of virtualized operating systems. Exploiting vPMU allows to offload a huge overhead from the hypervisor, therefore allowing for more accurate data. Using it is fairly straight-forward, and involves having the vpmu=1 option in the guest's configuration file.
Use a driver domain. This removes having the dom0 as a bottleneck when using more than one guest, as the backend is located in the driver domain. This also allows to keep roles separated, and to have a specific-purpose domain just for the handling of block I/O.
Use a HVM domain as driver domain. This removes the overhead of PV MMU-related hypercalls and should allow for cleaner data.

Sources

Konrad Wilk, "Xen Profiling: oprofile and perf"

Brendan Gregg, "Perf examples"

Perf tutorial

Thursday, May 29, 2014

OPW, Linux: Learning about Linux with ftrace

So, this week I've been studying and documenting code and asking questions to my OPW mentor (who, by the way, suggested the topic for this kick-off blog post, thank you so much!), and starting to code (and debug). I've been mainly making use of the ftrace environment to get traces from a running kernel and compare them to what I understood from a static study; also, as I'm working on a block I/O driver, I've been also experimenting with the blktrace tool. The following paragraphs will try to outline how I have been using them.

If you have any feedback and the time to write a comment about it, please do; I'm just a beginner here, and can make use of any suggestion.

Learning about Linux with ftrace

ftrace, whose name is heritage of its initial purpose of kernel-space function tracer, is much more than that: it is available in Linux (and has been from v2.6.27 on) as a framework that features several tracers, each sporting a configuration interface exposed to userspace applications. It indeed includes a function tracer, with the possibility to probe for both the entry points only and the pair of entry/exit points of kernel functions; these are called respectively function and function_graph. Other tracers provided by the framework, instead, allow to trace especially areas that disable interrupts: the irqsoff tracer saves a snapshot of the longest measured period of time where interrupts where disabled; similarly, preemptoff saves a trace of the longest period that had preemption disabled; the preemptirqsoff tracer joins the operations performed by the two. Last, the wakeup and wakeup_rt tracers are involved in recording the maximum service latency of high-priority tasks, the latter focusing on real-time tasks.

ftrace also allows for dynamic tracing, having the system run with no measurable overhead when tracing is disabled; this convenient feature is implemented with code text rewrite: when starting up the kernel, or stopping a running trace, the code of the running kernel is dynamically modified by replacing all calls to function profilers with a no-operation instruction, thereby reducing the overhead to zero. Having the chance to dynamically enable and disable tracing points throughout kernel code also gives the possibility to restrict traces to relevant functions, reducing the overhead unavoidably produced by enabled tracing.

Being a versatile and efficient kernel-space tracing environment, ftrace proves to be an unindispensable tool to the student or developer willing to become familiar with a kernel component. If this is the case, the most useful tracers between those listed above are probably the function and function_graph tracers, as they give a concise representation of the code's execution flow.

The interface to the ftrace framework resides in the kernel's debugfs. Assuming that the debugfs is mounted on the directory /sys/kernel/debug, the first file we'd be willing to use is probably the one that allows us to select a tracer. Let's say we want to collect a function graph. First, we can list available tracers to see if the one we want is actually compiled in the kernel.

# cat /sys/kernel/debug/tracing/available_tracers

blk function_graph wakeup_dl wakeup_rt wakeup function nop

It seems like we have the function_graph tracer built in our kernel (want to know what that mysterious blk tracer is? Read the section "Tracing the block layer" below). We can select it immediately by echoing its name to the current_tracer file. But, let us see, before that, if any tracer has already been selected.

# cat /sys/kernel/debug/tracing/current_tracer

nop

The nop tracer is a placeholder to indicate that no tracer is selected. Let's select function_graph instead.

# echo function_graph > /sys/kernel/debug/tracing/current_tracer

# cat /sys/kernel/debug/tracing/current_tracer

function_graph

At this point, we can finally start tracing. But first, let's choose an interesting test case to trace. Let's say we want to follow the path of a 4KB read I/O request performed on the sda drive from syscall to dispatch. Writing it as a couple of bash command lines:

# echo 3 > /proc/sys/vm/drop_caches

# dd if=/dev/sda of=/dev/null bs=1K count=4

The first command drops all data kept in the kernel's page cache. Dropping the cache might not be necessary, but is strongly advised in this case, as the blocks we are trying to read from disk could by coincidence be already mapped into memory.
Before going on to the real experiment, we can glance at the file that will keep the trance once it's been collected. What we want to read is the /sys/kernel/debug/tracing/trace special file:

# cat /sys/kernel/debug/tracing/trace
# tracer: function_graph
#
# CPU DURATION FUNCTION CALLS
# | | | | | | |

As of now, the file is empty, but the header already gives us an idea of what the trace will include. In it, for each row, we'll see the id of the CPU where the function was executed, the duration of its execution and, as expected, its name.
Now we can add tracing around our experiment and finally execute it. We will be clearing the contents of the trace file before starting to record, just to be sure we are collecting only the relevant output.

# echo nop > /sys/kernel/debug/tracing/current_tracer
# echo function_graph > /sys/kernel/debug/tracing/current_tracer
# echo 3 > /proc/sys/vm/drop_caches

# echo 1 > /sys/kernel/debug/tracing/tracing_on

# dd if=/dev/sda of=/dev/null bs=1K count=4

# echo 0 > /sys/kernel/debug/tracing/tracing_on

We can glance at our experiment's trace by viewing the contents of the /sys/kernel/debug/tracing/trace file. Be warned, it is most likely the length of the trace's content has very much increased.

Improving trace accuracy

If we start eagerly studying the trace we collected, we will probably encounter a number of issues that affect the readability of the output we collected, the most relevant being that the trace probably includes also function calls that do not concern the service of that 4KB read request we wanted to see. One of the reasons for this issue is that some time unavoidably elapses from the bash executing that "echo 1" command and actually starting the dd application; also, the trace will contain all preliminary operations performed on behalf of the dd application, such as, e.g. loading its executable image into memory, opening the /dev/sda file and the /dev/null file, and so on.

The first thing we'd want to try is restricting the trace to read-related functions. By glancing at documents and manpages, we see that everything starts with the SyS_read() system call. If we could restrict tracing to that only function and its children, the output would be clearer. Turns out we actually can:

# echo SyS_read > /sys/kernel/debug/tracing/set_graph_function

# cat /sys/kernel/debug/tracing/set_graph_function

SyS_read

If we repeat the experiment, we see that now only SyS_read and the functions it calls are in the trace. Although useful, however, this method does not take into account that not all the handling of the I/O happens in the context of the SyS_read function.

Let's revert back to tracing all function by simply clearing the file:

# echo > /sys/kernel/debug/tracing/set_graph_function

What we really want to do to improve the accuracy of our trace and enhance its readability is exploiting one more feature of the ftrace environment: setting trace markers (that is, trace lines that will help our user-space application to get in sync with the kernel, e.g. telling it that we're about to start issuing I/O so that it can report it on the trace or when we have finished). Having a marker in the trace makes sense only if the marker is set immediately before the event we want to mark. This can be done by setting up a helper program, preferrably written in a compiled language, which will allow us to get rid of a big amount of spurious events. The official documentation page for ftrace provides an example for such a helper program. Starting from that outline, I have made available on my gist a draft of possible implementation for our test. Even if almost trivial, it's pretty long, and would certainly be boring and uncomfortable to read on a blog, so I'm linking it here.

If we are sure we want to see only operations performed on behalf of a single process, we could also trace by PID, by specifying in the set_ftrace_pid file the PID of the process we want to trace. However, this still would prevent us from seeing events performed in interrupt context following actions from that process (e.g. I/O request completions). The example I linked above includes tracing by PID, and will therefore discard some events that might be relevant to the student.

Tracing the block layer

While studying the block layer or another block I/O-related kernel component, we could be interested only in tracing actions related to handling of requests. ftrace comes in handy also for collecting traces basing on events; let's see what events it supports, and if we can do something with them.

# cat /sys/kernel/debug/tracing/available_events
<it's a huge list, so I'm snipping>
block:block_rq_remap
block:block_bio_remap
block:block_split
block:block_unplug
block:block_plug
block:block_sleeprq
block:block_getrq
block:block_bio_queue
block:block_bio_frontmerge
block:block_bio_backmerge
block:block_bio_complete
block:block_bio_bounce
block:block_rq_issue
block:block_rq_insert
block:block_rq_complete
block:block_rq_requeue
block:block_rq_abort
block:block_dirty_buffer
block:block_touch_buffer
<...>

ftrace is not the only option for block layer-related study and debugging. Actually, the Linux kernel block layer maintainer, Jens Axboe, has introduced a block layer I/O tracing tool, blktrace, that provides detailed traces about request-queue operations from kernel to user space. blktrace efficiently records traces to one or more buffers in a binary format, which is then parsed back into a human-readable format by the blkparse utility. blktrace also adds to common events the log messages produced by block layer components, therefore highly facilitating debugging on block I/O-related subsystems. blktrace is completely integrated with the debugfs interface provided by ftrace, and supports a fair amount of the options provided by the ftrace environment (such as, e.g. filtering traces and dynamic tracepoints).
To enable the block tracer provided by the blktrace facility, it is sufficient to select the blk tracer as current one. It is also necessary to enable traces per-block-device, by writing to the file /sys/block/<device>/trace/enable as previously seen. Note that, when the blktrace framework is compiled in the kernel, writing to the per-block-device enable file will start/stop a trace even if the tracer is not blk.

Extending kernel oopses with function traces

Documenting and studying code is not the only field where a function tracer can be of great help: we can make good use of a complete trace even when debugging a kernel component. The main issue is that we don't know when an error will occur, how critical it will be, and ultimately might not know when, or even be able at all to, collect a trace of the failure.
ftrace can be configured to dump to terminal (or better, to serial console) a trace for us in case of an oops. This behavior, which is disabled by default can be dynamically changed with the switch /proc/sys/kernel/ftrace_dump_on_oops. It can also be configured at boot time by adding the option ftrace_dump_on_oops to the kernel's command line; note that we can also choose which tracer we want to record a trace with also when booting with traces enabled, by placing the ftrace=<tracer_name> option before the former command line parameter. In his article about debugging the kernel with ftrace, Steven Rostedt suggested shrinking the per-CPU buffer kept by ftrace especially in this occasion, as the buffer is to be dumped to the console as soon as an oops is produced. Also, the same article points out a nice feature of ftrace, that can be induced to print out the ftrace buffer with the SysRq-z combination, therefore allowing the developer to collect a trace of the running kernel even if it is stuck with interrupts disabled.

Sources

ftrace's official documentation: https://www.kernel.org/doc/Documentation/trace/ftrace.txt

Steven Rostedt's article: http://lwn.net/Articles/365835/ and http://lwn.net/Articles/366796/

ftrace on eLinux: http://elinux.org/Ftrace