Friday, August 8, 2014

OPW, Linux: Profiling a Xen Guest

Here is my take at describing how have I been using some of the most popular tools when it comes to profiling. My OPW mentor suggested me to make use of these tools while looking into locking issues I had to deal with in the context of my summer project. I'll also try to include some hints that my mentor provided and should allow to gather more accurate data.
If you have any criticism or suggestion and the time to write about it, please do.

OPW, Linux: Profiling a Xen Guest

The first tool I had fun with is perf, a tool which abstracts away hardware differences of the CPUs supported by Linux and allows to use performance counters for profiling purposes. Performance counters are CPU registers used to count events that are relevant to profiling: instructions executed, cache events, branch prediction performance; they are instrumental to profiling applications as they allow to trace their execution and identify hot spots in the executed code.

The perf tool works on top of those hardware tools and uses them as building blocks to keep also per-task, per-CPU and per-workload counters, also enabling for source code instrumentation, letting the developer gather accurate annotated traces. Tracing is performed with the kprobes (with ftrace) and uprobes frameworks, which both allow for dynamic tracing, therefore highly reducing the overhead unavoidably added by tracepoints. If you are interested in dynamic tracing you can either read the manual or a previous blog post on ftrace.

From a more practical point of view, the fact that perf relies on ftrace also means that it must be compiled on top of your currently-executing kernel; in fact, it lives in a directory of the kernel source code tree, and can be build by exploiting the kernel make subsystem by issuing:

cd tools/perf

If you want to install the tool, just enter a make install after that. Otherwise, you'll need to use the perf executable which has been created in the tools/perf directory of your development kernel tree.

Then, we're done! Let's find hot spots in the code I'm writing. Let's record a run of the IOmeter emulator built on top of the fio flexible I/O tester and performed from a paravirtualized Xen guest on a physical block device emulated with the null_blk driver. The test script basically starts fio with the IOmeter jobfile. Note that the jobfile specifies libaio as the I/O engine to use for the test; just keep that in mind.

$ cat
fio iometer-file-access-server

To make things simple, let's trace all available perf events.

perf record ./

Once the command has been executed, we can ask perf to list the recorded tracepoints, ordering them by placing first the ones with the higher overhead. Since the output is very long, I'm making it available in my Dropbox. I'll just paste here the first lines.

# To display the header info, please use --header/--header-only options.
# Samples: 271K of event 'cpu-clock'
# Event count (approx.): 67840250000
# Overhead    Command       Shared Object                               Symbol
# ........  .........  ..................  ...................................
    24.04%        fio  [kernel.kallsyms]   [k] xen_hypercall_xen_version      
     4.82%        fio     [.] io_submit                      
     3.98%        fio     [.] 0x0000000000000614             
     3.68%        fio  fio                 [.] get_io_u                       
     3.40%        fio  [kernel.kallsyms]   [k] __blockdev_direct_IO           
     2.78%        fio  fio                 [.] axmap_isset                    
     1.93%        fio  [kernel.kallsyms]   [k] lookup_ioctx                   
     1.66%        fio  [kernel.kallsyms]   [k] kmem_cache_alloc               
     1.64%        fio  [kernel.kallsyms]   [k] do_io_submit                   
     1.18%        fio  fio                 [.] _start                         
     1.16%        fio  [kernel.kallsyms]   [k] __blk_mq_run_hw_queue          
     1.08%        fio  [kernel.kallsyms]   [k] blk_throtl_bio                 
     0.91%        fio  [kernel.kallsyms]   [k] blk_mq_make_request            
     0.90%        fio  fio                 [.] fio_gettime                    
     0.90%        fio  fio                 [.] td_io_queue                    
     0.76%        fio  [kernel.kallsyms]   [k] generic_make_request_checks    
     0.72%        fio  [kernel.kallsyms]   [k] pvclock_clocksource_read       
     0.72%        fio  [kernel.kallsyms]   [k] __rcu_read_lock                
     0.71%        fio  [kernel.kallsyms]   [k] aio_read_events                
     0.69%        fio  fio                 [.] add_slat_sample                
     0.67%        fio  fio                 [.] io_u_queued_complete           
     0.66%        fio  [kernel.kallsyms]   [k] generic_file_aio_read          
     0.63%        fio  [kernel.kallsyms]   [k] __fget                         
     0.60%        fio  [kernel.kallsyms]   [k] bio_alloc_bioset               
     0.60%        fio  fio                 [.] add_clat_sample                
     0.59%        fio  [kernel.kallsyms]   [k] blkdev_aio_read                
     0.56%        fio  [kernel.kallsyms]   [k] percpu_ida_alloc               
     0.55%        fio  [kernel.kallsyms]   [k] copy_user_generic_string       
     0.54%        fio  [kernel.kallsyms]   [k] sys_io_getevents               
     0.54%        fio  [kernel.kallsyms]   [k] blkif_queue_rq

First of all, let's look at the fourth column, the one with one character enclosed in square brackets. That columns tells us whether the listed symbol is a userspace symbol (we have a dot in square brackets) or a kernelspace symbol (we see a k character in square brackets).
When it comes to userspace symbols, we can disambiguate whether we're looking at library functions or at user application symbols by glancing at the third column, which shows either the name of the library or the application. The second column, which might mislead us, shows instead the context of execution of the symbol. In this case, most of the execution happened in the context of the fio process, which is the one performing I/O during our tests.
The last column, the one on the right, shows us the name of the symbol, if perf managed to retrieve it. The first column, the one on the left, shows instead the percentage of time in which the list symbol was executing, given that 100% is the amount of time that covered the whole execution of a test.

So, perf did not record just events in the kernel, but also in the library used by fio to perform I/O, which is the libaio one we've seen before. Interesting fact, some of the library calls have a relatively high overhead: for example, the io_submit() function of libaio has been executing for almost the 5% of the test. This doesn't surprise us very much, as we kind of expected to find some hot libaio symbols by just reading Jens Axboe's paper which shows and explains profiling performed on such a test.

The most relevant hotspot, however, is in the kernel, and seems to be the xen_version hypercall, which has been executing for almost a fourth of the test. This seemed to me some kind of mistake until Konrad Wilk told me that the hypercall was probably being executed while grabbing a lock; in fact, the xen_version hypercall is used as a hypervisor callback: it is called to see if an event is pending, and if there is an interrupt is executed on return. That same hypercall, for example, is called whenever the functions spin_lock_irqsave() and spin_unlock_irqrestore() are called. This seems to identify the issue as a locking issue, so...

Debugging contention issues
... to get deeper into the investigation of such an issue, I chose to have a look at lockstat, the lock debugging framework provided by the Linux kernel. Essentially, lockstat provides statistics about the usage of locks, therefore being of great help to idenfity lock contention issues that can heavily impact performance.

First step is enabling it in the kernel's configuration. This can be done by setting CONFIG_LOCK_STAT=y or by browsing the menuconfig to the "Kernel hacking > Lock debugging (spinlocks, mutexes, etc...) > Lock usage statistics" option. As soon as the kernel has been re-compiled and re-installed with the framework enabled, we can enable lock statistics collection with:

$ echo 1 | sudo tee /proc/sys/kernel/lock_stat

From this moment, lock statistics are available in the special procfs file /proc/lock_stat, which we can glance at by issuing:

$ less /proc/lock_stat

An example of the output of that command collected on the frontend guest during the execution of the fio IOmeter test can be found, again, on my Dropbox. It's very very long and would not be much comfortable to read in here, so I'm just commenting it in this blog post.
First, we can see that the output is split in different sections, bordered with dots. Each of the sections contains information about a lock, whose name is the first thing listed there. For example, the first section is about the (&rinfo->io_lock)->rlock. Basically, it's the lock protecting access to the shared data structure used to exchange block I/O requests and responses between the guest domain and the driver domain following Xen's PV protocol. After that, we find a whole set of numbers, whose meaning is explained in the header above. We might be interested in extrapolating the number of contentions (the contentions field), the average time spent in waiting for the lock to be released (waittime-avg), the number of times that the lock has been acquired (acquisitions) and the average time the lock has been held (holdtime-avg). For example, we see that the io_lock has suffered 47500 contentions, has been acquired 1348488 times, and a thread willing to acquire it has been waiting for 25.4 microseconds.
After that batch of information, we find a listing of call traces; this indicated which threads of execution brought to acquiring that lock. Always with reference to our io_lock, we see that the two contenting traces concern the service of an interrupt, probably for a request completion, when the driver queues a response in the shared data structure (blkif_interrupt() is the key here) and the request insertion path, where the driver is queueing a request in the shared data structure (blk_queue_bio(), which is called back as soon as a completion kicks the driver for pending request insertions).

So, this is it. Seems like we've found a possible suspect. Let's make sure there aren't other relevant contention points by just glancing at the stats by ordering them by number of contentions. We can do that very easily if we know that the output is already ordered by number of contentions. Let's just then grep the headers of the sections.

$ grep : /proc/lock_stat

We can see from the output produced by this command that the second-most contented lock (rq->lock) has suffered "just" 1579 contentions during the test, which is an order of magnitude lower than the io_lock.

Tips for more accurate statistics
Just as a side note, I want to list here some tips that my mentor, Konrad Wilk, has given me on collecting more accurate data from profiling tools while dealing with Xen guests, in the hope that they will be useful to others as they have been to me.
Boot the PV guest with vPMU enabled. vPMU is a hardware support to performance monitoring, based on the performance monitoring unit, specifically prepared to help collecting performance data of virtualized operating systems. Exploiting vPMU allows to offload a huge overhead from the hypervisor, therefore allowing for more accurate data. Using it is fairly straight-forward, and involves having the vpmu=1 option in the guest's configuration file.
Use a driver domain. This removes having the dom0 as a bottleneck when using more than one guest, as the backend is located in the driver domain. This also allows to keep roles separated, and to have a specific-purpose domain just for the handling of block I/O.
Use a HVM domain as driver domain. This removes the overhead of PV MMU-related hypercalls and should allow for cleaner data.

Brendan Gregg, "Perf examples"

No comments:

Post a Comment