Showing posts with label xen. Show all posts
Showing posts with label xen. Show all posts

Tuesday, August 26, 2014

OPW: The Xen Project Developer Summit

So, about halfway through my OPW internship, I was informed that my wonderful mentor, Konrad Wilk, and Xen Project Community Manager Lars Kurth thought to allow me to attend the Xen Project Developer Summit that was to be held in Chicago on the 18th and 19th of August. I actually went there, had fun and learned a lot: now it's time to write a blog post about it!

The Xen Project Developer Summit

The keynote session featured talks from Xen community and development coordinators, which aimed to provide an insight of the main improvements and of the new features introduced in the Xen Project. Lars Kurth described the challenges of coordinating such a large and growing project and the efforts made towards better cooperation. George Dunlap, release coordinator for Xen 4.3 and 4.4, and Konrad Rzeszutek Wilk, that took over the task for the upcoming 4.5, detailed the main improvements made to the Xen hypervisor during the last development cycles. They also explained and commented on the rationale behind the feature selection work performed by release coordinators and some of the toughest decisions that are required by their duty.



Linux x86 co-maintainer David Vrabel provided a status update on the newest features added to Linux domain support and set up a wishlist for future features to be developed. During the keynote session I had the chance to meet a former Xen Project intern, Elena Ufimtseva, who worked on the implementation of vNUMA in Xen during the 2013 OPW round, mentored by Citrix Senior Engineer Dario Faggioli. She also presented her work in 2013 at the Xen Developer Summit. I could gather details and her opinion about her experience with OPW and in finding a job after the end of the internship, and I could discuss with her some of the challenges experienced by an intern and the best ways to exploit those difficult aspects by turning them into stronger development skills.

The following talks provided in-depth details on the main research aspects of Xen development. Some of them covered the performance of Xen with respect to networking, such as Jun Nakajima's talk on the main bottlenecks found while experimenting with Xen as a Network Virtualization Functions platform component and the solutions that were implemented by Intel. Other talks focused on storage, as Felipe Franciosi's insight on memory grant technologies available in Xen that can contribute to optimising aggregate workloads of several GB/s per guest (he actually allowed me to take part in the BoF session that followed his talk, therefore giving me the chance to hear further opinions and learn even more on the storage performance achieved by Xen guests). Still on the same trend, Filipe Manco presented NEC Europe's work towards tracking down performance limitations and bottlenecks that increase startup latencies of Xen guests, when they are run in bursts of thousands; he also proposed a prototype reimplementation of some Xen components to prove his points. Anil Madhavapeddy showed the benefits of the new Irminsule distributed transactional filesystem, that allows to handle storage accesses in a version control system fashion, letting unikernels running in isolated stubdomains, such as MirageOS, use a common and consistent API. More talks covered security aspects of virtualization, as Mihai Dontu's presentation, that proposed a zero-footprint implementation of memory introspection for Xen domUs that can allow a supervisor domain to perform run-time detection of malware on Xen-based guests; James Bielman described Galois' implementation of Mandatory Access Control for the Xenstore, showing how it can be managed by a centralized security server as it does not benefit from the XSM security policy. James Fehlig's talk, instead, covered the important topic of virtualization management tools, providing an overview of libvirt, a status update on the libxenlight driver and a roadmap proposal. Moving on to the topic of architecture and hardware support, Daniel Kiper approached the subject of EFI, outlining how Xen efficiently uses its infrastructure and what can be improved in the support provided by the hypervisor. Wei Liu instead described the status of vNUMA support in Xen, giving an in-depth report of its implementation and of its importance with relevant statistics.

The main session opened with a detailed overview of the Verizon Cloud architecture provided by Don Slutz, which described what features are used and the optimization it provides to both Xen and QEMU. It also featured a report on the Linux kernel delta that SUSE supports for Xen and a proposal on how to address it, delivered by Luis Rodriguez. Following another trend were some Xen-on-ARM-related talks, as the presentation by Stefano Stabellini, that provided an insight on the current state of the project and how it performs on the newest ARMv8 64-bit platforms, and the one by Julien Grall, which detailed the process of porting an OS as a Xen-on-ARM guest. Jonathan Daugherty also described, in his talk, his experience in porting FreeRTOS to Xen on a Cortex A15-based platform. More talks were performance-related, as Zoltan Kiss' presentation on network improvements made in XenServer and Feng Wu's on Intel's work on introducing interrupt posting with its virtualization technology. John Else explained his work about efficient inter-domain communication of performance data, his findings about the XenStore being the bottleneck in the current technique and proposed a lock-free, efficient solution to the issue. Talks also included the relevant topic of testing for a software ecosystem as complex as the Xen one: Ian Jackson presented Xen's automatic testing facility, osstest, outlining its last development steps and the wider set of configurations it now supports. Some of the talks were related, instead, to safety aspects of using Xen in an environment with real-time constraints. Nathan Studer and Robert VanVossen presented DornerWorks' efforts on certifying Xen for automotive, medical and avionics, the challenges behind the task, a proposed roadmap to overcome the most tricky aspects and the current state of the project. Sisu Xi described the Washington University's work on RT-Xen with the aim of combining real-time and virtualization. Willing to give, instead, a more detailed insight on unikernels, Adam Wick outlined their features and described the general rules that establish whether a unikernel is the right choice for a software infrastructure component. Glauber Costa introduced the topic of LibraryOSs, highlighting their benefits in terms of performance, lightness and scalability, describing which applications they support and how can prove to be useful to the Xen community. Philip Tricca explained the drawbacks of the static configuration used to isolate system components in OpenXT, a collection of hardened Linux-on-Xen virtual machines providing a user platform for client devices, and a new toolkit to enhance the platform's flexibility.

During the main session I met some of my fellow OPW interns. I had the chance to talk to the brilliant Mindy Preston, who worked on MirageOS's network stack fixing bugs and implementing missing RFCs, about her experience and exchange opinions about ARM-based boards. I had the chance to take part in the final OPW/GSoC-related panel with her; it also featured the very professional GSoC intern Jyotsna Prakash, who worked on cloud API support for MirageOS by implementing cloud API bindings for OCaml, along with some of our mentors and Lars Kurth as a host. The panel gave us interns the chance to provide feedback to our mentors and to the program's organization and to express our opinion about what we learned from it. It also covered very important aspects of participating in a large open-source project within a heterogeneous and just as big community: George Dunlap thoroughly explained the lights and shadows of Linus Torvalds' approach to commenting bad code, while Konrad Wilk delivered a thoughtful insight about how cultural differences can influence the interaction between developers during software review.

What Did I Learn

Being able to attend the conference was a highly educational experience. It allowed me to get a better idea of how the community is organized, to get involved even more and hear about the experience of other attendees. I also could benefit from my mentor's advice on how to interact with other developers. Having to speak in front of an audience also has always been one of the aspects of working on a project that I feared the most; the chance to take part in the panel and my mentor's very useful advice make a huge addition to my experience and will allow me to fully exploit the opportunity to share my findings and my enthusiasm with others on future occasions.

As a final note, I'd like to thank my very patient mentor, Konrad Wilk, for allowing me to take part in OPW (even if I applied to him as a candidate on the very last day before the deadline) and for his invaluable guidance during the program; I'd like to thank also the GNOME Foundation and Xen Community Manager Lars Kurth for granting me the opportunity to attend the conference, and Elena Ufimtseva for giving me the benefit of her own experience. Last, but not least, I'd like to thank my always so helpful advisor, Paolo Valente, and Citrix Senior Engineer Dario Faggioli for introducing me to the internship program.

Links

This report has been re-posted by The Xen Project on their blog!
Xen Developer Summit - Chicago, 2014 - Schedule
Slides used for many of the talks (Xen Project's official SlideShare)
See videos on The Xen Project's official YouTube channel

Friday, August 8, 2014

OPW, Linux: Profiling a Xen Guest

Here is my take at describing how have I been using some of the most popular tools when it comes to profiling. My OPW mentor suggested me to make use of these tools while looking into locking issues I had to deal with in the context of my summer project. I'll also try to include some hints that my mentor provided and should allow to gather more accurate data.
If you have any criticism or suggestion and the time to write about it, please do.


OPW, Linux: Profiling a Xen Guest

The first tool I had fun with is perf, a tool which abstracts away hardware differences of the CPUs supported by Linux and allows to use performance counters for profiling purposes. Performance counters are CPU registers used to count events that are relevant to profiling: instructions executed, cache events, branch prediction performance; they are instrumental to profiling applications as they allow to trace their execution and identify hot spots in the executed code.

The perf tool works on top of those hardware tools and uses them as building blocks to keep also per-task, per-CPU and per-workload counters, also enabling for source code instrumentation, letting the developer gather accurate annotated traces. Tracing is performed with the kprobes (with ftrace) and uprobes frameworks, which both allow for dynamic tracing, therefore highly reducing the overhead unavoidably added by tracepoints. If you are interested in dynamic tracing you can either read the manual or a previous blog post on ftrace.

From a more practical point of view, the fact that perf relies on ftrace also means that it must be compiled on top of your currently-executing kernel; in fact, it lives in a directory of the kernel source code tree, and can be build by exploiting the kernel make subsystem by issuing:

cd tools/perf
make

If you want to install the tool, just enter a make install after that. Otherwise, you'll need to use the perf executable which has been created in the tools/perf directory of your development kernel tree.

Then, we're done! Let's find hot spots in the code I'm writing. Let's record a run of the IOmeter emulator built on top of the fio flexible I/O tester and performed from a paravirtualized Xen guest on a physical block device emulated with the null_blk driver. The test script basically starts fio with the IOmeter jobfile. Note that the jobfile specifies libaio as the I/O engine to use for the test; just keep that in mind.

$ cat iometer_test.sh
#!/bin/bash
fio iometer-file-access-server

To make things simple, let's trace all available perf events.

perf record ./iometer_test.sh

Once the command has been executed, we can ask perf to list the recorded tracepoints, ordering them by placing first the ones with the higher overhead. Since the output is very long, I'm making it available in my Dropbox. I'll just paste here the first lines.

# To display the perf.data header info, please use --header/--header-only options.
#
# Samples: 271K of event 'cpu-clock'
# Event count (approx.): 67840250000
#
# Overhead    Command       Shared Object                               Symbol
# ........  .........  ..................  ...................................
#
    24.04%        fio  [kernel.kallsyms]   [k] xen_hypercall_xen_version      
     4.82%        fio  libaio.so.1.0.1     [.] io_submit                      
     3.98%        fio  libaio.so.1.0.1     [.] 0x0000000000000614             
     3.68%        fio  fio                 [.] get_io_u                       
     3.40%        fio  [kernel.kallsyms]   [k] __blockdev_direct_IO           
     2.78%        fio  fio                 [.] axmap_isset                    
     1.93%        fio  [kernel.kallsyms]   [k] lookup_ioctx                   
     1.66%        fio  [kernel.kallsyms]   [k] kmem_cache_alloc               
     1.64%        fio  [kernel.kallsyms]   [k] do_io_submit                   
     1.18%        fio  fio                 [.] _start                         
     1.16%        fio  [kernel.kallsyms]   [k] __blk_mq_run_hw_queue          
     1.08%        fio  [kernel.kallsyms]   [k] blk_throtl_bio                 
     0.91%        fio  [kernel.kallsyms]   [k] blk_mq_make_request            
     0.90%        fio  fio                 [.] fio_gettime                    
     0.90%        fio  fio                 [.] td_io_queue                    
     0.76%        fio  [kernel.kallsyms]   [k] generic_make_request_checks    
     0.72%        fio  [kernel.kallsyms]   [k] pvclock_clocksource_read       
     0.72%        fio  [kernel.kallsyms]   [k] __rcu_read_lock                
     0.71%        fio  [kernel.kallsyms]   [k] aio_read_events                
     0.69%        fio  fio                 [.] add_slat_sample                
     0.67%        fio  fio                 [.] io_u_queued_complete           
     0.66%        fio  [kernel.kallsyms]   [k] generic_file_aio_read          
     0.63%        fio  [kernel.kallsyms]   [k] __fget                         
     0.60%        fio  [kernel.kallsyms]   [k] bio_alloc_bioset               
     0.60%        fio  fio                 [.] add_clat_sample                
     0.59%        fio  [kernel.kallsyms]   [k] blkdev_aio_read                
     0.56%        fio  [kernel.kallsyms]   [k] percpu_ida_alloc               
     0.55%        fio  [kernel.kallsyms]   [k] copy_user_generic_string       
     0.54%        fio  [kernel.kallsyms]   [k] sys_io_getevents               
     0.54%        fio  [kernel.kallsyms]   [k] blkif_queue_rq

First of all, let's look at the fourth column, the one with one character enclosed in square brackets. That columns tells us whether the listed symbol is a userspace symbol (we have a dot in square brackets) or a kernelspace symbol (we see a k character in square brackets).
When it comes to userspace symbols, we can disambiguate whether we're looking at library functions or at user application symbols by glancing at the third column, which shows either the name of the library or the application. The second column, which might mislead us, shows instead the context of execution of the symbol. In this case, most of the execution happened in the context of the fio process, which is the one performing I/O during our tests.
The last column, the one on the right, shows us the name of the symbol, if perf managed to retrieve it. The first column, the one on the left, shows instead the percentage of time in which the list symbol was executing, given that 100% is the amount of time that covered the whole execution of a test.

So, perf did not record just events in the kernel, but also in the library used by fio to perform I/O, which is the libaio one we've seen before. Interesting fact, some of the library calls have a relatively high overhead: for example, the io_submit() function of libaio has been executing for almost the 5% of the test. This doesn't surprise us very much, as we kind of expected to find some hot libaio symbols by just reading Jens Axboe's paper which shows and explains profiling performed on such a test.

The most relevant hotspot, however, is in the kernel, and seems to be the xen_version hypercall, which has been executing for almost a fourth of the test. This seemed to me some kind of mistake until Konrad Wilk told me that the hypercall was probably being executed while grabbing a lock; in fact, the xen_version hypercall is used as a hypervisor callback: it is called to see if an event is pending, and if there is an interrupt is executed on return. That same hypercall, for example, is called whenever the functions spin_lock_irqsave() and spin_unlock_irqrestore() are called. This seems to identify the issue as a locking issue, so...

Debugging contention issues
... to get deeper into the investigation of such an issue, I chose to have a look at lockstat, the lock debugging framework provided by the Linux kernel. Essentially, lockstat provides statistics about the usage of locks, therefore being of great help to idenfity lock contention issues that can heavily impact performance.

First step is enabling it in the kernel's configuration. This can be done by setting CONFIG_LOCK_STAT=y or by browsing the menuconfig to the "Kernel hacking > Lock debugging (spinlocks, mutexes, etc...) > Lock usage statistics" option. As soon as the kernel has been re-compiled and re-installed with the framework enabled, we can enable lock statistics collection with:

$ echo 1 | sudo tee /proc/sys/kernel/lock_stat

From this moment, lock statistics are available in the special procfs file /proc/lock_stat, which we can glance at by issuing:

$ less /proc/lock_stat

An example of the output of that command collected on the frontend guest during the execution of the fio IOmeter test can be found, again, on my Dropbox. It's very very long and would not be much comfortable to read in here, so I'm just commenting it in this blog post.
First, we can see that the output is split in different sections, bordered with dots. Each of the sections contains information about a lock, whose name is the first thing listed there. For example, the first section is about the (&rinfo->io_lock)->rlock. Basically, it's the lock protecting access to the shared data structure used to exchange block I/O requests and responses between the guest domain and the driver domain following Xen's PV protocol. After that, we find a whole set of numbers, whose meaning is explained in the header above. We might be interested in extrapolating the number of contentions (the contentions field), the average time spent in waiting for the lock to be released (waittime-avg), the number of times that the lock has been acquired (acquisitions) and the average time the lock has been held (holdtime-avg). For example, we see that the io_lock has suffered 47500 contentions, has been acquired 1348488 times, and a thread willing to acquire it has been waiting for 25.4 microseconds.
After that batch of information, we find a listing of call traces; this indicated which threads of execution brought to acquiring that lock. Always with reference to our io_lock, we see that the two contenting traces concern the service of an interrupt, probably for a request completion, when the driver queues a response in the shared data structure (blkif_interrupt() is the key here) and the request insertion path, where the driver is queueing a request in the shared data structure (blk_queue_bio(), which is called back as soon as a completion kicks the driver for pending request insertions).

So, this is it. Seems like we've found a possible suspect. Let's make sure there aren't other relevant contention points by just glancing at the stats by ordering them by number of contentions. We can do that very easily if we know that the output is already ordered by number of contentions. Let's just then grep the headers of the sections.

$ grep : /proc/lock_stat

We can see from the output produced by this command that the second-most contented lock (rq->lock) has suffered "just" 1579 contentions during the test, which is an order of magnitude lower than the io_lock.

Tips for more accurate statistics
Just as a side note, I want to list here some tips that my mentor, Konrad Wilk, has given me on collecting more accurate data from profiling tools while dealing with Xen guests, in the hope that they will be useful to others as they have been to me.
Boot the PV guest with vPMU enabled. vPMU is a hardware support to performance monitoring, based on the performance monitoring unit, specifically prepared to help collecting performance data of virtualized operating systems. Exploiting vPMU allows to offload a huge overhead from the hypervisor, therefore allowing for more accurate data. Using it is fairly straight-forward, and involves having the vpmu=1 option in the guest's configuration file.
Use a driver domain. This removes having the dom0 as a bottleneck when using more than one guest, as the backend is located in the driver domain. This also allows to keep roles separated, and to have a specific-purpose domain just for the handling of block I/O.
Use a HVM domain as driver domain. This removes the overhead of PV MMU-related hypercalls and should allow for cleaner data.


Sources
Brendan Gregg, "Perf examples"

Tuesday, April 22, 2014

OPW, Xen: some initial info

Turns out I've been accepted for OPW 2014!
I'm so thrilled to be working with the Xen Project to domain support for Linux.
I'll be trying to document my three-months journey in this blog as part of the program, but writing is really not my thing; also, what I will write is (pretty obviously) just what I think I have understood from my readings and investigations. So, please bear with me and do not hesitate to comment about any mistake you may find in these posts, if you want to.

I'll be working on adding some support for the multi-queue API, recently merged in the Linux kernel, to the block I/O paravirtualized Xen driver for Linux. In the following post, I'll try to provide some initial information about the such an idea and its motivations; the project was proposed by Konrad Rzeszutek Wilk, to whom I have been assigned as intern. A more in-depth and articulate description can be found in his project outline or in the documents linked from the outline itself, but I'm so happy that I'm looking forward even to writing about it, so here is my take.

Multi-queue support for the block I/O paravirtualized Xen drivers

The kernel component charged with the task of managing block I/O issued by applications is, in Linux, referred to as the block layer. Its task includes passing on block I/O requests to the device driver, which actually interacts with the block device. Until v3.12, the Linux kernel's block layer offered to block device drivers two interfaces, one of which is more commonly used and is called the request interface. When exposing this interface, the block layer internally stores I/O requests, queueing them in a buffering area; while in this area, requests can be re-ordered (e.g. to reduce penalties due to disk head movements), merged (to reduce per-request-processing overheads) or, more in general, subjected to service policies implemented in the block I/O scheduling sub-component of the block layer. Buffering has also the function to limit submission rate to prevent any hardware buffer from over-running. From a more practical point of view, when in request-based mode, the block layer keeps a simple request queue for each device; it adds new requests to the queue according to its internal policies; the request queue is drained by the device driver from its head.

Such an interface was designed when a high-end drive could handle hundreds of I/O operations per second ([1]); in a recent paper ([2]), Linux block layer maintainer Jens Axboe pointed out the three causes that make it a bottleneck when used with a high-performance SSD that can handle hundreds of thousands of IOPS.
  1. The request queue is a major source of contention. In fact, it is protected by a single spinlock, the queue_lock, which is grabbed each time an I/O request is issued, attempted to be merged, enqueued in the I/O scheduler's private data structures, dispatched to the driver and completed.
  2. Hardware interrupts and request completions reduce usability of caches. A high number of I/O operations per second cause a high number of interrupts, despite the presence of interrupt mitigation ([3]). If one CPU core is in charge of handling all hardware interrupts and forwarding them to other cores as software interrupts, as it happens on most modern storage devices ([2]), a single core may spend a considerable amount of time handling interrupts, switching context, and therefore polluting caches. Also, other cores must undergo an inter-processor interrupt to execute the I/O completion routine.
  3. Cross-CPU memory accesses exacerbate queue_lock contention. Remote memory accesses are needed every time the completion routine for an I/O request is executed on a CPU core which is different from the one on which the request was last handled. In fact, on completion it is necessary to grab the queue_lock, as the request must be removed from the request queue. As a consequence, the cache line with the last state of the lock (stored in the cache of the core where the lock was last acquired) must be explicitly invalidated.
Efforts applied in order to remove such a bottleneck were focused on the key points of reducing lock contention and remote memory accesses. Even before v3.13, the Linux kernel already provided a way to short-out the request queue (with the make request interface), but at the cost of deferring any pre-processing and buffering performed on I/O requests by the block layer.

The multi-queue API introduced in the Linux kernel attempts to address this issue by providing to device drivers a third interface to the block layer. The request queue is split in a pool of separate per-CPU or per-NUMA-socket queues. The first mode, in particular, allows to each CPU to queue requests in its private request queue in a lockless fashion, thus reducing the previously-critical lock contention issues; while in these software queues (or submission queues), a request can be subjected to I/O scheduling policies. The device driver, instead, accesses a set of hardware queues (or dispatch queues), whose number matches the actual number of queues supported by the device driver; those act as a buffer to adjust I/O submission rate between the software queues and the actual dispatch to the device driver.

The Xen hypervisor provides paravirtualized drivers for block I/O to avoid IDE emulation (such as, e.g. the one provided by QEMU), which is slow. The PV block I/O driver follows Xen's split device driver model, which includes a backend driver (kept by a driver domain, usually dom0, which hosts hardware-specific drivers) and a frontend driver (used by one of the unprivileged domains, which does not have to deal with hardware-specific or interface-specific details). The two halves of a PV driver share some memory pages, where they keep one or more consensual rings (might usually be only one ring or a pair of tx/rx rings); those rings are accessed by both partecipants through a producer/consumer protocol which uses an event channel for the needed notifications (e.g. presence of a new request or response in the ring). This is called "PV protocol" ([4]), and is used by all paravirtualized drivers in Xen.

The block PV protocol uses only one ring, which is shared between the driver domain and the guest domain. To avoid the allocation of a large amount of possibly-unused shared memory, initially the ring has the minimum indispensable size to queue requests and responses, but not the data itself. As soon as the guest domain starts issuing requests, the frontend driver requests a certain number of memory areas (grants) to hold the data. If the request is a write request, the memory areas are filled with the data to be written on behalf of the guest domain and necessary permissions are added for the driver domain. A grant reference is added to the request, and the request is queued in the ring. The driver domain, at this point, is notified that a new request from the guest domain is pending service; when the request is finally served, the driver domain parses the grant reference and uses it to map the related memory areas to its own address space. As soon as the request has been fulfulled, the grants are unmapped from the driver domain's address space, a response is written on the shared ring and the guest domain is notified about the completion; while handling such a notification, the guest domain removes any access permission on the grants for the driver domain.

With respect to the original protocol, the current implementation has had the benefit of many improvements, such as increasing the maximum number of in-flight I/O with indirect descriptors ([5]), reducing lock contention on the grant lock of a domain and reducing the number of memory [un]mapping operations needed to pass the I/O on to dom0 ([6]).
However, as currently implemented in Linux, the block PV driver still makes use of a single block thread for each ring to handle I/O transmission.

Porting the multi-queue API to the driver should allow it to allocate per-vCPU block threads, thus allowing to each vCPU lockless access to a private ring. The performance benefits should include, as in the Linux kernel ([2]), increasing throughput and reducing service latency of I/O requests.