Monday, 19 November 2012

Intel Performance Monitoring: Loose Ends

This post is part of the series on performance monitoring with Intel MSRs on Linux:
- A Linux Module For Reading/Writing MSRs
- Intel MSR Performance Monitoring Basics
- Fun with MSRs: Counting Performance Events On Intel
- Scripting MSR Performance Tests With kdb+
- Scripting MSR Performance Tests With kdb+: Part 2
- Intel Performance Monitoring: Loose Ends (this post)

If you haven't already, you'll need to download the q 3.0 trial version for Linux from Kx Systems. Although it's the 32-bit version, it is fully-functional apart from the fact that it is time-limited to somewhere around an hour's use before you need to restart it. I was labouring under the misapprehension that it could be run on a 64-bit system, but since q is a dynamically-linked application, you'll need to do something with chroot and 32-bit libraries if you want to try that.

You will also need root access to your Linux system, since the next step is to download the source-code from GitHub for the MSR kernel driver and install it onto your system. The source code for this post is also hosted on GitHub and can be found here. Both directories have a Makefile and so should be easy enough to use. If you're doing performance monitoring on Linux you can probably work a Makefile.

The next part is "installing" the q instance. This is as simple as unzipping the linux.zip file in some vaguely sensible location and then setting the environment variable QHOME to the absolute path of the directory containing the l32 directory (which is very easy to find). After that you can either add the l32 diretory to your path or invoke q with a relative prefix.

One thing which will make your life easier on Linux is using rlwrap to give you a sensible console experience - e.g. using the cursor keys to navigate the command line rather than emit ASCII control codes to the screen, etc. I alias the q command as follows:

alias qq='rlwrap q'
And that's all there is to it.

Assuming you've built the source and installed the driver...

The next step would be to se the LD_LIBRARY_PATH value to contain or equal the directory path containing the libpmc.so file. After that, you can start q by issuing the command

rlwrap q pmc.q
The script automatically loads the pmcdata.csv file and you can view its contents by simply typing the following at the q-prompt:
q).pmc.evt
But of course, you don't need to do that, only if you want to see what it contains.

I've canned a couple of PMC configurations but you should definitely consider configuring your own. I wouldn't go so far as to say they're the most coherent set of PMC choices, but hey, mix and match and see something useful on each run. I'm talking about the functions .pmc.script1, .pmc.script2, .pmc.script3 and .pmc.script4. Each of these functions takes what I've called a domain argument, which is a symbol atom or vector with possible values `usr`os. These are flag values (in the sense that the IA32_PERFEVTSELx register has individual bit-flags). By specifying one or the other or both, you determine whether the PMCs and FFCs count in ring 0, rings 1, 2 & 3 or all of them. I have to say that there's not much point counting cycles etc in the kernel (ring 0) with the trite testcode.c I uploaded to GitHub, since it only executes a user-land function (gettimeofday) and you get nonsense back when just specifying `os! Interesting, nonetheless.

There's an added enhancement as well to the libpmc.c code as well. Instead of relying on functions declared as extern and then packaged together by the linker, it uses the dynamic loader to look for the dynamic shared object (DSO) libtest.so. It loads this on each execution of test-run (and pre-links the symbols to avoid some ugly skew on the first run), which means that it's possible to leave kdb+ running, recompile your code with a minor change in it, and re-run the script you ran before. This way you can see immediately what effect your change has had on the execution characteristics. As my almost-three-year-old would say: "Cool, huh?" (note to self: stop saying things like that when he's around).

Anyway, for the non-q programmers amongst you, you could execute a script the following ways:

.pmc.script1[`os]
.pmc.script1`os
.pmc.script1[`os`usr]
.pmc.script1`os`usr
The first two lines are functionally equivalent, as are the final two.

I'll probably keep remembering things I need to add to these instructions, but one in particular is in the .pmc.runscript function. I have hard-coded (in the sense that anything in an ASCII text-file is hard-coded) the reference clock-speed of my CPU: 2.7 GHz. It uses the difference between the values for the reference and actual clock cycles (from the FFCs) to show a rough number representing the current core speed. I've found it suprisingly accurate, and as mentioned in a previous post, the core spends most of its time at 800 MHz. The other results-column worth treating with some suspicion is the "nanos" column. This basically takes the number of reference clock ticks counted by the FFC (careful about whether it's counting in usr/os) and divides it by 2.7 to come up with a spurious "wall-time" nanosecond value. Little more than a trinket, to be honest, but it amused me at the time.

It's definitely worth pondering, if you get the chance, whether the load-pressure exerted on your memory sub-system by the CPU will do something interesting if your CPU changes its speed-stepping to start running at full speed (or even in turbo mode). In more complex code with hefty workloads, after the first few iterations the results may show your CPU increasing its clock-speed and hence issuing more load (and store) requests to the QPI and LLC/IO Controller per time unit. Now, you could get to this point by disabling speed-stepping — probably your preferred option if you're trying to do something vaguely forensic — but it's very interesting to see how the profile of some code changes with the CPU-to-Bus ratio. The system which coped amazingly well with the load put on it by a quiesced CPU may suddenly show up as a bottleneck when it's speed increases.

As a parting example, here's what I get when I execute .pmc.script1[`usr]:

instAny clkCore clkRef UopsAny UopsP015 UopsP234 L3Miss MHz nanos
-----------------------------------------------------------------
83      497     1685   131     68       47       0      796 624  
83      112     362    124     52       39       0      835 134  
83      74      227    108     45       36       0      880 84   
83      74      254    108     41       39       0      787 94   
83      73      254    112     47       34       0      776 94   
83      74      227    112     47       35       0      880 84   
83      78      281    112     48       36       0      749 104  
83      77      254    112     49       36       0      819 94   
83      79      254    112     51       36       0      840 94   
83      77      254    112     47       31       0      819 94   
83      79      281    112     51       36       0      759 104  
83      74      254    112     47       35       0      787 94   
83      79      281    112     48       37       0      759 104  

No comments:

Post a Comment