This post is part of the series on performance monitoring with Intel MSRs on Linux:
- A Linux Module For Reading/Writing MSRs
- Intel MSR Performance Monitoring Basics
- Fun with MSRs: Counting Performance Events On Intel
- Scripting MSR Performance Tests With kdb+
- Scripting MSR Performance Tests With kdb+: Part 2
- Intel Performance Monitoring: Loose Ends (this post)
If you haven't already, you'll need to download the
q 3.0 trial version for Linux from Kx Systems. Although it's the 32-bit version, it is fully-functional apart from the fact that it is time-limited to somewhere around an hour's use before you need to restart it. I was labouring under the misapprehension that it could be run on a 64-bit system, but since
q is a dynamically-linked application, you'll need to do something with
chroot and 32-bit libraries if you want to try that.
You will also need
root access to your Linux system, since the next step is to download the source-code from GitHub for the MSR kernel driver and install it onto your system. The source code for this post is also hosted on GitHub and can be found here. Both directories have a
Makefile and so should be easy enough to use. If you're doing performance monitoring on Linux you can probably work a
The next part is "installing" the
q instance. This is as simple as unzipping the
linux.zip file in some vaguely sensible location and then setting the environment variable
QHOME to the absolute path of the directory containing the
l32 directory (which is very easy to find). After that you can either add the
l32 diretory to your path or invoke
q with a relative prefix.
One thing which will make your life easier on Linux is using
rlwrap to give you a sensible console experience - e.g. using the cursor keys to navigate the command line rather than emit ASCII control codes to the screen, etc. I alias the
q command as follows:
alias qq='rlwrap q'And that's all there is to it.
Assuming you've built the source and installed the driver...
The next step would be to se the
LD_LIBRARY_PATH value to contain or equal the directory path containing the
libpmc.so file. After that, you can start
q by issuing the command
rlwrap q pmc.qThe script automatically loads the
pmcdata.csvfile and you can view its contents by simply typing the following at the q-prompt:
q).pmc.evtBut of course, you don't need to do that, only if you want to see what it contains.
I've canned a couple of PMC configurations but you should definitely consider configuring your own. I wouldn't go so far as to say they're the most coherent set of PMC choices, but hey, mix and match and see something useful on each run. I'm talking about the functions
.pmc.script4. Each of these functions takes what I've called a
domain argument, which is a symbol atom or vector with possible values
`usr`os. These are flag values (in the sense that the
IA32_PERFEVTSELx register has individual bit-flags). By specifying one or the other or both, you determine whether the PMCs and FFCs count in ring 0, rings 1, 2 & 3 or all of them. I have to say that there's not much point counting cycles etc in the kernel (ring 0) with the trite
testcode.c I uploaded to GitHub, since it only executes a user-land function (
gettimeofday) and you get nonsense back when just specifying
`os! Interesting, nonetheless.
There's an added enhancement as well to the
libpmc.c code as well. Instead of relying on functions declared as
extern and then packaged together by the linker, it uses the dynamic loader to look for the dynamic shared object (DSO)
libtest.so. It loads this on each execution of test-run (and pre-links the symbols to avoid some ugly skew on the first run), which means that it's possible to leave kdb+ running, recompile your code with a minor change in it, and re-run the script you ran before. This way you can see immediately what effect your change has had on the execution characteristics. As my almost-three-year-old would say: "Cool, huh?" (note to self: stop saying things like that when he's around).
Anyway, for the non-q programmers amongst you, you could execute a script the following ways:
.pmc.script1[`os] .pmc.script1`os .pmc.script1[`os`usr] .pmc.script1`os`usrThe first two lines are functionally equivalent, as are the final two.
I'll probably keep remembering things I need to add to these instructions, but one in particular is in the
.pmc.runscript function. I have hard-coded (in the sense that anything in an ASCII text-file is hard-coded) the reference clock-speed of my CPU: 2.7 GHz. It uses the difference between the values for the reference and actual clock cycles (from the FFCs) to show a rough number representing the current core speed. I've found it suprisingly accurate, and as mentioned in a previous post, the core spends most of its time at 800 MHz. The other results-column worth treating with some suspicion is the "nanos" column. This basically takes the number of reference clock ticks counted by the FFC (careful about whether it's counting in
os) and divides it by 2.7 to come up with a spurious "wall-time" nanosecond value. Little more than a trinket, to be honest, but it amused me at the time.
It's definitely worth pondering, if you get the chance, whether the load-pressure exerted on your memory sub-system by the CPU will do something interesting if your CPU changes its speed-stepping to start running at full speed (or even in turbo mode). In more complex code with hefty workloads, after the first few iterations the results may show your CPU increasing its clock-speed and hence issuing more load (and store) requests to the QPI and LLC/IO Controller per time unit. Now, you could get to this point by disabling speed-stepping — probably your preferred option if you're trying to do something vaguely forensic — but it's very interesting to see how the profile of some code changes with the CPU-to-Bus ratio. The system which coped amazingly well with the load put on it by a quiesced CPU may suddenly show up as a bottleneck when it's speed increases.
As a parting example, here's what I get when I execute
instAny clkCore clkRef UopsAny UopsP015 UopsP234 L3Miss MHz nanos ----------------------------------------------------------------- 83 497 1685 131 68 47 0 796 624 83 112 362 124 52 39 0 835 134 83 74 227 108 45 36 0 880 84 83 74 254 108 41 39 0 787 94 83 73 254 112 47 34 0 776 94 83 74 227 112 47 35 0 880 84 83 78 281 112 48 36 0 749 104 83 77 254 112 49 36 0 819 94 83 79 254 112 51 36 0 840 94 83 77 254 112 47 31 0 819 94 83 79 281 112 51 36 0 759 104 83 74 254 112 47 35 0 787 94 83 79 281 112 48 37 0 759 104