This post is part of the series on performance monitoring with Intel MSRs on Linux:
- A Linux Module For Reading/Writing MSRs
- Intel MSR Performance Monitoring Basics
- Fun with MSRs: Counting Performance Events On Intel
- Scripting MSR Performance Tests With kdb+
- Scripting MSR Performance Tests With kdb+: Part 2
- Intel Performance Monitoring: Loose Ends (this post)
If you haven't already, you'll need to download the q
3.0 trial version for Linux from Kx Systems. Although it's the 32-bit version, it is fully-functional apart from the fact that it is time-limited to somewhere around an hour's use before you need to restart it. I was labouring under the misapprehension that it could be run on a 64-bit system, but since q
is a dynamically-linked application, you'll need to do something with chroot
and 32-bit libraries if you want to try that.
You will also need root
access to your Linux system, since the next step is to download the source-code from GitHub for the MSR kernel driver and install it onto your system. The source code for this post is also hosted on GitHub and can be found here. Both directories have a Makefile
and so should be easy enough to use. If you're doing performance monitoring on Linux you can probably work a Makefile
.
The next part is "installing" the q
instance. This is as simple as unzipping the linux.zip
file in some vaguely sensible location and then setting the environment variable QHOME
to the absolute path of the directory containing the l32
directory (which is very easy to find). After that you can either add the l32
diretory to your path or invoke q
with a relative prefix.
One thing which will make your life easier on Linux is using rlwrap
to give you a sensible console experience - e.g. using the cursor keys to navigate the command line rather than emit ASCII control codes to the screen, etc. I alias the q
command as follows:
alias qq='rlwrap q'And that's all there is to it.
Assuming you've built the source and installed the driver...
The next step would be to se the LD_LIBRARY_PATH
value to contain or equal the directory path containing the libpmc.so
file. After that, you can start q
by issuing the command
rlwrap q pmc.qThe script automatically loads the
pmcdata.csv
file and you can view its contents by simply typing the following at the q-prompt:
q).pmc.evtBut of course, you don't need to do that, only if you want to see what it contains.
I've canned a couple of PMC configurations but you should definitely consider configuring your own. I wouldn't go so far as to say they're the most coherent set of PMC choices, but hey, mix and match and see something useful on each run. I'm talking about the functions .pmc.script1
, .pmc.script2
, .pmc.script3
and .pmc.script4
. Each of these functions takes what I've called a domain
argument, which is a symbol atom or vector with possible values `usr`os
. These are flag values (in the sense that the IA32_PERFEVTSELx
register has individual bit-flags). By specifying one or the other or both, you determine whether the PMCs and FFCs count in ring 0, rings 1, 2 & 3 or all of them. I have to say that there's not much point counting cycles etc in the kernel (ring 0) with the trite testcode.c
I uploaded to GitHub, since it only executes a user-land function (gettimeofday
) and you get nonsense back when just specifying `os
! Interesting, nonetheless.
There's an added enhancement as well to the libpmc.c
code as well. Instead of relying on functions declared as extern
and then packaged together by the linker, it uses the dynamic loader to look for the dynamic shared object (DSO) libtest.so
. It loads this on each execution of test-run (and pre-links the symbols to avoid some ugly skew on the first run), which means that it's possible to leave kdb+ running, recompile your code with a minor change in it, and re-run the script you ran before. This way you can see immediately what effect your change has had on the execution characteristics. As my almost-three-year-old would say: "Cool, huh?" (note to self: stop saying things like that when he's around).
Anyway, for the non-q programmers amongst you, you could execute a script the following ways:
.pmc.script1[`os] .pmc.script1`os .pmc.script1[`os`usr] .pmc.script1`os`usrThe first two lines are functionally equivalent, as are the final two.
I'll probably keep remembering things I need to add to these instructions, but one in particular is in the .pmc.runscript
function. I have hard-coded (in the sense that anything in an ASCII text-file is hard-coded) the reference clock-speed of my CPU: 2.7 GHz. It uses the difference between the values for the reference and actual clock cycles (from the FFCs) to show a rough number representing the current core speed. I've found it suprisingly accurate, and as mentioned in a previous post, the core spends most of its time at 800 MHz. The other results-column worth treating with some suspicion is the "nanos" column. This basically takes the number of reference clock ticks counted by the FFC (careful about whether it's counting in usr
/os
) and divides it by 2.7 to come up with a spurious "wall-time" nanosecond value. Little more than a trinket, to be honest, but it amused me at the time.
It's definitely worth pondering, if you get the chance, whether the load-pressure exerted on your memory sub-system by the CPU will do something interesting if your CPU changes its speed-stepping to start running at full speed (or even in turbo mode). In more complex code with hefty workloads, after the first few iterations the results may show your CPU increasing its clock-speed and hence issuing more load (and store) requests to the QPI and LLC/IO Controller per time unit. Now, you could get to this point by disabling speed-stepping — probably your preferred option if you're trying to do something vaguely forensic — but it's very interesting to see how the profile of some code changes with the CPU-to-Bus ratio. The system which coped amazingly well with the load put on it by a quiesced CPU may suddenly show up as a bottleneck when it's speed increases.
As a parting example, here's what I get when I execute .pmc.script1[`usr]
:
instAny clkCore clkRef UopsAny UopsP015 UopsP234 L3Miss MHz nanos ----------------------------------------------------------------- 83 497 1685 131 68 47 0 796 624 83 112 362 124 52 39 0 835 134 83 74 227 108 45 36 0 880 84 83 74 254 108 41 39 0 787 94 83 73 254 112 47 34 0 776 94 83 74 227 112 47 35 0 880 84 83 78 281 112 48 36 0 749 104 83 77 254 112 49 36 0 819 94 83 79 254 112 51 36 0 840 94 83 77 254 112 47 31 0 819 94 83 79 281 112 51 36 0 759 104 83 74 254 112 47 35 0 787 94 83 79 281 112 48 37 0 759 104
Thank You
ReplyDelete