Sunday, 11 November 2012

Fun With MSRs: Counting Performance Events On Intel

This post is part of the series on performance monitoring with Intel MSRs on Linux:
- A Linux Module For Reading/Writing MSRs
- Intel MSR Performance Monitoring Basics
- Fun with MSRs: Counting Performance Events On Intel (this post)
- Scripting MSR Performance Tests With kdb+
- Scripting MSR Performance Tests With kdb+: Part 2
- Intel Performance Monitoring: Loose Ends

Hi, the last two posts have laid some groundwork for this post, in which I hope to show how you can measure various performance-related events using Intel's MSRs. This post assumes you have at least installed the MSR kernel module discussed in this earlier post. All we're going to do this time is record two MSR configuration scripts to memory and execute some arbitrary code to measure some performance metrics. One script will configure the MSRs and reset the counter values to zero, while the other will read the accumulated values after the test code has executed.

The code we're going to profile is going to be enormously simple: we're going to print something using the printf function. At least this should be very easy to alter to test how different argument types require more processing than others.

Here's the code:

Linux kernel module msrdrv.c

#include "msrdrv.h"
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/ioctl.h>
#include <fcntl.h>
#include <unistd.h>
#include <errno.h>
#include <stdlib.h>
#include <stdio.h>

static int loadDriver()
{
    int fd;
    fd = open("/dev/" DEV_NAME, O_RDWR);
    if (fd == -1) {
        perror("Failed to open /dev/" DEV_NAME);
    }
    return fd;
}

static void closeDriver(int fd)
{
    int e;
    e = close(fd);
    if (e == -1) {
        perror("Failed to close fd");
    }
}

/*
 * Reference:
 * Intel Software Developer's Manual Vol 3B "253669.pdf" August 2012
 * Intel Software Developer's Manual Vol 3C "326019.pdf" August 2012
 */
int main(void)
{
    int fd;
    struct MsrInOut msr_start[] = {
        { MSR_WRITE, 0x38f, 0x00, 0x00 },       // ia32_perf_global_ctrl: disable 4 PMCs & 3 FFCs
        { MSR_WRITE, 0xc1, 0x00, 0x00 },        // ia32_pmc0: zero value (35-5)
        { MSR_WRITE, 0xc2, 0x00, 0x00 },        // ia32_pmc1: zero value (35-5)
        { MSR_WRITE, 0xc3, 0x00, 0x00 },        // ia32_pmc2: zero value (35-5)
        { MSR_WRITE, 0xc4, 0x00, 0x00 },        // ia32_pmc3: zero value (35-5)
        { MSR_WRITE, 0x309, 0x00, 0x00 },       // ia32_fixed_ctr0: zero value (35-17)
        { MSR_WRITE, 0x30a, 0x00, 0x00 },       // ia32_fixed_ctr1: zero value (35-17)
        { MSR_WRITE, 0x30b, 0x00, 0x00 },       // ia32_fixed_ctr2: zero value (35-17)
        { MSR_WRITE, 0x186, 0x004101c2, 0x00 }, // ia32_perfevtsel1, UOPS_RETIRED.ALL (19-28)
        { MSR_WRITE, 0x187, 0x0041010e, 0x00 }, // ia32_perfevtsel0, UOPS_ISSUED.ANY (19.22)
        { MSR_WRITE, 0x188, 0x01c1010e, 0x00 }, // ia32_perfevtsel2, UOPS_ISSUED.ANY-stalls (19-22)
        { MSR_WRITE, 0x189, 0x004101a2, 0x00 }, // ia32_perfevtsel3, RESOURCE_STALLS.ANY (19-27)
        { MSR_WRITE, 0x38d, 0x222, 0x00 },      // ia32_perf_fixed_ctr_ctrl: ensure 3 FFCs enabled
        { MSR_WRITE, 0x38f, 0x0f, 0x07 },       // ia32_perf_global_ctrl: enable 4 PMCs & 3 FFCs
        { MSR_STOP, 0x00, 0x00 }
    };

    struct MsrInOut msr_stop[] = {
        { MSR_WRITE, 0x38f, 0x00, 0x00 },       // ia32_perf_global_ctrl: disable 4 PMCs & 3 FFCs
        { MSR_WRITE, 0x38d, 0x00, 0x00 },       // ia32_perf_fixed_ctr_ctrl: clean up FFC ctrls
        { MSR_READ, 0xc1, 0x00 },               // ia32_pmc0: read value (35-5)
        { MSR_READ, 0xc2, 0x00 },               // ia32_pmc1: read value (35-5)
        { MSR_READ, 0xc3, 0x00 },               // ia32_pmc2: read value (35-5)
        { MSR_READ, 0xc4, 0x00 },               // ia32_pmc3: read value (35-5)
        { MSR_READ, 0x309, 0x00 },              // ia32_fixed_ctr0: read value (35-17)
        { MSR_READ, 0x30a, 0x00 },              // ia32_fixed_ctr1: read value (35-17)
        { MSR_READ, 0x30b, 0x00 },              // ia32_fixed_ctr2: read value (35-17)
        { MSR_STOP, 0x00, 0x00 }
    };


    fd = loadDriver();
    ioctl(fd, IOCTL_MSR_CMDS, (long long)msr_start);
    printf("This is a hex number 0x%x\n", -1);
    ioctl(fd, IOCTL_MSR_CMDS, (long long)msr_stop);
    printf("uops retired:    %7lld\n", msr_stop[2].value);
    printf("uops issued:     %7lld\n", msr_stop[3].value);
    printf("stalled cycles:  %7lld\n", msr_stop[4].value);
    printf("resource stalls: %7lld\n", msr_stop[5].value);
    printf("instr retired:   %7lld\n", msr_stop[6].value);
    printf("core cycles:     %7lld\n", msr_stop[7].value);
    printf("ref cycles:      %7lld\n", msr_stop[8].value);
    closeDriver(fd);
    return 0;
}

The code (rather than the content of the MsrInOut arrays) is straighforward. It opens a the file /dev/msrdrv, the char-driver handle set up in the install.sh script by the mknod command. The code uses the ioctl command to communicate through this file-handle with the driver. If you looked through the driver source-code (see the previous posts, link above), you would see that the final parameter contains the address of the MSR controller script we want to execute. To compile the above code, I used: gcc -g -I. -O2 -o msrtest msrtest.c

Once the msr_start script has executed, the performance counters will be ticking. The program calls the printf function and the promptly invokes the ioclt function with the address of the msr_stop script. Since the stop script also contains the result-reading commands, the program then simply prints out the values now stored in various elements of the msr_stop array.

Some discussion of the contents of the msr_start and msr_stop is now necessary. I've tried to comment the code well enough to make the scripts intlligible, but here goes anyway:

The msr_start script

  1. MSR_WRITE, 0x38f: this is the global control switch for both the fixed-function and general-purpose counters. This instruction clears the "enable" bits so no further counting will take place.
  2. MSR_WRITE, 0xc1: also 0xc2, 0xc3, 0xc4. This zeros the values in the general-purpose counters.
  3. MSR_WRITE, 0x309: also 0x30a, 0x30b. This zeros the values in the fixed-function counters.
  4. MSR_WRITE, 0x186: configures the events counted into PMC0. The value 0x004101c2 enables the counter, sets it to count when executing in user-mode, and selects the event UOPS_RETIRED.ALL (event 0x0e, umask 0x01). This counts the µ-ops which were actually executed (some will not be due to branch mis-prediction). See the IA32_PERFEVTSELx image below for help in deciphering what the bit-fields mean.
  5. MSR_WRITE, 0x187: configures the events counted into PMC1: UOPS_ISSUED.ANY. This counts the µ-ops issued by the instruction decoder.
  6. MSR_WRITE, 0x188: configures the events counted into PMC2: UOPS_ISSUED.ANY — stall cycles. You should read the Intel docs about how setting cmask = 1, inv = 1 (and possibly any = 1) inverts the counter so that it counts cycles when no µ-ops are issued.
  7. MSR_WRITE, 0x189: configures the events counted into PMC3: RESOURCE_STALLS.ANY. The manual states that this counts "Cycles Allocation is stalled due to Resource Related reason [sic]"... I think I know what it means, but wouldn't put money on it.
  8. MSR_WRITE, 0x38d: configures the fixed-function counters to be enabled when the logical core is executing user-land code. See Vol. 3B, pp. 18-8, 18-5.
  9. MSR_WRITE, 0x38f: flips the bits in the global MSR control register to enable all four PMCs (0x0f) and all three FFCs (0x07). Once these values have been written by the WRMSR command, the counters will start accumulating values.
  10. MSR_STOP: exits the loop within the MSR driver.


IA32_PERFEVTSELx - version 3

The msr_stop script

  1. MSR_WRITE, 0x38f: stop the counters using the global control. Issue this first to stop the counters as soon as possible.
  2. MSR_WRITE, 0x38d: some housekeeping. This clears the enabled-flags in the fixed-function configuration MSR.
  3. MSR_READ, 0xc1: also 0xc2, 0xc3, 0xc4. These read the values from each of the four PMCs into the MsrInOut array.
  4. MSR_READ, 0x309: also 0x30a, 0x30b. These read the values from each of the three FFCs into the MsrInOut array.
  5. MSR_STOP: exits the loop within the MSR driver.

It might be quite instructive to view the contents of the msr_start and msr_stop arrays to see how the MsrInOut structs pack into memory. While the kernel module has a compile-time debug flag, it's still useful to be able to see what it is that's being sent to the driver:

(gdb) x/60xw &msr_start
0x7fffffffdf20: 0x00000002 0x0000038f 0x00000000 0x00000000
0x7fffffffdf30: 0x00000002 0x000000c1 0x00000000 0x00000000
0x7fffffffdf40: 0x00000002 0x000000c2 0x00000000 0x00000000
0x7fffffffdf50: 0x00000002 0x000000c3 0x00000000 0x00000000
0x7fffffffdf60: 0x00000002 0x000000c4 0x00000000 0x00000000
0x7fffffffdf70: 0x00000002 0x00000309 0x00000000 0x00000000
0x7fffffffdf80: 0x00000002 0x0000030a 0x00000000 0x00000000
0x7fffffffdf90: 0x00000002 0x0000030b 0x00000000 0x00000000
0x7fffffffdfa0: 0x00000002 0x00000186 0x004101c2 0x00000000
0x7fffffffdfb0: 0x00000002 0x00000187 0x0041010e 0x00000000
0x7fffffffdfc0: 0x00000002 0x00000188 0x01c1010e 0x00000000
0x7fffffffdfd0: 0x00000002 0x00000189 0x004101a2 0x00000000
0x7fffffffdfe0: 0x00000002 0x0000038d 0x00000222 0x00000000
0x7fffffffdff0: 0x00000002 0x0000038f 0x0000000f 0x00000007
0x7fffffffe000: 0x00000003 0x00000000 0x00000000 0x00000000

And the stop script before it is "executed":

(gdb) x/40xw &msr_stop
0x7fffffffe010: 0x00000002 0x0000038f 0x00000000 0x00000000
0x7fffffffe020: 0x00000002 0x0000038d 0x00000000 0x00000000
0x7fffffffe030: 0x00000001 0x000000c1 0x00000000 0x00000000
0x7fffffffe040: 0x00000001 0x000000c2 0x00000000 0x00000000
0x7fffffffe050: 0x00000001 0x000000c3 0x00000000 0x00000000
0x7fffffffe060: 0x00000001 0x000000c4 0x00000000 0x00000000
0x7fffffffe070: 0x00000001 0x00000309 0x00000000 0x00000000
0x7fffffffe080: 0x00000001 0x0000030a 0x00000000 0x00000000
0x7fffffffe090: 0x00000001 0x0000030b 0x00000000 0x00000000
0x7fffffffe0a0: 0x00000003 0x00000000 0x00000000 0x00000000

And after it has been "executed" (in which I've mixed the output of "x/40xw &msr_stop" and "x/20xg &msr_stop" to make it clearer):

0x7fffffffe010: 0x00000002 0x0000038f 0x0000000000000000
0x7fffffffe020: 0x00000002 0x0000038d 0x0000000000000000
0x7fffffffe030: 0x00000001 0x000000c1 0x0000000000002866
0x7fffffffe040: 0x00000001 0x000000c2 0x0000000000002bfa
0x7fffffffe050: 0x00000001 0x000000c3 0x0000000000017ca7
0x7fffffffe060: 0x00000001 0x000000c4 0x00000000000003ea
0x7fffffffe070: 0x00000001 0x00000309 0x0000000000000dd3
0x7fffffffe080: 0x00000001 0x0000030a 0x00000000000079c4
0x7fffffffe090: 0x00000001 0x0000030b 0x0000000000019b1e
0x7fffffffe0a0: 0x00000003 0x00000000 0x0000000000000000

The Results

These are the results printed out on my system, for one iteration through the code, with no accounting for the fixed costs of starting and stopping the performance-counters. It wasn't run under gdb so the results won't tally with the values printed above. However, it suffices for the purposes of showing how you can count with the PMCs: it's never been about measuring the performance of the printf command!

$ ./msrtest
This is a hex number 0xffffffff
uops retired:      10043
uops issued:       10956
stalled cycles:    83345
resource stalls:     371
instr retired:      3538
core cycles:       25225
ref cycles:        85131

As alluded to above, there are many potential sources of error here: that only one iteration was performed, that I didn't establish some baseline, fixed-cost values for invoking the ioctl function, that I haven't disabled hyper-threading, that I didn't bind my code to a core, that I didn't disable speed-stepping or turbo boost. All this is up to you and I'll show how you do some of them in a future post. I hope this has at least given you a head-start in understanding performance-counting events!

No comments:

Post a Comment