Friday, 9 November 2012

A Linux Kernel Module For Reading/Writing MSRs

This post is part of the series on performance monitoring with Intel MSRs on Linux:
- A Linux Module For Reading/Writing MSRs (this post)
- Intel MSR Performance Monitoring Basics
- Fun with MSRs: Counting Performance Events On Intel
- Scripting MSR Performance Tests With kdb+
- Scripting MSR Performance Tests With kdb+: Part 2
- Intel Performance Monitoring: Loose Ends

It's been a while since the last post, mostly because I've been trying to get my head around the way the Intel performance monitoring instructions work. Rolling your own test-harness to measure how many clock-ticks, µops or L1 cache misses have taken place in a given stretch of code is quite involved — but don't let that put you off, it's pretty cool once you've got it all working. Of course, you don't have to roll your own, but it is in the best British traditions of pottering around in the garden shed, taking things to bits just to see how they work. Your alternatives are to download a trial copy of Intel's VTune or to Agner Fog's free and reasonably easy-to-use "testp" library.

If you do want to do it yourself, you need to be able to write to the CPU's MSRs. The Intel docs make it clear that it's only possible to use the WRMSR instruction if you are executing with ring 0 privileges - or in other words, the instruction is executed by the kernel. This was a fairly daunting step, but it turns out that writing your own Linux driver is really quite straightforward, made all the more so by the Linux Device Drivers book. At this point I should add that reviewing how Agner Fog solved this problem was very instructive. I've adopted a similar approach in the code I discuss below, which for must-fit/must-match reasons is very similar to the driver code in his testp library.

The MSR Kernel Driver

Since I plan on writing a series of shorter posts rather than one giant monolith (c.f. the epic "Relocations, Relocations" post which devoured my life for a while) I won't go into the whys or wherefores of the Linux driver code. What follows below is about as short and sweet as it can be made in order to get the job done.

Since each call to the driver involves a system call and hence a context switch (which will be included in the timing figures you record), it makes sense to send multiple read/write commands at once. A not unreasonable approach is to pass the driver an array of commands defining various writes to or reads from a number of different MSRs. Each element in the array is processed by the driver until it reaches some sort of "stop" instruction. The MsrInOut struct is defined in msrdrv.h and is shared between client and kernel code:

Shared header msrdrv.h

#ifndef _MG_MSRDRV_H
#define _MG_MSRDRV_H

#include <linux/ioctl.h>
#include <linux/types.h>

#define DEV_NAME "msrdrv"
#define DEV_MAJOR 223
#define DEV_MINOR 0

#define MSR_VEC_LIMIT 32

#define IOCTL_MSR_CMDS _IO(DEV_MAJOR, 1)

enum MsrOperation {
    MSR_NOP   = 0,
    MSR_READ  = 1,
    MSR_WRITE = 2,
    MSR_STOP  = 3,
    MSR_RDTSC = 4
};

struct MsrInOut {
    unsigned int op;              // MsrOperation
    unsigned int ecx;             // msr identifier
    union {
        struct {
            unsigned int eax;     // low double word
            unsigned int edx;     // high double word
        };
        unsigned long long value; // quad word
    };
}; // msrdrv.h:27:1: warning: packed attribute is unnecessary for ‘MsrInOut’ [-Wpacked]

#endif

The msrdrv.c below is the entirety of the driver code. Much of it is boilerplate and will look very similar to any basic kernel driver. The interesting part is between lines 53 and 105, with the msrdrv_ioctl function handling the processing of the MsrInOut command array. Of all the different MsrOperation variants, the more interesting are MSR_READ and MSR_WRITE.

MSR_READ

When an MsrInOut command structure is seen with an op value of MSR_READ, the read_msr function is invoked. This executes the RDMSR instruction, using the ecx parameter as its argument by storing its value in register ECX. The result of the RDMSR operation is stored in registers EAX:EDX, and the C-code combines these two 32-bit values into a single 64-bit value, which it then stores in the current MsrInOut struct's value field — a write to memory which can be read by the client code on the kernel function's return.

MSR_WRITE

The write_msr function handles MSR_WRITE commands. Note that in this case the values in the eax and edx parameters are stored in registers EAX and EDX respectively before the WRMSR instruction is executed. No values are read or returned.

msrdrv.c

#include <linux/init.h>
#include <linux/module.h>
#include <linux/fs.h>
#include <linux/cdev.h>

#include "msrdrv.h"

//#define _MG_DEBUG
#ifdef _MG_DEBUG
#define dprintk(args...) printk(args);
#else
#define dprintk(args...)
#endif

MODULE_LICENSE("Dual BSD/GPL");


static int msrdrv_open(struct inode* i, struct file* f)
{
    return 0;
}

static int msrdrv_release(struct inode* i, struct file* f)
{
    return 0;
}

static ssize_t msrdrv_read(struct file *f, char *b, size_t c, loff_t *o)
{
    return 0;
}

static ssize_t msrdrv_write(struct file *f, const char *b, size_t c, loff_t *o)
{
    return 0;
}

static long msrdrv_ioctl(struct file *f, unsigned int ioctl_num, unsigned long ioctl_param);

dev_t msrdrv_dev;
struct cdev *msrdrv_cdev;

struct file_operations msrdrv_fops = {
    .owner =          THIS_MODULE,
    .read =           msrdrv_read,
    .write =          msrdrv_write,
    .open =           msrdrv_open,
    .release =        msrdrv_release,
    .unlocked_ioctl = msrdrv_ioctl,
    .compat_ioctl =   NULL,
};

static long long read_msr(unsigned int ecx) {
    unsigned int edx = 0, eax = 0;
    unsigned long long result = 0;
    __asm__ __volatile__("rdmsr" : "=a"(eax), "=d"(edx) : "c"(ecx));
    result = eax | (unsigned long long)edx << 0x20;
    dprintk(KERN_ALERT "Module msrdrv: Read 0x%016llx (0x%08x:0x%08x) from MSR 0x%08x\n", result, edx, eax, ecx)
    return result;
}

static void write_msr(int ecx, unsigned int eax, unsigned int edx) {
    dprintk(KERN_ALERT "Module msrdrv: Writing 0x%08x:0x%08x to MSR 0x%04x\n", edx, eax, ecx)
    __asm__ __volatile__("wrmsr" : : "c"(ecx), "a"(eax), "d"(edx));
}

static long long read_tsc(void)
{
    unsigned eax, edx;
    long long result;
    __asm__ __volatile__("rdtsc" : "=a"(eax), "=d"(edx));
    result = eax | (unsigned long long)edx << 0x20;
    dprintk(KERN_ALERT "Module msrdrv: Read 0x%016llx (0x%08x:0x%08x) from TSC\n", result, edx, eax)
    return result;
}

static long msrdrv_ioctl(struct file *f, unsigned int ioctl_num, unsigned long ioctl_param)
{
    struct MsrInOut *msrops;
    int i;
    if (ioctl_num != IOCTL_MSR_CMDS) {
            return 0;
    }
    msrops = (struct MsrInOut*)ioctl_param;
    for (i = 0 ; i <= MSR_VEC_LIMIT ; i++, msrops++) {
        switch (msrops->op) {
        case MSR_NOP:
            dprintk(KERN_ALERT "Module " DEV_NAME ": seen MSR_NOP command\n")
            break;
        case MSR_STOP:
            dprintk(KERN_ALERT "Module " DEV_NAME ": seen MSR_STOP command\n")
            goto label_end;
        case MSR_READ:
            dprintk(KERN_ALERT "Module " DEV_NAME ": seen MSR_READ command\n")
            msrops->value = read_msr(msrops->ecx);
            break;
        case MSR_WRITE:
            dprintk(KERN_ALERT "Module " DEV_NAME ": seen MSR_WRITE command\n")
            write_msr(msrops->ecx, msrops->eax, msrops->edx);
            break;
        case MSR_RDTSC:
            dprintk(KERN_ALERT "Module " DEV_NAME ": seen MSR_RDTSC command\n")
            msrops->value = read_tsc();
            break;
        default:
            dprintk(KERN_ALERT "Module " DEV_NAME ": Unknown option 0x%x\n", msrops->op)
            return 1;
        }
    }
    label_end:

    return 0;
}


static int msrdrv_init(void)
{
    long int val;
    msrdrv_dev = MKDEV(DEV_MAJOR, DEV_MINOR);
    register_chrdev_region(msrdrv_dev, 1, DEV_NAME);
    msrdrv_cdev = cdev_alloc();
    msrdrv_cdev->owner = THIS_MODULE;
    msrdrv_cdev->ops = &msrdrv_fops;
    cdev_init(msrdrv_cdev, &msrdrv_fops);
    cdev_add(msrdrv_cdev, msrdrv_dev, 1);
    printk(KERN_ALERT "Module " DEV_NAME " loaded\n");
    return 0;
}

static void msrdrv_exit(void)
{
    long int val;
    cdev_del(msrdrv_cdev);
    unregister_chrdev_region(msrdrv_dev, 1);
    printk(KERN_ALERT "Module " DEV_NAME " unloaded\n");
}

module_init(msrdrv_init);
module_exit(msrdrv_exit);

Since I quite like to keep all the source in one place, here's the Makefile for the above kernel module. It uses the kernel's build-system and apart from the boilerplate the only thing which you need to specify is the argument to obj-m!

Makefile

ifneq ($(KERNELRELEASE),)
        obj-m := msrdrv.o

else
        KERNELDIR ?= /lib/modules/$(shell uname -r)/build
        PWD := $(shell pwd)

default:
        $(MAKE) -C $(KERNELDIR) M=$(PWD) modules

endif

clean:
        rm -f *.ko *.o

Once you've built the above, you should be able to install and uninstall it by executing (as root) one of the two following scripts.

install.sh

#!/bin/bash

if [ "$(whoami)" != "root" ] ; then
        echo -e "\n\tYou must be root to run this script.\n"
        exit 1
fi
chmod 666 /dev/msrdrv
insmod -f msrdrv.ko

uninstall.sh

#!/bin/bash

if [ "$(whoami)" != "root" ] ; then
        echo -e "\n\tYou must be root to run this script.\n"
        exit 1
fi

rmmod msrdrv
rm /dev/msrdrv

Of course, so far I haven't gone into any detail about what the Intel MSRs do or how to program them. That discussion might not happen all in one place; the easiest way to see how they work is in practice. To that end, I plan on showing a framework I've put together which lets you use Kx Systems' kdb+ database to launch tests and dynamically review the test-results.

2 comments:

  1. Hello. Thanks for your publication. In static long long read_msr(unsigned int ecx) . Can you explain to me why you use 0x20 ?

    ReplyDelete
    Replies
    1. Hi. It's been a while since I wrote this (just over 7 years!), but this line:

      eax | (unsigned long long)edx << 0x20

      which should probably have been written like this:

      eax | ((unsigned long long)edx << 0x20)

      intends to promote the value from edx into the top 32 bits of the 64-bit value. Without having re-read the Intel manual (I'll leave that to you), my guess is that the assembly instruction "rdmsr" reports its result in eax and edx.

      Delete