Wednesday, 12 December 2012

Another vectorised bitwise-OR function for kdb+, AVX style


In my last post on a vectorised bitwise-OR function for kdb+, I wondered towards the end what the next interesting step would be — should it be the production of an AVX-variant of the same code, or should it be the modification of the code to handle short as well as byte values? Well, I opted for the former, and found out that it really wasn't worth the trouble, and the biggest benefit was realising some of the limitations of the AVX instruction-set. It's worth sharing, since this may help others in the future and will help produce a faster version when AVX2 instructions come out... which is a hint.

Wednesday, 21 November 2012

A vectorised bitwise-OR function for kdb+

A long time ago I started writing some shared-library functions for kdb+ which would offer bitwise comparison of integer vector types using SSE vector registers. I came up with something which worked well enough but wanted to know if it could be made to go any faster. I wondered in particular about Intel's prefetch instructions and needed a way of verifying whether they made any difference.

Monday, 19 November 2012

Intel Performance Monitoring: Loose Ends

This post is part of the series on performance monitoring with Intel MSRs on Linux:
- A Linux Module For Reading/Writing MSRs
- Intel MSR Performance Monitoring Basics
- Fun with MSRs: Counting Performance Events On Intel
- Scripting MSR Performance Tests With kdb+
- Scripting MSR Performance Tests With kdb+: Part 2
- Intel Performance Monitoring: Loose Ends (this post)

If you haven't already, you'll need to download the q 3.0 trial version for Linux from Kx Systems. Although it's the 32-bit version, it is fully-functional apart from the fact that it is time-limited to somewhere around an hour's use before you need to restart it. I was labouring under the misapprehension that it could be run on a 64-bit system, but since q is a dynamically-linked application, you'll need to do something with chroot and 32-bit libraries if you want to try that.

Wednesday, 14 November 2012

Scripting MSR Performance Tests With kdb+: Part 2

This post continues the series on performance monitoring with Intel MSRs on Linux using the batch-oriented kernel module to read and write values from and to the MSRs. The previous posts can be found here:
- A Linux Module For Reading/Writing MSRs
- Intel MSR Performance Monitoring Basics
- Fun with MSRs: Counting Performance Events On Intel
- Scripting MSR Performance Tests With kdb+
- Scripting MSR Performance Tests With kdb+: Part 2 (this post ;)
- Intel Performance Monitoring: Loose Ends

This time I'm going to build the shared library used by kdb+ to launch and control the test run. It's fairly simple, since the fiddly work of calculating the values to be written to the IA32_PERFEVTSELx, IA32_FIXED_CTR_CTRL and IA32_PERF_GLOBAL_CTRL MSRs has already been done. What it will do is own the process of stopping, clearing and staring the counters, as well as running a baseline to test the fixed costs of the interation with the MSR kernel module.

Tuesday, 13 November 2012

Scripting MSR Performance Tests with kdb+

This post is part of the series on performance monitoring with Intel MSRs on Linux:
- A Linux Module For Reading/Writing MSRs
- Intel MSR Performance Monitoring Basics
- Fun with MSRs: Counting Performance Events On Intel
- Scripting MSR Performance Tests With kdb+ (this post)
- Scripting MSR Performance Tests With kdb+: Part 2
- Intel Performance Monitoring: Loose Ends

One of the issues with coding performance monitoring code is the management of the PMC/FFC configuration scripts. As you can see from my previous posts (1, 2, 3), using the scripts with the MSR kernel driver is easy, but getting the right data into the script in the first place is a bit more tricky. You could certainly provide helper functions in order to facilitate the twiddling of the various bits in the IA32_PERFEVTSELx registers. However, to make it useable I think it should be possible to look up the different performance monitoring events by name.

Sunday, 11 November 2012

Fun With MSRs: Counting Performance Events On Intel

This post is part of the series on performance monitoring with Intel MSRs on Linux:
- A Linux Module For Reading/Writing MSRs
- Intel MSR Performance Monitoring Basics
- Fun with MSRs: Counting Performance Events On Intel (this post)
- Scripting MSR Performance Tests With kdb+
- Scripting MSR Performance Tests With kdb+: Part 2
- Intel Performance Monitoring: Loose Ends

Hi, the last two posts have laid some groundwork for this post, in which I hope to show how you can measure various performance-related events using Intel's MSRs. This post assumes you have at least installed the MSR kernel module discussed in this earlier post. All we're going to do this time is record two MSR configuration scripts to memory and execute some arbitrary code to measure some performance metrics. One script will configure the MSRs and reset the counter values to zero, while the other will read the accumulated values after the test code has executed.

Friday, 9 November 2012

Intel MSR Performance Monitoring Basics

This post is part of the series on performance monitoring with Intel MSRs on Linux:
- A Linux Module For Reading/Writing MSRs
- Intel MSR Performance Monitoring Basics (this post)
- Fun with MSRs: Counting Performance Events On Intel
- Scripting MSR Performance Tests With kdb+
- Scripting MSR Performance Tests With kdb+: Part 2
- Intel Performance Monitoring: Loose Ends

In the previous post I published code to create, build and install a Linux kernel module which would permit a user to execute a batch of commands to read from or write to Intel MSRs. This post will provide some background on using MSRs and controlling their behaviour.

A Linux Kernel Module For Reading/Writing MSRs

This post is part of the series on performance monitoring with Intel MSRs on Linux:
- A Linux Module For Reading/Writing MSRs (this post)
- Intel MSR Performance Monitoring Basics
- Fun with MSRs: Counting Performance Events On Intel
- Scripting MSR Performance Tests With kdb+
- Scripting MSR Performance Tests With kdb+: Part 2
- Intel Performance Monitoring: Loose Ends

It's been a while since the last post, mostly because I've been trying to get my head around the way the Intel performance monitoring instructions work. Rolling your own test-harness to measure how many clock-ticks, µops or L1 cache misses have taken place in a given stretch of code is quite involved — but don't let that put you off, it's pretty cool once you've got it all working. Of course, you don't have to roll your own, but it is in the best British traditions of pottering around in the garden shed, taking things to bits just to see how they work.

Saturday, 30 June 2012

Interpreting readelf -r, in this case R_X86_64_PC32

Having just put the monster Relocations, Relocations blog-post to bed, at one point I caught myself trying to compute a relocation from the information given by readelf -r. It turns out that it's a bit confusing, and not at all clear how you get from the readelf output to addresses and offsets. So, I've put together the following shared library in the hope that we can walk through that process.

Sunday, 24 June 2012

Relocations, Relocations

I've wanted to write something about symbol relocations in ELF binaries for a while now, but it's become apparent that it's no small topic, since it depends on understanding of the Executable and Linkable Format. I'm going to try to report what I've found in as much detail as I can imagine before I realise what a monumental task it actually is.

Friday, 9 March 2012

The Trouble With FSUB

It's not often you get a really nasty surprise when writing software, and even less often that the nasty surprise is lurking in the compiler (or assembler). It turns out that the Gnu assembler does not treat the operands of instructions such as fsub and fdiv uniformly - in fact, some of the time it does the opposite of what you instruct it to do.

Friday, 2 March 2012

Thursday, 1 March 2012

A Caesar Cypher? Using SIMD? Why?

There's no easy answer to that. I had applied to join a group on some social networking site which was for assembly language programmers, and to check that I hadn't got them confused with a flat-pack furnishings fanciers, the polite email asked whether I would send them a simple encryption routine written in assembly which didn't rely on xor. Since I seem to be playing with vector instructions, I thought I'd have a go using some SSE (<=4.1) instructions.

Friday, 17 February 2012

AVX matrix-multiplication, or something like it

It's been a while since the last post, and I can confidently say that I understand one or two percent of the new (well, new to me) world of AVX instructions. There was the not so not-so-brief incident involving lots of head scratching about why my test implementation using vgatherqpd would cause a SIGILL exception on my Sandy Bridge laptop. I guess cpuid does have another use outside timing loops ;)

Thursday, 16 February 2012

Stack Alignment

The System V ABI "AMD64 Architecture Processor Supplement" stipulates that the stack should be aligned to a 16-byte boundary before calling a function. It provides the following:

"The end of the input argument area shall be aligned on a 16 (32, if __m256 is passed on stack) byte boundary. In other words, the value (%rsp + 8) is always a multiple of 16 (32) when control is transferred to the function entry point. The stack pointer, %rsp, always points to the end of the latest allocated stack frame."

So... aligning the stack to a 16-byte boundary is easy, right?

Tuesday, 17 January 2012

Bounded ranges, constrained or clamped values

So, given the ability to build a jump table for a range of input, it would be useful to transform a given input value onto one of the valid options for that table. If you have a quick look at the pseudo code reproduced from Wikipedia, you'll see it has a validate x instruction, which if you ask me is an unapologetically brief way of describing a hashing, bounding, clamping or range-checking process. However, that's what this post is about: confining an input to a range of output values. It's actually unrelated to the generation or use of a jump table, but it doesn't take much imagination to see how it might be useful.

Switch Blocks & Jump Tables

I didn't know how jump tables were assembled prior to starting this investigation, and as anyone interested in assembly code will testify, using Google doesn't really help very much. Jonathan Bartlett's excellent "Programming From The Ground Up" and Appendix C, "C Idioms In Assembly Language" sadly doesn't cover switch blocks, but Wikipedia turns out to have a good entry.

Compare And Swap

Atomic instructions are used by the OS to provide higher-level concurrency constructs such as locks and semaphores. Probably the best known is the cmpxchg instruction, which takes two operands: a source register and destination register or address. To be useful in concurrent code, the destination operand will be a memory address. It is described in the Intel Software Developer's Manual Volume 2A at 3-188 (or page 260 according to my PDF reader) as follows:
Compares the value in the AL, AX, EAX, or RAX register with the [...] destination operand. If the two values are equal, the [...] source operand is loaded into the destination operand. Otherwise, the destination operand is loaded into the AL, AX, EAX or RAX register. RAX register is available only in 64-bit mode.

"In 64-bit mode, the instruction’s default operation size is 32 bits. [...] Use of the REX.W prefix promotes operation to 64 bits."

Anyway, here's an example.

The initial stack, reading process arguments (and environment variables)

I wrote a 32-bit assembly application a while ago which performed the simple task of printing out the program arguments and then the environment variables. Most people have seen a C-style main method:
int main(int argc, char* argv[]) {
 //...
}
Wikipedia tells me that "Unix (though not POSIX.1) and Windows have a third argument giving the program's environment", and takes the following form:
int main(int argc, char *argv[], char *envp[]) {
 //...
}
See libc_start_main.c at lines 58-67 for the mechanics of how the signature is varied between the two forms. Depending on which form is implemented the stack frame would look different as it would have a third argument pushed onto it. However, we're not going to rely on the glibc code to invoke a main method, we're going to implement a global function called _start and see what we're given: this is at a level below what you'd see in a C-style main function.

Syntax Highlighting for Assembly Code

Having put a fair amount of time into writing the posts I've published up so far, I've become disappointed with publishing code snippets in <pre> tags. There is Alex Gorbatchev's shiny JavaScript solution, but back when this blog was on Wordpress, I couldn't use it on as I couldn't supply my own 'brush' to format GNU assembler/gas code.

Calling Java from assembler

Hi there,

You may think that this is completely insane. We have C, right? You know, that high-level language which might still be popular come the end of the year?

Yes, but then that's hardly the point. I want to know how to do the same thing in assembly. So, with that in mind, here goes.

Monday, 16 January 2012

Calling ASM functions from Java

If you've been inquisitive enough to read the "About" pages you'll see that my day job involves writing software in Java. To that end, I've put together some code which demonstrates calling a function in a shared library (written in assembler). Hopefully the following will illustrate the steps involved fairly clearly.

Hello, World!

Well, here it is, the ubiquitous "Hello, World!" example.

Despite the title, there are a couple of interesting things to note in the code below. The first is calculating the length of the string hellotxt, the second being the different usage of the $ operator in GNU assembler.