Some Assembly Required - A vectorised bitwise-OR function for kdb+

A long time ago I started writing some shared-library functions for kdb+ which would offer bitwise comparison of integer vector types using SSE vector registers. I came up with something which worked well enough but wanted to know if it could be made to go any faster. I wondered in particular about Intel’s prefetch instructions and needed a way of verifying whether they made any difference.

Please also see the follow-up post Another vectorised bitwise-OR function, AVX style.

In order to find out, I downloaded, built and figured out how to use Agner Fog’s unique testp library. I say unique because I am unaware of any other non-commercial software which aims to help programmers accurately measure the performance of a section of code. The invariant RDTSC instruction just doesn’t cut the mustard in an age of speed-stepping and turbo boost. I have to admit to finding Agner’s library a little bit confusing, but also got useful results with it. And then… well, then I got side-tracked, as I thought it could be simplified for the single-threaded use-case, after all, how hard could it be? Once the MSR kernel module is in place, it should be easy to roll-your-own controller. Then I had the idea about using kdb+ to run the controller and manage the results… and so I spent quite a bit of time not writing vectorised bitwise comparison functions and instead wrote a performance monitoring tool (entirely for my own selfish needs, hence the reason it only currently targets the 2nd generation Core i7 processor family).

But now I’m back on the original project and have been able to determine that the prefetch instruction does make a measurable difference and think I’ve found the sweet spot which knocks about 10% off the execution time of the vector register code-section. I’ll present the code in chunks because it’s quite long, not because of the comparison code itself but because of having to handle insufficiently-well aligned data and end-of-vector data which doesn’t fill a whole XMM register. The supporting code for this blog-post can be found on my space on GitHub.

/*
 * (c) Michael Guyver, 2012, all rights reserved. 
 */
.include "savereg.macro.s"

.section .rodata
    .align    0x10                            # Align LbadTypeStr to 16-byte boundary
.LbadTypeStr:
    .asciz    "vectype"

.section .text
    .globl    bitwiseOr                       # Declare global function 'bitwiseOr'
    .type     bitwiseOr, STT_FUNC

.LbadType:                                    # Handler for invalid input vector-type
    lea      .LbadTypeStr(%rip), %rdi         # Load address of error message
    callq     krr@PLT                         # Call kdb's error function
    jmp      .LrestoreStackForExit            # Exit via the clean-up function

.LsuccessExit:
    mov       %r14, %rax                      # Copy result address to RAX
.LrestoreStackForExit:
    mov       %rbp, %rsp                      # Restore poppable SP
    pop       %r15                            # Restore non-volatile registers
    pop       %r14
    pop       %r13
    pop       %r12
    pop       %rbx
    pop       %rbp
    ret                                       # Exit to calling code

bitwiseOr:
    push      %rbp                            # Push all non-volatile registers
    push      %rbx
    push      %r12
    push      %r13
    push      %r14
    push      %r15
    mov       %rsp, %rbp                      # Store updated SP

The code above contains the simple error-reporting function .LbadType. This notifies the q binary that an exception has occurred which is reported on return from the library function. The .LrestoreStackForExit code epilogue which is executed on any exit-path from bitwiseOr restored the non-volatile registers from the stack and and finally executes the ret instruction. Note also the .LsuccesExit function which sets the result value on the RAX register (and which then flows into the function epilogue). I’ve shown the prologue of the bitwiseOr function for comparison with the function epilogue.

It’s probably worth mentioning that I’ve stored the stack-pointer in the RBP to make it simple to recover the position from which the values can be popped from the stack. This has the advantage that I don’t need to think too hard about the stack-pointer alignment and can simply align it to the next lowest 16-byte boundary, as in the following code by performing a bitwise AND with bitwise NOT 0x0F. Note also that this code has been modified to accommodate two additional parameters, namely the function pointers to libpmc.c’s start_counters and stop_counters.

    sub       $0x10, %rsp                     # Reserve stack space
    and       $~0xF, %rsp                     # Align to 16-byte boundary
    push      %rdx                            # Store arg2: start_counters address: 0x08(%rsp)
    push      %rcx                            # Store arg3: stop_counters address:  0x00(%rsp)

    mov       %rdi, %r12                      # Save arg0: byte-vec
    mov       %rsi, %r13                      # Save arg1: byte-mask

    movb      2(%rdi), %al                    # Copy arg0->t to lo-byte
    mov       %al, %bl                        # Copy type
    sub       $4, %bl                         # Check lower-bound
    jl        .LbadType                       # Branch if below
    sub       $4, %al                         # Check upper-bound
    jg        .LbadType                       # Branch if above

After the stack alignment and storing the two function pointers on the stack (keeping the 16-byte alignment), the code then tests the type of the first argument. Have a look at the k.h header file for the definition of the 3.0 struct k0. The third byte in the struct defines the type. At the moment the type-check tests a lower and an upper-bound of 4; this isn’t nuts, it’s just waiting till I implement the code for short, int and possibly long values whose type-values are 5, 6 and 7. If the test fails either comparison the code branches to the error handler function. The code then continues to instantiate the result object, using functions exposed by the q binary:

    # Create result object
    mov       $1, %rdi                        # Specify type (bool-vec)
    movq      8(%r12), %rsi                   # Copy veclen(arg0)
    callq     ktn@PLT                         # Create result struct
    mov       %rax, %r14                      # Store result address in non-vol register

It’s a simple code-snippet: it requests the construction of a boolean vector with the same length as input vector arg0. See code.kx.com for information on the ktn function for creating lists. The code then stores the pointer to the result struct in a spare non-volatile register for later use.

We’re finally getting to some of the comparison code. The basic comparison I’ve come up with is as follows:

example XMM value under test:	`000102030405060708090a0b0c0d0e0f`
test against vectorised mask byte:	`01010101010101010101010101010101`
(interim result)	`00010001000100010001000100010001`
compare with zeroed-operand:	`00000000000000000000000000000000`
(interim result)	`FF00FF00FF00FF00FF00FF00FF00FF00`
XOR with fully-set operand:	`FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF`
(interim result)	`00FF00FF00FF00FF00FF00FF00FF00FF`
Reduce `0xFF` to boolean ‘true’ `0x01`:	`00010001000100010001000100010001`

It’s a bit long winded and for the special case where you’re testing against a byte-mask of 0x01 you could skip the final three steps (note to self), but it works for the general case. In order to perform these comparisons and transformations, I need to set up a vector containing 16 copies of the byte-mask passed in as the second argument to bitwiseOr, a fully-zeroed XMM register (easy), a fully-set XMM (easy, as it turns out) and an XMM register with each byte set to 0x01 (easy if you know that Agner Fog has taken the time and trouble to publish in his manuals the steps necessary to generate different constant values):

.LpreLoopByte:
    # shuffle mask byte into all 128 bits of XMM
    xor       %rax, %rax                      # zero register
    movb      8(%r13), %al                    # load mask-byte, arg1->g
    movq      %rax, %xmm0                     # copy byte to XMM
    pxor      %xmm1, %xmm1                    # clear mask bytes
    pshufb    %xmm1, %xmm0                    # XMM0 contains 16 copies of arg1->g
    pcmpeqw   %xmm2, %xmm2                    # set all bits

    # Initialise 0xFF->0x01 conversion mask
    pcmpeqw   %xmm3, %xmm3                    # Generates: 0xffff_ffff_ffff_ffff_ffff_ffff_ffff_ffff
    psrlw     $15, %xmm3                      # Rotates:   0x0001_0001_0001_0001_0001_0001_0001_0001
    packuswb  %xmm3, %xmm3                    # Packs:     0x0101_0101_0101_0101_0101_0101_0101_0101

    movq      8(%r14), %r15                   # Copy veclen(result) -> R15
    xor       %rdx, %rdx                      # Clear counter

Having set up the constant values needed to do the comparisons, there now follows the main-loop controller code. It’s complicated by the fact that it needs to test bounding-conditions, counts of elements remaining as well as cache-line alignment before it can branch to the high-througput code. Read the comments for more detail:

.LloopByte:
    mov       %r15, %rax                      # Copy vec-len
    sub       %rdx, %rax                      # Calculate remaining
    jle       .LsuccessExit                   #   branch to exit if <= 0
    cmp       $0x10, %rax                     # Compare remaining with 16
    jl        .LsubXmmRemaining               #   branch if < 16 
    je        .LsingleReg                     #   branch if == 16
    lea       0x10(%r12,%rdx), %rbx           # Load address of next read
    test      $0x3f, %rbx                     # Test for 64-byte alignment
    jnz       .LsingleReg                     #   if not aligned, jump to generic handler
    cmp       $0x40, %rax                     # Read is aligned, check num bytes available
    jge       .LquadRegPre                    #   if >= 64, handle 64-bytes at a time

The code starting at label .LsingleReg handles the case where there is exactly one XMM register of data remaining, or the input data is not yet aligned to a 64-byte boundary. The result is then copied into the boolean result-vector, which is 16-byte aligned (enabling the use of movaps without further alignment checks). Finally it jumps back to the loop controller code.

.LsingleReg:
    movaps    0x10(%r12,%rdx,1), %xmm4        # Load 16-bytes of input
    pand      %xmm0, %xmm4                    # compare with mask
    pcmpeqb   %xmm1, %xmm4                    # set to 0xFF if equal
    pxor      %xmm2, %xmm4                    # flip bits
    pand      %xmm3, %xmm4                    # convert 0xFF to 0x01
    movaps    %xmm4, 0x10(%r14,%rdx,1)        # Given >= 16 bytes, copy result
    add       $0x10, %rdx
    jmp       .LloopByte

The next chunk of code is longer as a I’ve unrolled a loop such that each loop iteration addresses one whole cache-line of data by using four XMM registers. A couple of things to point out are that the .LquadRegPre label makes it easy to skip the code I’ve added to start the PMC timer in the line:

call      *0xa8(%rsp)

The value 0xa8(%rsp) references the function pointer start_counters and the 0xa8 value allows for the values stored to the stack by the macro m_save_regs, the code for which you can review on GitHub, if you like.

Then, of course, comes the prefetcht0 instruction, the cause of the delay in this post seeing the light of day. There’s not a great deal of information available about its operation but some guidance about its use suggests ensuring that the data it references is sufficiently far enough ahead of the current read-point so that it will have been loaded by the time it’s needed. Furthermore, since it may result in a load-instruction being issued, it should appear after any explicit load-instructions as it may negatively impact their performance. There are several types of prefetch instruction, and although the whole area is slightly murky, I seem to get the best results with prefetcht0.

.LquadRegPre:
    m_save_regs
    call      *0xa8(%rsp)
    m_restore_regs
.LquadReg:
    movaps    0x10(%r12,%rdx,1), %xmm4        # Copy cache line into registers
    movaps    0x20(%r12,%rdx,1), %xmm5
    movaps    0x30(%r12,%rdx,1), %xmm6
    movaps    0x40(%r12,%rdx,1), %xmm7

    prefetcht0 0x210(%r12,%rdx,1)             # Prefetch src cache line 8 loops hence (after movaps above)

    pand      %xmm0, %xmm4                    # Bitwise-AND 16-byte value against the vectorised byte-mask
    pand      %xmm0, %xmm5
    pand      %xmm0, %xmm6
    pand      %xmm0, %xmm7

    pcmpeqb   %xmm1, %xmm4                    # Compare byte-values with zero, each now 0xFF or 0x00
    pcmpeqb   %xmm1, %xmm5
    pcmpeqb   %xmm1, %xmm6
    pcmpeqb   %xmm1, %xmm7

    pxor      %xmm2, %xmm4                    # Flip bytes using an operand with all bits set
    pxor      %xmm2, %xmm5
    pxor      %xmm2, %xmm6
    pxor      %xmm2, %xmm7

    pand      %xmm3, %xmm4                    # Filter bits so those set 0xFF now equal 0x01
    pand      %xmm3, %xmm5
    pand      %xmm3, %xmm6
    pand      %xmm3, %xmm7

    movaps    %xmm4, 0x10(%r14,%rdx,1)        # Copy result to output
    movaps    %xmm5, 0x20(%r14,%rdx,1)
    movaps    %xmm6, 0x30(%r14,%rdx,1)
    movaps    %xmm7, 0x40(%r14,%rdx,1)

    add       $0x40, %rdx                     # Add 64 to counter
    mov       %r15, %rax                      # Copy veclen
    sub       %rdx, %rax                      # Subtract counter from veclen
    cmp       $0x40, %rax                     # Compare remainder against 64 bytes
    jge       .LquadReg                       #   if >= 64, repeat aligned branch
    m_save_regs
    call      *0xa0(%rsp)
    m_restore_regs
    jmp       .LloopByte                      # Branch to generic loop

The following series of instructions handles the case where there are fewer than 16 bytes remaining. It still loads 16-bytes into the XMM register, trusting in memory-alignment niceties to assume that we’re not going to attempt to load data off the side of the virtual page, but it does take care to ensure that it only writes the right number of bytes to the result. It attempts to do this by comparing the element-count successively with 8, then 4, 2 and finally 1, decrementing the count appropriately as bytes are written to the result vector. By way of example, assume that 15 bytes need to be processed: in that case, each of the sections which deal with 8, 4, 2 and 1 byte will execute since 8 + 4 + 2 + 1 = 15. With only 7 bytes remaining, code sections for 4, 2 and 1 would execute. Etc.

.LsubXmmRemaining:
    movaps    0x10(%r12,%rdx,1), %xmm4        # Load 16-bytes of input
    pand      %xmm0, %xmm4                    # compare
    pcmpeqb   %xmm1, %xmm4                    # set to 0xff if equal
    pxor      %xmm2, %xmm4                    # flip bytes
    pand      %xmm3, %xmm4                    # convert 0xff to 0x01

    movaps    %xmm4, (%rsp)                   # Copy boolean results to stack
    xor       %rbx, %rbx                      # Zero RBX for counting bytes copied
    mov       %r15, %rax                      # Copy veclen(result)
    sub       %rdx, %rax                      # Calc remaining
    cmp       $0x08, %rax                     # Compare remaining with 8
    jl        .LltEightRemaining              #   branch < 8
    movq      (%rsp,%rbx,1), %r8              # Copy QW from stack to QW reg
    movq      %r8, 0x10(%r14,%rdx,1)          # Copy the same to the result
    add       $0x08, %rdx                     # Add 8 to counter
    add       $0x08, %rbx                     # Add 8 to src-counter
    sub       $0x08, %rax                     # Subtract 8 from remaining
.LltEightRemaining:                           # Handle < 8
    cmp       $0x04, %rax                     # Compare remaining with 4
    jl        .LltFourRemaining               #   branch < 4
    movl      (%rsp,%rbx,1), %r8d             # Copy result to DW reg
    movl      %r8d, 0x10(%r14,%rdx,1)         # Copy DW to result
    add       $0x04, %rdx                     # Add 4 to counter
    add       $0x04, %rbx                     # Add 4 to src-counter
    sub       $0x04, %rax                     # Subtract 4 from remaining
.LltFourRemaining:
    cmp       $0x02, %rax
    jl        .LltTwoRemaining
    movw      (%rsp,%rbx,1), %r8w
    movw      %r8w, 0x10(%r14,%rdx,1)         # Copy W to result
    add       $0x02, %rdx                     # Add 2 to counter
    add       $0x02, %rbx                     # Add 2 to src-counter
    sub       $0x02, %rax                     # Subtract 2 from remaining
.LltTwoRemaining:
    cmp       $0x01, %rax
    jl        .LnoneRemaining
    movb      (%rsp,%rbx,1), %r8b
    movb      %r8b, 0x10(%r14,%rdx,1)         # Copy DW to result
.LnoneRemaining:
    jmp       .LsuccessExit

PREFETCH performance data

I chose the .pmc.script4 set of performance counters to illustrate the effect of the prefetch instruction since it shows µ-ops as well as some memory-related counters. Here’s the code with the prefetcht0 instruction in place:

q)t0:1_.pmc.script4`usr
q)-15#t0
instAny clkCore clkRef UopsAny MemUopsLd MemUopsSv StallCycles MHz nanos
------------------------------------------------------------------------
46860   15056   50816  51589   7826      6248      596         800 18821
46860   14858   50141  51599   7826      6248      526         800 18571
46860   14814   50006  51596   7826      6248      479         800 18521
46860   15130   51086  51589   7826      6248      695         800 18921
46860   14914   50330  51589   7826      6248      585         800 18641
46860   14810   49979  51599   7826      6248      501         800 18511
46860   14877   50195  51595   7826      6248      526         800 18591
46860   14854   50141  51599   7826      6248      550         800 18571
46860   14738   49736  51578   7826      6248      508         800 18421
46860   14737   49736  51593   7826      6248      505         800 18421
46860   14656   49466  51596   7826      6248      480         800 18321
46860   15008   50654  51593   7826      6248      642         800 18761
46860   14527   49007  51578   7826      6248      384         800 18151
46860   15015   50681  51589   7826      6248      614         800 18771
46860   14804   49979  51593   7826      6248      556         800 18511

And here are the results with the prefetcht0 instruction removed:

q)t1:1_.pmc.script4`usr
q)-15#t1
instAny clkCore clkRef UopsAny MemUopsLd MemUopsSv StallCycles MHz nanos
------------------------------------------------------------------------
45298   19202   64806  50025   6263      6248      1379        800 24002
45298   18919   63861  50021   6263      6248      1265        800 23652
45298   19291   65103  50025   6263      6248      1397        800 24112
45298   19034   64239  50029   6263      6248      1234        800 23792
45298   19296   65103  50041   6263      6248      1380        800 24112
45298   19176   64698  50021   6263      6248      1331        800 23962
45298   19337   65265  50041   6264      6248      1423        800 24172
45298   18977   64050  50021   6262      6248      1255        800 23722
45298   19254   64995  50025   6264      6248      1374        800 24072
45298   18942   63915  50021   6262      6248      1262        800 23672
45298   19293   65103  50001   6263      6248      1371        800 24112
45298   19074   64374  50203   6263      6248      1244        800 23842
45298   19395   65481  50017   6263      6248      1417        800 24252
45298   18976   64050  50026   6263      6248      1242        800 23722
45298   19277   65076  50005   6264      6248      1371        800 24102

Here are the averages of the performance runs joined using uj or “union join”:

q)avgres:{[t;n] flip key[d]!enlist each value d:avg[n#t]}
q)avgres[t0;-100] uj avgres[t1;-100]
instAny clkCore  clkRef   UopsAny  MemUopsLd MemUopsSv StallCycles nanos   
---------------------------------------------------------------------------
46860   14987.26 50581.71 51595.45 7825.774  6248      602.7419    18734.23
45298   19240.52 64937.52 50030.77 6262.806  6248      1359.903    24050.71

OK, some things to note:
- the addition of the prefetcht0 instruction increases the number of instructions retired (instAny column) and µ-ops issued. Despite the increaed work the code section appears to complete in a shorter time;
- the number of stall cycles is significantly higher in the version without the prefetcht0 instruction;
- the prefetcht0 instruction affects the results returned by the combination of PMC event MEM_UOP_RETIRED.ALL filtered by MEM_UOP_RETIRED.LOADS.
- the performance improvement factor appears to be around 1.28, but
- the processor is clearly executing at a lower clock frequency (helpfully calculated in the results tables as 800 MHz) and hence is not exercising the memory sub-system as hard as it would do, were it executing at full speed. What would happen in that case? Would the prefetcht0 instruction help, or would the memory sub-system reach saturation point and it make no difference at all?

Well, by increasing the size of the first parameter to 1 million (from 100k), and increasing the test iterations, I managed to get the CPU to try a bit harder. Not only that, but also the memory sub-system: while 1 million bytes will still fit in the L3 cache, it’s making it work harder. If I use an average from the last 1024 test runs (of 2048), I get the following:

q)avgres[t0;-1024] uj avgres[t1;-1024]
instAny clkCore  clkRef   UopsAny  MemUopsLd L3Miss   StallCycles MHz      nanos   
-----------------------------------------------------------------------------------
468720  208939.4 166570.2 515684.1 78138.16  15.08691 45108.97    3387.897 61692.64
453096  265471.6 211694.3 500098.8 62515.32  6.955078 51533.47    3386.98  78405.09

So…
- similar proportions of MemUopsLd, vastlly different stall cycle ratio to the low-power runs;
- when playing with different iteration counts, the L3Miss value was significantly higher than 15.0 for the run with the prefetcht0 instruction;
- the prefetcht0 version is faster by a factor of 1.27, which is nice;
- if the nanos column really can be treated as wall time, then a function which (once warmed-up etc) handles 1mm bitwise comparisons in 61 micros is probably going to work out OK.

A production version?

I have put together a non-commercial version for you to play with which you can download from here. You can load it using the kdbasm.q scriptlet:

/ load the shared library 'libbwcmp.so' and assign the 2-arg function 'bitwiseOr' to .mg.or
.mg.or:`libbwcmp 2:(`bitwiseOr;2);

An easy way to execute it and see its results is as follows:

q)n:2000
q).mg.or[n#`byte$ til 0xFF;0x01]
01010101010101010101010101010101010101010101010101010101010101010101010101010..
q).mg.or[n#`byte$ til 0xFF;0x02]
00110011001100110011001100110011001100110011001100110011001100110011001100110..
q).mg.or[n#`byte$ til 0xFF;0x03]
01110111011101110111011101110111011101110111011101110111011101110111011101110..
q).mg.or[n#`byte$ til 0xFF;0x08]
00000000111111110000000011111111000000001111111100000000111111110000000011111..
q).mg.or[n#`byte$ til 0xFF;0x10]
00000000000000001111111111111111000000000000000011111111111111110000000000000..
q).mg.or[n#`byte$ til 0xFF;0xFF]
01111111111111111111111111111111111111111111111111111111111111111111111111111..
q)\t .mg.or[n#`byte$ til 0xFF;0xFF]
0
q)/ \:D/

Next steps?

Well, there’s clearly scope for extending this function to support more integer types, though I do wonder about the utility of writing something for long values. That’s a lot of flag bits which you’d be storing. I think after perhaps writing something for short values, targetting the AVX registers is the obvious thing to do. It shouldn’t be too hard, I don’t think (famous last words?), since apart from more stringent memory-alignment requirements, the approach ought to be almost identical. It will be interesting to see how much faster it goes, how many instructions are issued, µ-ops issued and how it affects the memory controllers, since if the code executes faster there they will have to work harder to service the load-requests. Anyway, that’s for another day. Time to put this one to bed.