Some Assembly Required - Another vectorised bitwise-OR function for kdb+, AVX style

In my last post on a vectorised bitwise-OR function for kdb+, I wondered towards the end what the next interesting step would be — should it be the production of an AVX-variant of the same code, or should it be the modification of the code to handle short as well as byte values? Well, I opted for the former, and found out that it really wasn’t worth the trouble, and the biggest benefit was realising some of the limitations of the AVX instruction-set. It’s worth sharing, since this may help others in the future and will help produce a faster version when AVX2 instructions come out… which is a hint.

When someone starts talking about “AVX” it’s usual think of 256-bit, 32-byte, half-cache-line-wide vector registers and instructions.

The thing is, only some of the AVX instructions can operate on 32-byte YMM registers, and pending the arrival of Haswell and the AVX2 instructions, there’s no obvious advantage to some of them over the SSE instructions. In his “Optimizing subroutines in assembly language” manual (PDF), Agner Fog states (s. 13.6):

The AVX instruction set supports floating point operations in 256-bit vectors. Integer operations in 256-bit vectors require the AVX2 instruction set.

However, while some AVX instructions don’t provide access to wider registers, they’re necessary since there is a cost to mixing AVX and legacy SSE instructions. Again, per Agner Fog:

All the existing XMM instructions have two versions in the AVX instruction set. A legacy version which leaves the upper half (bit 128-255) of the target YMM register unchanged, and a VEX-prefix version which sets the upper half to zero in order to remove false dependences. If a code sequence contains both 128-bit and 256-bit instructions then it is important to use the VEX-prefix version of the 128-bit instructions. Mixing 256 bit instructions with legacy 128-bit instructions without VEX prefix will cause the processor to switch the register file between different states which may cost many clock cycles on Intel’s Sandy Bridge processor.

So to take my current use-case, the logical, vectorised, comparison of individual byte values, there’s no benefit to using vpcmpeqb, vpxor and vpand over the SSE pcmpeqb, pxor and pand instructions since they all only operate on XMM registers — it’s just a question of cost-avoidance. Some performance testing of the converted bitwise-OR function bears this out, and the AVX version is, in fact, ever so slightly slower (although I’m happy to take responsibility for that).

The conversion of the existing SSE code was fairly straightforward, and once I realised that I wasn’t going to have to deal with YMM registers it made the loop-accounting identical. By that I mean having to deal with less than a full YMM register’s worth of byte-values and how to copy the less than 32 bytes to the result-vector efficiently. The steps were quite straightforward:
- ensure that memory-reads and writes are properly aligned; and
- ensure that only AVX instructions are used.

kdbavxor.s

# (c) Michael Guyver, 2012, all rights reserved. Please see the copy on GitHub for the full licence.

.section .rodata
    .align    0x10
.LbadTypeStr:
    .asciz   "vectype"

.section .text

.globl    mgBitwiseOr
.type     mgBitwiseOr, STT_FUNC

.LhandleBadType:                            # Handler for invalid input vector-type
    lea        .LbadTypeStr(%rip), %rdi     # Load address of error message
    callq      krr@PLT                      # Call kdb's error-reporting function
    jmp        .LrestoreStackForExit        # Exit via the clean-up function

.LsuccessExit:
    mov        %r14, %rax                   # Copy result address to RAX

.LrestoreStackForExit:
    mov        %rbp, %rsp                   # Restore SP
    pop        %r15                         # Restore non-durable registers
    pop        %r14
    pop        %r13
    pop        %r12
    pop        %rbx
    pop        %rbp
    vzeroupper                              # Guard against expensive AVX->SSE transition 
    ret                                     # Exit to caller

    .align 0x10                             # Align entry-point to 16-byte boundary

mgBitwiseOr:
    push       %rbp                         # Push all non-durable registers
    push       %rbx
    push       %r12
    push       %r13
    push       %r14
    push       %r15
    mov        %rsp, %rbp                   # Store updated SP
    sub        $0x10, %rsp                  # Ensure alignment reserves at least 16 bytes
    and        $-0x20, %rsp                 # Align SP to 32-byte boundary

.LstoreInputArgs:
    mov        %rdx, 0x08(%rsp)             # Store arg2
    mov        %rcx, 0x00(%rsp)             # Store arg3
    mov        %rdi, %r12                   # Copy/store arg0
    mov        %rsi, %r13                   # Copy/store arg1

    vzeroupper                              # Guard against expensive SSE->AVX transition

.LtestInputType:
    cmpb       $4, 2(%rdi)                  # Compare input vec-tye with 4
    jne        .LhandleBadType              #   branch if not-equal

.LcreateResultObject:
    mov        $1, %rdi                     # arg0: vector-type
    movq       8(%r12), %rsi                # arg1: copy input vector-length
    callq      ktn@PLT                      # Create result vector
    mov        %rax, %r14                   # Safe-store result pointer

.LsetupAvxConstants:
    xor        %rax, %rax                   # Zero register
    movb       8(%r13), %al                 # Copy mask byte-value
    movq       %rax, %xmm0                  # Copy again to XMM0

    vpxor      %xmm1, %xmm1, %xmm1          # Clear XMM0: arg[2] = arg[0] ^ arg[1]
    vpshufb    %xmm1, %xmm0, %xmm0          # Byte-wise shuffle: msk=arg[0], val=arg[1], dst=arg[2]; 2B/4-282, p. 284
    vpcmpeqw   %xmm2, %xmm2, %xmm2          # Set all bits in XMM2

    vpcmpeqw   %xmm3, %xmm3, %xmm3          # Set all bits in XMM3: 0xffff_ffff_ffff_ffff_...
    vpsrlw     $15, %xmm3, %xmm3            # Rotates by 15 bits:   0x0001_0001_0001_0001_...
    vpackuswb  %xmm3, %xmm3, %xmm3          # Packs:                0x0101_0101_0101_0101_...

.LsetupLoopCounter:
    xor        %rdx, %rdx                   # Clear loop-counter
    mov        8(%r14), %r15                # Copy result->length to R15

.LloopStart:
    mov        %r15, %rax                   # Copy vec-len
    sub        %rdx, %rax                   # Calculate remaining
    je         .LsuccessExit                #   branch to exit-handler if remaining == 0
    cmp        $0x10, %rax                  # Compare remaining with 16
    jl         .LsubXmmRemaining            #   branch if < 16
    je         .LprocessOneXmm              #   branch if == 16
    lea        0x10(%r12,%rdx), %rbx        # Calculate address of read cursor
    test       $0x3f, %rbx                  # Test for 64-byte alignment
    jnz        .LprocessOneXmm              #   if not aligned, jump to generic handler
    cmp        $0x40, %rax                  # Read is aligned, check num bytes remaining
    jge        .LprocessCacheLine           #   if >= 64, jump to cache-line handler

.LprocessOneXmm:
    vmovaps    0x10(%r12,%rdx,1), %xmm4     #
    vpand      %xmm0, %xmm4, %xmm4          #
    vpcmpeqb   %xmm1, %xmm4, %xmm4          #
    vpxor      %xmm2, %xmm4, %xmm4          #
    vpand      %xmm3, %xmm4, %xmm4          #
    vmovaps    %xmm4, 0x10(%r14,%rdx,1)     # 
    add        $0x10, %rdx
    jmp        .LloopStart

The next section contains some minor differences from the original SSE version which I thought it worth mentioning. In an attempt to lower the work done by the memory controller, I have issued two vmovaps instead of four movaps instructions. The fact that the memory controller does “less” work is measurable insofar as this code issues 60.0% of the number of MEM_UOP_RETIRED.LOADS events.

In order to unpack the 32-byte YMM registers into XMM registers, I’ve used the vinsertf128 instruction with the imm8 value to copy the YMM high double-quad-word/octo-word to the low DQW/OW of another YMM register — or in other words, an XMM register. I use the same process in reverse so that the result is written to memory in two 32-byte chunks.

    .align 0x10                             # Align jump target to 16-byte boundary

.LprocessCacheLine:
    vmovaps    0x10(%r12,%rdx,1), %ymm4     # Copy 32 bytes to YMM4
    vmovaps    0x30(%r12,%rdx,1), %ymm6     # Copy 32 bytes to YMM6

    prefetcht0 0x310(%r12,%rdx,1)           # Hint to prefetch cache line several loops hence

    vperm2f128 $0x01, %ymm4, %ymm4, %ymm5   # Unpack high-bytes from YMM4 to XMM5
    vperm2f128 $0x01, %ymm6, %ymm6, %ymm7   # Unpack high-bytes from YMM6 to XMM7

    vpand      %xmm0, %xmm4, %xmm4          # Perform bitwise comparison
    vpand      %xmm0, %xmm5, %xmm5          #
    vpand      %xmm0, %xmm6, %xmm6          #
    vpand      %xmm0, %xmm7, %xmm7          #

    vpcmpeqb   %xmm1, %xmm4, %xmm4          #
    vpcmpeqb   %xmm1, %xmm5, %xmm5          #
    vpcmpeqb   %xmm1, %xmm6, %xmm6          #
    vpcmpeqb   %xmm1, %xmm7, %xmm7          #

    vpxor      %xmm2, %xmm4, %xmm4          #
    vpxor      %xmm2, %xmm5, %xmm5          #
    vpxor      %xmm2, %xmm6, %xmm6          #
    vpxor      %xmm2, %xmm7, %xmm7          #

    vpand      %xmm3, %xmm4, %xmm4          #
    vpand      %xmm3, %xmm5, %xmm5          #
    vpand      %xmm3, %xmm6, %xmm6          #
    vpand      %xmm3, %xmm7, %xmm7          #

    vperm2f128 $0x20, %ymm4, %ymm5, %ymm4   # Copy XMM5 to the high-bytes of YMM4
    vperm2f128 $0x20, %ymm6, %ymm7, %ymm6   # Copy XMM7 to the high-bytes of YMM6

    vmovaps    %ymm4, 0x10(%r14,%rdx,1)     # Copy 32 bytes to result
    vmovaps    %ymm6, 0x30(%r14,%rdx,1)     # Copy 32 bytes to result

    add        $0x40, %rdx                  # Loop accounting: add 64 to counter
    mov        %r15, %rax                   # Copy result length
    sub        %rdx, %rax                   # Subtract counter from result.len
    cmp        $0x40, %rax                  # Compare result with 64
    jge        .LprocessCacheLine           #   if greater or equal, loop again
    jmp        .LloopStart                  # Otherwise, enter generic loop handler

The following code is, again, functionally identical to the SSE variant except for the replacement of legacy SSE instructions with their AVX counterparts. This section of code copies less than a full XMM register’s worth of bytes to the result-vector. Say, for example, there are 7 bytes remaining to copy, it would skip copying the high quad-word of the XMM value now on the stack (8 bytes), but would copy the next double-word (4 bytes), word (2 bytes) and byte value to the result vector (= 7).

.LsubXmmRemaining:
    vmovaps   0x10(%r12,%rdx,1), %xmm4      # Load 16-bytes of input
    vpand     %xmm0, %xmm4, %xmm4           # compare
    vpcmpeqb  %xmm1, %xmm4, %xmm4           # set to 0xff if equal
    vpxor     %xmm2, %xmm4, %xmm4           # flip bytes
    vpand     %xmm3, %xmm4, %xmm4           # convert 0xff to 0x01

    vmovaps   %xmm4, (%rsp)                 # Copy boolean results to stack
    xor       %rbx, %rbx                    # Zero RBX for counting bytes copied
    mov       %r15, %rax                    # Copy veclen(result)
    sub       %rdx, %rax                    # Calc remaining
    cmp       $0x08, %rax                   # Compare remaining with 8
    jl        .LltEightRemaining            #   branch < 8
    movq      (%rsp,%rbx,1), %r8            # Copy QW from stack to QW reg
    movq      %r8, 0x10(%r14,%rdx,1)        # Copy the same to the result
    add       $0x08, %rdx                   # Add 8 to counter
    add       $0x08, %rbx                   # Add 8 to src-counter
    sub       $0x08, %rax                   # Subtract 8 from remaining
.LltEightRemaining:                         # Handle < 8
    cmp       $0x04, %rax                   # Compare remaining with 4
    jl        .LltFourRemaining             #   branch < 4
    movl      (%rsp,%rbx,1), %r8d           # Copy result to DW reg
    movl      %r8d, 0x10(%r14,%rdx,1)       # Copy DW to result
    add       $0x04, %rdx                   # Add 4 to counter
    add       $0x04, %rbx                   # Add 4 to src-counter
    sub       $0x04, %rax                   # Subtract 4 from remaining
.LltFourRemaining:
    cmp       $0x02, %rax
    jl        .LltTwoRemaining
    movw      (%rsp,%rbx,1), %r8w
    movw      %r8w, 0x10(%r14,%rdx,1)       # Copy W to result
    add       $0x02, %rdx                   # Add 2 to counter
    add       $0x02, %rbx                   # Add 2 to src-counter
    sub       $0x02, %rax                   # Subtract 2 from remaining
.LltTwoRemaining:
    cmp       $0x01, %rax
    jl        .LnoneRemaining
    movb      (%rsp,%rbx,1), %r8b
    movb      %r8b, 0x10(%r14,%rdx,1)       # Copy DW to result
.LnoneRemaining:
    jmp       .LsuccessExit

I’m not going to go into an analysis of this code in any great depth — but a comparison of the two different functions from the same q session shows the following results (the AVX version is on top):

q).pmc.avgres:{[t;n] flip key[d]!enlist each value d:avg[n#t] }
q).pmc.setAvx[]
q)avx:.pmc.script4`usr; 
q).pmc.setSse[]
q)sse:.pmc.script4`usr; 
q).pmc.avgres[avx;-1024] uj .pmc.avgres[sse;-1024]
instAny clkCore  clkRef   UopsAny  MemUopsLd L3Miss   StallCycles MHz      nanos   
-----------------------------------------------------------------------------------
468950  208697   165777.4 485184.3 46915.16  1.169922 31703.37    3399.318 61399.37
468954  208664.1 165706.5 515918.6 78157.2   3.637695 42506.97    3400.265 61372.37

Some items of note:
- a insigificant difference in the number of instructions executed between the two variants (instAny);
- a noticeable difference in the number of µ-ops issued (UopsAny);
- a very significant difference in the number of memory-load µ-ops issued (MemUopsLd);
- a very significant difference in the number of L3 misses (L3Miss);
- a very significant difference in the number of stall-cycles experienced by the two different functions (StallCycles);
- an entirely equivalent average CPU clock speed for each test run (MHz, derived from clkCore and clkRef)… yet
- an almost identical number of reference clocks (clkRef) and hence wall-time for both variants. In light of the above, I find this very difficult to explain — the AVX version has everything going for it. It’s like playing top trumps but losing in spite of having a card with all the best scores. If you’ve got any ideas, please drop me a line?

Like the last time, I’ve included the code for this post on GitHub, complete with licence details etc.