In my last post on a vectorised bitwise-OR function for kdb+, I wondered towards the end what the next interesting step would be — should it be the production of an AVX-variant of the same code, or should it be the modification of the code to handle
short
as well as byte
values? Well, I opted for the former, and found out that it really wasn't worth the trouble, and the biggest benefit was realising some of the limitations of the AVX instruction-set. It's worth sharing, since this may help others in the future and will help produce a faster version when AVX2 instructions come out... which is a hint.When someone starts talking about "AVX" it's usual think of 256-bit, 32-byte, half-cache-line-wide vector registers and instructions.
The thing is, only some of the AVX instructions can operate on 32-byte YMM registers, and pending the arrival of Haswell and the AVX2 instructions, there's no obvious advantage to some of them over the SSE instructions. In his "Optimizing subroutines in assembly language" manual (PDF), Agner Fog states (s. 13.6):
The AVX instruction set supports floating point operations in 256-bit vectors. Integer operations in 256-bit vectors require the AVX2 instruction set.
However, while some AVX instructions don't provide access to wider registers, they're necessary since there is a cost to mixing AVX and legacy SSE instructions. Again, per Agner Fog:
All the existing XMM instructions have two versions in the AVX instruction set. A legacy version which leaves the upper half (bit 128-255) of the target YMM register unchanged, and a VEX-prefix version which sets the upper half to zero in order to remove false dependences. If a code sequence contains both 128-bit and 256-bit instructions then it is important to use the VEX-prefix version of the 128-bit instructions. Mixing 256 bit instructions with legacy 128-bit instructions without VEX prefix will cause the processor to switch the register file between different states which may cost many clock cycles on Intel's Sandy Bridge processor.
So to take my current use-case, the logical, vectorised, comparison of individual byte values, there's no benefit to using
vpcmpeqb
, vpxor
and vpand
over the SSE pcmpeqb
, pxor
and pand
instructions since they all only operate on XMM registers — it's just a question of cost-avoidance. Some performance testing of the converted bitwise-OR function bears this out, and the AVX version is, in fact, ever so slightly slower (although I'm happy to take responsibility for that).
The conversion of the existing SSE code was fairly straightforward, and once I realised that I wasn't going to have to deal with YMM registers it made the loop-accounting identical. By that I mean having to deal with less than a full YMM register's worth of byte-values and how to copy the less than 32 bytes to the result-vector efficiently. The steps were quite straightforward:
- ensure that memory-reads and writes are properly aligned; and
- ensure that only AVX instructions are used.
kdbavxor.s
# (c) Michael Guyver, 2012, all rights reserved. Please see the copy on GitHub for the full licence. .section .rodata .align 0x10 .LbadTypeStr: .asciz "vectype" .section .text .globl mgBitwiseOr .type mgBitwiseOr, STT_FUNC .LhandleBadType: # Handler for invalid input vector-type lea .LbadTypeStr(%rip), %rdi # Load address of error message callq krr@PLT # Call kdb's error-reporting function jmp .LrestoreStackForExit # Exit via the clean-up function .LsuccessExit: mov %r14, %rax # Copy result address to RAX .LrestoreStackForExit: mov %rbp, %rsp # Restore SP pop %r15 # Restore non-durable registers pop %r14 pop %r13 pop %r12 pop %rbx pop %rbp vzeroupper # Guard against expensive AVX->SSE transition ret # Exit to caller .align 0x10 # Align entry-point to 16-byte boundary mgBitwiseOr: push %rbp # Push all non-durable registers push %rbx push %r12 push %r13 push %r14 push %r15 mov %rsp, %rbp # Store updated SP sub $0x10, %rsp # Ensure alignment reserves at least 16 bytes and $-0x20, %rsp # Align SP to 32-byte boundary .LstoreInputArgs: mov %rdx, 0x08(%rsp) # Store arg2 mov %rcx, 0x00(%rsp) # Store arg3 mov %rdi, %r12 # Copy/store arg0 mov %rsi, %r13 # Copy/store arg1 vzeroupper # Guard against expensive SSE->AVX transition .LtestInputType: cmpb $4, 2(%rdi) # Compare input vec-tye with 4 jne .LhandleBadType # branch if not-equal .LcreateResultObject: mov $1, %rdi # arg0: vector-type movq 8(%r12), %rsi # arg1: copy input vector-length callq ktn@PLT # Create result vector mov %rax, %r14 # Safe-store result pointer .LsetupAvxConstants: xor %rax, %rax # Zero register movb 8(%r13), %al # Copy mask byte-value movq %rax, %xmm0 # Copy again to XMM0 vpxor %xmm1, %xmm1, %xmm1 # Clear XMM0: arg[2] = arg[0] ^ arg[1] vpshufb %xmm1, %xmm0, %xmm0 # Byte-wise shuffle: msk=arg[0], val=arg[1], dst=arg[2]; 2B/4-282, p. 284 vpcmpeqw %xmm2, %xmm2, %xmm2 # Set all bits in XMM2 vpcmpeqw %xmm3, %xmm3, %xmm3 # Set all bits in XMM3: 0xffff_ffff_ffff_ffff_... vpsrlw $15, %xmm3, %xmm3 # Rotates by 15 bits: 0x0001_0001_0001_0001_... vpackuswb %xmm3, %xmm3, %xmm3 # Packs: 0x0101_0101_0101_0101_... .LsetupLoopCounter: xor %rdx, %rdx # Clear loop-counter mov 8(%r14), %r15 # Copy result->length to R15 .LloopStart: mov %r15, %rax # Copy vec-len sub %rdx, %rax # Calculate remaining je .LsuccessExit # branch to exit-handler if remaining == 0 cmp $0x10, %rax # Compare remaining with 16 jl .LsubXmmRemaining # branch if < 16 je .LprocessOneXmm # branch if == 16 lea 0x10(%r12,%rdx), %rbx # Calculate address of read cursor test $0x3f, %rbx # Test for 64-byte alignment jnz .LprocessOneXmm # if not aligned, jump to generic handler cmp $0x40, %rax # Read is aligned, check num bytes remaining jge .LprocessCacheLine # if >= 64, jump to cache-line handler .LprocessOneXmm: vmovaps 0x10(%r12,%rdx,1), %xmm4 # vpand %xmm0, %xmm4, %xmm4 # vpcmpeqb %xmm1, %xmm4, %xmm4 # vpxor %xmm2, %xmm4, %xmm4 # vpand %xmm3, %xmm4, %xmm4 # vmovaps %xmm4, 0x10(%r14,%rdx,1) # add $0x10, %rdx jmp .LloopStart
The next section contains some minor differences from the original SSE version which I thought it worth mentioning. In an attempt to lower the work done by the memory controller, I have issued two
vmovaps
instead of four movaps
instructions. The fact that the memory controller does "less" work is measurable insofar as this code issues 60.0% of the number of MEM_UOP_RETIRED.LOADS
events.
In order to unpack the 32-byte YMM registers into XMM registers, I've used the
vinsertf128
instruction with the imm8
value to copy the YMM high double-quad-word/octo-word to the low DQW/OW of another YMM register — or in other words, an XMM register. I use the same process in reverse so that the result is written to memory in two 32-byte chunks..align 0x10 # Align jump target to 16-byte boundary .LprocessCacheLine: vmovaps 0x10(%r12,%rdx,1), %ymm4 # Copy 32 bytes to YMM4 vmovaps 0x30(%r12,%rdx,1), %ymm6 # Copy 32 bytes to YMM6 prefetcht0 0x310(%r12,%rdx,1) # Hint to prefetch cache line several loops hence vperm2f128 $0x01, %ymm4, %ymm4, %ymm5 # Unpack high-bytes from YMM4 to XMM5 vperm2f128 $0x01, %ymm6, %ymm6, %ymm7 # Unpack high-bytes from YMM6 to XMM7 vpand %xmm0, %xmm4, %xmm4 # Perform bitwise comparison vpand %xmm0, %xmm5, %xmm5 # vpand %xmm0, %xmm6, %xmm6 # vpand %xmm0, %xmm7, %xmm7 # vpcmpeqb %xmm1, %xmm4, %xmm4 # vpcmpeqb %xmm1, %xmm5, %xmm5 # vpcmpeqb %xmm1, %xmm6, %xmm6 # vpcmpeqb %xmm1, %xmm7, %xmm7 # vpxor %xmm2, %xmm4, %xmm4 # vpxor %xmm2, %xmm5, %xmm5 # vpxor %xmm2, %xmm6, %xmm6 # vpxor %xmm2, %xmm7, %xmm7 # vpand %xmm3, %xmm4, %xmm4 # vpand %xmm3, %xmm5, %xmm5 # vpand %xmm3, %xmm6, %xmm6 # vpand %xmm3, %xmm7, %xmm7 # vperm2f128 $0x20, %ymm4, %ymm5, %ymm4 # Copy XMM5 to the high-bytes of YMM4 vperm2f128 $0x20, %ymm6, %ymm7, %ymm6 # Copy XMM7 to the high-bytes of YMM6 vmovaps %ymm4, 0x10(%r14,%rdx,1) # Copy 32 bytes to result vmovaps %ymm6, 0x30(%r14,%rdx,1) # Copy 32 bytes to result add $0x40, %rdx # Loop accounting: add 64 to counter mov %r15, %rax # Copy result length sub %rdx, %rax # Subtract counter from result.len cmp $0x40, %rax # Compare result with 64 jge .LprocessCacheLine # if greater or equal, loop again jmp .LloopStart # Otherwise, enter generic loop handler
The following code is, again, functionally identical to the SSE variant except for the replacement of legacy SSE instructions with their AVX counterparts. This section of code copies less than a full XMM register's worth of bytes to the result-vector. Say, for example, there are 7 bytes remaining to copy, it would skip copying the high quad-word of the XMM value now on the stack (8 bytes), but would copy the next double-word (4 bytes), word (2 bytes) and byte value to the result vector (= 7).
.LsubXmmRemaining: vmovaps 0x10(%r12,%rdx,1), %xmm4 # Load 16-bytes of input vpand %xmm0, %xmm4, %xmm4 # compare vpcmpeqb %xmm1, %xmm4, %xmm4 # set to 0xff if equal vpxor %xmm2, %xmm4, %xmm4 # flip bytes vpand %xmm3, %xmm4, %xmm4 # convert 0xff to 0x01 vmovaps %xmm4, (%rsp) # Copy boolean results to stack xor %rbx, %rbx # Zero RBX for counting bytes copied mov %r15, %rax # Copy veclen(result) sub %rdx, %rax # Calc remaining cmp $0x08, %rax # Compare remaining with 8 jl .LltEightRemaining # branch < 8 movq (%rsp,%rbx,1), %r8 # Copy QW from stack to QW reg movq %r8, 0x10(%r14,%rdx,1) # Copy the same to the result add $0x08, %rdx # Add 8 to counter add $0x08, %rbx # Add 8 to src-counter sub $0x08, %rax # Subtract 8 from remaining .LltEightRemaining: # Handle < 8 cmp $0x04, %rax # Compare remaining with 4 jl .LltFourRemaining # branch < 4 movl (%rsp,%rbx,1), %r8d # Copy result to DW reg movl %r8d, 0x10(%r14,%rdx,1) # Copy DW to result add $0x04, %rdx # Add 4 to counter add $0x04, %rbx # Add 4 to src-counter sub $0x04, %rax # Subtract 4 from remaining .LltFourRemaining: cmp $0x02, %rax jl .LltTwoRemaining movw (%rsp,%rbx,1), %r8w movw %r8w, 0x10(%r14,%rdx,1) # Copy W to result add $0x02, %rdx # Add 2 to counter add $0x02, %rbx # Add 2 to src-counter sub $0x02, %rax # Subtract 2 from remaining .LltTwoRemaining: cmp $0x01, %rax jl .LnoneRemaining movb (%rsp,%rbx,1), %r8b movb %r8b, 0x10(%r14,%rdx,1) # Copy DW to result .LnoneRemaining: jmp .LsuccessExit
I'm not going to go into an analysis of this code in any great depth — but a comparison of the two different functions from the same q session shows the following results (the AVX version is on top):
q).pmc.avgres:{[t;n] flip key[d]!enlist each value d:avg[n#t] } q).pmc.setAvx[] q)avx:.pmc.script4`usr; q).pmc.setSse[] q)sse:.pmc.script4`usr; q).pmc.avgres[avx;-1024] uj .pmc.avgres[sse;-1024] instAny clkCore clkRef UopsAny MemUopsLd L3Miss StallCycles MHz nanos ----------------------------------------------------------------------------------- 468950 208697 165777.4 485184.3 46915.16 1.169922 31703.37 3399.318 61399.37 468954 208664.1 165706.5 515918.6 78157.2 3.637695 42506.97 3400.265 61372.37
Some items of note:
- a insigificant difference in the number of instructions executed between the two variants (
instAny
);- a noticeable difference in the number of µ-ops issued (
UopsAny
);- a very significant difference in the number of memory-load µ-ops issued (
MemUopsLd
);- a very significant difference in the number of L3 misses (
L3Miss
);- a very significant difference in the number of stall-cycles experienced by the two different functions (
StallCycles
);- an entirely equivalent average CPU clock speed for each test run (
MHz
, derived from clkCore
and clkRef
)... yet- an almost identical number of reference clocks (
clkRef
) and hence wall-time for both variants. In light of the above, I find this very difficult to explain — the AVX version has everything going for it. It's like playing top trumps but losing in spite of having a card with all the best scores. If you've got any ideas, please drop me a line?
Like the last time, I've included the code for this post on GitHub, complete with licence details etc.
No comments:
Post a Comment