Some Assembly Required - The Trouble With FSUB

It’s not often you get a really nasty surprise when writing software, and even less often that the nasty surprise is lurking in the compiler (or assembler). It turns out that the Gnu assembler does not treat the operands of instructions such as fsub and fdiv uniformly — in fact, some of the time it does the opposite of what you instruct it to do.

Were I to start ranting about this state of affairs, I would probably mention in passing that one of the attractions of writing assembly was that you get exactly what you expect, rather than a compiler writer’s interpretation of your interpretation of a higher-level language, and that it might be a shame if someone managed to corrupt your intention when you wrote “subtract y from x” and instead substituted “subtract x from y”.

The bug, as it was christened long before I happened upon it, occurs when using the Gnu assembler with a non-commutative x87 FPU instruction where the source operand is %st(0), and is documented fairly quietly in the Gnu Assembler documentation. Let’s have a look at the fsub instruction.

The Intel manual provides the following:

Opcode	Instruction	Description
`D8 E0+i`	`FSUB ST(0), ST(i)`	Subtract ST(i) from ST(0) and store result in ST(0).
`DC E8+i`	`FSUB ST(i), ST(0)`	Subtract ST(0) from ST(i) and store result in ST(i).
`DE E8+i`	`FSUBP ST(i), ST(0)`	Subtract ST(0) from ST(i), store result in ST(i), and pop register stack.

It continues:

Subtracts the source operand from the destination operand and stores the difference in the destination location. The destination operand is always an FPU data register; the source operand can be a register or a memory location. Source operands in memory can be in single-precision or double-precision floating-point format or in word or doubleword integer format.

So far, so good. There is of course the usual operand reversal for gas:

; Intel
    fsub     st(0),  st(1)         ; st(0) <-- st(0)-st(1)

# Gnu Assembler
    fsub     %st(1),  %st(0)       # st(0) <-- st(0)-st(1)

In the example above, the source operand is %st(1), so all’s well and you get what you expected. However, if you write

    fsub     %st(0),  %st(1)       # st(1) <-- st(1)-st(0) ... or is it?

what you get is not st(1) <-- st(1)-st(0), but rather st(1) <-- st(0)-st(1). Has it reversed the operands for the subtraction (but at least had the decency to store the result in the right register)?

Unfortunately, this is exactly what it does, as you can see for yourself if you compile and execute the fsubtest.s application whose listing appears below (a simple Makefile is listed too). The output of the program is as follows:

input:  st(0): 5.0
        st(1): 7.0
instr:  fsub %st(0), %st(1)
output: st(0): 5.0
        st(1): -2.0             [<- No: right register, wrong result, Ed.]

input:  st(0): 5.0
        st(1): 7.0
instr:  fsub %st(1), %st(0)
output: st(0): -2.0
        st(1): 7.0

input:  st(0): 5.0
        st(1): 7.0
instr:  fsubr %st(0), %st(1)
output: st(0): 5.0
        st(1): 2.0             [<- No: right register, wrong result, Ed.]

input:  st(0): 5.0
        st(1): 7.0
instr:  fsubr %st(1), %st(0)
output: st(0): 2.0
        st(1): 7.0

I don’t know whether the following table makes things any easier to follow, but the expected and actual outcomes are made explicit:

instruction		assemby	example	behaviour
`fsub %st(0), %st(1)`	expected	`%st(1)= %st(1)-%st(0)`	`%st(1) = 7-5 = 2`
	actual	`%st(1)= %st(0)-%st(1)`	`%st(1) = 5-7 = -2`	“wrong”
`fsub %st(1), %st(0)`	expected	`%st(0)= %st(0)-%st(1)`	`%st(0) = 5-7 = -2`
	actual	`%st(0)= %st(0)-%st(1)`	`%st(0) = 5-7 = -2`
`fsubr %st(0), %st(1)`	expected	`%st(1)= %st(0)-%st(1)`	`%st(1) = 5-7 = -2`
	actual	`%st(1)= %st(1)-%st(0)`	`%st(1) = 7-5 = 2`	“wrong”
`fsubr %st(1), %st(0)`	expected	`%st(0)= %st(1)-%st(0)`	`%st(0) = 7-5 = 2`
	actual	`%st(0)= %st(1)-%st(0)`	`%st(0) = 7-5 = 2`

The table hopefully shows how the assembler fails to produce the correct opcodes for the fsub and fsubr instructions where %st(0) is the source operand. Put another way, the assembler has fsub %st(0), %st(n) mixed up with fsubr %st(0), %st(n).

I only found out about the bug as I was trying to translate some FASM code into gas and couldn’t understand the reason for the incorrect result. The problem was worse as the only logical explanation was that the compiler was broken — which is usually a good indicator that your logic is flawed and you need to check it again.

The other thing which surprises me is that this bug isn’t the top result for a search such as “(Gnu assembler|gas) fsub”. I couldn’t find mention of it in Blum’s Professional Assembly Language book, an very good reference regardless. It turns out that the fdiv instruction family is also mangled by the Gnu assembler. Alan Modra, who has dealt with this bug on the Sourceware and GCC mailing lists since at least 1999, writes:

 Here are examples of `broken' opcodes.  You might like to verify that your
Unixware assemblers produce the same.
   1 0000 DCE3           fsub %st,%st(3)
   2 0002 DCEB           fsubr %st,%st(3)
   3 0004 DCF3           fdiv %st,%st(3)
   4 0006 DCFB           fdivr %st,%st(3)
   5 0008 DEE3           fsubp %st,%st(3)
   6 000a DEEB           fsubrp %st,%st(3)
   7 000c DEF3           fdivp %st,%st(3)
   8 000e DEFB           fdivrp %st,%st(3)

Here’s the short test program I mentioned earlier. It passes a function pointer as a parameter to a routine which prints the top two values on the FPU stack, calls the function at the pointer before printing %st(0) and %st(1) again. It’s very simple:

fsubtest.s

.section .data
msgfmt:
        .ascii "input:  st(0): %.1f\n"
        .ascii "        st(1): %.1f\n"
        .ascii "instr:  %s\n"
        .ascii "output: st(0): %.1f\n"
        .asciz "        st(1): %.1f\n\n"
s_subst0st1:
        .asciz "fsub %st(0), %st(1)"
s_subst1st0:
        .asciz "fsub %st(1), %st(0)"
s_subrst0st1:
        .asciz "fsubr %st(0), %st(1)"
s_subrst1st0:
        .asciz "fsubr %st(1), %st(0)"

st0:
        .double 5.0
st1:
        .double 7.0

.section .bss
.lcomm result, 0x8

.section .text
.globl _start
_start:

        nop
        lea     fsubst0st1, %rdi
        call    finvoker

        lea     fsubst1st0, %rdi
        call    finvoker

        lea     fsubrst0st1, %rdi
        call    finvoker

        lea     fsubrst1st0, %rdi
        call    finvoker

        xor     %rdi, %rdi
        call exit

fsubst0st1:
        fsub    %st(0), %st(1)
        lea s_subst0st1, %rsi
        ret

fsubst1st0:
        fsub    %st(1), %st(0)
        lea     s_subst1st0, %rsi
        ret

fsubrst0st1:
        fsubr   %st(0), %st(1)
        lea     s_subrst0st1, %rsi
        ret

fsubrst1st0:
        fsubr   %st(1), %st(0)
        lea     s_subrst1st0, %rsi
        ret

finvoker:
        push    %rbp                         # Store base-pointer
        and     $~0xF, %rsp                  # Align stack-pointer for call to printf

        finit
        fldl    st1                          # Push value at st1 onto FP stack
        fldl    st0                          # Push value at st0 onto FP stack
        call    *%rdi                        # Invoke function pointer
        movsd   st0, %xmm0                   # Copy value at st0 to XMM0
        movsd   st1, %xmm1                   # Copy value at st1 to XMM1

        fstpl   result                       # Copy st(0) to result0 and pop FP stack
        movsd result, %xmm2                  # Copy value from FP stack to XMM2
        fstpl   result                       # Repeat for next top-of-stack
        movsd   result, %xmm3                # Copy value to XMM3

        lea     msgfmt, %rdi                 # Load address of msgfmt into RDI
        mov     $0x5, %al                    # Set varargs-count in AL
        call    printf                       # 
        pop     %rbp                         # Restore base pointer prior to return
        ret

And its Makefile:

Makefile

all: fsubtest.o
 ld --dynamic-linker /lib/ld-linux-x86-64.so.2 -o fsubtest -lc fsubtest.o

fsubtest.o: fsubtest.s
 as -gstabs -o fsubtest.o fsubtest.s

clean:
 rm -f *.o fsubtest
..

So that’s it?

Well, actually, there’s more. It turns out that you can’t inspect what the assembler has done for you by using objdump ‑D myprog, as it mistranslates the opcodes again. A broken implementation of objdump would swear blind that the opcodes you are looking at are, in fact, the ones which you asked for:

00000000004002c5 <fsubst0st1>:
  4002c5:       dc e1                   fsub   %st,%st(1)    # NO! Those opcodes are not FSUB!
  4002c7:       48 8d 34 25 15 05 60    lea    0x600515,%rsi
  4002ce:       00 
  4002cf:       c3                      retq   

00000000004002d0 <fsubst1st0>:
  4002d0:       d8 e1                   fsub   %st(1),%st
  4002d2:       48 8d 34 25 29 05 60    lea    0x600529,%rsi
  4002d9:       00 
  4002da:       c3                      retq   

00000000004002db <fsubrst0st1>:
  4002db:       dc e9                   fsubr  %st,%st(1)    # NO! Those opcodes are not FSUBR!
  4002dd:       48 8d 34 25 3d 05 60    lea    0x60053d,%rsi
  4002e4:       00 
  4002e5:       c3                      retq   

00000000004002e6 <fsubrst1st0>:
  4002e6:       d8 e9                   fsubr  %st(1),%st
  4002e8:       48 8d 34 25 52 05 60    lea    0x600552,%rsi
  4002ef:       00 
  4002f0:       c3                      retq

However, if you use a “fixed” version of objdump, it shows you the true state of affairs. When I say “fixed” I mean that it’s been compiled with the SYSV386_COMPAT preprocessor value #defined as 0, about which more follows below.

00000000004002c5 <fsubst0st1>:
  4002c5:       dc e1                   fsubr  %st,%st(1)     # fsubr now correctly reflects the opcodes
  4002c7:       48 8d 34 25 15 05 60    lea    0x600515,%rsi
  4002ce:       00 
  4002cf:       c3                      retq

00000000004002d0 <fsubst1st0>:
  4002d0:       d8 e1                   fsub   %st(1),%st
  4002d2:       48 8d 34 25 29 05 60    lea    0x600529,%rsi
  4002d9:       00 
  4002da:       c3                      retq

00000000004002db <fsubrst0st1>:
  4002db:       dc e9                   fsub   %st,%st(1)     # fsub now correctly reflects the opcodes
  4002dd:       48 8d 34 25 3d 05 60    lea    0x60053d,%rsi
  4002e4:       00 
  4002e5:       c3                      retq

00000000004002e6 <fsubrst1st0>:
  4002e6:       d8 e9                   fsubr  %st(1),%st
  4002e8:       48 8d 34 25 52 05 60    lea    0x600552,%rsi
  4002ef:       00 
  4002f0:       c3                      retq

Of course, we want to be able to generate the correct opcodes. Of course, in order to do so, you need to have compiled the source file with a “fixed” as binary or be using one of the other work-arounds outlined below, and you also have to use a “fixed” version of objdump.

00000000004002cd <fsubst0st1>:
  4002cd:       dc e9                   fsub   %st,%st(1)     # Correct
  4002cf:       48 8d 34 25 5d 05 60    lea    0x60055d,%rsi
  4002d6:       00 
  4002d7:       c3                      retq

00000000004002d8 <fsubst1st0>:
  4002d8:       d8 e1                   fsub   %st(1),%st
  4002da:       48 8d 34 25 71 05 60    lea    0x600571,%rsi
  4002e1:       00 
  4002e2:       c3                      retq

00000000004002e3 <fsubrst0st1>:
  4002e3:       dc e1                   fsubr  %st,%st(1)     # Correct
  4002e5:       48 8d 34 25 85 05 60    lea    0x600585,%rsi
  4002ec:       00 
  4002ed:       c3                      retq

00000000004002ee <fsubrst1st0>:
  4002ee:       d8 e9                   fsubr  %st(1),%st
  4002f0:       48 8d 34 25 9a 05 60    lea    0x60059a,%rsi
  4002f7:       00 
  4002f8:       c3                      retq

So what does one do to fix it?

The first mention of the issue I found is on the binutils mailing list by Alan Modra in an exchange with Horst von Brand. It seems that Modra was aware of the issue prior to this as he says:

FYI, here's a comment I added to binutils/include/opcode/i386.h, just to
make you aware of a horrible kludge.

 /* The UnixWare assembler, and probably other AT&T derived ix86 Unix
   assemblers, generate floating point instructions with reversed
   source and destination registers in certain cases.  Unfortunately,
   gcc and possibly many other programs use this reversed syntax, so
   we're stuck with it.

   eg. `fsub %st(3),%st' results in st <- st - st(3) as expected, but
   `fsub %st,%st(3)' results in st(3) <- st - st(3), rather than
   the expected st(3) <- st(3) - st !

   This happens with all the non-commutative arithmetic floating point
   operations with two register operands, where the source register is
   %st, and destination register is %st(i).  Look for FloatDR below.  */

#ifndef UNIXWARE_COMPAT
/* Set non-zero for broken, compatible instructions.  Set to zero for
   non-broken opcodes at your peril.  gcc generates UnixWare
   compatible instructions.  */
#define UNIXWARE_COMPAT 1
#endif

I would love to get rid of this stupidity, but that needs a
synchronised update of both gcc and binutils.

So there is the reason why it hasn’t been fixed, since GCC is coded to expect that broken behaviour. However, somewhat confusingly, Modra later posts a patch to GCC in March 2000 and renames the preprocessor value from UNIXWARE_COMPAT to SYSV386_COMPAT (I guess that Sourceware binutils != Gnu GCC). If the value is set to 0 it causes GCC to emit the correct instructions to its assembler (in the hope that it is expecting them). Just to be clear: setting SYSV386_COMPAT to 0 also fixes the as binary. To compile binutils in this way you need to set the CPPFLAGS option to the configure script as follows:

./configure CPPFLAGS=-DSYSV386_COMPAT=0 --prefix=/path/to/basedir/etc

The CPPFLAGS option is the preferred way of setting preprocessor flags and will permit the default CFLAGS options to be set, eg: with -g -O2.

The subject pops up from time to time on the mailing list, with gems such as this:

> I was reading the manual (vol 2a) and it looks like this is supposed  
> to assemble as de f9,
> am I nuts or reading something wrong?

See the comment at the start of include/opcode/i386.h.  gcc is nuts.  :)

It all seems to go quiet until 2007, when an H.J. Lu gets involved and proposes to make the output selection a runtime option. This is now incorporated into gas, and specifying ‑mmnemonic=intel as an argument to as causes the test-case to produce the correct output bytes. However, I’m not sure what else this switch changes, as in the patch there’s a comment which reads:

+  /* intel_mnemonic implies intel_syntax.  */
+  intel_mnemonic = intel_syntax = mnemonic_flag;

Is that something that I want? What are the effects on my code if intel_syntax is enabled? The Gnu as docs suggest another (probably better) way of causing the assembler to behave in the way you expect it to, and that’s through using the .intel_mnemonic directive in the source. It’s not clear whether the directive is scope-limited to the source-file being assembled or until the next appearance of the .att_mnemonic directive. The effect of the .intel_mnemonic is described in the docs as follows:

9.13.5 AT&T Mnemonic versus Intel Mnemonic

as supports assembly using Intel mnemonic. .intel_mnemonic selects Intel mnemonic with Intel syntax, and .att_mnemonic switches back to the usual AT&T mnemonic with AT&T syntax for compatibility with the output of gcc. Several x87 instructions, fadd, fdiv, fdivp, fdivr, fdivrp, fmul, fsub, fsubp, fsubr and fsubrp, are implemented in AT&T System V/386 assembler with different mnemonics from those in Intel IA32 specification. gcc generates those instructions with AT&T mnemonic.

To be honest, none of this is very clear: “.intel_mnemonic selects Intel mnemonic with Intel syntax”. What does that mean? The various side-effects of compiling with SYSV386_COMPAT, assembling with ‑mmnemonic=intel or using the .intel_mnemonic are completely opaque and render the use of gas questionable for those writing assembler, unless they tip-toe around the set of FPU instructions which are the known trouble-makers and write “wrong” assembler to produce “right” opcodes - which is a horrible idea. There should be a dedicated switch for putting as into a sane, predictable mode for fixing these instructions — or, if that is exactly what the Intel mnemonic switches do, could someone please make this clear in the docs and describe exactly what you get when using the switches described? Please?