Recently it has come up on the LKML and other blogs about how C is not thread safe. The nice piece of example code given is

int trylock()
{
int res;

if (res == 0)
++acquires_count;

return res;
}

At first glance it seems OK, and I wasn't exactly sure what the problem was until I pulled it apart. GCC is well within its rights to produce the following code (annotated for clarity).

push   %ebp                     ; setting up stack frame
mov    %esp,%ebp                ; save esp to ebp
sub    \$0x8,%esp                ; make some room on stack
movl   <mutex>,(%esp)           ; copy mutex to stack
mov    <acquires_count>,%edx    ; copy acquires_count to %edx
cmp    \$0x1,%eax                ; set carry flag if result == 0
mov    %edx,<acquires_count>    ; write back acquires_count
leave                           ; release stack frame
ret                             ; done!

This code uses a bit of a trick to avoid a branch around code. It uses the carry flag, which is simple enough to understand when adding. Consider if you are adding 0001 + 1111 you end up with 11111, where the top bit is the carry bit. You can then feed this carry bit into another part of the addition operation if you were, say, using two 32 bit numbers to represent a 64 bit number on a 32 bit system.

The trick uses the dual nature of the carry flag as a "borrow" bit; if a < b when a - b then the bit is set. Consider the simplest case of 0 - 1 -- the carry bit is set to indicate to the next operation that a bit needs to be borrowed. Imagine for a moment we have a four bit machine and want to do the 8-bit unsigned operation

1111 0000 -   (240) -
0000 0001     (  1)
==== ====

First we do 0000 - 0001 which leaves us 1111 with the borrow bit set. Then we do 1111 - 0000 but take note that the borrow bit is set, so get rid of the borrowed bit, leaving us 1110. Putting the two together gives us 1110 1111 or 239.

Therefore the cmp \$0x1,<result> instruction (which is really just result - 1 with no output but setting flags) can only set the carry flag (i.e. borrow a bit) if the value is 0; any other value in result means there will be no need to borrow and the carry flag is not set.

Consequently the add with carry (adc) will add one to the result only if the carry bit was set. A neat trick which avoids a jump, but not so much thread safe: if you are interrupted after the cmp but before the mov store you can write back the wrong value. Most people (including me till 5 minutes ago) looking at the code would assume that it was safe on the theory that acquires_count is not touched if you don't have the lock because it is inside the if statement.

Interestingly, the Intel compiler does the simpler jump method

push   <mutex>                     ; push argument
pop    %ecx                        ; get return addr on top
test   %eax,%eax                   ; AND %eax,%eax
jne    <trylock+0x16>  --------+   ; if that wasn't 0, jump out
ret                    <-------+   ; done!

Frankly I have no idea which one is faster/better (I would guess the second one because you don't always dirty a cache line, but you'd have to measure that against the cost of the branch) but obviously GCC thinks one thing and Intel thinks another and I bet if you put them in a room together you would end up with 3 opinions! I think the major point is that once upon a time it was mostly people writing kernels doing things like accessing I/O ports or other memory mapped areas that needed to be very careful the compiler didn't optimise out (or in) something incorrectly. With multi-threaded C code that problem moves up to the application developer too.

A different instruction set can help here. The Itanium version is possibly less prone to this sort of thing because predication makes it faster to implement conditional branches -- any instruction with a predicate that is not set is skipped. The load and add is safe because it will only be done with the lock held. However, if anyone is wondering why Itanium come with such big caches, the difference between the 6 line version and the Itanium version below should answer that.

[MII]       alloc r32=ar.pfs,4,3,0               ; alloc stack frame
mov r33=b0                           ; return pointer
mov r34=r1
[MMI]       addl r14=72,r1;;                     ; r11 = &mutex from GOT
ld8 r11=[r14]                        ;
nop.i 0x0;;
[MIB]       mov r35=r11                          ; put into argument 1
nop.i 0x0
[MII]       cmp4.eq.unc p0,p6=r8,r0              ; if result !0, set predicate p6
mov.i ar.pfs=r32
mov r1=r34;;                         ; load acquires_count from GOT
nop.i 0x0
[MII]       ld4 r9=[r10]
mov b0=r33