technovelty

weblog of Ian Wienand

RSS  |  technovelty home  |  page of ian  |  ian@wienand.org

Non-thread safe C

Recently it has come up on the LKML and other blogs about how C is not thread safe. The nice piece of example code given is

int trylock()
{
    int res;

    res = pthread_mutex_trylock(&mutex);
    if (res == 0)
            ++acquires_count;

    return res;
}

At first glance it seems OK, and I wasn't exactly sure what the problem was until I pulled it apart. GCC is well within its rights to produce the following code (annotated for clarity).

push   %ebp                     ; setting up stack frame
mov    %esp,%ebp                ; save esp to ebp
sub    $0x8,%esp                ; make some room on stack
movl   <mutex>,(%esp)           ; copy mutex to stack
call   <pthread_mutex_trylock>  ; call pthread_mutex_trylock
mov    <acquires_count>,%edx    ; copy acquires_count to %edx
cmp    $0x1,%eax                ; set carry flag if result == 0
adc    $0x0,%edx                ; add with carry
mov    %edx,<acquires_count>    ; write back acquires_count
leave                           ; release stack frame
ret                             ; done!

This code uses a bit of a trick to avoid a branch around code. It uses the carry flag, which is simple enough to understand when adding. Consider if you are adding 0001 + 1111 you end up with 11111, where the top bit is the carry bit. You can then feed this carry bit into another part of the addition operation if you were, say, using two 32 bit numbers to represent a 64 bit number on a 32 bit system.

The trick uses the dual nature of the carry flag as a "borrow" bit; if a < b when a - b then the bit is set. Consider the simplest case of 0 - 1 -- the carry bit is set to indicate to the next operation that a bit needs to be borrowed. Imagine for a moment we have a four bit machine and want to do the 8-bit unsigned operation

1111 0000 -   (240) -
0000 0001     (  1)
==== ====

First we do 0000 - 0001 which leaves us 1111 with the borrow bit set. Then we do 1111 - 0000 but take note that the borrow bit is set, so get rid of the borrowed bit, leaving us 1110. Putting the two together gives us 1110 1111 or 239.

Therefore the cmp $0x1,<result> instruction (which is really just result - 1 with no output but setting flags) can only set the carry flag (i.e. borrow a bit) if the value is 0; any other value in result means there will be no need to borrow and the carry flag is not set.

Consequently the add with carry (adc) will add one to the result only if the carry bit was set. A neat trick which avoids a jump, but not so much thread safe: if you are interrupted after the cmp but before the mov store you can write back the wrong value. Most people (including me till 5 minutes ago) looking at the code would assume that it was safe on the theory that acquires_count is not touched if you don't have the lock because it is inside the if statement.

Interestingly, the Intel compiler does the simpler jump method

push   <mutex>                     ; push argument
call   <pthread_mutex_trylock>
pop    %ecx                        ; get return addr on top
test   %eax,%eax                   ; AND %eax,%eax
jne    <trylock+0x16>  --------+   ; if that wasn't 0, jump out
addl   $0x1,<acquires_count>   |
ret                    <-------+   ; done!

Frankly I have no idea which one is faster/better (I would guess the second one because you don't always dirty a cache line, but you'd have to measure that against the cost of the branch) but obviously GCC thinks one thing and Intel thinks another and I bet if you put them in a room together you would end up with 3 opinions! I think the major point is that once upon a time it was mostly people writing kernels doing things like accessing I/O ports or other memory mapped areas that needed to be very careful the compiler didn't optimise out (or in) something incorrectly. With multi-threaded C code that problem moves up to the application developer too.

A different instruction set can help here. The Itanium version is possibly less prone to this sort of thing because predication makes it faster to implement conditional branches -- any instruction with a predicate that is not set is skipped. The load and add is safe because it will only be done with the lock held. However, if anyone is wondering why Itanium come with such big caches, the difference between the 6 line version and the Itanium version below should answer that.

 [MII]       alloc r32=ar.pfs,4,3,0               ; alloc stack frame
             mov r33=b0                           ; return pointer
             mov r34=r1
 [MMI]       addl r14=72,r1;;                     ; r11 = &mutex from GOT
             ld8 r11=[r14]                        ;
             nop.i 0x0;;
 [MIB]       mov r35=r11                          ; put into argument 1
             nop.i 0x0
             br.call.sptk.few b0=<pthread_mutex_trylock>
 [MII]       cmp4.eq.unc p0,p6=r8,r0              ; if result !0, set predicate p6
             mov.i ar.pfs=r32
             mov r1=r34;;                         ; load acquires_count from GOT
 [MIB]       addl r10=144,r1                      ;
             nop.i 0x0
       (p06) br.cond.dptk.few <trylock+0x80>;;    ; jump to [1] if p6 set
 [MII]       ld4 r9=[r10]
             mov b0=r33
             addl r3=144,r1;;
 [MMI]       adds r2=1,r9;;                       ; acquires_count++
             st4 [r3]=r2                          ; save updated value
             nop.i 0x0
 [MIB]       nop.m 0x0                            ; [1]
             nop.i 0x0
             br.few <trylock+0x90>;;              ; jump to [2]
 [MII]       nop.m 0x0
             mov b0=r33
             nop.i 0x0;;
 [MIB]       nop.m 0x0                            ; [2]
             nop.i 0x0
             br.ret.sptk.many b0;;

I guess the moral of the story is that multi-threading is hard, but we knew that anyway.

posted at: Mon, 29 Oct 2007 12:13 | in /code/badcode | permalink | add comment (8 others)

Posted by Maurice Massar at Mon Oct 29 18:51:26 2007

"A neat trick which avoids a jump, but not so much thread safe: if you are interrupted after the cmp but before the store you can write back the wrong value."

Why? The carry flag must not be changed by a
context switch, and if "acquires_count" is
read/written by different threads, any access
to it must be protected.

Posted by Jon at Mon Oct 29 20:32:42 2007

Interesting discussion, but also regarding "A neat trick which avoids a jump, but not so much thread safe: if you are interrupted after the cmp but before the store you can write back the wrong value." - for someone not well versed in assembly, where is the store? is it a sub-instruction of the adc? so is the problem if you are interrupted between the cmp and the adc in your listing?

Posted by Ian at Mon Oct 29 21:08:39 2007

The problem is you're storing acquires_count in a register then
unconditionally writing it back, whether you have the lock or not.
Most people looking at that would assume the opposite; if we don't
have the lock we don't touch the variable

-i

Posted by Daniel Burrows at Tue Oct 30 01:06:53 2007

Ow, that made my head hurt.

If I'm not mistaken, the upshot here is that anything with multiple readers and writers has to be "volatile", or the compiler might optimize it into a bug.  Is that right?  It adds yet another easy-to-miss detail that has to be perfectly right all the time or risk subtle errors ... so it's certainly right in line with the rest of C.

Posted by Jan Hudec at Tue Oct 30 18:20:15 2007

As far as I know, increment is not atomic. Never. It never was specified as atomic.

Even the add instruction itself might not, in fact, be atomic on SMP system (on i386 it needs lock prefix, on some CPUs other tricks are needed). AFAIK even volatile will not make increment atomic. That's why Linux kernel goes to great lengths to define all operations on atomic_t in assembly.

Unfortunately pthread library does not seem to have anything with atomic increment (except a little unwieldy semaphores), but recent glib does have atomic increment and such (also in assembly!).

Posted by Ian at Tue Oct 30 19:19:44 2007

Yep, increment isn't atomic (although in this case that isn't the issue).  You might like to check out libatomic-ops (http://www.hpl.hp.com/research/linux/atomic_ops/ which has been written for all sorts of atomic operations.

Posted by jldugger at Tue Oct 30 20:22:13 2007

Fortunately, we use nesC in the my lab, and it has keyword support for atomic. This works out well in our environment, where the compiler simply turns off  interrupts. Theoretically one could replace them with a more complicated locking system, but I don't think anyone's been motivated to do so yet.

Posted by amit kr singh at Fri May 29 14:01:22 2009

hi this is a free website

Add a comment
*Name
*Email (not shown)
Website
*Comment:
Anti-spam:
* denotes required field

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 2.5 License.