technovelty

weblog of Ian Wienand

RSS  |  technovelty home  |  page of ian  |  ianw@ieee.org

Conditional Operator Harmful ? yes : no

It seems existing versions of gcc don't warn when you use an assignment as a truth value as the first operand to the conditional operator. For example:

int fn(int i)
{
	return (i = 0x20) ? 0x40 : 0x80;
}

Presumably you meant ==, but unfortunately that does not warn with gcc (maybe it will one day).

We can use the tree dumping mechanisms (previously mentioned here) to confirm our problem. For the following function gcc creates:

fn (i)
{
  int D.1541;
  int iftmp.0;

  i = 32;
  if (1)
    {
      iftmp.0 = 64;
    }
  else
    {
      iftmp.0 = 128;
    }
  D.1541 = iftmp.0;
  return D.1541;
}

As you can see, this isn't the result we wanted. This can be quite nasty if you use the conditional operator in a hidden fashion; for example behind an assert. A common idiom is:

#define ASSERT(x) (x) ? : fail()

/* programmer now subsitutes a == for */
ASSERT(x = y);

You'll currently get no warning you've slipped up and used the assert wrong. Moral of the story: a quick audit of your code might turn up some surprising uses! However, in this case, the Intel compiler picks this one up:

$ icc -c test.c
test.c(3): warning #187: use of "=" where "==" may have been intended
  	   return (i = 0x20) ? 0x40 : 0x80;

Although it's often not trivial to move to another compiler, as a rule I would recommend trying alternative compilers for your code as it's a great sanity check.

posted at: Sat, 26 Apr 2008 08:29 | in /code/c | permalink | add comment (1 others)

Checking users of typedefs

If you poke around the Linux VM layers you might wonder why there is a STRICT_MM_TYPECHECKS option. It is there to stop programmers touching opaque types. Linus has said many times that a typedef should only be made for a completely opaque type. This therefore implies that rather than operate on it directly, there is some sort of "API" to get at the values.

A simple example is below. my_long_t is opaque; the only way you should use its value is via the my_val accessor.

But how do we enforce this, when the type is really just an unsigned long? We wrap it up in something else, in this case a struct.

#ifndef STRICT
  typedef unsigned long my_long_t;
# define my_val(x) (x)

#else
  typedef struct { unsigned long t; } my_long_t;
# define my_val(x) ((x).t)

#endif

my_long_t my_naughty_function(my_long_t input)
{
        return input * 1000;
}

We can now see the outcome

$ gcc -c strict.c
$ gcc -DSTRICT -c strict.c
strict.c: In function 'my_naughty_function':
strict.c:15: error: invalid operands to binary *

So now we have an error where we do the wrong thing, rather than no notification at all! This niftily moves you up from "bug via visual inspection" to "won't compile" (Rusty Russell talked about this scale at linux.conf.au once -- does anyone have a reference to that talk?). Why not just leave it like this then? Well it might confuse the compiler; if you have rules about always passing structures on the stack, for example, the above will suck. In general, it's best to hide as little from the compiler as possible, to give it the best chance of doing a good job for you.

posted at: Fri, 30 Jun 2006 13:35 | in /code/c | permalink | add comment (1 others)

GCC Trampolines

Paul Wayper writes about trampolines. This is something I've come across before, when I was learning about function pointers with IA64.

Nested functions are a strange contraption, and probably best avoided (people who should know agree). For example, here is a completely obfuscated way to double a number.

#include <stdio.h>

int function(int arg) {

        int nested_function(int nested_arg) {
                return arg + nested_arg;
        }

        return nested_function(arg);
}


int main(void)
{
        printf("%d\n", function(10));
}

Let's disassemble that on IA64 (because it is so much easier to understand).

0000000000000000 :
   0:   00 10 15 08 80 05       [MII]       alloc r34=ar.pfs,5,4,0
   6:   c0 80 33 7e 46 60                   adds r12=-16,r12
   c:   04 08 00 84                         mov r35=r1
  10:   01 00 00 00 01 00       [MII]       nop.m 0x0
  16:   10 02 00 62 00 80                   mov r33=b0
  1c:   04 00 01 84                         mov r36=r32;;
  20:   0a 70 40 18 00 21       [MMI]       adds r14=16,r12;;
  26:   00 00 39 20 23 e0                   st4 [r14]=r32
  2c:   01 70 00 84                         mov r15=r14
  30:   10 00 00 00 01 00       [MIB]       nop.m 0x0
  36:   00 00 00 02 00 00                   nop.i 0x0
  3c:   48 00 00 50                         br.call.sptk.many b0=70 
  40:   0a 08 00 46 00 21       [MMI]       mov r1=r35;;
  46:   00 00 00 02 00 00                   nop.m 0x0
  4c:   20 02 aa 00                         mov.i ar.pfs=r34
  50:   00 00 00 00 01 00       [MII]       nop.m 0x0
  56:   00 08 05 80 03 80                   mov b0=r33
  5c:   01 61 00 84                         adds r12=16,r12
  60:   11 00 00 00 01 00       [MIB]       nop.m 0x0
  66:   00 00 00 02 00 80                   nop.i 0x0
  6c:   08 00 84 00                         br.ret.sptk.many b0;;

0000000000000070 :
  70:   0a 40 00 1e 10 10       [MMI]       ld4 r8=[r15];;
  76:   80 00 21 00 40 00                   add r8=r32,r8
  7c:   00 00 04 00                         nop.i 0x0
  80:   11 00 00 00 01 00       [MIB]       nop.m 0x0
  86:   00 00 00 02 00 80                   nop.i 0x0
  8c:   08 00 84 00                         br.ret.sptk.many b0;;

Note that nothing funny happens there at all. The nested function looks exactly like a real function, but we don't do any of the alloc or anything to get us a new stack frame -- we're using the stack frame of the old function. No trampoline involved, just a straight branch.

So why do we need a trampoline? Because a nested function has an extra property over a normal function; the stack pointer must be set to be the stack pointer of the function it is nested within. This means that if we were to pass around pointers to the nested function, we would have to keep track that this isn't a pointer to just any old function, but a pointer to special function that needs its stack setup to come from the nested function. C doesn't work like this, there is only one type "pointer to function".

The easiest way is therefore to create a little function that sets the stack to the right place and calls the code. This "trampoline" is a normal function, so no need for trickery. For example, we can modify this to use a function pointer to the nested function:

#include <stdio.h>

int function(int arg) {

        int nested_function(int nested_arg) {
                return arg + nested_arg;
        }

        int (*function_pointer)(int arg) = nested_function;

        return function_pointer(arg);
}


int main(void)
{
        printf("%d\n", function(10));
}

We can see that now we are calling through a function pointer, we go through __ia64_trampoline, which is a little code stub in libgcc.a which loads up the right stack pointer and calls the "nested" function.

function:
	.prologue 12, 33
	.mmb
	.save ar.pfs, r34
	alloc r34 = ar.pfs, 1, 3, 1, 0
	.fframe 48
	adds r12 = -48, r12
	nop 0
	.mmi
	addl r15 = @ltoffx(__ia64_trampoline#), r1
	mov r35 = r1
	.save rp, r33
	mov r33 = b0
	.body
	;;
	.mmi
	nop 0
	adds r8 = 24, r12
	adds r14 = 16, r12
	.mmi
	ld8.mov r15 = [r15], __ia64_trampoline#
	;;
	st4 [r14] = r32
	nop 0
	.mmi
	mov r14 = r8
	;;
	st8 [r14] = r15, 8
	adds r15 = 40, r12
	;;
	.mfi
	st8 [r14] = r15, 8
	nop 0
	addl r15 = @ltoff(@fptr(nested_function.1814#)), gp
	;;
	.mmi
	ld8 r15 = [r15]
	;;
	st8 [r14] = r15, 8
	adds r15 = 16, r12
	;;
	.mmb
	st8 [r14] = r15
	ld8 r14 = [r8], 8
	nop 0
	.mmi
	ld4 r36 = [r15]
	;;
	nop 0
	mov b6 = r14
	.mbb
	ld8 r1 = [r8]
	nop 0
	br.call.sptk.many b0 = b6
	;;
	.mii
	mov r1 = r35
	mov ar.pfs = r34
	mov b0 = r33
	.mib
	nop 0
	.restore sp
	adds r12 = 48, r12
	br.ret.sptk.many b0
	.endp function#
	.section	.rodata.str1.8,"aMS",@progbits,1
	.align 8

So, the summary is that if you're taking the address of a nested function, to avoid having different types of pointers to functions if they are nested or not gcc puts in the stub "trampoline" code.

posted at: Mon, 15 May 2006 23:53 | in /code/c | permalink | add comment (0 others)

Unrolling and Duff's Device

Loop unrolling can be a fascinating art.

The canonical example of loop unrolling is something like

for (i = 0; i < 100; i++)
    x[i] = i;

becoming

for (i = 0; i < 25; i+=4) {
    x[i] = i;
    x[i+1] = i+1;
    x[i+2] = i+2;
    x[i+3] = i+3;
}

Longer runs of instructions afford you a better chance of managing to find instruction level parallelism; that is instructions that can all happen together. You can keep the pipeline fuller and hence run faster, at the cost of some expanded code size.

But unlike that example, you generally do not know where your loop ends. In the above example, we made four copies of the loop body, reducing the iterations by 4. For an indeterminate number of iterations, we would need to do the original loop n % 4 times and then the unrolled body floor(n / 4) times. So to spell it out, if we want to do the loop 7 times, then we execute the original loop 3 times and the unrolled portion once.

Our new code might look something like

int extras = n % 4;
for (i = 0; i < extras; i++)
    x[i] += i;

for ( ; i < n / 4 ; i+=4) {
    x[i] = i;
    x[i+1] = i+1;
    x[i+2] = i+2;
    x[i+3] = i+3;
}

That leads to a little trick called "Duff's Device", references to which can be found most anywhere (however I found most of them a little unclear, hence this post).

Duff's device looks like this

send(to, from, count)
register short *to, *from;
register count;
{
        register n=(count+7)/8;

        switch(count%8) {
        case 0: do {    *to = *from++;
        case 7:         *to = *from++;
        case 6:         *to = *from++;
        case 5:         *to = *from++;
        case 4:         *to = *from++;
        case 3:         *to = *from++;
        case 2:         *to = *from++;
        case 1:         *to = *from++;
                } while(--n>0);
        }
}

Now that doesn't look like loop unrolling, but it really is. In this case, we have unwound a loop doing simply *to = *from 8 times. But we still have that extra bit problem, since we have an indeterminate count.

To understand it, let's just work through what happens for possible inputs. For values of count from 1-7, n is one and count % 8 is just count so we fall right into the case statements, which fall through to each other without break and never repeat. The nifty bit comes when count is greater than 7. In this case, our first iteration comes from the switch where we fall through to do the extra count % 8 iterations. After that, we loop back to the start of the unrolling and do our 8 step loop as many times as we require.

Now by post-incrementing *to you can use this to implement memcpy(). An interesting exercise might be to see how much slower this is than the highly optimised memcpy that comes with glibc.

The code is most certainly ingenious and this page explains some of the history of it. Today I think it is mostly good as an exercise on "why it works".

If you're wondering how gcc handles unrolling, it creates the unrolled loop and the does a sort of fall through test which evaluates how many extra loops are required and branches into the unrolled portion at the correct point. A little extract is below, you can see r14 holds the number of extra loops required and branches higher up the unrolled code for each extra loop required.

...
  70: [MFB]       cmp4.eq p6,p7=0,r14
  76:             nop.f 0x0
  7c:       (p06) br.cond.dpnt.few 150 ;;
  80: [MFB]       cmp4.eq p6,p7=1,r14
  86:             nop.f 0x0
  8c:       (p06) br.cond.dpnt.few 130 ;;
  90: [MFB]       cmp4.eq p6,p7=2,r14
  96:             nop.f 0x0
  9c:       (p06) br.cond.dpnt.few 120 ;;
  a0: [MFB]       cmp4.eq p6,p7=3,r14
  a6:             nop.f 0x0
  ac:       (p06) br.cond.dpnt.few 110 ;;
  b0: [MFB]       cmp4.eq p6,p7=4,r14
  b6:             nop.f 0x0
  bc:       (p06) br.cond.dpnt.few 100 ;;
  c0: [MFB]       cmp4.eq p6,p7=5,r14
  c6:             nop.f 0x0
  cc:       (p06) br.cond.dpnt.few f0 ;;
  d0: [MMI]       cmp4.eq p6,p7=6,r14;;
  d6:       (p07) st4 [r16]=r17,4
  dc:             nop.i 0x0
...

posted at: Fri, 24 Mar 2006 09:32 | in /code/c | permalink | add comment (0 others)

Poking around inside the compiler

Shehjar, one of our students, came up with the interesting little

int i = 5;
printf("%d %d %d", i, i--, ++i);

and wondered exactly what it should do. Well, the C standard says that arguments to functions are evaluated first, but does not specify in what order they are evaluated. So this is a classic example of undefined behaviour (in fact it's even listed in the common list of not a bug not a bug by the GCC manual). In fact, ++ and friends don't actually create sequence points within the code (points where things are known to be evaluated), so they can play host to all sorts of undefined behaviour like above.

Of course, gcc doesn't leave you blind to this. If you turn on -Wall you get warning: operation on 'i' may be undefined. You do compile with -Wall right?

But this doesn't change the fact that that code prints 5,6,5 on x86 and on Itanium it prints 5,5,5. This suggested some investigation.

Firstly, lets simplify this down to

extern blah(int, int, int);

void function(void)
{
        int i = 5;
        blah(i, i--, ++i);
}

The problem is, looking at the output of the assembly doesn't tell us that much, because GCC is smart and by the time it has got to assembly, it is simply moving constant values into registers. So we need some way to peek into what GCC thinks it's doing to come up with those values.

Hence the magic -fdump flag. Building with this gives us dumps of the stages GCC is going through as it generates it's code.

$ gcc -fdump-tree-all -S -O ./test.c

This gives us lots of dumps, in numerical order as gcc is moving along. With modern GCC's, a good place to pick things up is to look is in the .gimple output file. Gimplification is the process GCC goes through optimising its internal tree structure (called GIMPLE after SIMPLE, apparently).

The .gimple file is pretty-printed from the tree representation back into C code. So, on i386 it looks like.

x86 $ cat test.c.t08.gimple

;; Function function (function)

function ()
{
  int i.0;
  int i;

  i = 5;
  i = i + 1;
  i.0 = i;
  i = i - 1;
  blah (i, i.0, i);
}

And on Itanium it looks like

ia64 $ cat test.c.t08.gimple
;; Function function (function)

function ()
{
  int i.0;
  int i;

  i = 5;
  i.0 = i;
  i = i - 1;
  i = i + 1;
  blah (i, i.0, i);
}

At this point, we can clearly see how gcc ends up with the idea that the code translates into 5,6,5 on x86 and 5,5,5 on Itanium. We also know that it is not the fault of optimisation, since we can see just what it is about to optimise matches the output we get. Why exactly it reaches this conclusion on the different architectures escapes me at the moment, but I'm slowing reading through and understanding the GCC code base hoping to one day find enlightenment :) Certainly passing arguments is very different on x86 compared to IA64 (on the stack rather than via registers) so no doubt some small piece of the backend config tweaks this into happening.

Another interesting thing to look at is the raw output of the tree gcc is using. You can see that via something like

$ gcc -S -fdump-tree-all-raw  -o test test.c

Which gives you an output like


;; Function function (function)

function ()
@1      function_decl    name: @2       type: @3       srcp: test.c:4
                         extern         body: @4
@2      identifier_node  strg: function lngt: 8
@3      function_type    size: @5       algn: 8        retn: @6
                         prms: @7
@4      bind_expr        type: @6       vars: @8       body: @9
@5      integer_cst      type: @10      low : 8
@6      void_type        name: @11      algn: 8
@7      tree_list        valu: @6
@8      var_decl         name: @12      type: @13      scpe: @1
                         srcp: test.c:6                artificial
                         size: @14      algn: 32       used: 1
@9      statement_list   0   : @15      1   : @16      2   : @17
                         3   : @18      4   : @19
@10     integer_type     name: @20      unql: @21      size: @22
                         algn: 64       prec: 36       unsigned
                         min : @23      max : @24
@11     type_decl        name: @25      type: @6       srcp: <built-in>:0
@12     identifier_node  strg: i.0      lngt: 3
@13     integer_type     name: @26      size: @14      algn: 32
                         prec: 32       min : @27      max : @28
@14     integer_cst      type: @10      low : 32
@15     modify_expr      type: @6       op 0: @29      op 1: @30
@16     modify_expr      type: @13      op 0: @29      op 1: @31
@17     modify_expr      type: @13      op 0: @8       op 1: @29
@18     modify_expr      type: @13      op 0: @29      op 1: @32
@19     call_expr        type: @13      fn  : @33      args: @34
@20     identifier_node  strg: bit_size_type           lngt: 13
@21     integer_type     name: @20      size: @22      algn: 64
                         prec: 36
@22     integer_cst      type: @10      low : 64
@23     integer_cst      type: @10      low : 0
@24     integer_cst      type: @10      low : -1
@25     identifier_node  strg: void     lngt: 4
@26     type_decl        name: @35      type: @13      srcp: <built-in>:0
@27     integer_cst      type: @13      high: -1       low : -2147483648
@28     integer_cst      type: @13      low : 2147483647
@29     var_decl         name: @36      type: @13      scpe: @1
                         srcp: test.c:5                size: @14
                         algn: 32       used: 1
@30     integer_cst      type: @13      low : 5
@31     plus_expr        type: @13      op 0: @29      op 1: @37
@32     minus_expr       type: @13      op 0: @29      op 1: @37
@33     addr_expr        type: @38      op 0: @39
@34     tree_list        valu: @29      chan: @40
@35     identifier_node  strg: int      lngt: 3
@36     identifier_node  strg: i        lngt: 1
@37     integer_cst      type: @13      low : 1
@38     pointer_type     size: @14      algn: 32       ptd : @41
@39     function_decl    name: @42      type: @41      srcp: test.c:1
                         undefined      extern
@40     tree_list        valu: @8       chan: @43
@41     function_type    size: @5       algn: 8        retn: @13
                         prms: @44
@42     identifier_node  strg: blah     lngt: 4
@43     tree_list        valu: @29
@44     tree_list        valu: @13      chan: @45
@45     tree_list        valu: @13      chan: @46
@46     tree_list        valu: @13      chan: @7

You can see the @X points to other nodes, which builds up to an entire tree structure of different nodes which represent the program. You can read more about the internal representation here.

Take home point? Using these dumps is a good way to start peering into what GCC is actually doing underneath its fairly extensive hood.

posted at: Tue, 07 Feb 2006 15:45 | in /code/c | permalink | add comment (2 others)

inline functions

Chatting with Hal, one of our summer students, we came across a bevy of interesting things to do with inlining code.

posted at: Thu, 15 Dec 2005 15:44 | in /code/c | permalink | add comment (0 others)

how to do max and min properly

The canonical definition of max is something like

#define max(a,b) ( ( (a) >= (b) ) ? (a) : (b))

But this raises many issues with types in C. What if you pass in an unsigned int and a signed int all at the same time? Under what circumstances is the comparison valid? What type should be returned?

The kernel gets around this by being tricky

/*
 * min()/max() macros that also do
 * strict type-checking.. See the
 * "unnecessary" pointer comparison.
 */
#define min(x,y) ({ \
        typeof(x) _x = (x);     \
        typeof(y) _y = (y);     \
        (void) (&_x == &_y);            \
        _x < _y ? _x : _y; })

The give you a clue to how this works by talking about the pointer comparison. C99 6.5.9.2 says that for the == comparison

Both operands are pointers to qualified or unqualified version of compatible types

Where compatible types are defined in 6.7.2.1

Two types have compatible type if their types are the same.

There are additional rules, but gcc is right to warn you when you compare pointers of incompatible types. Note that if the inputs are not pointers, but arithmetic types, the C99 standard says that normal arithmetic conversions apply. You can see that the outcome of this is really probably not what you want, but the compiler has no reason to warn you.

Thus the above code makes a warning spew out if you try to compare values of incompatible types. You can try it yourself

$ cat test.c
#include <stdio.h>

int main(void)
{
        int i = -100;
        unsigned int j = 100;

        if (&i == &j);

        if (i < j)
                printf("OK\n");
        if (i > j)
                printf("i is -100, why is that > 100?\n");

        return 0;
}

$ gcc -Wall -o test test.c
test.c: In function 'main':
test.c:8: warning: comparison of distinct pointer types lacks a cast

$ ./test
i is -100, why is that > 100?

Not the most straight forward warning in the world, but sufficient to let you know you're doing something wrong. Note we don't get any warning where the problem really is, except we get a value we don't expect. C certainly can be a very tricky language to get right!

posted at: Thu, 08 Dec 2005 17:19 | in /code/c | permalink | add comment (0 others)

What exactly does -Bsymblic do?

-Bsymbolic is described by the ld man page as

When creating a shared library, bind references to global symbols to the definition within the shared library, if any. Normally, it is possible for a program linked against a shared library to override the definition within the shared library. This option is only meaningful on ELF platforms which support shared libraries.

Indeed, it is interesting to peer under the hood a little. In fact, the only thing the -Bsymbolic flag does when building a shared library is add a flag in the dynamic section of the binary called DT_SYMBOLIC. Having a look at the ABI description of the flag is a little more instructive.

This element's presence in a shared object library alters the dynamic linker's symbol resolution algorithm for references within the library. Instead of starting a symbol search with the executable file, the dynamic linker starts from the shared object itself. If the shared object fails to supply the referenced symbol, the dynamic linker then searches the executable file and other shared objects as usual.

Ahh, now we can have a peek into the dynamic loader to see what exactly it does with this flag. Pull open the glibc source and have a look at elf/dl-load.c, particularly at the bit in _dl_map_obj_from_fd where we have

 /* If this object has DT_SYMBOLIC set modify now its scope.  We don't
     have to do this for the main map.  */
  if ((mode & RTLD_DEEPBIND) == 0
      && __builtin_expect (l->l_info[DT_SYMBOLIC] != NULL, 0)
      && &l->l_searchlist != l->l_scope[0])
    {
      /* Create an appropriate searchlist.  It contains only this map.
         This is the definition of DT_SYMBOLIC in SysVr4.  */
      l->l_symbolic_searchlist.r_list =
        (struct link_map **) malloc (sizeof (struct link_map *));

      if (l->l_symbolic_searchlist.r_list == NULL)
        {
          errstring = N_("cannot create searchlist");
          goto call_lose_errno;
        }

      l->l_symbolic_searchlist.r_list[0] = l;
      l->l_symbolic_searchlist.r_nlist = 1;

      /* Now move the existing entries one back.  */
      memmove (&l->l_scope[1], &l->l_scope[0],
               (l->l_scope_max - 1) * sizeof (l->l_scope[0]));

      /* Now add the new entry.  */
      l->l_scope[0] = &l->l_symbolic_searchlist;
    }

So we can see (in typical glibcistic code) that we insert ourselves at the beginning of the link map, and move everything else along further down the chain.

Finally, an example to illustrate. It's a bit long, but worth understanding.

ianw@lime:/tmp/symbolic$ cat Makefile
all: test testsym

clean:
        rm -f *.so test testsym

liboverride.so : override.c
        $(CC) -shared -fPIC -o liboverride.so override.c

libtest.so : libtest.c
        $(CC) -shared -fPIC -o libtest.so libtest.c

libtestsym.so : libtest.c
        $(CC) -shared -fPIC -Wl,-Bsymbolic -o libtestsym.so libtest.c

test : test.c libtest.so liboverride.so
        $(CC) -L. -ltest -o test test.c

testsym : test.c libtestsym.so liboverride.so
        $(CC) -L. -ltestsym -o testsym test.c

ianw@lime:/tmp/symbolic$ cat libtest.c
#include 

int foo(void) {
        printf("libtest foo called\n");
        return 1;
}

int test_foo(void)
{
        return foo();
}
ianw@lime:/tmp/symbolic$ cat override.c
#include 

int foo(void)
{
        printf("override foo called\n");
        return 0;
}
ianw@lime:/tmp/symbolic$ cat test.c
#include 

int main(void)
{
        printf("%d\n", test_foo());
}


ianw@lime:/tmp/symbolic$ make
cc -shared -fPIC -o libtest.so libtest.c
cc -shared -fPIC -o liboverride.so override.c
cc -L. -ltest -o test test.c
cc -shared -fPIC -Wl,-Bsymbolic -o libtestsym.so libtest.c
cc -L. -ltestsym -o testsym test.c

So we have an application, a library (compiled once with -Bsymbolic and once without) and another library with a symbol foo that will override our original definition.

ianw@lime:/tmp/symbolic$ LD_LIBRARY_PATH=. ./test
libtest foo called
1
ianw@lime:/tmp/symbolic$ LD_LIBRARY_PATH=. ./testsym
libtest foo called
1

No surprises here ... both libraries call the "builtin" foo and everything is fine.

ianw@lime:/tmp/symbolic$ LD_LIBRARY_PATH=. LD_PRELOAD=./liboverride.so  ./test
override foo called
0
ianw@lime:/tmp/symbolic$ LD_LIBRARY_PATH=. LD_PRELOAD=./liboverride.so  ./testsym
libtest foo called
1

Now we see the difference. The preloaded foo was ignored in the second case, because the libtest was told to look within itself for symbols before looking outside.

-Bsymbolic is a rather big hammer to maintain symbol visiblity with however; it makes it all or nothing. Chances are you'd be better of with a version script. For example, with the following version script we can achieve the same thing.

ianw@lime:/tmp/symbolic$ cat Versions
{global: test_foo;  local: *; };
ianw@lime:/tmp/symbolic$ make
cc -shared -fPIC -Wl,-version-script=Versions -o libtestver.so libtest.c
cc -L. -ltestver -o testver test.c
ianw@lime:/tmp/symbolic$ LD_LIBRARY_PATH=. LD_PRELOAD=./liboverride.so ./testver
libtest foo called
1

You can see that since the version of foo in libtestver.so was local it got bound before the LD_PRELOAD, so has the same effect as -Bsymbolic.

Moral of the story? Version your symbols!

posted at: Thu, 10 Nov 2005 16:40 | in /code/c | permalink | add comment (0 others)

Stripping unused functions

benno raised an interesting question on getting rid of unused functions from a binary. If you link a group of object files together, the linker will be smart enough to leave out any files that it know doesn't get used. But what about functions that don't get used?

The best solution I could think of was to use -ffunction-sections and then --gc-sections to remove those sections that aren't used.

$ cat foo.c
int foo(void) {};
int bar(void) {};

$ cat main.c
int
main(void)
{
  foo();
  return 0;
};

$ gcc -c main.c
$ gcc -c -ffunction-sections foo.c
$ gcc -Wl,--gc-sections foo.o main.o -o program
$ objdump -d ./program | grep bar

This seems well out of scope for the linker to resolve, as it's job isn't to splice and dice text sections. The compiler doesn't really have enough information to see what is going on either. I think it would be possible to analyse the relocations in a group of pre-linked object files, and then compare that against a list of global functions in those same object files and if there isn't a relocation against it, prune the function out of the object.

I think the nicest way to do that would be with libelf wrappers for Python. But surely someone has looked at doing that before?

posted at: Wed, 09 Nov 2005 15:02 | in /code/c | permalink | add comment (0 others)

cdecl

cdecl is another tool that has been around since the days when Trumpet Winsock and SLIP were cutting edge (the README says 1996) which I just found a use for.

I find it handy for debugging things like

#include <stdio.h>
#include <setjmp.h>

jmp_buf j;

extern int* fun1(void);
extern int* fun2(void);
extern int* fun3(void);
int* foo(int *p);

int* foo (int *p)
{
	p = fun1();
	if (setjmp (j))
	   return p;

	   /* longjmp (j) may occur in fun3. */
	   p = fun3();
	   return p;
}
ianw@lime:/tmp$ gcc -g -c  -O2 -W -Wall -Wstrict-prototypes -Wmissing-prototypes -Werror -g -O3 jmp.c
cc1: warnings being treated as errors
jmp.c: In function 'foo':
jmp.c:11: warning: argument 'p' might be clobbered by 'longjmp' or 'vfork'

I can never remember the correct volatile for this situation

ianw@lime:~$ cdecl
cdecl> explain volatile int *p
declare p as pointer to volatile int
cdecl> explain int *volatile v
declare p as volatile pointer to int

So clearly the second version is the one I want, since I need to indicate that the pointer value might change underneath us. Nifty!

posted at: Wed, 02 Nov 2005 14:28 | in /code/c | permalink | add comment (0 others)

Investigating floats

float variables are, on most current hardware, instrumented as as single precision IEEE 754 value. We know that floating point values only approximate real values, but it is often hard to visualise. When we have the binary decimal 0.12345 we are really saying something like 1*10-1 + 2*10-2 + 3*10-3 .... Thus a binary decimal 0.10110 describes 1*2-1 + 0*2-2 + 1*2-3 ...

So clearly, with one decimal digit we can describe the range 0.0 - 0.9, but with one binary digit we can describe only 0 and 1/2 = 0.5. The fractions we can represent exactly are the dyadic rationals. Other rationals repeat (just like in base 10) and others are irrational. So for example, 0.1 can not be repesented exactly without infinite precision.

The program below illustrates how a single precision floating point value works by showing what bits are set and adding them up.

For example

$ ./float .5
0.500000 = 1 * (1/2^0) * 2^-1
0.500000 = 1 * (1/1) * 2^-1
0.500000 = 1 * 1 * 0.500000
0.500000 = 0.5
$ ./float .1
0.100000 = 1 * (1/2^0 + 1/2^1 + 1/2^4 + 1/2^5 + 1/2^8 + 1/2^9 + 1/2^12 + 1/2^13 + 1/2^16 + 1/2^17 + 1/2^20 + 1/2^21 + 1/2^23) * 2^-4
0.100000 = 1 * (1/1 + 1/2 + 1/16 + 1/32 + 1/256 + 1/512 + 1/4096 + 1/8192 + 1/65536 + 1/131072 + 1/1048576 + 1/2097152 + 1/8388608) * 2^-4
0.100000 = 1 * 1.60000002384 * 0.062500
0.100000 = 0.10000000149

I use doubles to show how the errors creep in.

#include <stdio.h>
#include <string.h>
#include <stdlib.h>

/* return 2^n */
int two_to_pos(int n)
{
	if (n == 0)
		return 1;
	return 2 * two_to_pos(n - 1);
}

double two_to_neg(int n)
{
	if (n == 0)
		return 1;
	return 1.0 / (two_to_pos(abs(n)));
}

double two_to(int n)
{
	if (n >= 0)
		return two_to_pos(n);
	if (n < 0)
		return two_to_neg(n);
	return 0;
}

/* Go through some memory "m" which is the 24 bit significand of a
   floating point number.  We work "backwards" from the bits
   furthest on the right, for no particular reason. */
double calc_float(int m, int bit)
{
	/* 23 bits; this terminates recursion */
	if (bit > 23)
		return 0;

	/* if the bit is set, it represents the value 1/2^bit */
	if ((m >> bit) & 1)
		return 1.0L/two_to(23 - bit) + calc_float(m, bit + 1);

	/* otherwise go to the next bit */
	return calc_float(m, bit + 1);
}

int main(int argc, char *argv[])
{
	float f;
	int m,i,sign,exponent,significand;

	if (argc != 2)
	{
		printf("usage: float 123.456\n");
		exit(1);
	}

	if (sscanf(argv[1], "%f", &f) != 1)
	{
		printf("invalid input\n");
		exit(1);
	}

	/* We need to "fool" the compiler, as if we start to use casts
	   (e.g. (int)f) it will actually do a conversion for us.  We
	   want access to the raw bits, so we just copy it into a same
	   sized variable. */
	memcpy(&m, &f, 4);

	/* The sign bit is the first bit */
	sign = (m >> 31) & 0x1;

	/* Exponent is 8 bits following the sign bit */
	exponent = ((m >> 23) & 0xFF) - 127;

	/* Significand fills out the float, the first bit is implied
	   to be 1, hence the 24 bit OR value below. */
	significand = (m & 0x7FFFFF) | 0x800000;

	/* print out a power representation */
	printf("%f = %d * (", f, sign ? -1 : 1);
	for(i = 23 ; i >= 0 ; i--)
	{
		if ((significand >> i) & 1)
			printf("%s1/2^%d", (i == 23) ? "" : " + ",
			       23-i);
	}
	printf(") * 2^%d\n", exponent);

	/* print out a fractional representation */
	printf("%f = %d * (", f, sign ? -1 : 1);
	for(i = 23 ; i >= 0 ; i--)
	{
		if ((significand >> i) & 1)
			printf("%s1/%d", (i == 23) ? "" : " + ",
			       (int)two_to(23-i));
	}
	printf(") * 2^%d\n", exponent);

	/* convert this into decimal and print it out */
	printf("%f = %d * %.12g * %f\n",
	       f,
	       (sign ? -1 : 1),
	       calc_float(significand, 0),
	       two_to(exponent));

	/* do the math this time */
	printf("%f = %.12g\n",
	       f,
	       (sign ? -1 : 1) *
	       calc_float(significand, 0) *
	       two_to(exponent)
		);

	return 0;
}

posted at: Thu, 20 Oct 2005 16:20 | in /code/c | permalink | add comment (0 others)

printf tricks

A sometimes handy trick is registering your own specifiers for printf. Although it is GNU only, it is still a handy trick. For example, glibc comes with a built-in extension to reduce a float for you, e.g.

$ cat test.c
#include 
#include 

int main(void)
{
        float bytes = 1024.0 * 1024.0 * 1024.0;

        register_printf_function ('b', printf_size, printf_size_info);

        printf("%.0bB\n", bytes);

}

$ gcc -o test test.c
$ ./test
1gB

If we register the handler with a capital letter, the base will be 1000, otherwise it is 1024. It is unfortunate that this doesn't appear to support SI units, but it is still handy (and may one day).

The glibc manual has full details on customising printf further. You can easily add a specifier for your own structures, etc.

posted at: Wed, 19 Oct 2005 10:02 | in /code/c | permalink | add comment (0 others)

Even numbers

Is

int even(int n)
{
        return (n & 1);
}

a valid way of determining the evenness of a number? For unsigned integers, yes. For signed positive integers also yes. However, C99 6.2.6.2-2 states

For signed integer types, the bits of the object representation shall be divided into three groups: value bits, padding bits, and the sign bit. [...] If the sign bit is one, the value shall be modified in one of the following ways

In other words, there are three possible representations of a signed integer in C, and one of them (one's complement) has a negative representation that fails the above test. However, I don't think there are actually any one's complement machines, so you're pretty safe.

However, Murphy's law dictates that eventually your code will run on some bizarre device that uses one's complement. Thus the best bet is to use a remainder (%) and let the compiler figure it out for you. Have a look at what the compiler gets up to for the most straight forward case:

$ cat test.c
int function(int i)
{
        return (i % 2);
}

$ gcc -O2 -S test.c
$ cat test.s
        .file   "test.c"
        .text
        .p2align 4,,15
.globl function
        .type   function, @function
function:
        pushl   %ebp
        movl    %esp, %ebp
        movl    8(%ebp), %eax
        andl    $-2147483647, %eax
        js      .L5
        popl    %ebp
        ret
        .p2align 4,,15
.L5:
        decl    %eax
        orl     $-2, %eax
        incl    %eax
        popl    %ebp
        ret
        .size   function, .-function
        .ident  "GCC: (GNU) 4.0.2 20050821 (prerelease) (Debian 4.0.1-6)"
        .section        .note.GNU-stack,"",@progbits

It's a bit tricky because that remainder needs to account for negative values. So $-2147483657 == -0x7FFFFFFF, which in two's complement is binary 1000 0000 0000 0000 0000 0000 0000 0001. So the andl checks the last bit just like we did before, however having the top bit set means that if the top bit is set in the other operand (i.e. the sign bit, indicating it is a negative number), the sign flag will be set and the js (jump on sign bit) will be triggered (that next bit at .L5 is a consequence of C99 6.5.5 6 where we have the relation (a/b)*b + a%b == a).

We can reduce the even function down to

$ cat test.c
int function(int i)
{
        return ((unsigned int)i % 2);
}

$ gcc -O2 -S test.c

$ cat test.s
        .file   "test.c"
        .text
        .p2align 4,,15
.globl function
        .type   function, @function
function:
        pushl   %ebp
        movl    %esp, %ebp
        movl    8(%ebp), %eax
        andl    $1, %eax
        popl    %ebp
        ret
        .size   function, .-function
        .ident  "GCC: (GNU) 4.0.2 20050821 (prerelease) (Debian 4.0.1-6)"
        .section        .note.GNU-stack,"",@progbits

Which is the same as the original on this two's complement x86. In this trivial case the cast to unsigned could have been used with the original & 1 check and would have been OK, but my feeling is this the % is a little more clear (and if you need to worry about negative values your hand is forced).

I think the moral is to be worried whenever you pull out a bitwise operator and try to pass it off the work to the compiler (with a quick sanity check).

posted at: Wed, 14 Sep 2005 09:51 | in /code/c | permalink | add comment (0 others)

Buffering with streams

When you know about buffering it's trivial, but when you don't you can get some very unexpected behaviour from UNIX.

For example, the following program will print nothing.

#include 

int main(void)
{
        printf("hello world");

        while (1)
                sleep(1);
}

This is because by default your stdout stream (which printf uses) will be in line buffering mode, where output will not be printed until a newline (\n) is seen.

You can either add the newline (as was probably intended, but it's an easy enough thing to leave off!) or manually call fflush to flush the buffer.

There are three buffer modes available; unbuffered, line and block buffered. You can access them with the setbuf call. Unbuffered is the default for stderr, while your tty by default takes on line buffering. Anything connected to a block device (e.g. a file on a disk, or pipes) take on block buffering, where the best amounts of data for the device are written at once.

But essentially, if your output is missing check for those newlines!

posted at: Fri, 26 Aug 2005 23:51 | in /code/c | permalink | add comment (0 others)

sched_yield considered harmful

According to recent discussions sched_yield can cause performance problems with some code.

I haven't looked at the OpenLDAP code, but a common problem is that if you call pthread_cond_signal it doesn't actually relinquish control from the current thread to the thread (if any) that has been signaled. The thread that has been signalled has to wait for its next time slice before it will realise it has been woken up, and the signalling thread loops around the mutex for the conditional doing nothing. Thus a common idiom is to call pthread_cond_signal and then sched_yield to relinquish control from your thread to the waiting thread (I've certainly done this myself).

This can lead to very bad performance under certain circumstances since sched_yield on Linux is saying to the scheduler "I have nothing important to do". This may actually be the opposite of what is really true, as usually developers see this problem when there is so much work to do you don't get a chance to relinquish control and make some forward progress with it. This is especially bad on a SMP machine where you probably don't want to relinquish control at all, since you don't need to give up your timeslice for other threads to start running on other processors; what allows you forward progress on a uniprocessor might be slowing you down needlessly on a SMP machine. On some other systems sched_yield will relinquish control to another thread in your thread group, but this is not specified by the standard so you can not depend on it.

One solution to such a problem is presented below. This is a fairly standard situation where are request is made and needs to be handled. To avoid using sched_yield one needs to queue the requests for a dispatcher to handle and take an extra lock that it is possible to sleep on when required. On a uniprocessor, when requests are flooding in request() will just be looping queueing work until its timeslice is complete. The typical idiom is to place a yield after the unlock of dispatcher.mutex to allow the dispatcher to do some work. However by adding another lock you can make request sleep, giving the dispatcher time to clear the queue. You will notice if you include the usleep that everything happens nice and sequentially on uniprocessor or SMP; this is because the dispatcher gets a chance to run as the requests go to sleep. Conversely without the sleep the requests are flooding in as fast as possible; to keep your request handling latency reasonable you don't want to be giving up the CPU to every Tom, Dick and Harry every time you handle a new request!

Compile this once as

$ gcc -Wall -o server server.c -lpthread

to see the problem, then once as

$ gcc -Wall -o server-resched server.c -lpthread -DRESCHED 

to see the dispatcher handling the batched requests. I believe this is a good solution to minimise latency below a CPU timeslice when requests are flooding, without having to "fool" the scheduler with sched_yield. Something to keep in mind is that if you find yourself using sched_yield when you may very well have more useful work to do, then chances are you're doing the wrong thing.

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <pthread.h>
#include <errno.h>

struct request_tag_t {
	struct request_tag_t *next;
	int operation;
};

struct dispatcher_tag_t {
	struct request_tag_t *first;
	struct request_tag_t *last;
	int running;
	pthread_mutex_t mutex;
	pthread_cond_t request;
	pthread_mutex_t resched;
};

struct dispatcher_tag_t dispatcher = {
	.first = NULL,
	.last = NULL,
	.running = 0,
	.mutex = PTHREAD_MUTEX_INITIALIZER,
	.request = PTHREAD_COND_INITIALIZER,
	.resched = PTHREAD_MUTEX_INITIALIZER,
};

int rand_num(int range)
{
	return (int)((range)*1.0*rand()/(RAND_MAX+1.0));
}

void *dispatcher_routine(void *arg)
{
	struct request_tag_t *request;

	printf("dispatcher starting\n");

	while (1)
	{
		/* take the dispatcher queue mutex */
		pthread_mutex_lock(&dispatcher.mutex);

		/* if nothing to do, wait on the dispatcher queue
		 * condition variable until there is */
		while (dispatcher.first == NULL)
			pthread_cond_wait(&dispatcher.request, &dispatcher.mutex);

		/* take a request from the queue */
		request = dispatcher.first;
		dispatcher.first = request->next;
		if (dispatcher.first == NULL)
			dispatcher.last = NULL;

		/* handle request, probably by handling it in a new thread. */
		printf("I'm dispatching request %d\n", request->operation);

		/* free request we have just finished with */
		free(request);

		/* release queue lock */
		pthread_mutex_unlock(&dispatcher.mutex);
#ifdef RESCHED
		/* release the resched lock to signal we've seen the queue */
		pthread_mutex_unlock(&dispatcher.resched);
#endif
	}
}

int tries = 1;

void request(int operation)
{
	struct request_tag_t *request;

	/* take the dispatcher queue mutex */
	pthread_mutex_lock(&dispatcher.mutex);

	/* start the thread if it isn't */
	if (!dispatcher.running)
	{
		pthread_t thread;
		printf("dispatcher not running, starting\n");
		dispatcher.running = 1;
		pthread_create(&thread, NULL, dispatcher_routine, NULL);
	}

	/* allocate a new request */
	request = malloc(sizeof(struct request_tag_t));
	if (request == NULL)
		exit(1);
	request->next = NULL;
	request->operation = operation;

	/* add to the queue, maintaing first and last */
	if (dispatcher.first == NULL) {
		dispatcher.first = request;
		dispatcher.last = request;
	} else {
		(dispatcher.last)->next = request;
		dispatcher.last = request;
	}

	/* signal something to do */
	pthread_cond_signal(&dispatcher.request);

	/* free the queue lock */
	pthread_mutex_unlock(&dispatcher.mutex);

	/* temping to put a sched_yield() here */

#ifdef RESCHED
	/* try the resched lock.  if we can't have it, that means
	 * we've already got it and the dispatch thread hasn't had a
	 * chance to run and unlock it.  We queue up to 10 requests
	 * before allowing the dispatcher to run by sleeping on the
	 * mutex
	 */
	if (pthread_mutex_trylock(&dispatcher.resched) == EBUSY) {
		if (tries++ == 10) {
			pthread_mutex_lock(&dispatcher.resched);
			tries = 1;
		}
	} else 
		tries = 1;
#endif
}

int main(void) {
	while(1)
	{
		int n = rand_num(3);
		printf("requesting %d\n", n);
		request(n);

		/* if you uncomment this, everything should play
		 * nicely with one dispatch per request, because other
		 * threads get a chance to run.  Otherwise we have
		 * requests flooding in as fast as possible 
		 */

		/* usleep(500000); */
	}
}

posted at: Tue, 23 Aug 2005 10:36 | in /code/c | permalink | add comment (0 others)

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 2.5 License.