Unrolling and Duff's Device

Fri 24 March 2006

Loop unrolling can be a fascinating art.

The canonical example of loop unrolling is something like

for (i = 0; i < 100; i++)
    x[i] = i;

becoming

for (i = 0; i < 25; i+=4) {
    x[i] = i;
    x[i+1] = i+1;
    x[i+2] = i+2;
    x[i+3] = i+3;
}

Longer runs of instructions afford you a better chance of managing to find instruction level parallelism; that is instructions that can all happen together. You can keep the pipeline fuller and hence run faster, at the cost of some expanded code size.

But unlike that example, you generally do not know where your loop ends. In the above example, we made four copies of the loop body, reducing the iterations by 4. For an indeterminate number of iterations, we would need to do the original loop n % 4 times and then the unrolled body floor(n / 4) times. So to spell it out, if we want to do the loop 7 times, then we execute the original loop 3 times and the unrolled portion once.

Our new code might look something like

int extras = n % 4;
for (i = 0; i < extras; i++)
    x[i] += i;

for ( ; i < n / 4 ; i+=4) {
    x[i] = i;
    x[i+1] = i+1;
    x[i+2] = i+2;
    x[i+3] = i+3;
}

That leads to a little trick called "Duff's Device", references to which can be found most anywhere (however I found most of them a little unclear, hence this post).

Duff's device looks like this

send(to, from, count)
register short *to, *from;
register count;
{
        register n=(count+7)/8;

        switch(count%8) {
        case 0: do {    *to = *from++;
        case 7:         *to = *from++;
        case 6:         *to = *from++;
        case 5:         *to = *from++;
        case 4:         *to = *from++;
        case 3:         *to = *from++;
        case 2:         *to = *from++;
        case 1:         *to = *from++;
                } while(--n>0);
        }
}

Now that doesn't look like loop unrolling, but it really is. In this case, we have unwound a loop doing simply *to = *from 8 times. But we still have that extra bit problem, since we have an indeterminate count.

To understand it, let's just work through what happens for possible inputs. For values of count from 1-7, n is one and count % 8 is just count so we fall right into the case statements, which fall through to each other without break and never repeat. The nifty bit comes when count is greater than 7. In this case, our first iteration comes from the switch where we fall through to do the extra count % 8 iterations. After that, we loop back to the start of the unrolling and do our 8 step loop as many times as we require.

Now by post-incrementing *to you can use this to implement memcpy(). An interesting exercise might be to see how much slower this is than the highly optimised memcpy that comes with glibc.

The code is most certainly ingenious and this page explains some of the history of it. Today I think it is mostly good as an exercise on "why it works".

If you're wondering how gcc handles unrolling, it creates the unrolled loop and the does a sort of fall through test which evaluates how many extra loops are required and branches into the unrolled portion at the correct point. A little extract is below, you can see r14 holds the number of extra loops required and branches higher up the unrolled code for each extra loop required.

...
  70: [MFB]       cmp4.eq p6,p7=0,r14
  76:             nop.f 0x0
  7c:       (p06) br.cond.dpnt.few 150 ;;
  80: [MFB]       cmp4.eq p6,p7=1,r14
  86:             nop.f 0x0
  8c:       (p06) br.cond.dpnt.few 130 ;;
  90: [MFB]       cmp4.eq p6,p7=2,r14
  96:             nop.f 0x0
  9c:       (p06) br.cond.dpnt.few 120 ;;
  a0: [MFB]       cmp4.eq p6,p7=3,r14
  a6:             nop.f 0x0
  ac:       (p06) br.cond.dpnt.few 110 ;;
  b0: [MFB]       cmp4.eq p6,p7=4,r14
  b6:             nop.f 0x0
  bc:       (p06) br.cond.dpnt.few 100 ;;
  c0: [MFB]       cmp4.eq p6,p7=5,r14
  c6:             nop.f 0x0
  cc:       (p06) br.cond.dpnt.few f0 ;;
  d0: [MMI]       cmp4.eq p6,p7=6,r14;;
  d6:       (p07) st4 [r16]=r17,4
  dc:             nop.i 0x0
...

Poking around inside the compiler

Tue 07 February 2006

Shehjar, one of our students, came up with the interesting little

int i = 5;
printf("%d %d %d", i, i--, ++i);

and wondered exactly what it should do. Well, the C standard says that arguments to functions are evaluated first, but does not specify in what order they are evaluated. So this is a classic example of undefined behaviour (in fact it's even listed in the common list of not a bug not a bug by the GCC manual). In fact, ++ and friends don't actually create sequence points within the code (points where things are known to be evaluated), so they can play host to all sorts of undefined behaviour like above.

Of course, gcc doesn't leave you blind to this. If you turn on -Wall you get warning: operation on 'i' may be undefined. You do compile with -Wall right?

But this doesn't change the fact that that code prints 5,6,5 on x86 and on Itanium it prints 5,5,5. This suggested some investigation.

Firstly, lets simplify this down to

extern blah(int, int, int);

void function(void)
{
        int i = 5;
        blah(i, i--, ++i);
}

The problem is, looking at the output of the assembly doesn't tell us that much, because GCC is smart and by the time it has got to assembly, it is simply moving constant values into registers. So we need some way to peek into what GCC thinks it's doing to come up with those values.

Hence the magic -fdump flag. Building with this gives us dumps of the stages GCC is going through as it generates it's code.

$ gcc -fdump-tree-all -S -O ./test.c

This gives us lots of dumps, in numerical order as gcc is moving along. With modern GCC's, a good place to pick things up is to look is in the .gimple output file. Gimplification is the process GCC goes through optimising its internal tree structure (called GIMPLE after SIMPLE, apparently).

The .gimple file is pretty-printed from the tree representation back into C code. So, on i386 it looks like.

x86 $ cat test.c.t08.gimple

;; Function function (function)

function ()
{
  int i.0;
  int i;

  i = 5;
  i = i + 1;
  i.0 = i;
  i = i - 1;
  blah (i, i.0, i);
}

And on Itanium it looks like

ia64 $ cat test.c.t08.gimple
;; Function function (function)

function ()
{
  int i.0;
  int i;

  i = 5;
  i.0 = i;
  i = i - 1;
  i = i + 1;
  blah (i, i.0, i);
}

At this point, we can clearly see how gcc ends up with the idea that the code translates into 5,6,5 on x86 and 5,5,5 on Itanium. We also know that it is not the fault of optimisation, since we can see just what it is about to optimise matches the output we get. Why exactly it reaches this conclusion on the different architectures escapes me at the moment, but I'm slowing reading through and understanding the GCC code base hoping to one day find enlightenment :) Certainly passing arguments is very different on x86 compared to IA64 (on the stack rather than via registers) so no doubt some small piece of the backend config tweaks this into happening.

Another interesting thing to look at is the raw output of the tree gcc is using. You can see that via something like

$ gcc -S -fdump-tree-all-raw  -o test test.c

Which gives you an output like

;; Function function (function)

function ()
@1      function_decl    name: @2       type: @3       srcp: test.c:4
                         extern         body: @4
@2      identifier_node  strg: function lngt: 8
@3      function_type    size: @5       algn: 8        retn: @6
                         prms: @7
@4      bind_expr        type: @6       vars: @8       body: @9
@5      integer_cst      type: @10      low : 8
@6      void_type        name: @11      algn: 8
@7      tree_list        valu: @6
@8      var_decl         name: @12      type: @13      scpe: @1
                         srcp: test.c:6                artificial
                         size: @14      algn: 32       used: 1
@9      statement_list   0   : @15      1   : @16      2   : @17
                         3   : @18      4   : @19
@10     integer_type     name: @20      unql: @21      size: @22
                         algn: 64       prec: 36       unsigned
                         min : @23      max : @24
@11     type_decl        name: @25      type: @6       srcp: <built-in>:0
@12     identifier_node  strg: i.0      lngt: 3
@13     integer_type     name: @26      size: @14      algn: 32
                         prec: 32       min : @27      max : @28
@14     integer_cst      type: @10      low : 32
@15     modify_expr      type: @6       op 0: @29      op 1: @30
@16     modify_expr      type: @13      op 0: @29      op 1: @31
@17     modify_expr      type: @13      op 0: @8       op 1: @29
@18     modify_expr      type: @13      op 0: @29      op 1: @32
@19     call_expr        type: @13      fn  : @33      args: @34
@20     identifier_node  strg: bit_size_type           lngt: 13
@21     integer_type     name: @20      size: @22      algn: 64
                         prec: 36
@22     integer_cst      type: @10      low : 64
@23     integer_cst      type: @10      low : 0
@24     integer_cst      type: @10      low : -1
@25     identifier_node  strg: void     lngt: 4
@26     type_decl        name: @35      type: @13      srcp: <built-in>:0
@27     integer_cst      type: @13      high: -1       low : -2147483648
@28     integer_cst      type: @13      low : 2147483647
@29     var_decl         name: @36      type: @13      scpe: @1
                         srcp: test.c:5                size: @14
                         algn: 32       used: 1
@30     integer_cst      type: @13      low : 5
@31     plus_expr        type: @13      op 0: @29      op 1: @37
@32     minus_expr       type: @13      op 0: @29      op 1: @37
@33     addr_expr        type: @38      op 0: @39
@34     tree_list        valu: @29      chan: @40
@35     identifier_node  strg: int      lngt: 3
@36     identifier_node  strg: i        lngt: 1
@37     integer_cst      type: @13      low : 1
@38     pointer_type     size: @14      algn: 32       ptd : @41
@39     function_decl    name: @42      type: @41      srcp: test.c:1
                         undefined      extern
@40     tree_list        valu: @8       chan: @43
@41     function_type    size: @5       algn: 8        retn: @13
                         prms: @44
@42     identifier_node  strg: blah     lngt: 4
@43     tree_list        valu: @29
@44     tree_list        valu: @13      chan: @45
@45     tree_list        valu: @13      chan: @46
@46     tree_list        valu: @13      chan: @7

You can see the @X points to other nodes, which builds up to an entire tree structure of different nodes which represent the program. You can read more about the internal representation here.

Take home point? Using these dumps is a good way to start peering into what GCC is actually doing underneath its fairly extensive hood.

inline functions

Thu 15 December 2005

Chatting with Hal, one of our summer students, we came across a bevy of interesting things to do with inlining code.

C99 6.7.4 tells us that inline can only be considered a hint, by the standard. gcc may choose not to inline things marked inline for a number of reasons.
The gcc attribute `always_inline <http://gcc.gnu.org/onlinedocs/gcc-4.0.2/gcc/Function-Attributes.html#Function-Attributes>`_ only applies to functions that are already declared with the inline function specifier.
always_inline will inline even with optimisation turned off. The kernel wants this on all the time so has
```
#define inline                  inline          __attribute__((always_inline))
```
Without optimisation turned on (-O) -finline-functions does nothing. This is because gcc doesn't have enough information to analyse the functions and decide whether to put things inline or not.
Code bloat is an obvious issue with inlining large amounts of code; C++ can be particularly bad with excessive inlining (and is one reason why gcc might choose to ignore inline directives).
Particularly relevant to the kernel is stack bloat -- if you join two functions together that separately use half the available stack when inlined, gcc may not be able to figure out the scope of variables is different, and they will all get separate stack space. For example
```
static inline void stack_user1(void)
{
    int blah[1000];
}

static inline void stack_user2(void)
{
    int blah[1000];
}

/* we use a lot of stack without realising it */
int caller(void)
{
    stack_user1();
    stack_user2();
}
```
Documenation on using inline as a macro with the advantage of type checking.

how to do max and min properly

Thu 08 December 2005

The canonical definition of max is something like

#define max(a,b) ( ( (a) >= (b) ) ? (a) : (b))

But this raises many issues with types in C. What if you pass in an unsigned int and a signed int all at the same time? Under what circumstances is the comparison valid? What type should be returned?

The kernel gets around this by being tricky

/*
 * min()/max() macros that also do
 * strict type-checking.. See the
 * "unnecessary" pointer comparison.
 */
#define min(x,y) ({ \
        typeof(x) _x = (x);     \
        typeof(y) _y = (y);     \
        (void) (&_x == &_y);            \
        _x < _y ? _x : _y; })

The give you a clue to how this works by talking about the pointer comparison. C99 6.5.9.2 says that for the == comparison

Both operands are pointers to qualified or unqualified version of compatible types

Where compatible types are defined in 6.7.2.1

Two types have compatible type if their types are the same.

There are additional rules, but gcc is right to warn you when you compare pointers of incompatible types. Note that if the inputs are not pointers, but arithmetic types, the C99 standard says that normal arithmetic conversions apply. You can see that the outcome of this is really probably not what you want, but the compiler has no reason to warn you.

Thus the above code makes a warning spew out if you try to compare values of incompatible types. You can try it yourself

$ cat test.c
#include <stdio.h>

int main(void)
{
        int i = -100;
        unsigned int j = 100;

        if (&i == &j);

        if (i < j)
                printf("OK\n");
        if (i > j)
                printf("i is -100, why is that > 100?\n");

        return 0;
}

$ gcc -Wall -o test test.c
test.c: In function 'main':
test.c:8: warning: comparison of distinct pointer types lacks a cast

$ ./test
i is -100, why is that > 100?

Not the most straight forward warning in the world, but sufficient to let you know you're doing something wrong. Note we don't get any warning where the problem really is, except we get a value we don't expect. C certainly can be a very tricky language to get right!

What exactly does -Bsymblic do?

Thu 10 November 2005

-Bsymbolic is described by the ld man page as

When creating a shared library, bind references to global symbols to the definition within the shared library, if any. Normally, it is possible for a program linked against a shared library to override the definition within the shared library. This option is only meaningful on ELF platforms which support shared libraries.

Indeed, it is interesting to peer under the hood a little. In fact, the only thing the -Bsymbolic flag does when building a shared library is add a flag in the dynamic section of the binary called DT_SYMBOLIC. Having a look at the ABI description of the flag is a little more instructive.

`` This element's presence in a shared object library alters the dynamic linker's symbol resolution algorithm for references within the library. Instead of starting a symbol search with the executable file, the dynamic linker starts from the shared object itself. If the shared object fails to supply the referenced symbol, the dynamic linker then searches the executable file and other shared objects as usual.``

Ahh, now we can have a peek into the dynamic loader to see what exactly it does with this flag. Pull open the glibc source and have a look at `elf/dl-load.c <http://sources.redhat.com/cgi-bin/cvsweb.cgi/libc/elf/dl-load.c?rev=1.249.2.17&content-type=text/x-cvsweb-markup&cvsroot=glibc>`_, particularly at the bit in _dl_map_obj_from_fd where we have

/* If this object has DT_SYMBOLIC set modify now its scope.  We don't
    have to do this for the main map.  */
 if ((mode & RTLD_DEEPBIND) == 0
     && __builtin_expect (l->l_info[DT_SYMBOLIC] != NULL, 0)
     && &l->l_searchlist != l->l_scope[0])
   {
     /* Create an appropriate searchlist.  It contains only this map.
        This is the definition of DT_SYMBOLIC in SysVr4.  */
     l->l_symbolic_searchlist.r_list =
       (struct link_map **) malloc (sizeof (struct link_map *));

     if (l->l_symbolic_searchlist.r_list == NULL)
       {
         errstring = N_("cannot create searchlist");
         goto call_lose_errno;
       }

     l->l_symbolic_searchlist.r_list[0] = l;
     l->l_symbolic_searchlist.r_nlist = 1;

     /* Now move the existing entries one back.  */
     memmove (&l->l_scope[1], &l->l_scope[0],
              (l->l_scope_max - 1) * sizeof (l->l_scope[0]));

     /* Now add the new entry.  */
     l->l_scope[0] = &l->l_symbolic_searchlist;
   }

So we can see (in typical glibcistic code) that we insert ourselves at the beginning of the link map, and move everything else along further down the chain.

Finally, an example to illustrate. It's a bit long, but worth understanding.

ianw@lime:/tmp/symbolic$ cat Makefile
all: test testsym

clean:
        rm -f *.so test testsym

liboverride.so : override.c
        $(CC) -shared -fPIC -o liboverride.so override.c

libtest.so : libtest.c
        $(CC) -shared -fPIC -o libtest.so libtest.c

libtestsym.so : libtest.c
        $(CC) -shared -fPIC -Wl,-Bsymbolic -o libtestsym.so libtest.c

test : test.c libtest.so liboverride.so
        $(CC) -L. -ltest -o test test.c

testsym : test.c libtestsym.so liboverride.so
        $(CC) -L. -ltestsym -o testsym test.c

ianw@lime:/tmp/symbolic$ cat libtest.c
#include <stdio.h>

int foo(void) {
        printf("libtest foo called\n");
        return 1;
}

int test_foo(void)
{
        return foo();
}
ianw@lime:/tmp/symbolic$ cat override.c
#include <stdio.h>

int foo(void)
{
        printf("override foo called\n");
        return 0;
}
ianw@lime:/tmp/symbolic$ cat test.c
#include <stdio.h>

int main(void)
{
        printf("%d\n", test_foo());
}


ianw@lime:/tmp/symbolic$ make
cc -shared -fPIC -o libtest.so libtest.c
cc -shared -fPIC -o liboverride.so override.c
cc -L. -ltest -o test test.c
cc -shared -fPIC -Wl,-Bsymbolic -o libtestsym.so libtest.c
cc -L. -ltestsym -o testsym test.c

So we have an application, a library (compiled once with -Bsymbolic and once without) and another library with a symbol foo that will override our original definition.

ianw@lime:/tmp/symbolic$ LD_LIBRARY_PATH=. ./test
libtest foo called
1
ianw@lime:/tmp/symbolic$ LD_LIBRARY_PATH=. ./testsym
libtest foo called
1

No surprises here ... both libraries call the "builtin" foo and everything is fine.

ianw@lime:/tmp/symbolic$ LD_LIBRARY_PATH=. LD_PRELOAD=./liboverride.so  ./test
override foo called
0
ianw@lime:/tmp/symbolic$ LD_LIBRARY_PATH=. LD_PRELOAD=./liboverride.so  ./testsym
libtest foo called
1

Now we see the difference. The preloaded foo was ignored in the second case, because the libtest was told to look within itself for symbols before looking outside.

-Bsymbolic is a rather big hammer to maintain symbol visiblity with however; it makes it all or nothing. Chances are you'd be better of with a version script. For example, with the following version script we can achieve the same thing.

ianw@lime:/tmp/symbolic$ cat Versions
{global: test_foo;  local: *; };
ianw@lime:/tmp/symbolic$ make
cc -shared -fPIC -Wl,-version-script=Versions -o libtestver.so libtest.c
cc -L. -ltestver -o testver test.c
ianw@lime:/tmp/symbolic$ LD_LIBRARY_PATH=. LD_PRELOAD=./liboverride.so ./testver
libtest foo called
1

You can see that since the version of foo in libtestver.so was local it got bound before the LD_PRELOAD, so has the same effect as -Bsymbolic.

Moral of the story? Version your symbols!

Update 2010-03-22 : see the update for more information on code optimisation with -Bsymbolic.

Stripping unused functions

Wed 09 November 2005

benno raised an interesting question on getting rid of unused functions from a binary. If you link a group of object files together, the linker will be smart enough to leave out any files that it know doesn't get used. But what about functions that don't get used?

The best solution I could think of was to use -ffunction-sections and then --gc-sections to remove those sections that aren't used.

$ cat foo.c
int foo(void) {};
int bar(void) {};

$ cat main.c
int
main(void)
{
  foo();
  return 0;
};

$ gcc -c main.c
$ gcc -c -ffunction-sections foo.c
$ gcc -Wl,--gc-sections foo.o main.o -o program
$ objdump -d ./program | grep bar

This seems well out of scope for the linker to resolve, as it's job isn't to splice and dice text sections. The compiler doesn't really have enough information to see what is going on either. I think it would be possible to analyse the relocations in a group of pre-linked object files, and then compare that against a list of global functions in those same object files and if there isn't a relocation against it, prune the function out of the object.

I think the nicest way to do that would be with libelf wrappers for Python. But surely someone has looked at doing that before?

cdecl

Wed 02 November 2005

cdecl is another tool that has been around since the days when Trumpet Winsock and SLIP were cutting edge (the README says 1996) which I just found a use for.

I find it handy for debugging things like

#include <stdio.h>
#include <setjmp.h>

jmp_buf j;

extern int* fun1(void);
extern int* fun2(void);
extern int* fun3(void);
int* foo(int *p);

int* foo (int *p)
{
    p = fun1();
    if (setjmp (j))
       return p;

       /* longjmp (j) may occur in fun3. */
       p = fun3();
       return p;
}
ianw@lime:/tmp$ gcc -g -c  -O2 -W -Wall -Wstrict-prototypes -Wmissing-prototypes -Werror -g -O3 jmp.c
cc1: warnings being treated as errors
jmp.c: In function 'foo':
jmp.c:11: warning: argument 'p' might be clobbered by 'longjmp' or 'vfork'

I can never remember the correct volatile for this situation

ianw@lime:~$ cdecl
cdecl> explain volatile int *p
declare p as pointer to volatile int
cdecl> explain int *volatile v
declare p as volatile pointer to int

So clearly the second version is the one I want, since I need to indicate that the pointer value might change underneath us. Nifty!

Investigating floats

Thu 20 October 2005

float variables are, on most current hardware, instrumented as as single precision IEEE 754 value. We know that floating point values only approximate real values, but it is often hard to visualise. When we have the binary decimal 0.12345 we are really saying something like 1*10-1 + 2*10-2 + 3*10-3 .... Thus a binary decimal 0.10110 describes 1*2-1 + 0*2-2 + 1*2-3 ...

So clearly, with one decimal digit we can describe the range 0.0 - 0.9, but with one binary digit we can describe only 0 and 1/2 = 0.5. The fractions we can represent exactly are the dyadic rationals. Other rationals repeat (just like in base 10) and others are irrational. So for example, 0.1 can not be repesented exactly without infinite precision.

The program below illustrates how a single precision floating point value works by showing what bits are set and adding them up.

For example

$ ./float .5
0.500000 = 1 * (1/2^0) * 2^-1
0.500000 = 1 * (1/1) * 2^-1
0.500000 = 1 * 1 * 0.500000
0.500000 = 0.5
$ ./float .1
0.100000 = 1 * (1/2^0 + 1/2^1 + 1/2^4 + 1/2^5 + 1/2^8 + 1/2^9 + 1/2^12 + 1/2^13 + 1/2^16 + 1/2^17 + 1/2^20 + 1/2^21 + 1/2^23) * 2^-4
0.100000 = 1 * (1/1 + 1/2 + 1/16 + 1/32 + 1/256 + 1/512 + 1/4096 + 1/8192 + 1/65536 + 1/131072 + 1/1048576 + 1/2097152 + 1/8388608) * 2^-4
0.100000 = 1 * 1.60000002384 * 0.062500
0.100000 = 0.10000000149

I use doubles to show how the errors creep in.

#include <stdio.h>
#include <string.h>
#include <stdlib.h>

/* return 2^n */
int two_to_pos(int n)
{
    if (n == 0)
        return 1;
    return 2 * two_to_pos(n - 1);
}

double two_to_neg(int n)
{
    if (n == 0)
        return 1;
    return 1.0 / (two_to_pos(abs(n)));
}

double two_to(int n)
{
    if (n >= 0)
        return two_to_pos(n);
    if (n < 0)
        return two_to_neg(n);
    return 0;
}

/* Go through some memory "m" which is the 24 bit significand of a
   floating point number.  We work "backwards" from the bits
   furthest on the right, for no particular reason. */
double calc_float(int m, int bit)
{
    /* 23 bits; this terminates recursion */
    if (bit > 23)
        return 0;

    /* if the bit is set, it represents the value 1/2^bit */
    if ((m >> bit) & 1)
        return 1.0L/two_to(23 - bit) + calc_float(m, bit + 1);

    /* otherwise go to the next bit */
    return calc_float(m, bit + 1);
}

int main(int argc, char *argv[])
{
    float f;
    int m,i,sign,exponent,significand;

    if (argc != 2)
    {
        printf("usage: float 123.456\n");
        exit(1);
    }

    if (sscanf(argv[1], "%f", &f) != 1)
    {
        printf("invalid input\n");
        exit(1);
    }

    /* We need to "fool" the compiler, as if we start to use casts
       (e.g. (int)f) it will actually do a conversion for us.  We
       want access to the raw bits, so we just copy it into a same
       sized variable. */
    memcpy(&m, &f, 4);

    /* The sign bit is the first bit */
    sign = (m >> 31) & 0x1;

    /* Exponent is 8 bits following the sign bit */
    exponent = ((m >> 23) & 0xFF) - 127;

    /* Significand fills out the float, the first bit is implied
       to be 1, hence the 24 bit OR value below. */
    significand = (m & 0x7FFFFF) | 0x800000;

    /* print out a power representation */
    printf("%f = %d * (", f, sign ? -1 : 1);
    for(i = 23 ; i >= 0 ; i--)
    {
        if ((significand >> i) & 1)
            printf("%s1/2^%d", (i == 23) ? "" : " + ",
                   23-i);
    }
    printf(") * 2^%d\n", exponent);

    /* print out a fractional representation */
    printf("%f = %d * (", f, sign ? -1 : 1);
    for(i = 23 ; i >= 0 ; i--)
    {
        if ((significand >> i) & 1)
            printf("%s1/%d", (i == 23) ? "" : " + ",
                   (int)two_to(23-i));
    }
    printf(") * 2^%d\n", exponent);

    /* convert this into decimal and print it out */
    printf("%f = %d * %.12g * %f\n",
           f,
           (sign ? -1 : 1),
           calc_float(significand, 0),
           two_to(exponent));

    /* do the math this time */
    printf("%f = %.12g\n",
           f,
           (sign ? -1 : 1) *
           calc_float(significand, 0) *
           two_to(exponent)
        );

    return 0;
}