printf tricks

A sometimes handy trick is registering your own specifiers for printf. Although it is GNU only, it is still a handy trick. For example, glibc comes with a built-in extension to reduce a float for you, e.g.

$ cat test.c
#include
#include

int main(void)
{
        float bytes = 1024.0 * 1024.0 * 1024.0;

        register_printf_function ('b', printf_size, printf_size_info);

        printf("%.0bB\n", bytes);

}

$ gcc -o test test.c
$ ./test
1gB

If we register the handler with a capital letter, the base will be 1000, otherwise it is 1024. It is unfortunate that this doesn't appear to support SI units, but it is still handy (and may one day).

The glibc manual has full details on customising printf further. You can easily add a specifier for your own structures, etc.

Even numbers

Is

int even(int n)
{
        return (n & 1);
}

a valid way of determining the evenness of a number? For unsigned integers, yes. For signed positive integers also yes. However, C99 6.2.6.2-2 states

For signed integer types, the bits of the object representation shall be divided into three groups: value bits, padding bits, and the sign bit. [...] If the sign bit is one, the value shall be modified in one of the following ways

  • the corresponding value with the sign bit 0 is negated (sign and magnitude);
  • the sign bit has the value -(2:sup:N)(two's complement);
  • the sign bit has the value -(2:sup:N - 1)(one's complement).

In other words, there are three possible representations of a signed integer in C, and one of them (one's complement) has a negative representation that fails the above test. However, I don't think there are actually any one's complement machines, so you're pretty safe.

However, Murphy's law dictates that eventually your code will run on some bizarre device that uses one's complement. Thus the best bet is to use a remainder (%) and let the compiler figure it out for you. Have a look at what the compiler gets up to for the most straight forward case:

$ cat test.c
int function(int i)
{
        return (i % 2);
}

$ gcc -O2 -S test.c
$ cat test.s
        .file   "test.c"
        .text
        .p2align 4,,15
.globl function
        .type   function, @function
function:
        pushl   %ebp
        movl    %esp, %ebp
        movl    8(%ebp), %eax
        andl    $-2147483647, %eax
        js      .L5
        popl    %ebp
        ret
        .p2align 4,,15
.L5:
        decl    %eax
        orl     $-2, %eax
        incl    %eax
        popl    %ebp
        ret
        .size   function, .-function
        .ident  "GCC: (GNU) 4.0.2 20050821 (prerelease) (Debian 4.0.1-6)"
        .section        .note.GNU-stack,"",@progbits

It's a bit tricky because that remainder needs to account for negative values. So $-2147483657 == -0x7FFFFFFF, which in two's complement is binary 1000 0000 0000 0000 0000 0000 0000 0001. So the andl checks the last bit just like we did before, however having the top bit set means that if the top bit is set in the other operand (i.e. the sign bit, indicating it is a negative number), the sign flag will be set and the js (jump on sign bit) will be triggered (that next bit at .L5 is a consequence of C99 6.5.5 6 where we have the relation (a/b)*b + a%b == a).

We can reduce the even function down to

$ cat test.c
int function(int i)
{
        return ((unsigned int)i % 2);
}

$ gcc -O2 -S test.c

$ cat test.s
        .file   "test.c"
        .text
        .p2align 4,,15
.globl function
        .type   function, @function
function:
        pushl   %ebp
        movl    %esp, %ebp
        movl    8(%ebp), %eax
        andl    $1, %eax
        popl    %ebp
        ret
        .size   function, .-function
        .ident  "GCC: (GNU) 4.0.2 20050821 (prerelease) (Debian 4.0.1-6)"
        .section        .note.GNU-stack,"",@progbits

Which is the same as the original on this two's complement x86. In this trivial case the cast to unsigned could have been used with the original & 1 check and would have been OK, but my feeling is this the % is a little more clear (and if you need to worry about negative values your hand is forced).

I think the moral is to be worried whenever you pull out a bitwise operator and try to pass it off the work to the compiler (with a quick sanity check).

Buffering with streams

When you know about buffering it's trivial, but when you don't you can get some very unexpected behaviour from UNIX.

For example, the following program will print nothing.

#include

int main(void)
{
        printf("hello world");

        while (1)
                sleep(1);
}

This is because by default your stdout stream (which printf uses) will be in line buffering mode, where output will not be printed until a newline (\n) is seen.

You can either add the newline (as was probably intended, but it's an easy enough thing to leave off!) or manually call fflush to flush the buffer.

There are three buffer modes available; unbuffered, line and block buffered. You can access them with the setbuf call. Unbuffered is the default for stderr, while your tty by default takes on line buffering. Anything connected to a block device (e.g. a file on a disk, or pipes) take on block buffering, where the best amounts of data for the device are written at once.

But essentially, if your output is missing check for those newlines!

sched_yield considered harmful

According to recent discussions sched_yield can cause performance problems with some code.

I haven't looked at the OpenLDAP code, but a common problem is that if you call pthread_cond_signal it doesn't actually relinquish control from the current thread to the thread (if any) that has been signaled. The thread that has been signalled has to wait for its next time slice before it will realise it has been woken up, and the signalling thread loops around the mutex for the conditional doing nothing. Thus a common idiom is to call pthread_cond_signal and then sched_yield to relinquish control from your thread to the waiting thread (I've certainly done this myself).

This can lead to very bad performance under certain circumstances since sched_yield on Linux is saying to the scheduler "I have nothing important to do". This may actually be the opposite of what is really true, as usually developers see this problem when there is so much work to do you don't get a chance to relinquish control and make some forward progress with it. This is especially bad on a SMP machine where you probably don't want to relinquish control at all, since you don't need to give up your timeslice for other threads to start running on other processors; what allows you forward progress on a uniprocessor might be slowing you down needlessly on a SMP machine. On some other systems sched_yield will relinquish control to another thread in your thread group, but this is not specified by the standard so you can not depend on it.

One solution to such a problem is presented below. This is a fairly standard situation where are request is made and needs to be handled. To avoid using sched_yield one needs to queue the requests for a dispatcher to handle and take an extra lock that it is possible to sleep on when required. On a uniprocessor, when requests are flooding in request() will just be looping queueing work until its timeslice is complete. The typical idiom is to place a yield after the unlock of dispatcher.mutex to allow the dispatcher to do some work. However by adding another lock you can make request sleep, giving the dispatcher time to clear the queue. You will notice if you include the usleep that everything happens nice and sequentially on uniprocessor or SMP; this is because the dispatcher gets a chance to run as the requests go to sleep. Conversely without the sleep the requests are flooding in as fast as possible; to keep your request handling latency reasonable you don't want to be giving up the CPU to every Tom, Dick and Harry every time you handle a new request!

Compile this once as

$ gcc -Wall -o server server.c -lpthread

to see the problem, then once as

$ gcc -Wall -o server-resched server.c -lpthread -DRESCHED

to see the dispatcher handling the batched requests. I believe this is a good solution to minimise latency below a CPU timeslice when requests are flooding, without having to "fool" the scheduler with sched_yield. Something to keep in mind is that if you find yourself using sched_yield when you may very well have more useful work to do, then chances are you're doing the wrong thing.

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <pthread.h>
#include <errno.h>

struct request_tag_t {
    struct request_tag_t *next;
    int operation;
};

struct dispatcher_tag_t {
    struct request_tag_t *first;
    struct request_tag_t *last;
    int running;
    pthread_mutex_t mutex;
    pthread_cond_t request;
    pthread_mutex_t resched;
};

struct dispatcher_tag_t dispatcher = {
    .first = NULL,
    .last = NULL,
    .running = 0,
    .mutex = PTHREAD_MUTEX_INITIALIZER,
    .request = PTHREAD_COND_INITIALIZER,
    .resched = PTHREAD_MUTEX_INITIALIZER,
};

int rand_num(int range)
{
    return (int)((range)*1.0*rand()/(RAND_MAX+1.0));
}

void *dispatcher_routine(void *arg)
{
    struct request_tag_t *request;

    printf("dispatcher starting\n");

    while (1)
    {
        /* take the dispatcher queue mutex */
        pthread_mutex_lock(&dispatcher.mutex);

        /* if nothing to do, wait on the dispatcher queue
         * condition variable until there is */
        while (dispatcher.first == NULL)
            pthread_cond_wait(&dispatcher.request, &dispatcher.mutex);

        /* take a request from the queue */
        request = dispatcher.first;
        dispatcher.first = request->next;
        if (dispatcher.first == NULL)
            dispatcher.last = NULL;

        /* handle request, probably by handling it in a new thread. */
        printf("I'm dispatching request %d\n", request->operation);

        /* free request we have just finished with */
        free(request);

        /* release queue lock */
        pthread_mutex_unlock(&dispatcher.mutex);
#ifdef RESCHED
        /* release the resched lock to signal we've seen the queue */
        pthread_mutex_unlock(&dispatcher.resched);
#endif
    }
}

int tries = 1;

void request(int operation)
{
    struct request_tag_t *request;

    /* take the dispatcher queue mutex */
    pthread_mutex_lock(&dispatcher.mutex);

    /* start the thread if it isn't */
    if (!dispatcher.running)
    {
        pthread_t thread;
        printf("dispatcher not running, starting\n");
        dispatcher.running = 1;
        pthread_create(&thread, NULL, dispatcher_routine, NULL);
    }

    /* allocate a new request */
    request = malloc(sizeof(struct request_tag_t));
    if (request == NULL)
        exit(1);
    request->next = NULL;
    request->operation = operation;

    /* add to the queue, maintaing first and last */
    if (dispatcher.first == NULL) {
        dispatcher.first = request;
        dispatcher.last = request;
    } else {
        (dispatcher.last)->next = request;
        dispatcher.last = request;
    }

    /* signal something to do */
    pthread_cond_signal(&dispatcher.request);

    /* free the queue lock */
    pthread_mutex_unlock(&dispatcher.mutex);

    /* temping to put a sched_yield() here */

#ifdef RESCHED
    /* try the resched lock.  if we can't have it, that means
     * we've already got it and the dispatch thread hasn't had a
     * chance to run and unlock it.  We queue up to 10 requests
     * before allowing the dispatcher to run by sleeping on the
     * mutex
     */
    if (pthread_mutex_trylock(&dispatcher.resched) == EBUSY) {
        if (tries++ == 10) {
            pthread_mutex_lock(&dispatcher.resched);
            tries = 1;
        }
    } else
        tries = 1;
#endif
}

int main(void) {
    while(1)
    {
        int n = rand_num(3);
        printf("requesting %d\n", n);
        request(n);

        /* if you uncomment this, everything should play
         * nicely with one dispatch per request, because other
         * threads get a chance to run.  Otherwise we have
         * requests flooding in as fast as possible
         */

        /* usleep(500000); */
    }
}

on structure alignment

C99 states (6.7.2.1 - 12)

Each non-bit-field member of a structure or union object is aligned in an implementation defined manner appropriate to its type.

For example, gcc 4 has changed the way structures line up on the stack for IA64.

#include

struct disk_stat {
        int fd;
        unsigned count;
};

int main(void)
{
        int blah;
        struct disk_stat test;

        printf("%p\n", &test);
}
ianw@lime:/tmp$ vi test.c
ianw@lime:/tmp$ gcc-4.0 -o test test.c
ianw@lime:/tmp$ ./test
0x60000fffff8eb480
0x60000fffff8eb484
ianw@lime:/tmp$ gcc-3.4 -o test test.c
ianw@lime:/tmp$ ./test
0x60000fffffafb470
0x60000fffffafb480

This is allowable because the two members of the structure (ints) only require 4 byte alignment. Although it may make for worse code; I guess the lesson is think about how your structure might be layed out and if required give it explicit alignment.

const with functions

gcc 4 now throws up a warning about functions defined with const; it appears to be taking the const as a qualifier on the return code.

ianw@lime:/tmp$ cat const.c
static const int const_function(int i)
{
    return i + 100;
}

void call_const_function(void)
{
        int b = const_function(20);
}
ianw@lime:/tmp$ gcc-4.0 -Wall -c const.c
const.c:2: warning: type qualifiers ignored on function return type
const.c: In function 'call_const_function':
const.c:11: warning: unused variable 'b'

When you define a function const you are telling the compiler that you don't examine anything but your arguments (i.e. no globals either) and have no side effects (other than what you return). This allows it to be smarter in compiling the code.

Type qualifiers (like const) are ignored for function return values, for reasons explained here.

Thus the way to indicate to gcc that the function is const is via __attribute__((const)).

Actually, I built two IA64 kernels, one with a const function properly attributed an another with it not and the code wasn't actually any different. But it might be, one day! I am told that in the past when a function is called twice from within the same function code improvements have been quantified.

Using LD_PRELOAD to override a function

For some reason, people seem to get this quite wrong a lot of the time. Certainly one should not be playing with symbols that start with __ unless you really know what you're doing with them.

ianw@lime:~/tmp/override$ cat override.c
#define _GNU_SOURCE 1
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <dlfcn.h>

pid_t getpid(void)
{
        pid_t (*orig_getpid)(void) = dlsym(RTLD_NEXT, "getpid");
        printf("Calling GETPID\n");

        return orig_getpid();
}

ianw@lime:~/tmp/override$ cat test.c
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

int main(void)
{
        printf("%d\n", getpid());
}

ianw@lime:~/tmp/override$ gcc -shared -fPIC -o liboverride.so override.c -ldl
ianw@lime:~/tmp/override$ gcc -o test test.c
ianw@lime:~/tmp/override$ LD_PRELOAD=./liboverride.so ./test
Calling GETPID
15187

Superblock scheduling and tail duplication

I hate it when compiler people start talking and I have no idea what they are on about. It turns out if you take the time to read their papers, you can glean a little of the black art that is compilers.

Superblocks and tail duplication are quite important on an architecture like IA64, because the compiler has to find as much instruction level parallelism (ILP) as possible. What this means is that the longer list of instructions you can get queued up which don't depend on each other, the better. Once an instruction depends on another one you have to introduce a stop, which slows things down and makes the processor wait around doing nothing.

So a superblock is an abstraction the compiler can internally make by finding blocks of instructions that are likely to occur sequentially. Superblocks have one entry point, but may have multiple exit points. The compiler can create superblocks by either static analysis of code (seeing which code paths are likely by looking at them at compile time; this is an interesting problem worthy of hours on it's own) or by looking at profile information gained from running code (i.e. which code paths are followed in real conditions). Tail duplication is an extension to make even bigger superblocks by essentially copying code. An example might illustrate

function()
{
  do {
      /* entry */
      if (something_likely)
         /* part 1 */
      else
         /* part 2 */
      /* both */
  } while (something)
}

Could be changed to something like

function()
{
a:
 /* entry */               |-> superblock 1
 if (something_likely) {   |
    /* part 1 */           |
    /* both   */           |
    if (something)         |
       goto a              |
    else                   |
       return              |
 }
 /* part 2 */           |-> superblock 2
 /* both   */           |
 if (something)         |
    goto a              |
 else                   |
    return              |
}

You can see how it duplicates the common part of the code (both) to make larger blocks; this is tail duplication. Notice how the first superblock takes into account the entry block as well -- since the compiler knows the if statement is likely to be taken (from some sort of static analysis or profiling information), it should try to keep the whole thing together.

Superblocks let the compiler make better decisions about code scheduling. Imagine in entry you do a load from some known memory address. You only read this value in superblock 1, never write it. However, say in superblock 2 you do a store to an unknown at compile time address (this is called a hazard). This means you can't optimise out this first load in entry because if you end up taking superblock 2 the store may change the value (the compiler can't know what is being changed, so it has to assume the worst). Thus you end up loading the same thing every time, even though it probably hasn't changed. But you want your superblock as big as possible so you really want to include the entry block ... you can't have your cake and eat it too!

Lucikly, smart people have come up with ways to analyse a superblocks and remove what they call loop invariant code from it. This means the compiler can realise it can move the load in entry outside of superblock 1 into another block, since superblock 1 will never change it. The compiler can then make superblock 2 branch back to this "new" block. This means the value only gets loaded once on the first entry to superblock 1 and only then again if it is (possibly) changed by superblock 2. Enjoy your cake!

So next time an compiler geek starts talking about superblocks maybe we'll be almost able to keep up!