sched_yield considered harmful

Tue 23 August 2005

According to recent discussions sched_yield can cause performance problems with some code.

I haven't looked at the OpenLDAP code, but a common problem is that if you call pthread_cond_signal it doesn't actually relinquish control from the current thread to the thread (if any) that has been signaled. The thread that has been signalled has to wait for its next time slice before it will realise it has been woken up, and the signalling thread loops around the mutex for the conditional doing nothing. Thus a common idiom is to call pthread_cond_signal and then sched_yield to relinquish control from your thread to the waiting thread (I've certainly done this myself).

This can lead to very bad performance under certain circumstances since sched_yield on Linux is saying to the scheduler "I have nothing important to do". This may actually be the opposite of what is really true, as usually developers see this problem when there is so much work to do you don't get a chance to relinquish control and make some forward progress with it. This is especially bad on a SMP machine where you probably don't want to relinquish control at all, since you don't need to give up your timeslice for other threads to start running on other processors; what allows you forward progress on a uniprocessor might be slowing you down needlessly on a SMP machine. On some other systems sched_yield will relinquish control to another thread in your thread group, but this is not specified by the standard so you can not depend on it.

One solution to such a problem is presented below. This is a fairly standard situation where are request is made and needs to be handled. To avoid using sched_yield one needs to queue the requests for a dispatcher to handle and take an extra lock that it is possible to sleep on when required. On a uniprocessor, when requests are flooding in request() will just be looping queueing work until its timeslice is complete. The typical idiom is to place a yield after the unlock of dispatcher.mutex to allow the dispatcher to do some work. However by adding another lock you can make request sleep, giving the dispatcher time to clear the queue. You will notice if you include the usleep that everything happens nice and sequentially on uniprocessor or SMP; this is because the dispatcher gets a chance to run as the requests go to sleep. Conversely without the sleep the requests are flooding in as fast as possible; to keep your request handling latency reasonable you don't want to be giving up the CPU to every Tom, Dick and Harry every time you handle a new request!

Compile this once as

$ gcc -Wall -o server server.c -lpthread

to see the problem, then once as

$ gcc -Wall -o server-resched server.c -lpthread -DRESCHED

to see the dispatcher handling the batched requests. I believe this is a good solution to minimise latency below a CPU timeslice when requests are flooding, without having to "fool" the scheduler with sched_yield. Something to keep in mind is that if you find yourself using sched_yield when you may very well have more useful work to do, then chances are you're doing the wrong thing.

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <pthread.h>
#include <errno.h>

struct request_tag_t {
    struct request_tag_t *next;
    int operation;
};

struct dispatcher_tag_t {
    struct request_tag_t *first;
    struct request_tag_t *last;
    int running;
    pthread_mutex_t mutex;
    pthread_cond_t request;
    pthread_mutex_t resched;
};

struct dispatcher_tag_t dispatcher = {
    .first = NULL,
    .last = NULL,
    .running = 0,
    .mutex = PTHREAD_MUTEX_INITIALIZER,
    .request = PTHREAD_COND_INITIALIZER,
    .resched = PTHREAD_MUTEX_INITIALIZER,
};

int rand_num(int range)
{
    return (int)((range)*1.0*rand()/(RAND_MAX+1.0));
}

void *dispatcher_routine(void *arg)
{
    struct request_tag_t *request;

    printf("dispatcher starting\n");

    while (1)
    {
        /* take the dispatcher queue mutex */
        pthread_mutex_lock(&dispatcher.mutex);

        /* if nothing to do, wait on the dispatcher queue
         * condition variable until there is */
        while (dispatcher.first == NULL)
            pthread_cond_wait(&dispatcher.request, &dispatcher.mutex);

        /* take a request from the queue */
        request = dispatcher.first;
        dispatcher.first = request->next;
        if (dispatcher.first == NULL)
            dispatcher.last = NULL;

        /* handle request, probably by handling it in a new thread. */
        printf("I'm dispatching request %d\n", request->operation);

        /* free request we have just finished with */
        free(request);

        /* release queue lock */
        pthread_mutex_unlock(&dispatcher.mutex);
#ifdef RESCHED
        /* release the resched lock to signal we've seen the queue */
        pthread_mutex_unlock(&dispatcher.resched);
#endif
    }
}

int tries = 1;

void request(int operation)
{
    struct request_tag_t *request;

    /* take the dispatcher queue mutex */
    pthread_mutex_lock(&dispatcher.mutex);

    /* start the thread if it isn't */
    if (!dispatcher.running)
    {
        pthread_t thread;
        printf("dispatcher not running, starting\n");
        dispatcher.running = 1;
        pthread_create(&thread, NULL, dispatcher_routine, NULL);
    }

    /* allocate a new request */
    request = malloc(sizeof(struct request_tag_t));
    if (request == NULL)
        exit(1);
    request->next = NULL;
    request->operation = operation;

    /* add to the queue, maintaing first and last */
    if (dispatcher.first == NULL) {
        dispatcher.first = request;
        dispatcher.last = request;
    } else {
        (dispatcher.last)->next = request;
        dispatcher.last = request;
    }

    /* signal something to do */
    pthread_cond_signal(&dispatcher.request);

    /* free the queue lock */
    pthread_mutex_unlock(&dispatcher.mutex);

    /* temping to put a sched_yield() here */

#ifdef RESCHED
    /* try the resched lock.  if we can't have it, that means
     * we've already got it and the dispatch thread hasn't had a
     * chance to run and unlock it.  We queue up to 10 requests
     * before allowing the dispatcher to run by sleeping on the
     * mutex
     */
    if (pthread_mutex_trylock(&dispatcher.resched) == EBUSY) {
        if (tries++ == 10) {
            pthread_mutex_lock(&dispatcher.resched);
            tries = 1;
        }
    } else
        tries = 1;
#endif
}

int main(void) {
    while(1)
    {
        int n = rand_num(3);
        printf("requesting %d\n", n);
        request(n);

        /* if you uncomment this, everything should play
         * nicely with one dispatch per request, because other
         * threads get a chance to run.  Otherwise we have
         * requests flooding in as fast as possible
         */

        /* usleep(500000); */
    }
}