The quickest way to do nothing

As I was debugging something recently, an instruction popped up that seemed a little incongruous:

lea 0x0(%edi,%eiz,1),%edi

Now this is an interesting instruction on a few levels. Firstly, %eiz is a psuedo-register that simply equates to zero somewhat like MIPS r0; I don't think it is really in common usage. But when you look closer, this instruction is a fancy way of doing nothing. It's a little clearer in Intel syntax mode:

lea    edi,[edi+eiz*1+0x0]

So we can see that this is using scaled indexed addressing mode to load into %edi the value in %edi plus 0 * 1 with an offset of 0x0; i.e. put the value of %edi into %edi, i.e. do nothing. So why would this appear?

What we can see from the disassembley is that this single instruction takes up an impressive 7 bytes:

8048489:    8d bc 27 00 00 00 00    lea    edi,[edi+eiz*1+0x0]

Now, compare that to a standard nop which requires just a single byte to encode. Thus to pad out 7 bytes of space would require 7 nop instructions to be issued, which is a significantly slower way of doing nothing! Let's investigate just how much...

Below is a simple program that does nothing in a tight-loop; firstly using nops and then the lea do-nothing method.

#include <stdio.h>
#include <stdint.h>
#include <time.h>

typedef uint64_t cycle_t;

static inline cycle_t
i386_get_cycles(void)
{
        cycle_t result;
        __asm__ __volatile__("rdtsc" : "=A" (result));
        return result;
}

#define get_cycles i386_get_cycles

int main() {

    int i;
    uint64_t t1, t2;

    t1 = get_cycles();

    /* nop do nothing */
    while (i < 100000) {
        __asm__ __volatile__("nop;nop;nop");
        i++;
    }
    t2 = get_cycles();
    printf("%ld\n", t2 - t1);

    i = 0;
    t1 = get_cycles();

    /* lea do-nothing */
    while (i < 100000) {
        __asm__ __volatile__("lea 0x0(%edi,%eiz,1),%edi");
        i++;
    }

    t2 = get_cycles();
    printf("%ld\n", t2 - t1);
}

Firstly, you'll notice that rather than the 7-bytes mentioned before, we're comparing 3-byte sequences. That's because the lea instruction ends up encoded as:

8048388:       8d 3c 27                lea    (%edi,%eiz,1),%edi

When you hand-code this instruction, you can't actually convince the assembler to pad out those extra zeros for the zero displacement because it realises it doesn't need them, so why would it waste the space! So, how did they get in there in the original disassembley? If gas is trying to align something by padding, it has built-in sequences for the most efficient way of doing that for different sizes (you can see it in i386_align_code of gas/config/tc-i386.c which adds the extra 4 bytes in directly).

Anyway, we can build and test this out (note you need the special -mindex-reg flag passed to gas to use the %eiz syntax):

$ gcc -O3 -Wa,-mindex-reg  -o wait wait.c
$ ./wait
300072
189945

So, if you need 3-bytes of padding in your code for some reason, it's ~160% slower to pad out 3-bytes with no-ops rather than a single larger instruction (at least on my aging Pentium M laptop).

So now you can rest easy knowing that even though your code is doing nothing, it is doing it in the most efficient manner possible!