technovelty

weblog of Ian Wienand

RSS  |  technovelty home  |  page of ian  |  ianw@ieee.org

Why symbol visibility is good

ELF has two related concepts for managing symbols in your programs. The first concept is the symbol binding. Global binding means the symbol is visible outside the file being built; local binding is the opposite and keeps the symbol local only (static) and weak is like global, but suggests that the symbol can be overridden.

$ cat syms.c
static int local(void) { }

int global(void) { }

int  __attribute__((weak)) weak(void) { }

$ gcc -o syms -c syms.c

$ readelf --syms ./syms

Symbol table '.symtab' contains 10 entries:
   Num:    Value  Size Type    Bind   Vis      Ndx Name
...
     5: 00000000     8 FUNC    LOCAL  DEFAULT    1 local
     8: 00000008     8 FUNC    GLOBAL DEFAULT    1 global
     9: 00000010     8 FUNC    WEAK   DEFAULT    1 weak
...

This is all well and good, but starts breaking down when you want to load many different modules and keep strict API's (such as, say, dynamic libraries!).

Consider that for two files to share a common function, the function must end up with a global visibility.

$ cat file1.c
void common_but_not_part_of_api(void) { }

$ cat file2.c
extern void common_but_not_part_of_api(void);

void api_function(void) {
     common_but_not_part_of_api();
}

$ gcc -shared -fPIC  -o library file1.c file2.c
$ readelf --syms ./library

Symbol table '.symtab' contains 60 entries:
   Num:    Value  Size Type    Bind   Vis      Ndx Name
...
    53: 00000424    29 FUNC    GLOBAL DEFAULT   11 api_function
    55: 0000041c     5 FUNC    GLOBAL DEFAULT   11 common_but_not_part_of_api
...

In the example above, both the function we want exported (api_function) and the function we don't wish exported (common_but_not_part_of_api) end up with exactly the same attributes. Binding attributes are useful for the linker putting together object files; but aren't a complete solution.

To combat this, ELF provides for visibility attributes. Symbols can be default, protected, hidden or internal. Using these attributes, we can flag extra information for the dynamic loader so it can know which symbols are for public consumption, and which are for internal use only.

The most logical way to use this is to make all symbols by default hidden with -fvisibility=hidden and then "punch holes in the wall" for those symbols you want visible.

$ cat file1.c
void common_but_not_part_of_api(void) { }

$ cat file2.c
extern void common_but_not_part_of_api(void);

void  __attribute__((visibility("default"))) api_function(void) {
      common_but_not_part_of_api();
}

$ gcc -fvisibility=hidden -shared -fPIC  -o library file1.c file2.c
$ readelf --syms ./library

Symbol table '.symtab' contains 60 entries:
   Num:    Value  Size Type    Bind   Vis      Ndx Name
    48: 000003cc     5 FUNC    LOCAL  HIDDEN   11 common_but_not_part_of_api
    54: 000003d4    29 FUNC    GLOBAL DEFAULT  11 api_function

Now the dynamic loader has enough information to distinguish between the two, and can stop any external access to common_but_not_part_of_api easily.

This extra information also has potential for performance improvements. Any time a symbol may be overridden, the compiler must generate a program lookup table (PLT) entry for the function so that the dynamic loader can re-direct the function call. The PLT is a trampoline which gets the correct address of the function being called (from the global offset table, GOT) and bounces the function call to the right place. An example should illustrate:

Bouncing via the PLT

In the first example, there was not enough information to tell if the function would ever be able to be overridden, hence a PLT entry had to be created and the function called through it (disassemble it to see the details!). With correct symbol visibility attributes, there is enough information to know that common_but_not_part_of_api is never to be overridden, hence the PLT (and the associated costs of trampolining) can be avoided.

The internal attribute is even stricter; it says that this function will never be called from outside this module (for example, we might pass the address of common_but_not_part_of_api to anyone). This can lead to even better code, because on many architectures transitioning to another module might involve flipping global pointer registers or other similarly expensive operations.

So that's how symbol binding and visibility attributes can work together to get you the best performance possible from your program!

posted at: Wed, 08 Oct 2008 17:13 | in /code | permalink | add comment (0 others)

Whoda' thunk it!

I learned a new word today, thunk. At first, given the context, I thought it meant text (as in program code) hunk, but the Hacker's Dictionary suggests it's actually more a term for a closure.

For those not familiar with what a closure is, it's a "thunk" of encapsulated code created by the compiler which is dynamically generated and managed. Stealing the Wikipedia example:

Closures

Nested functions are probably the closest thing to a closure you can get with C (I've talked about nested functions and trampolines before). However, because the stack disappears it's not nearly as useful, I think the term for these is lexical closures, because it's basically a way to share some stack and restrict name-spaces.

Amusingly, the code using the term "thunk" was assembler, so my first search was for thunk gas (as in GNU Assembler). This of course lead to endless articles on the high price of gas — a non-renewable fossil fuel running out and capitalists realising inelastic demand can be used to line their pockets? Whoda' thunk it!

posted at: Mon, 07 Jul 2008 23:46 | in /code | permalink | add comment (1 others)

Conditional Operator Harmful ? yes : no

It seems existing versions of gcc don't warn when you use an assignment as a truth value as the first operand to the conditional operator. For example:

int fn(int i)
{
	return (i = 0x20) ? 0x40 : 0x80;
}

Presumably you meant ==, but unfortunately that does not warn with gcc (maybe it will one day).

We can use the tree dumping mechanisms (previously mentioned here) to confirm our problem. For the following function gcc creates:

fn (i)
{
  int D.1541;
  int iftmp.0;

  i = 32;
  if (1)
    {
      iftmp.0 = 64;
    }
  else
    {
      iftmp.0 = 128;
    }
  D.1541 = iftmp.0;
  return D.1541;
}

As you can see, this isn't the result we wanted. This can be quite nasty if you use the conditional operator in a hidden fashion; for example behind an assert. A common idiom is:

#define ASSERT(x) (x) ? : fail()

/* programmer now subsitutes a == for */
ASSERT(x = y);

You'll currently get no warning you've slipped up and used the assert wrong. Moral of the story: a quick audit of your code might turn up some surprising uses! However, in this case, the Intel compiler picks this one up:

$ icc -c test.c
test.c(3): warning #187: use of "=" where "==" may have been intended
  	   return (i = 0x20) ? 0x40 : 0x80;

Although it's often not trivial to move to another compiler, as a rule I would recommend trying alternative compilers for your code as it's a great sanity check.

posted at: Sat, 26 Apr 2008 08:29 | in /code/c | permalink | add comment (1 others)

Compare and Swap with PIC

Our Dear Leader Sam Hocevar has previously blogged about PIC and inline ASM. Today I came across a sort of extension to this problem.

Consider the following code, which implements a double word compare and swap using the x86 cmpxchg8b instruction (for a bonus you can lock it to make it atomic).

#include <stdio.h>

typedef struct double_word_t {
	int a;
	int b;
} double_word;

/* atomically compare old and mem, if they are the same then copy new
   back to mem */
int compare_and_swap(double_word *mem,
		     double_word old,
		     double_word new) {

	char result;
	__asm__ __volatile__("lock; cmpxchg8b %0; setz %1;"
			     : "=m"(*mem), "=q"(result)
			     : "m"(*mem), "d" (old.b), "a" (old.a),
			       "c" (new.b), "b" (new.a)
			     : "memory");
	return (int)result;
}

int main(void)
{

	double_word w = {.a = 0, .b = 0};
	double_word old = {.a = 17, .b = 42};
	double_word new = {.a = 12, .b = 13};

	/* old != w, therefore nothing happens */
	compare_and_swap(&w, old, new);
	printf("Should fail -> (%d,%d)\n", w.a, w.b);

	/* old == w, therefore w = new */
	old.a = 0; old.b = 0;
	compare_and_swap(&w, old, new);
	printf("Should work  -> (%d,%d)\n", w.a, w.b);

	return 0;
}

This type of CAS can be used to implement lock-free algorithms (I've previously blogged about that sort of thing).

The problem is that the cmpxchg8b uses the ebx register, i.e. pseudo code looks like:

if(EDX:EAX == Destination) {
    ZF = 1;
    Destination = ECX:EBX;
}
else {
    ZF = 0;
    EDX:EAX = Destination;
}

PIC code reserves ebx for internal use, so if you try to compile that with -fPIC you will get an error about not being able to allocate ebx.

A first attempt to create a PIC friendly version would simply save and restore ebx and not gcc anything about it, something like:

  __asm__ __volatile__("pushl %%ebx;"   /* save ebx used for PIC GOT ptr */
		       "movl %6,%%ebx;" /* move new_val2 to %ebx */
		       "lock; cmpxchg8b %0; setz %1;"
		       "pop %%ebx;"     /* restore %ebx */
	    	       : "=m"(*mem), "=q"(result)
		       : "m"(*mem), "d" (old.b), "a" (old.a),
		         "c" (new.b), "m" (new.a) : "memory");

Unfortunately, this isn't a generic solution. It works fine with the PIC case, because gcc will not allocate ebx for anything else. But in the non-PIC case, there is a chance that ebx will be used for addr. This would cause a probably fairly tricky bug to track down!

The solution is to use the #if __PIC__ directive to either tell gcc you're clobbering ebx in the non-PIC case, or just keep two versions around; one that saves and restores ebx for PIC and one that doesn't.

posted at: Fri, 01 Feb 2008 21:03 | in /code/arch | permalink | add comment (0 others)

Playing with the x86 PMU

A discussion about commas lead to an excuse to have a play with the IA-32 processor performance managment unit (PMU). To start, take two versions of a program to count the number of commas in a text file — one in C and one in Python. The C one runs faster on the input data set of ~60MiB of random data, but why?

The CPU performance monitors give are the key to getting some idea of where the programs spend their time. I like to use perfmon2 because it's what I know, but Oprofile can do it too. All the available events for IA-32 are described in the manual; I currently of no better way of finding out about them than just reading it. On Itanium I reccommend Caliper which, for the most common situations, does most of the work for you and presents it in a nice report. Intel's Vtune also does a similar thing.

The first thing to investigate is if the CPU is getting enough instructions to keep busy. The IFU_MEM_STALL metric is a good place to start as it is triggered when the instruction fetch pipeline is stalled, presumably waiting on either the ITLB or the trace buffer (Icache).

$ pfmon -e CPU_CLK_UNHALTED,IFU_MEM_STALL ./comma < ./randomcommas
340559375 CPU_CLK_UNHALTED
   192115 IFU_MEM_STALL
$ pfmon -e CPU_CLK_UNHALTED,IFU_MEM_STALL python -Sc "import sys; print sum(l.count(',') for l in sys.stdin)" < ./randomcommas
1287100
4571257047 CPU_CLK_UNHALTED
 71981750 IFU_MEM_STALL

That works out to 0.05% of total cycles for the C version and 1.5% for the Python version, neither of which sets off immediate warning bells. If it did, we could start drilling down to things like the L2_IFETCH and ITLB_MISS events, or the BR_* branch events to try and see why the CPU is having to wait to get its next instruction.

Next it is useful to find the CPI (cycles per instruction). This is calculated by the ratio of retired instructions against the number of CPU cycles; since a superscalar machine can issue more than one instruction per cycle this should ideally be much greater than 1 (for example, an Itanium can execute up to 6 instructions each cycle).

$ pfmon -e INST_RETIRED,CPU_CLK_UNHALTED ./comma < ./randomcommas
542953593 INST_RETIRED
340612036 CPU_CLK_UNHALTED
$ pfmon -e INST_RETIRED,CPU_CLK_UNHALTED python -Sc "import sys; print sum(l.count(',') for l in sys.stdin)" < ./randomcommas
1194455205 INST_RETIRED
4569931735 CPU_CLK_UNHALTED

This works out at a CPI of 1.59 for the C version and 0.26 for the Python version. The Python version is clearly spending a lot of time waiting, because it isn't even able to issue one instruction every cycle.

At this point it seems the CPU has enough instructions to do, but it is sitting around waiting to get through those instructions. This suggests the waiting is related to getting data from the cache.

The load and store requests from the L2 cache are accounted to the L2_LD and L2_ST events respectively. These have the ability to mask out cache lines in different states of the MESI protocol, but for this we don't care so just ask pfmon to show us everything.

$ pfmon -e L2_LD:M:E:S:I,L2_ST:M:E:S:I ./comma < randomcommas 
102505 L2_LD:M:E:S:I
   167 L2_ST:M:E:S:I
$ pfmon -e L2_LD:M:E:S:I,L2_ST:M:E:S:I python -Sc "import sys; print sum(l.count(',') for l in sys.stdin)" < ./randomcommas
3278774 L2_LD:M:E:S:I
  10457 L2_ST:M:E:S:I

This shows us that the Python version does quite a few more stores than the C counterpart. Considering this program should simply be reading the input stream and counting the number of commas, we do not expect much store traffic at all. This suggests the Python version is doing some extra copying, for whatever reason (maybe some Python expert can pinpoint it?).

We can drill down a bit more into the memory related latencies. The DCU_LINES_IN event gives the total number of lines allocated in the cache. Another event, DCU_MISS_OUTSTANDING, gives a weighted measure of the cycles spent waiting for a cache line to be brought in. Each cycle spent waiting is weighted by the number of outstanding cache misses (I think the Pentium M I'm using can have up to 4 cache miss requests outstanding at once) and has some caveats, but can be considered a rough estimate of the time spent waiting for a cache line to be brought in. Therefore divding DCU_MISS_OUTSTANDING by DCU_LINES_IN gives us an approximate metric of how long a cache miss takes.

$ pfmon -e DCU_MISS_OUTSTANDING,DCU_LINES_IN ./comma < randomcommas 
769736 DCU_MISS_OUTSTANDING
102387 DCU_LINES_IN

$ pfmon -e DCU_MISS_OUTSTANDING,DCU_LINES_IN python -Sc "import sys; print sum(l.count(',') for l in sys.stdin)" < ./randomcommas
99810150 DCU_MISS_OUTSTANDING
 4240179 DCU_LINES_IN

So that works out to 7.5 cycles for the C version and 23 cycles for the Python version. This seems to strongly suggest that it is memory traffic that is weighing the Python version down.

That is only the initial phase; the results give a high level idea of why one program is running slower than the other. The initial analysis phase generally consists of taking the ratios of certain events to try and get some idea in your head of what the program is doing. Then comes the really hard work; drilling down to figure out how to fix it!

Some useful references:

posted at: Fri, 21 Dec 2007 11:25 | in /code/arch | permalink | add comment (2 others)

Non-thread safe C

Recently it has come up on the LKML and other blogs about how C is not thread safe. The nice piece of example code given is

int trylock()
{
    int res;

    res = pthread_mutex_trylock(&mutex);
    if (res == 0)
            ++acquires_count;

    return res;
}

At first glance it seems OK, and I wasn't exactly sure what the problem was until I pulled it apart. GCC is well within its rights to produce the following code (annotated for clarity).

push   %ebp                     ; setting up stack frame
mov    %esp,%ebp                ; save esp to ebp
sub    $0x8,%esp                ; make some room on stack
movl   <mutex>,(%esp)           ; copy mutex to stack
call   <pthread_mutex_trylock>  ; call pthread_mutex_trylock
mov    <acquires_count>,%edx    ; copy acquires_count to %edx
cmp    $0x1,%eax                ; set carry flag if result == 0
adc    $0x0,%edx                ; add with carry
mov    %edx,<acquires_count>    ; write back acquires_count
leave                           ; release stack frame
ret                             ; done!

This code uses a bit of a trick to avoid a branch around code. It uses the carry flag, which is simple enough to understand when adding. Consider if you are adding 0001 + 1111 you end up with 11111, where the top bit is the carry bit. You can then feed this carry bit into another part of the addition operation if you were, say, using two 32 bit numbers to represent a 64 bit number on a 32 bit system.

The trick uses the dual nature of the carry flag as a "borrow" bit; if a < b when a - b then the bit is set. Consider the simplest case of 0 - 1 -- the carry bit is set to indicate to the next operation that a bit needs to be borrowed. Imagine for a moment we have a four bit machine and want to do the 8-bit unsigned operation

1111 0000 -   (240) -
0000 0001     (  1)
==== ====

First we do 0000 - 0001 which leaves us 1111 with the borrow bit set. Then we do 1111 - 0000 but take note that the borrow bit is set, so get rid of the borrowed bit, leaving us 1110. Putting the two together gives us 1110 1111 or 239.

Therefore the cmp $0x1,<result> instruction (which is really just result - 1 with no output but setting flags) can only set the carry flag (i.e. borrow a bit) if the value is 0; any other value in result means there will be no need to borrow and the carry flag is not set.

Consequently the add with carry (adc) will add one to the result only if the carry bit was set. A neat trick which avoids a jump, but not so much thread safe: if you are interrupted after the cmp but before the mov store you can write back the wrong value. Most people (including me till 5 minutes ago) looking at the code would assume that it was safe on the theory that acquires_count is not touched if you don't have the lock because it is inside the if statement.

Interestingly, the Intel compiler does the simpler jump method

push   <mutex>                     ; push argument
call   <pthread_mutex_trylock>
pop    %ecx                        ; get return addr on top
test   %eax,%eax                   ; AND %eax,%eax
jne    <trylock+0x16>  --------+   ; if that wasn't 0, jump out
addl   $0x1,<acquires_count>   |
ret                    <-------+   ; done!

Frankly I have no idea which one is faster/better (I would guess the second one because you don't always dirty a cache line, but you'd have to measure that against the cost of the branch) but obviously GCC thinks one thing and Intel thinks another and I bet if you put them in a room together you would end up with 3 opinions! I think the major point is that once upon a time it was mostly people writing kernels doing things like accessing I/O ports or other memory mapped areas that needed to be very careful the compiler didn't optimise out (or in) something incorrectly. With multi-threaded C code that problem moves up to the application developer too.

A different instruction set can help here. The Itanium version is possibly less prone to this sort of thing because predication makes it faster to implement conditional branches -- any instruction with a predicate that is not set is skipped. The load and add is safe because it will only be done with the lock held. However, if anyone is wondering why Itanium come with such big caches, the difference between the 6 line version and the Itanium version below should answer that.

 [MII]       alloc r32=ar.pfs,4,3,0               ; alloc stack frame
             mov r33=b0                           ; return pointer
             mov r34=r1
 [MMI]       addl r14=72,r1;;                     ; r11 = &mutex from GOT
             ld8 r11=[r14]                        ;
             nop.i 0x0;;
 [MIB]       mov r35=r11                          ; put into argument 1
             nop.i 0x0
             br.call.sptk.few b0=<pthread_mutex_trylock>
 [MII]       cmp4.eq.unc p0,p6=r8,r0              ; if result !0, set predicate p6
             mov.i ar.pfs=r32
             mov r1=r34;;                         ; load acquires_count from GOT
 [MIB]       addl r10=144,r1                      ;
             nop.i 0x0
       (p06) br.cond.dptk.few <trylock+0x80>;;    ; jump to [1] if p6 set
 [MII]       ld4 r9=[r10]
             mov b0=r33
             addl r3=144,r1;;
 [MMI]       adds r2=1,r9;;                       ; acquires_count++
             st4 [r3]=r2                          ; save updated value
             nop.i 0x0
 [MIB]       nop.m 0x0                            ; [1]
             nop.i 0x0
             br.few <trylock+0x90>;;              ; jump to [2]
 [MII]       nop.m 0x0
             mov b0=r33
             nop.i 0x0;;
 [MIB]       nop.m 0x0                            ; [2]
             nop.i 0x0
             br.ret.sptk.many b0;;

I guess the moral of the story is that multi-threading is hard, but we knew that anyway.

posted at: Mon, 29 Oct 2007 12:13 | in /code/badcode | permalink | add comment (7 others)

Scrambled text

There's this thing going around on Facebook suggesting you are smart if you can read a paragaph where the middle letters of the words are scrambled.

Everything old is new again, since I remember reading about this in 2003. As far as I remember being able to read the scrambled text was not related to intelligence in any way.

Anyway, as a Friday afternoon distraction I wrote a little javascript to scramble text.

PlainScrambled
Scramble It

Ahh, the good-olde Fisher-Yates shuffle algorithm, friend of the first year tutorial!

posted at: Fri, 12 Oct 2007 18:19 | in /code/web | permalink | add comment (4 others)

Short review of guarded page-tables

Anyone who has done an operating systems 101 course is familiar with the standard hierarchical page-table structure. This is essentially a radix-tree where bits of the virtual address are taken as indexes into an array of pointers, each pointing to the next level down or, at the leaves, to a physical entry.

Hierarchical page-table

One potential issue with this style of page-table is with a very large, very sparse address space (such as in a single address-space operating system) there will be many intermediate steps with possibly only a single translation at the end (i.e. lots of single pages scattered all around your address-space).

Thus we can add a bit-field guard to each of the entries in the page table (Liedtke, 1994) to help indicate only those paths through the tree which are valid (similar to a Patricia tree).

Translation first proceeds as usual; the top bits are used as an index into the top level table. However, each entry now contains extra information about further valid addresses. The guard is masked against the next bits in the virtual address; if it matches (e.g. in our example is 0x2) the next step can proceed directly to the leaf node, bypassing the middle look-up. The entry can be marked as either final (i.e. pointing to a leaf node) or intermediate, in which case the process continues with the next level (which also has its own guard, and so forth). Any entry that does not match the guard is known to be false, and therefore a fault can be raised.

Hierarchical page-table with guard

To examine this further, consider that we know that each page in our 24-bit address space above is 4KiB, thanks to the 212 = 4KiB offset. Therefore each of the 16 possible values of the top four bits selects 1MiB of the 224 = 16MiB address space; e.g. 0x0 selects the first megabyte, 0x1 the second and so forth. The second 4 bits selects a 64KiB region of this 1MiB, and the last 4 bits selects a 4KiB region of this 64KiB region.

This scheme could work quite well if we were using only 16 pages each separated by 1MiB. Each of these pages would only require an entry in the top level with an appropriate guard which could then be marked as final and point directly to a physical frame.

Once things become less sparse however, we need to start filling in middle levels. If you have 2 pages within a 16MiB region, you need at least one extra bit of address to tell them apart. If you have 3 or 4 pages, you need at least 2 bits to tell them apart, and so forth.

It would therefore be beneficial to have variable level sizes to dynamically adapt and create the most efficient lookup possible given the address space layout of a process. For example, if you knew your process was only going map pages 0x123 (100100011) and 0x124 (100100100) a good option would be to make your top-level only two entries (i.e. check the top bit) with a guard on the entry at index 0x1 of 00100. You then need a second level with 8 pointers to differentiate the remaining 3 bits (each of these would be final and hence point directly to a physical page).

The more pages you start mapping, the less bits you can discard with the guard. If you follow this through to a guard of 0 bits you end up with a linear page-table.

It has been shown (Elphinstone, 1999) that fixed level sizes with 16 entries tends to work well. For a 32-bit address-space with 4KiB pages (e.g. 12 bits offset) this leaves 20 bits to be mapped by the page-tables; with each level mapping 24 bits this means a 5 level tree. The guard values provide enough path-compression that the table is not constantly walked. The disadvantage is that it may "over-allocate" meaning more space is dedicated to page-tables than strictly required.

Deciding where and when to split levels, what size indexes to put at each level and updating all the guard values dynamically to make the smallest, most efficient page-table would create a variable-radix page-table (Szmajda and Heiser, 2003).

In summary, a guarded page-table is more useful the more sparse your address space is. A variable-radix guarded page-table is complex, but could offer advantages for implementing variable page-size support and thus having positive effects on TLB coverage.

posted at: Wed, 05 Sep 2007 15:59 | in /code | permalink | add comment (0 others)

Fun with -funroll-loops

I was playing with loop unrolling, and decided to investigate it a little. The basis is this very simple program which looks like it could do with some loop unrolling.

#include <stdio.h>
#include <stdlib.h>
#include <assert.h>

#define SIZE (1500 * 1024 * 1024)

int fn(int *buf) {
        int i;

	for(i=0; i < (SIZE/sizeof(int)); i+=1)
		buf[i] = i;

	return 0;
}

int main(void)
{
        int *buf = malloc(SIZE);
        assert(buf);
	return fn(buf);
}

Firstly, lets have a look at what happens to the function when -funroll-loops is applied. Below is the de-compilation with some comments on what each line is doing.

Non-optimised
                         | nop.m 0x0
r14=0                    | mov r14=r0
save the "loop counter"  | mov.i r2=ar.lc
                         | nop.m 0x0
one less than iterations | movl r15=0x176fffff;;
                         | nop.m 0x0
initalise loop counter   | mov.i ar.lc=r15
                         | nop.i 0x0;;
                         | loop:
*buf = i, buf++          | st4 [r32]=r14,4
i++                      | adds r14=1,r14
dec lc, branch if > 0    | br.cloop.sptk.few .loop
                         | nop.m 0x0
restore loop counter     | mov.i ar.lc=r2
Loop unrolled
                         | nop.m 0x0
r26 = 0                  | mov r26=r0
save lc                  | mov.i r2=ar.lc
                         | nop.m 0x0
iter / 32                | movl r14=0x2edffff;;
                         | nop.m 0x0
setup loop counter       | mov.i ar.lc=r14
                         | nop.i 0x0
                         | loop:
r3 = buf                 | mov r3=r32
r8 = 1                   | adds r8=1,r26
r25 = 3                  | adds r25=3,r26
r23 = buf[3]             | adds r23=12,r32
r24 = 4                  | adds r24=4,r26
r22 = buf[4]             | adds r22=16,r32;;
buf[0] = 0; r3=buf+1     | st4 [r3]=r26,4
r21 = 5                  | adds r21=5,r26
r19 = buf[5]             | adds r19=20,r32
r20 = 6                  | adds r20=6,r26
r18 = buf[6]             | adds r18=24,r32
r16 = 7                  | adds r16=7,r26;;
                         | nop.m 0x0
buf[1] = 1, r3=buf+2     | st4 [r3]=r8,4
r15 = buf[7]             | adds r15=28,r32
r17 = 2                  | adds r17=1,r8
r26 += 8                 | adds r26=8,r26
buf += 8 (for next iter) | adds r32=32,r32;;
buf[2] = 2               | st4 [r3]=r17
buf[3] = 3               | st4 [r23]=r25
                         | nop.i 0x0
buf[4] = 4               | st4 [r22]=r24
buf[5] = 5               | st4 [r19]=r21
                         | nop.i 0x0
buf[6] = 6               | st4 [r18]=r20
buf[7] = 7               | st4 [r15]=r16
loop, dec branch         | br.cloop.sptk.few .loop
                         | nop.m 0x0
restore lc               | mov.i ar.lc=r2
                         | br.ret.sptk.many b0;;

We can see that in the first case the loop counter is setup to run 393216000 times (e.g. 1500*1024*1024/sizeof(int)) and a very simple loop of storing the value and incrementing is undertaken. In the second version, the code becomes more complex. We can see the loop now executes 49152000 times (e.g. 1500*1024*1024/sizeof(int)/8) so we can infer that for each loop iteration, 8 values have been unrolled to be written into our buf array. Indeed, when pulled apart we can see the 8 stores (numbers just reflect the first time through).

So, is it faster?

$ time ./loop

real    0m1.209s
user    0m0.388s
sys     0m0.820s

$ timme ./loop-unroll

real    0m1.056s
user    0m0.244s
sys     0m0.812s

Yes! Great. We can suggest a number of reasons why, but it would be interesting to get some hard statistics on why this is so. Thanks to the Itanium performance counters, we can. Firstly, lets start by seeing how cycles are being spent.

$ pfmon -e cpu_cycles,ia64_inst_retired_this,nops_retired,back_end_bubble_all ./loop
 547923609 CPU_CYCLES
1179995250 IA64_INST_RETIRED_THIS
    104431 NOPS_RETIRED
 154285534 BACK_END_BUBBLE_ALL

$ pfmon -e cpu_cycles,ia64_inst_retired_this,nops_retired,back_end_bubble_all ./loop-unroll
 311570590 CPU_CYCLES
1327451286 IA64_INST_RETIRED_THIS
 147560435 NOPS_RETIRED
  16088889 BACK_END_BUBBLE_ALL

Now, the Itanium can issue up to six instructions (IA64_INST_RETIRED_THIS) in two bundles of three instructions per cycle (CPU_CYCLES). However, you can't have dependencies between registers within a 3-instruction bundle (and branches and interruptions play havoc) so sometimes you need to pad with no-ops (this is why you need a good compiler). We can work out some figures from this, namely the useful instructions per cycle (e.g. instructions retired - nop instructions / cycles).

TestUseful instructions/cycle
loop1179995250 - 104431 / 547923609 = 2.15
unroll1327451286 - 147560435 / 311570590 = 3.78

This tells us that for each given cycle, the unrolled-loop version is doing more work (ideally we'd like that to be a the theoretical maximum of 6). So what is the unrolled version doing for all this time? Bubbles are where the CPU is waiting for something; a prime candidate is moving data from caches to registers. Bubbles make up (154285534/547823609) 28% of the cycles on the loop version, but only (16088889/311570590) 5% of the time for the unrolled version. This means the processor is spending a lot less time waiting for things to happen with the unrolled loop version.

We can drill down further to see what bubbles are occurring.

$ pfmon -e be_l1d_fpu_bubble_all,be_exe_bubble_all,be_RSE_bubble_all,Back_end_bubble_fe ./loop
 14229581 BE_L1D_FPU_BUBBLE_ALL
    97669 BE_EXE_BUBBLE_ALL
     1260 BE_RSE_BUBBLE_ALL
131354866 BACK_END_BUBBLE_FE

$ pfmon -e be_l1d_fpu_bubble_all,be_exe_bubble_all,be_RSE_bubble_all,Back_end_bubble_fe ./loop-unroll
22676169 BE_L1D_FPU_BUBBLE_ALL
   83064 BE_EXE_BUBBLE_ALL
    1263 BE_RSE_BUBBLE_ALL
  718850 BACK_END_BUBBLE_FE

We are now breaking the stalls down to see where they are occuring. L1D_FPU_BUBBLES are essentially related to micro-architectural issues with the caches; things like the L1D cache being overwhelmend and interactions with the hardware page-table walker. BE_EXE bubbles are more simply described a waiting for data to come from caches (or even main memory) to registers. BE_RSE bubbles are related to the register stack engine, and "front end" bubbles (BACK_END_BUBBLE_FE) are where there are not enough instructions coming from the "front-end" for the "back-end" to issue out to functional units in the processor.

In this case, the major cause of problems for the simple loop seems to be not being able to get enough instructions through to keep the processor busy. We can try to pin down what is happening in the front-end:

$ pfmon -e fe_bubble_imiss,fe_bubble_tlbmiss,fe_bubble_branch,fe_bubble_bubble ./loop
   27727 FE_BUBBLE_IMISS
    4910 FE_BUBBLE_TLBMISS
98094189 FE_BUBBLE_BRANCH
52615907 FE_BUBBLE_BUBBLE

$ pfmon -e fe_bubble_imiss,fe_bubble_tlbmiss,fe_bubble_branch,fe_bubble_bubble ./loop-unroll
780729 FE_BUBBLE_IMISS
  4858 FE_BUBBLE_TLBMISS
291620 FE_BUBBLE_BRANCH
111089 FE_BUBBLE_BUBBLE

What we can see here is that the increased code size of the unrolled version causes us a lot more i-cache misses, as expected. However, the problem for the unrolled version seems to be the time spent by the front-end dealing with the branch. This is not surprising since there is a rumor that there is a front-end buffer which fills up after 3 repeated cycles with branch instructions; apparently this is a problem addressed in the new Montecito chips (if anyone has one, I'll update this if you send me the numbers). Since we are doing 393216000 branches all within a single bundle its not surprising we've hit it!

Just to confirm, we can break-down the BE_EXE bubbles. Remember, BE_EXE bubbles are caused by stalls in the execution phase primarily due to cache latencies or inter-register dependencies. Looking at our code, we don't re-use the same register over-and-over so we would not expect to spend a long time waiting for registers, and indeed this is what we see.

$ pfmon -e be_exe_bubble_grall,be_exe_bubble_grgr ./loop
79601 BE_EXE_BUBBLE_GRALL
  919 BE_EXE_BUBBLE_GRGR

$ pfmon -e be_exe_bubble_grall,be_exe_bubble_grgr ./loop-unroll
76627 BE_EXE_BUBBLE_GRALL
  918 BE_EXE_BUBBLE_GRGR

Therefore in this case, the speed increase isn't coming from better cache behaviour as we might have expected, but greater ability to keep the processor busy with instructions between branches.

We can also examine the major events happening with the unrolled case. Here we see BE_L1D_FPU bubbles being the major overhead. On Itanium floating-point avoids the L1 cache (it is too big, and not re-used enough to make it useful) hence the concatenation of the event acronym. These bubbles can be broken down into the following events (measured in two runs, since you can only do 4 at a time).

$ pfmon -e be_l1d_fpu_bubble_l1d,be_l1d_fpu_bubble_l1d_dcurecir,\
  be_l1d_fpu_bubble_l1d_tlb,be_l1d_fpu_bubble_l1d_stbufrecir ./loop-unroll
 9623193 BE_L1D_FPU_BUBBLE_L1D_DCURECIR
     414 BE_L1D_FPU_BUBBLE_L1D_TLB
     282 BE_L1D_FPU_BUBBLE_L1D_STBUFRECIR
$ pfmon -e be_l1d_fpu_bubble_l1d_fullstbuf,\
    be_l1d_fpu_bubble_l1d_l2bpress,be_l1d_fpu_bubble_l1d_tlb ./loop-unroll
      23 BE_L1D_FPU_BUBBLE_L1D_FULLSTBUF
17220177 BE_L1D_FPU_BUBBLE_L1D_L2BPRESS
     425 BE_L1D_FPU_BUBBLE_L1D_TLB

These events expose micro-architectural details, the full extent of which is only gleaned by reading the processor manuals (a PhD in computer architecture probably helps too!). The store buffers are not an issue here -- the caches are more than able to keep up with our few writes. The TLB is also not involved; we have pretty good TLB coverage thanks to the virtual-linear page table and hardware page-table walker.

The two clues are DCURECIR and L2BPRESS. Essentially DCURECIR happens when references have to "re-circulate"; this is a side-effect of the data not being available. The L2BPRESS figure suggests why this is so; this event happens when the L2 cache can not handle any more requests. The L2 cache keeps a queue of outstanding requests; when this queue is full (for example, full of long latency requests to get data from main memory) the L2 cache causes "back-pressure" which stops the L1 cache from issuing new requests. This suggests the program is chewing through a lot of data and not using the cache very effectively, which is exactly what we are doing. Pre-fetching hints can help (but as described in a previous entry on hacking Enigma may not always help), but in this case there probably isn't any spare bandwidth to pre-fetch in because essentially no processing is happening.

What does all this mean? I don't know, but playing with performance monitors is fun!

posted at: Wed, 22 Aug 2007 16:22 | in /code/arch | permalink | add comment (0 others)

On incorrectly utilising "utilising"

My supervisor pointed out a very annoying habit I had formed when writing technical documents, best summed up below:

"It would be difficult to find a page of governmental, military, or academic writing that doesn't have on it the word utilize. It must be one of the most over-utilized words in the world. It seems as though people out to impress people with the significance of what they're doing use utilize when they should use use.

Utilize is not an elegant variation of the word use; it has its own distinct meaning. When you utilize something, you make do with something not normally used for the purpose, e.g., you utilize a dime when the bloody screwdriver is nowhere to be found. If the screwdriver were there, you'd use it, not utilize a stupid dime for the purpose. Use use when you mean use, and utilize only when it's properly used to mean--to use something not normally used. The computer went off-line, so they utilized Mr. Wang's abacus, the one he liked to use. Despite the temporary breakdown, the computer's use-rate was up (not its utilization-rate)."

Cheney, "Getting the words right" (1983)

You will now probably notice this subtle verbiage in most everything you read!

posted at: Tue, 21 Aug 2007 10:19 | in /code/badcode | permalink | add comment (5 others)

What actually happens when you plug in a USB device?

Today I started wondering what actually happens when you plug in a USB device. The little tour below goes from starting the USB subsystem to plugging something in, and I hope it is reasonably accurate. This entry is probably best read with a copy of the Linux kernel handy.

Linux USB Core, lower-layers

We can start our tour right at the very bottom in the heart of the USB core.

Things really start at the USB initialisation function in drivers/usb/core/usb.c:usb_init(). The first interesting call is to drivers/base/bus.c:bus_register(). We see that it passes as struct bus_type which looks like:

struct bus_type usb_bus_type = {
        .name =         "usb",
        .match =        usb_device_match,
        .uevent =       usb_uevent,
        .suspend =      usb_suspend,
        .resume =       usb_resume,
};

This is registering a new type of bus with the Linux driver core framework. The bus doesn't have much yet, just a name and some helper functions, but registering a bus sets up the kobject hierarchy that gets exported through /sys/bus/ (/sys/bus/usb in this case) and will allow the further hierarchical building of devices underneath by attaching them as the system runs. This is like the root directory of the USB system.

Your desktop/laptop/palmtop etc has a host controller which directly interfaces to USB devices; common types are UHCI, OHCI and EHCI. The drivers for these various types of controllers live in drivers/usb/host. These controllers are similar but different, so to minimise code duplication Linux has a Host Controller Driver framework (drivers/usb/core/hcd.c) which abstracts most of the common operations from the host controller driver.

The HCD layer does this by keeping a struct usb_hcd (drivers/usb/core/hcd.h) with all common information in it for a host controller. Each of host controller drivers fills out a struct hc_driver for its hardware dependent operations, as per below (taken from the UHCI driver)

static const struct hc_driver uhci_driver = {
        .description =          hcd_name,
        .product_desc =         "UHCI Host Controller",
        .hcd_priv_size =        sizeof(struct uhci_hcd),

        /* Generic hardware linkage */
        .irq =                  uhci_irq,
        .flags =                HCD_USB11,

        /* Basic lifecycle operations */
        .reset =                uhci_init,
        .start =                uhci_start,
#ifdef CONFIG_PM
        .suspend =              uhci_suspend,
        .resume =               uhci_resume,
        .bus_suspend =          uhci_rh_suspend,
        .bus_resume =           uhci_rh_resume,
#endif
        .stop =                 uhci_stop,

        .urb_enqueue =          uhci_urb_enqueue,
        .urb_dequeue =          uhci_urb_dequeue,

        .endpoint_disable =     uhci_hcd_endpoint_disable,
        .get_frame_number =     uhci_hcd_get_frame_number,

        .hub_status_data =      uhci_hub_status_data,
        .hub_control =          uhci_hub_control,
};

USB overview

It might be helpful to clarify a few USB concepts now. A USB device defines a group of end-points, where are grouped together into an interface. An end-point can be either "IN" or "OUT", and sends data in one direction only. End-points can have a number of different types:

There can be many interfaces (made of multiple end-points) and interfaces are grouped into "configurations". Most devices only have a single configuration.

Intel UHCI

You can see how this works at the host controller level with the above diagram clagged from the Intel UHCI documents. The controller has a "frame" register which is incremented every millisecond. Each frame pointer points to a queue of "transfer descriptors". The driver needs to schedule this work so that 90% of the time is given to isochronous data, and 10% left for control data. Should any time remain, the bulk data is transferred. You can see that any transfer descriptor for isochronous data will not be retried, but other data sits in a queue so it is never lost.

The USB layer communicates through USB request blocks, or URBs. A URB contains information about what end-point this request relates to, data, any related information or attributes and a call-back function to be called when the URB is complete. Drivers submit URBs to the USB core, which manages them in co-ordination with the USB host (see the urb_enqueue functions provided by the host driver). Your data gets sent off to the USB device by the USB core, and when its done your call-back is triggered.

Root Hub

There is one more element quite fundamental to the USB core, which is the hub -- all USB devices plug into a hub. The USB controller implements a root hub; you can have multiple root hubs in a machine, and other hubs can then connect to root hubs. The hub driver lives in drivers/usb/core/hub.c. The USB initialisation function starts up the khubd thread (drivers/usb/core/hub.c:usb_hub_init()) which waits for and handles USB events, but we will return to that later.

The hub driver is the first USB driver to be setup, so by examining how that works we can get a feel for how other drivers work.

The hub driver setup starts in drivers/usb/core/usb.c:usb_init() where the drivers/usb/core/hub.c:usb_hub_init() function is called. This calls drivers/usb/core/driver.c:usb_register_driver() which adds itself to the USB bus we mentioned previously. This sets up usb_probe_device() to handle any probe events from the Linux driver core. At this point the hub driver is ready to claim anything that looks like a USB hub.

The root hub setup phase comes out of the HCD setup phase, which proceeds something like this. The Linux driver core goes through all devices on the PCI bus (including the USB host controller of course) and calls the probe() function the device's driver has registered (the host controller registered itself in its initialisation function drivers/usb/core/uhci-hcd.c:uhci_hcd_init()). The USB host controller driver wires up the HCD layer function drivers/usb/core/hcd-pci.c:usb_hcd_pci_probe() to handle this probe (see struct pci_driver uhci_pci_driver in uhci-hcd.c; usb_hcd_pci_probe() does some generic setup, but then calls back into the host driver start() function to do any device specific setup).

usb_hcd_pci_probe() ends up calling drivers/usb/core/hcd.c:usb_add_hcd() which does some generic HCD setup and ends up calling register_root_hub().

register_root_hub() creates a new USB device and registers it with drivers/usb/core/hub.c:usb_new_device(). usb_new_device() first calls drivers/usb/core/config.c:usb_get_configuration which sets up the interface (all hubs only have one interface; the interrupt interface to notify of events on the hub) for the device before registering with the Linux driver core via drivers/base/core/device_add(). device_add() then causes the USB bus to be rescanned.

Binding root hub to a driver

Now we can examine how a USB device gets associated with a driver. To summarise, what needs to happen is the hub driver needs to bind to the host controllers root hub. This illustrates the general concept of a new device binding to a driver.

There is, believe it or not, more layers that come into play now. There is a "generic" USB driver that handles the setup of interfaces in the USB core. As we mentioned earlier a device has a series of end-points grouped together into an interface, and then may have multiple interfaces for different things. Drivers really only care about communicating with the device at the interface level, so the USB core takes care of getting things to this stage for you.

In drivers/usb/core/usb.c:usb_init() the final call is to drivers/usb/core/driver.c:usb_register_device_driver(). This does some simple wrapping of the driver, most importantly setting up usb_probe_device() to handle any probes. It then registers this with Linux driver core with a call to driver_register.

Remembering that drivers/usb/hub.c:usb_new_device() has called device_add(), the driver core will now see this new device and start probing for a driver to handle it. The USB generic driver is going to be called, which has registered drivers/usb/core/driver.c:usb_probe_device() as its probe function. This converts the Linux driver core device back to a USB device (i.e. the USB device that was registered by register_root_hub()) and calls calls the drivers probe function drivers/usb/core/generic.c:generic_probe().

The role of the generic driver is to get the interfaces on the device up and running. Firstly it calls drivers/usb/generic.c:choose_configuration() which simply reads through the device data and chooses a sane configuration. But wait, how does it know what is a sane configuration for the root hub? All the information has been "faked" for the root hub in drivers/usb/core/hcd.c:usb2_rh_dev_descriptor and the like. The root hub details are defined by the USB specification, so these can be kept statically.

Assuming everything is OK, drivers/usb/core/message.c:usb_set_configuration() is called to set up the chosen configuration. It uses the helper function usb_control_message() to send a message to the device about what configuration mode to go into. It then goes through the available interfaces setting up the kernel structures, and adds them with device_add().

Inch by inch, we are getting closer to having the USB system up and running. When a new interface is added, the driver core will now try and find a driver to handle it. When an interface driver is registered with drivers/usb/core/driver.c:usb_register_driver() it sets the probe function to drivers/usb/core/driver.c:usb_probe_interface(). So the driver core calls this function, and the first thing it does is checks the ID against the IDs the driver is happy to handle from the drivers id_table. It uses usb_match_one_id() for this, which can match the class, subclass and protocol to make sure the it will work with this driver.

The root hub is part of the hub class, so can be handled by the hub driver. Therefore drivers/usb/core/hub.c:hub_probe() will be called to get the hub driver to bind with the new hub. This does some sanity checking and calls into hub_configure(). This then does some more general setup, but things get interesting when the interrupt end-point is setup. The interrupt end-point on a hub will send an event whenever something happens on the hub, such as a device being plugged in or unplugged. This is done by creating a URB and binding it to the end-point, asking that hub_irq be called whenever this URB is complete (e.g. when an event is received).

New events on the hub

At this point, the system is waiting for something to happen. The root hub is setup and listening for new events - we are ready to plug in our device.

When this happens the host controller will raise an interrupt signalling that one of its ports has changed state. This will be handled by drivers/usb/host/uhci-hcd.c:uhci_irq(). This checks that the interrupt wasn't due to an error and then calls drivers/usb/host/uhci-q.c:uhci_scan_schedule(). This function implements the interface between the UHCI "transfer data" messages and the Linux URB scheme. It goes through the queues as illustrated in the figure above and finds any complete URBs and calls their completion function.

You may remember that the interrupt end-point of the hub was associated with a URB that would call drivers/usb/core/hub.c:hub_irq(). The UHCI code will go through it's queues, find that this URB is complete and therfore call the completion function.

We previously mentioned that the kernel starts the khubd daemon to handle hub events. We can see that hub_irq does some simple error checking, but its real job is to notify khubd that there is a new event on this hub.

hub_events() is the khubd daemon thread doing all the work. Once notified the hub has new event, it goes and reads the hub status to see what to do. A new device appearing will trigger a port change event, which is handled by hub_port_connect_change(). This does some initial setup of the device, but most importantly now calls usb_new_device().

At this point, if the USB driver for this device is loaded, the device is essentially ready to go. As with the root hub, the device will be probed and its interfaces found with the generic driver, and then those interfaces will be registered and any driver asked if they wish to bind to them.

Loading drivers

The question arises, however, about how modules are dynamically loaded. Keeping every single module for every possible USB device that may plug into the system is clearly sub-optimal, so we wish to load a device driver module only when required.

Most everyone is aware of udev which handles /dev device nodes these days. udev sits around listening for uevents which get sent to it via the kernel. These uevents get sent via netlink, which is like Unix sockets but different. To get uevent messages all you need to do is open a PF_NETLINK socket to the NETLINK_KOBJECT_UEVENT "port", e.g. as the code below extract from udev does.

memset(&snl, 0x00, sizeof(struct sockaddr_nl));
snl.nl_family = AF_NETLINK;
snl.nl_pid = getpid();
snl.nl_groups = 1;

uevent_netlink_sock = socket(PF_NETLINK, SOCK_DGRAM, NETLINK_KOBJECT_UEVENT);
if (uevent_netlink_sock == -1) {
   err("error getting socket: %s", strerror(errno));
   return -1;
}

So we have udev sitting around up in userspace waiting for messages that come from the kernel (as a side note, if /sys/kernel/uevent_helper is filled in with a path that will be run and receive the events with environment variables set; this is useful in early boot before udev has started.

Thus we want to get a message out to udev that there is a new USB interface around, and it should try and load a module to bind to it.

We previously identified drivers/usb/core/message.c:usb_set_configuration() as reading the interfaces for a device and registering them with the driver core. When this registers the interface, it also registers a uevent helper drivers/usb/core/message.c:usb_if_uevent(). This function gets called by the Linux driver core when the driver is added into the driver hierarchy. It adds some information to the uevent:

if (add_uevent_var(envp, num_envp, &i,
	   buffer, buffer_size, &length,
	   "INTERFACE=%d/%d/%d",
	   alt->desc.bInterfaceClass,
	   alt->desc.bInterfaceSubClass,
	   alt->desc.bInterfaceProtocol))
	return -ENOMEM;

if (add_uevent_var(envp, num_envp, &i,
	   buffer, buffer_size, &length,
	   "MODALIAS=usb:v%04Xp%04Xd%04Xdc%02Xdsc%02Xdp%02Xic%02Xisc%02Xip%02X",
	   le16_to_cpu(usb_dev->descriptor.idVendor),
	   le16_to_cpu(usb_dev->descriptor.idProduct),
	   le16_to_cpu(usb_dev->descriptor.bcdDevice),
	   usb_dev->descriptor.bDeviceClass,
	   usb_dev->descriptor.bDeviceSubClass,
	   usb_dev->descriptor.bDeviceProtocol,
	   alt->desc.bInterfaceClass,
	   alt->desc.bInterfaceSubClass,
	   alt->desc.bInterfaceProtocol))
	return -ENOMEM;

udev now has all the information required ask modprobe to load a module, if it can. The rule on my system looks something like:

# load the drivers
ENV{MODALIAS}=="?*",    RUN+="/sbin/modprobe --use-blacklist $env{MODALIAS}"

Now, all going well, modprobe loads an appropriate module based on the vendor id, etc. Loading the driver causes a re-scan of the USB bus and the driver will be probed to see if wants to handle any of the unbinded USB devices (using the same procedure as was done setting up the root hub). It should take over control of the device, and that's it your USB device is up and running!

Simple!

Handy resources if you want to know more: Linux Device Drivers, USB2 Specification, UHCI Design Guide.

posted at: Fri, 20 Jul 2007 16:29 | in /code/linux | permalink | add comment (2 others)

Dig Jazz Applet

I really enjoy DIG Jazz from the ABC. The digital radio channel via the TV set-top box is great to have on in the background when working from home, or you can use the streaming versions when work pays for the bandwidth :)

One slightly annoying thing is you hear something good and have to go to the website to figure out what was just played. I de-compiled the flash application to see where it gets its information, and it turns out it is just a fairly boring XML file. Since I've been meaning to get to know PyGtk a bit better, I knocked up a simple applet that, when clicked on, shows you the current song playing.

Dig Jazz Applet

Source code available here or I also made a Debian package.

posted at: Tue, 22 May 2007 14:39 | in /code/gnome | permalink | add comment (1 others)

Crikey Cleaner

I quite enjoy reading the thinking man's New Idea that is Crikey. I do not, however, enjoy the HTML of their daily email; for a new-media company it is very 1996. For example, every paragraph has what should be done once in a style-sheet (although there is a weird mix of using CSS and not), multiple breaks are used to add space and links are hidden behind a nasty re-director that gives you absolutely no clue where you're going to end up. You can choose plain text but it comes so mangled, with included random Unicode characters, that it's more hassle than it's worth. The W3C validator is a sea of red, should you try and run it through that; not that it really knows what to do because there's no DOCTYPE declaration. Here's a particularly bad sample:

<p style="font-size: 14px; font-family: Arial, Helvetica, sans-serif"> <i class="credit">Sophie Black writes:</i> <br> <br> <p style="font-size: 14px; font-family: Arial, Helvetica, sans-serif"> Although the AFP is reporting that the <a href="http://tracking.crikey.com.au/LinkRedirector.aspx?clid=3976b864-c145-409c-bfd6-6ffab8303e50&rid=2ad7a942-b9db-4727-8ab3-078bdec50e97" target="_blank">noon deadline</a> ... </p>

As given, the bad HTML and huge advertisements in big fixed sized tables make it all but impossible to read on a Palm Pilot screen, which is my preferred medium for the bus. Thus I wrote clean-crikey.py to strip out all the crap and leave you with something reasonable (and as a bonus save you anywhere up to 40KiB).

What I now do is run the daily email through this, and rsync it to a private place which is subscribed to via an Avantgo user-created channel. Then I just sync using the built-in wireless of the T|X and I'm done. But crikey, it shouldn't be that hard! Just send valid HTML with a range of sensible style-sheets...

Update: in response to a letter requesting a way to remove all articles by Christian Kerr published in today's Crikey, I have updated the script to strip articles with a particular byline if -b is passed.

posted at: Wed, 04 Apr 2007 14:01 | in /code/hacks | permalink | add comment (0 others)

Stressing the TLB with matrices

When playing around with memory management a matrix multiply is often mentioned as a TLB thrasher, and a good way to stress your system. It's interesting to illustrate this process a little.

The first thing to remember is that C99 (6.5.2.1 to be precise) dictates that arrays are always stored in row-major order, as shown below.

Example of row-major order

So, remembering back to first year algebra, to multiply two matrices we multiply and sum rows and columns, illustrated linearly as the computer see it below. Black lines are page boundaries, while boxes are matrix elements.

Matrix multiplication

The simplest code to do this is usually along the lines of

int a[LEN][LEN];
int b[LEN][LEN];
int r[LEN][LEN];

int i,j,k

for(i=0; i < LEN; i++)
	 for(j = 0; i < LEN; j++)
	       for(k = 0; k < LEN; k++)
		     r[i][j] += a[i][k] + b[k][j];

We can see how we have a repeated linear walk over the memory holding matrix b, done for each element in a. If we make b sufficiently large (say, a few hundred pages) we can be sure that by the time we get to the end, we have kicked out the earlier TLB entries. Then we start all over again, and start faulting pages back in. Thus a matrix multiply problem comes down to doing repeated page-sized stride walks over a linear region.

Looking at the diagram, it is also clearer what can solve this problem. We can either have more TLB entries, so we don't have to kick out entries which we will use again, or have larger pages -- e.g. less black lines. Or a smarter algorithm ...

posted at: Wed, 10 Jan 2007 11:21 | in /code/linux | permalink | add comment (0 others)

Set-bit Population Counts

We were visiting our friends at HP yesterday, and whilst discussing some of the Itanium product line plans, the comment that certain TLA's have a real sweet spot for Itanium because it can tell them the number of set bits in a word very quickly. I was aware of the popcnt instruction, but wasn't aware people were buying machines to use it.

I instrumented some code to run a test -- one using the GCC built-in (which falls through to the popcnt instruction on Itanium), one using a re-implementation of the generic GCC version, and one described in the excellent "Algorithms for programmers" book by Jörg Arndt (download it now if you don't have it!).

A naive implementation would do it with 64 shifts, which would take forever (for a fairly good explanation of why shifting takes so long, see the patent on optimised bit-shifting). The gcc built-in way is a clever hack:

const int count_table[256] =
{
    0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4,1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,
    1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
    1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
    2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
    1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
    2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
    2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
    3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,4,5,5,6,5,6,6,7,5,6,6,7,6,7,7,8
};

static inline unsigned long countit_table(unsigned long x) {
  int i, ret = 0;

  for (i = 0; i < 64; i += 8)
    ret += count_table[(x >> i) & 0xff];

  return ret;
}

The results I saw, in cycles, were:

Algorithm 2Ghz AMD Opteron 270 1.5Ghz Itanium2
Built in 30 6
Table lookup 45 27
Shifting 60 48

The AMD GCC built-in seems to create some slightly tighter code than my extract table lookup version -- it doesn't unroll the loop and I assume the cache does a good job of the prediction. But if your applications rely on finding set bits heavily, then Itanium is the platform for you!

posted at: Fri, 15 Dec 2006 13:39 | in /code/arch | permalink | add comment (2 others)

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 2.5 License.