Playing with the x86 PMU

A discussion about commas lead to an excuse to have a play with the IA-32 processor performance managment unit (PMU). To start, take two versions of a program to count the number of commas in a text file — one in C and one in Python. The C one runs faster on the input data set of ~60MiB of random data, but why?

The CPU performance monitors give are the key to getting some idea of where the programs spend their time. I like to use perfmon2 because it's what I know, but Oprofile can do it too. All the available events for IA-32 are described in the manual; I currently of no better way of finding out about them than just reading it. On Itanium I reccommend Caliper which, for the most common situations, does most of the work for you and presents it in a nice report. Intel's Vtune also does a similar thing.

The first thing to investigate is if the CPU is getting enough instructions to keep busy. The IFU_MEM_STALL metric is a good place to start as it is triggered when the instruction fetch pipeline is stalled, presumably waiting on either the ITLB or the trace buffer (Icache).

$ pfmon -e CPU_CLK_UNHALTED,IFU_MEM_STALL ./comma < ./randomcommas
340559375 CPU_CLK_UNHALTED
   192115 IFU_MEM_STALL
$ pfmon -e CPU_CLK_UNHALTED,IFU_MEM_STALL python -Sc "import sys; print sum(l.count(',') for l in sys.stdin)" < ./randomcommas
1287100
4571257047 CPU_CLK_UNHALTED
 71981750 IFU_MEM_STALL

That works out to 0.05% of total cycles for the C version and 1.5% for the Python version, neither of which sets off immediate warning bells. If it did, we could start drilling down to things like the L2_IFETCH and ITLB_MISS events, or the BR_* branch events to try and see why the CPU is having to wait to get its next instruction.

Next it is useful to find the CPI (cycles per instruction). This is calculated by the ratio of retired instructions against the number of CPU cycles; since a superscalar machine can issue more than one instruction per cycle this should ideally be much greater than 1 (for example, an Itanium can execute up to 6 instructions each cycle).

$ pfmon -e INST_RETIRED,CPU_CLK_UNHALTED ./comma < ./randomcommas
542953593 INST_RETIRED
340612036 CPU_CLK_UNHALTED
$ pfmon -e INST_RETIRED,CPU_CLK_UNHALTED python -Sc "import sys; print sum(l.count(',') for l in sys.stdin)" < ./randomcommas
1194455205 INST_RETIRED
4569931735 CPU_CLK_UNHALTED

This works out at a CPI of 1.59 for the C version and 0.26 for the Python version. The Python version is clearly spending a lot of time waiting, because it isn't even able to issue one instruction every cycle.

At this point it seems the CPU has enough instructions to do, but it is sitting around waiting to get through those instructions. This suggests the waiting is related to getting data from the cache.

The load and store requests from the L2 cache are accounted to the L2_LD and L2_ST events respectively. These have the ability to mask out cache lines in different states of the MESI protocol, but for this we don't care so just ask pfmon to show us everything.

$ pfmon -e L2_LD:M:E:S:I,L2_ST:M:E:S:I ./comma < randomcommas
102505 L2_LD:M:E:S:I
   167 L2_ST:M:E:S:I
$ pfmon -e L2_LD:M:E:S:I,L2_ST:M:E:S:I python -Sc "import sys; print sum(l.count(',') for l in sys.stdin)" < ./randomcommas
3278774 L2_LD:M:E:S:I
  10457 L2_ST:M:E:S:I

This shows us that the Python version does quite a few more stores than the C counterpart. Considering this program should simply be reading the input stream and counting the number of commas, we do not expect much store traffic at all. This suggests the Python version is doing some extra copying, for whatever reason (maybe some Python expert can pinpoint it?).

We can drill down a bit more into the memory related latencies. The DCU_LINES_IN event gives the total number of lines allocated in the cache. Another event, DCU_MISS_OUTSTANDING, gives a weighted measure of the cycles spent waiting for a cache line to be brought in. Each cycle spent waiting is weighted by the number of outstanding cache misses (I think the Pentium M I'm using can have up to 4 cache miss requests outstanding at once) and has some caveats, but can be considered a rough estimate of the time spent waiting for a cache line to be brought in. Therefore divding DCU_MISS_OUTSTANDING by DCU_LINES_IN gives us an approximate metric of how long a cache miss takes.

$ pfmon -e DCU_MISS_OUTSTANDING,DCU_LINES_IN ./comma < randomcommas
769736 DCU_MISS_OUTSTANDING
102387 DCU_LINES_IN

$ pfmon -e DCU_MISS_OUTSTANDING,DCU_LINES_IN python -Sc "import sys; print sum(l.count(',') for l in sys.stdin)" < ./randomcommas
99810150 DCU_MISS_OUTSTANDING
 4240179 DCU_LINES_IN

So that works out to 7.5 cycles for the C version and 23 cycles for the Python version. This seems to strongly suggest that it is memory traffic that is weighing the Python version down.

That is only the initial phase; the results give a high level idea of why one program is running slower than the other. The initial analysis phase generally consists of taking the ratios of certain events to try and get some idea in your head of what the program is doing. Then comes the really hard work; drilling down to figure out how to fix it!

Some useful references: