technovelty

weblog of Ian Wienand

RSS  |  technovelty home  |  page of ian  |  ianw@ieee.org

Set-bit Population Counts

We were visiting our friends at HP yesterday, and whilst discussing some of the Itanium product line plans, the comment that certain TLA's have a real sweet spot for Itanium because it can tell them the number of set bits in a word very quickly. I was aware of the popcnt instruction, but wasn't aware people were buying machines to use it.

I instrumented some code to run a test -- one using the GCC built-in (which falls through to the popcnt instruction on Itanium), one using a re-implementation of the generic GCC version, and one described in the excellent "Algorithms for programmers" book by Jörg Arndt (download it now if you don't have it!).

A naive implementation would do it with 64 shifts, which would take forever (for a fairly good explanation of why shifting takes so long, see the patent on optimised bit-shifting). The gcc built-in way is a clever hack:

const int count_table[256] =
{
    0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4,1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,
    1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
    1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
    2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
    1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
    2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
    2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
    3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,4,5,5,6,5,6,6,7,5,6,6,7,6,7,7,8
};

static inline unsigned long countit_table(unsigned long x) {
  int i, ret = 0;

  for (i = 0; i < 64; i += 8)
    ret += count_table[(x >> i) & 0xff];

  return ret;
}

The results I saw, in cycles, were:

Algorithm 2Ghz AMD Opteron 270 1.5Ghz Itanium2
Built in 30 6
Table lookup 45 27
Shifting 60 48

The AMD GCC built-in seems to create some slightly tighter code than my extract table lookup version -- it doesn't unroll the loop and I assume the cache does a good job of the prediction. But if your applications rely on finding set bits heavily, then Itanium is the platform for you!

posted at: Fri, 15 Dec 2006 13:39 | in /code/arch | permalink | add comment (2 others)

Posted by Alain Billoire at Fri Aug 10 02:54:18 2007

One such application that relies heavily on pop count is Monte Carlo simulation in statistical physics, using the so-called multi-spin coding technique. I modified my program after reading your post and got a factor seven improvement in speed on tanium.

Posted by Blaisorblade at Thu Sep 25 08:53:55 2008

Tanenbaum's Structured Computer Organization mention that popcnt was included in the Sparc architecture to allow simulating nuclear bomb explosion...
And if it helps so much Montecarlo simulations (even if I don't know this coding technique), that becomes really reasonable...

Add a comment
*Name
*Email (not shown)
Website
*Comment:
*Word above?
* denotes required field

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 2.5 License.