Articles in the hacks category

Crikey Cleaner

I quite enjoy reading the thinking man's New Idea that is Crikey. I do not, however, enjoy the HTML of their daily email; for a new-media company it is very 1996. For example, every paragraph has what should be done once in a style-sheet (although there is a weird mix of using CSS and not), multiple breaks are used to add space and links are hidden behind a nasty re-director that gives you absolutely no clue where you're going to end up. You can choose plain text but it comes so mangled, with included random Unicode characters, that it's more hassle than it's worth. The W3C validator is a sea of red, should you try and run it through that; not that it really knows what to do because there's no DOCTYPE declaration. Here's a particularly bad sample:

<p style="font-size: 14px; font-family: Arial, Helvetica, sans-serif"> <i class="credit">Sophie Black writes:</i> <br> <br> <p style="font-size: 14px; font-family: Arial, Helvetica, sans-serif"> Although the AFP is reporting that the <a href="http://tracking.crikey.com.au/LinkRedirector.aspx?clid=3976b864-c145-409c-bfd6-6ffab8303e50&rid=2ad7a942-b9db-4727-8ab3-078bdec50e97" target="_blank">noon deadline</a> ... </p>

As given, the bad HTML and huge advertisements in big fixed sized tables make it all but impossible to read on a Palm Pilot screen, which is my preferred medium for the bus. Thus I wrote clean-crikey.py to strip out all the crap and leave you with something reasonable (and as a bonus save you anywhere up to 40KiB).

What I now do is run the daily email through this, and rsync it to a private place which is subscribed to via an Avantgo user-created channel. Then I just sync using the built-in wireless of the T|X and I'm done. But crikey, it shouldn't be that hard! Just send valid HTML with a range of sensible style-sheets...

Update: in response to a letter requesting a way to remove all articles by Christian Kerr published in today's Crikey, I have updated the script to strip articles with a particular byline if -b is passed.

GCC beats the Intel compiler at cracking Enigma?

I read about attempts to crack old Enigma messages on Bruce Schneier's Blog (then I think it hit Slashdot).

I've always liked Enigma since I read The Code Book, and I instantly grabbed the binary and put it on my local IA64 box, a 900Mhz Itanium2.

Usually for this sort of thing, the Intel compiler produces much better code than GCC. So I thought it might be fun to see how much better! Jeff Gilchrist gave me the incarnation to do a little benchmark, and the results surprised me:

$ time ./enigma -c -f B:524:AZ:AAA -t B:524:AZ:ZZZ dict/00trigr.gen dict/00bigr.gen ct/orig1
gcc -O3

real    3m30.219s
user    3m25.492s
sys     0m0.256s

icc -O3

real    5m25.663s
user    5m18.900s
sys     0m0.252s

That seems to be the wrong way around! The first step step seemed to be to fiddle with compiler options. -O3 can be a bit aggressive for the Intel compiler, and not knowing anything about the code base I wondered if profile guided optimisation might help (where you run the binary once with profiling code, and then recompile so the compiler has more information.

icc -O3 pgo

real    5m24.414s
user    5m18.324s
sys     0m0.264s

icc -O2 pgo

real    5m23.997s
user    5m12.680s
sys     0m0.284s

icc -O2

real    5m18.919s
user    5m12.224s
sys     0m0.212s

Clearly we're not helping here, but -O2 seems to be the place to start. Next step was to use a profiler to see where to start. q-syscollect showed me what was taking up the most time.

icc -O2 profile

% time      self     cumul     calls self/call  tot/call name
 51.87    147.51    147.51     50.2M     2.94u     2.94u triscore
 33.83     96.20    243.72     46.6M     2.07u     2.07u biscore
 10.36     29.47    273.19     13.1M     2.25u     2.27u icscore
  2.75      7.83    281.02         -         -         - hillclimb

gcc -O3 profile

% time      self     cumul     calls self/call  tot/call name
 54.71    102.35    102.35     49.9M     2.05u     2.05u triscore
 26.69     49.93    152.29     46.4M     1.08u     1.08u biscore
 12.69     23.74    176.03     13.1M     1.81u     1.89u icscore
  2.54      4.75    180.78     14.5k      327u      327u init_path_lookup_H_M3
  1.74      3.26    184.04         -         -         - hillclimb

So, triscore seems to be the place to start. Trying to handle code that runs for 5 minutes doesn't lead to a good debugging experience, so I pulled triscore out so I could analyse it separately.

$ time ./triscore-gcc

real    0m1.399s
user    0m1.384s
sys     0m0.004s

$ time ./triscore-icc

real    0m2.346s
user    0m2.316s
sys     0m0.000s

Ok, now we have a small test case that runs in a few seconds. The code I used is available here.

Time to pull out Caliper and start peeking under the hood.

Caliper dcache output gcc
=========================
L1 data cache miss percentage:
   9.79 = 100 * (L1D_READ_MISSES.ALL / L1D_READS)

Percent of data references accessing L1 data cache:
  99.89 = 100 * (L1D_READS / DATA_REFERENCES)

Caliper dcache output icc
=========================
L1 data cache miss percentage:
  17.63 = 100 * (L1D_READ_MISSES.ALL / L1D_READS)

Percent of data references accessing L1 data cache:
  81.76 = 100 * (L1D_READS / DATA_REFERENCES)

We can see the average latency for the misses:

% Total                                        Avg.
 Dcache  Cumulat        Sampled       Dcache  Dcache
Latency    % of          Dcache      Latency  Laten.
 Cycles    Total         Misses       Cycles  Cycles   Load Module
------------------------------------------------------------------
100.00   100.00            1919        12664     6.6   triscore-icc
100.00   100.00             995         6919     7.0   triscore-gcc

Which lets us approximate how much extra time we spend waiting for cache.

gcc
L1D_READ_MISSES.ALL     x___   0       91013769

icc
L1D_READ_MISSES.ALL     x___   0      166681605

With a cycle taking ~1ns, average latency of 7 cycles (166681605 - 91013769) * 1 * 9 = ~ 0.6 seconds.

Let's have a look at what the TLB thinks about all this.

icc
===
Percentage of data references covered by L1 and L2 DTLB:
  99.02 % = 100 * (1 - L2DTLB_MISSES / DATA_REFERENCES)

Percentage of data references covered by the HPW:
   0.00 % = 100 * (DTLB_INSERTS_HPW / DATA_REFERENCES)

Percentage of data references covered by software trap:
   0.98 % = 100 * ((L2DTLB_MISSES - DTLB_INSERTS_HPW) / DATA_REFERENCES)

gcc
===
Percentage of data references covered by L1 and L2 DTLB:
  100.00 % = 100 * (1 - L2DTLB_MISSES / DATA_REFERENCES)

Percentage of data references covered by the HPW:
   0.00 % = 100 * (DTLB_INSERTS_HPW / DATA_REFERENCES)

Percentage of data references covered by software trap:
   0.00 % = 100 * ((L2DTLB_MISSES - DTLB_INSERTS_HPW) / DATA_REFERENCES)

Software trap? That sounds slow. We can ask Caliper to show us the instructions that are causing the software trap.

Function Details
------------------------------------------------------------------------------------------------------
% Total                                                 %       %       %
Sampled       Sampled      DTLB      DTLB      DTLB   DTLB    DTLB    DTLB          Line|
   DTLB          DTLB       L2        HPW      Soft    L2      HPW    Soft          Slot|  >Statement|
 Misses        Misses     Fills     Fills     Fills   Fill    Fill    Fill     Col,Offset  Instruction
------------------------------------------------------------------------------------------------------
100.00     [test-icc::triscore, 0x40000000000008c0, triscore.c]
                 600       429         0       171    71.5     0.0    28.5              6  Function Totals
         ----------------------------------------------------------------------------------------------
[** cut out other stuff **]
                 144         0         0       144     0.0     0.0   100.0       0x0c70:0    M_      lfetch        [r87] ;;
                   0         0         0         0     0.0     0.0     0.0             :1    M       lfetch        [r67]
                   0         0         0         0     0.0     0.0     0.0             :2    I       nop.i         0

Ah, huh! lftech is an operation to pre-fetch data you know you're about to use into the cache. After discussions with the Caliper people, this doesn't actually raise a software fault to the operating system, but does actually go to memory via the hashed page table walker (more on that later). Let's check out that load.

(gdb) break *(triscore + 0xc70)
Breakpoint 1 at 0x4000000000001530: file triscore.c, line 26.
(gdb) r
Starting program: /home/ianw/programs/enigma-test/extract/enigma-test/triscore-icc

Breakpoint 1, 0x4000000000001530 in triscore (stbrett=0x0, ciphertext=0x0, len=0) at triscore.c:26
26        for (i = 2; i < len-15; i += 16) {
(gdb) print/x $r87
$1 = 0x600010010fe8f9c0
(gdb) print/x *($r87)
Cannot access memory at address 0x600010010fe8f9c0
(gdb) print/x $r67
$2 = 0x6000000000005be8
(gdb) print/x *($r67)
$3 = 0x0

Well, that address in r87 looks really bogus. It seems that what is happening is that ICC is being very aggressive with its prefetching algorithm, to the point it is getting it wrong. This brings us close to the answer, but we can just check out a few other things.

Firstly, these programs are using a lot of local registers. Maybe it is register stack engine traffic; after all maybe ICC is more aggressive with register allocations.

-- gcc -------------------------------------------
                            PLM
Event Name                 U..K  TH          Count
--------------------------------------------------
BACK_END_BUBBLE.ALL        x___   0      249132805
BE_RSE_BUBBLE.ALL          x___   0           3727
BE_RSE_BUBBLE.OVERFLOW     x___   0           1792
BE_RSE_BUBBLE.UNDERFLOW    x___   0           1935
--------------------------------------------------

-- icc -------------------------------------------
                            PLM
Event Name                 U..K  TH          Count
--------------------------------------------------
BACK_END_BUBBLE.ALL        x___   0      440096824
BE_RSE_BUBBLE.ALL          x___   0           4128
BE_RSE_BUBBLE.OVERFLOW     x___   0           1792
BE_RSE_BUBBLE.UNDERFLOW    x___   0           2336
--------------------------------------------------

Nup; but we can see that the pipeline stalls are coming from the backend (BACK_END_BUBBLE.ALL). This mostly rules out bad instruction scheduling (i.e. the processor doesn't have enough to do) and probably shows why IPO didn't make much difference before.

By poking into the backend pipeline even more, we can see overhead is really coming from increased TLB activity.

-- gcc --------------------------------------------------
                                   PLM
Event Name                        U..K  TH          Count
---------------------------------------------------------
BE_L1D_FPU_BUBBLE.L1D_DCURECIR    x___   0      100968826
BE_L1D_FPU_BUBBLE.L1D_HPW         x___   0           8937
BE_L1D_FPU_BUBBLE.L1D_L2BPRESS    x___   0           1358
BE_L1D_FPU_BUBBLE.L1D_TLB         x___   0        6354270
---------------------------------------------------------

-- icc --------------------------------------------------
                                   PLM
Event Name                        U..K  TH          Count
---------------------------------------------------------
BE_L1D_FPU_BUBBLE.L1D_DCURECIR    x___   0      431700491
BE_L1D_FPU_BUBBLE.L1D_HPW         x___   0      181649893
BE_L1D_FPU_BUBBLE.L1D_L2BPRESS    x___   0           3207
BE_L1D_FPU_BUBBLE.L1D_TLB         x___   0       33004161
---------------------------------------------------------

Here we have confirmation of what appears to be going on. The time spent waiting for the hardware page table walker is orders of magnitude greater with ICC.

On Itanium, there is a cache of the page tables kept in memory acting as a third level TLB. When a TLB miss is taken first the 1st level TLB is checked, then the larger and slower second level, and finally the hardware goes to main memory and checks the "virtually hashed page table". Obviously this third level, being in system memory and lower down the memory hierarchy is much slower.

The lfetch instruction is asking to bring in a bogus cache line for a far out virtual address. This requires a TLB mapping; which obviously doesn't exist. So the normal TLB fault path takes over, finally ending up going to main memory for the hashed page table check. This hashed page table page also has to be moved into the cache, which causes even more cache misses.

At this point, the processor can't find the mapping and gives up; only if it is an aggressive lfetch.fault will the fault go through to the operating system (which then walks the OS page tables).

As a final step, let's recompile and run the Enigma test with -no-prefetch.

real    3m39.243s
user    3m33.648s
sys     0m0.176s

Still about 9 seconds slower than gcc, but close enough for me! I still think that with this code, ICC is being too smart.

So, moral of the story? Firstly, check your code actually does run faster with another compiler (I only started testing with the assumption I would find out how much faster ICC was). Secondly, if your code is running slowly check for aggressive prefetching of cache lines, it might be doing more damage than harm by wasting time looking up TLB entries that never exist. Thirdly, Caliper (and the underlying perfmon and Itanium PMU system) is really cool, and next time someone says Itanium is crap ask them how they would track that problem down on processor de jour!

moinmoin RSS import macro

RSSReader.py is a macro I wrote to import an RSS feed into MoinMoin. I do realise this voids the point of a Wiki in that text is editable and fluid, but on the other hand it seems logical to keep related things presented within your Wiki, even if they are generated externally. I think this is safer than XSLT scripts to present the RSS for a public Wiki, since MoinMoin doesn't seem to offer a great deal of flexibility restricting arbitrary code injection.

This whole idea was made possible to conceive and build within an few hours thanks to the Universal Feed Parser. Vive le logiciel libre!

Import a patch into subversion

I keep expecting there to be an easy way to do this (feel free to comment if there is), but I wanted to replicate the BitKeeper bk import -tpatch behaviour with subversion where you can give and diff and it will be imported into the repository. Importantly this involves adding and removing files as per the patch.

I thus wrote svnapply.py to do this. It actually turned out to be a little more in depth than I had hoped, as there are a few nuances with different diff outputs.

I even tested it by importing a 33mb kernel diff, comparing an export of it to a tree patched using the usual patch, and also re-creating the diff and applying it again, again comparing to the normally patched version. It all came out the same.

Passcard

Secure passwords are hard to come up with and remember. Things like Key Ring for Palm are good, but it means you have to have your Palm Pilot on you constantly. Even worse is something tied to your desktop/laptop. Since the only thing I reliably have on me most all times is my wallet, a paper based solution is best.

I got the idea for this from ./. You have a matrix with the alphabet and random letters, then choose an easy to remember word for each site.

a sample passcard

So for GMail my password might be 4pfMnrMIZb. For extra security, in case someone finds the card, I can always prepend with something extra not on the card (maybe always put a ! first or something).

Latex/awk sources to make a business card sized PDF of random letters available here. Laminated it should make a fairly decent increase in online personal security with minimal effort.

MAC Address to Producer String

Here's a little script that can tell you the producer assigned to a (globally unique) MAC address according to the officially designated allocations. With this you will be able to quantify the number of Apple users who can not follow instructions sufficiently to stop themselves tripping the network trip wire.

If you're googling for "mac address allocations" you really want the "Organizationally Unique Identifier" from here

#!/usr/bin/python

import os, pickle, sys

OUI_ALLOCATIONS="http://standards.ieee.org/regauth/oui/oui.txt"
CACHE_FILE="/home/ianw/.mac2prod.cache"

class MACCache:

    maccache = {}

    def __init__(self, recache):

        if (recache):
            self.setup_cache()
            return
        try:
            f = open(CACHE_FILE, "r")
            unpickler = pickle.Unpickler(f)
            self.maccache = unpickler.load()
        except:
            self.setup_cache()

    def setup_cache(self):
        f = os.popen("wget --quiet -O - " + OUI_ALLOCATIONS)
        for l in f.readlines():
            if l.find("(hex)") > 0:
                sl = l.split("\t")
                name = sl[2].strip()
                mac = sl[0][0:8].replace("-",":")
                self.maccache[mac] = name

        of = open(CACHE_FILE, 'w')
        pickler = pickle.Pickler(of)
        pickler.dump(self.maccache)


def usage():
    print "usage: mac2prod [-re-cache] 00:00:00:00:00:00"
    sys.exit(1)
if not len(sys.argv):
    usage()

if (sys.argv[1] == "-re-cache"):
    recache = True
    sys.argv.remove(sys.argv[1])
else:
    recache = False

if len(sys.argv[1]) != 17 or len(sys.argv[1].split(":")) != 6 :
    usage()

mc = MACCache(recache)

try:
    print mc.maccache[sys.argv[1][0:8].upper()]
except:
    print "Unknown Producer"

Spam cleaning an mbox file

The problem is that I have a large mbox file full of messages which I want to strip spam from with spamassassin. Ordering of the messages is important (i.e. they must go in and come out the other end in order, with identified spam removed).

My solution is to run the following script from formail thusly

$ formail < in-mbox -s /home/ianw/spam-clean.py spam-free-box
#!/usr/bin/python
import os, sys, popen2

if len(sys.argv) != 2:
    print "usage: spam-clean.py output"
    sys.exit(1)

#read in message from stdin.
message = sys.stdin.read()

sa = popen2.Popen3("/usr/bin/spamc -c")
sa.tochild.write(message)
sa.tochild.close()

if sa.wait() != 0:
    print "discarding spam ..."
    sys.exit(1)
else:
    print "ok ..."
    f = open(sys.argv[1], 'a')
    f.write(message)
    f.close()

It's slow, but a little profiling shows me that most of the time is spent asleep (i.e. in the wait() call). One neat project would be to daemonise this and take messages in a fifo manner so we could run a few spamassassins at the same time but maintain message order.

Hey, you can see my house from here!

Hey, you can see my house from here! (at least until Google changes something).

update: it only took them a day to break it, but I did notice they embed a "Powered by Google" icon now and a link to the terms of use. The terms of use on the photo imagery seem to rule out a nice API being on the cards, but it would ceratinly open the door to some really cool stuff.

update 2: Egg on my face!

But in the terms and conditions you get if you click on the "Terms and conditions" link displayed on your map, it still says

The photographic imagery made available for display through Google maps is provided under a nonexclusive, non-transferable license for use only by you. You may not use the imagery in any commercial or business environment or for any commercial or business purposes for yourself or any third parties.

However, the API terms and conditions says

1.2 Photographic Imagery. The Google map images accessible to you through the Service may contain photographic imagery. Your use of this photographic imagery is limited to displaying it to end users within the Service itself, and in the same manner, form, format, and appearance as it is provided by the Service. You may not, nor may you allow others to, copy, distribute, display, alter, or otherwise use, this photographic imagery except as it is provided to you through the Service. Google reserves the sole right and discretion to determine whether your display of photgraphic images through the Service is in conformance with this Section, and also reserves the right to terminate or suspend your access to photographic imagery at any time for any reason, without notice.

Obviously the intent is that you can embed a map in your page but don't fiddle with it too much. But it would be nice if the terms and conditions were cleared up.