Rob Pike Talk - "Interpreting the Data"

Wed 16 March 2005

Attendend a talk by Rob Pike from Google (who is important enough to have his own Wikipedia entry).

In short, the talk was on a language he jointly wrote which I imagine is the tool people at Google use to get figures like the Zeitgeist numbers (I don't think he actually mentioned the name, something like sawzill? Whatever it is I think it's also the name of a regular expression function in the language). Actually Google have lots of indexes, not just the entire internet but all the search queries, machine logs, developer check-in and build logs and probably what people eat for lunch. You can imagine it would be cool to see this data in interesting ways.

No database is going to handle all this data. In fact, a complex relational model database isn't really needed because most of the data can be thought of as associative ie. (result a + result b) + result c = result a + (result b + result c) and communicative ie. result a + result b + result c = result c + result b + result a.

The beauty of it is that you only ever think of things one record at a time. You then "emit" your transformation of the record upwards to an aggregator, which, as the name suggests, aggregates your results into something interesting. These aggreagtors plug in with the MapReduce implementation, which would be another few talks, although it can apparently sort a terrabyte of data in minutes. There are lots of aggregators, like sum, top, histograms, sort etc. All these communications happen through a Google standardised IDL layer called "protocol buffers" which have bindings into many languages like C/C++, Python, etc. It was suggested you can think of this IDL layer as "XML without the XML crap".

So as I understand it, you run Rob's tool over your index extracting those that you want. This is extremely parrellisable, since you only consider one record at a time and the results are associative/communicative. You pass this upwards to your aggregators, which plug into the MapReduce stuff which is also highly parallelised. They have plans to "pipe" these together (analogous to a Unix pipe), though they haven't done that yet (I imagine you'd want that for something like "most increasing searches" ... find all searches for last week, then all for this week, then take the difference of the two sets, then sort the result set, etc). Rob has a job he runs every morning to check stuff which takes a week of CPU time, and completes in a few minutes.

One interesting part is handling of crap index entries -- the language runs in two modes; one where you crash out for entries that you can't handle and one where they are ignored but logged. This means you test it, but when you run it on the google indexes where there are obviously pathalogical cases you didn't think about (a query string with a million 'a's for example) the program keeps running.

It was pointed out this model (consider only one record at a time) doesn't fit well with a graph structure of data, such as for calculating PageRank ... indeed Rob responded that alternative mechanisms are indeed used to calculate PageRank type figures. But you can certainly check the PageRank of a page via this tool, and easily run queries like "what is the highest PageRanked page on a site" (for example, adobe.com has the Acrobat Reader download page as the highest ranked page).