Google Maps - Lars Rasmussen

Thu 13 October 2005

Google Maps has a new branch in Darling Harbour, Sydney.

GM has about 500 million squares. Within weeks they hope to have high res of Sydney; once or twice more zoom over the current even. Zachary's pizza, Berkeley is Lars' tip for best Pizza in the world.

Originally was a C++ browser with Where2. They took it to Google who said they liked the web. They tried plugins/flash. With AJAX it was 4 weeks for prototype; as opposed to years for C++ app. Javascript has inconsistencies across browsers, but compared to C++ across OS it is easy. However, compared to Google Earth it is nowhere near as rich in features.

On the first day hit top load they predicted for 6 months. Satellite imaging caused traffic to grow exponentially in one day.

Their design philosophy

end-user is royalty
go beyond browser lowest-common-denominator : because they're Google browser people ask them what they want.
dynamic as native apps; simple as google website
do what it takes
launch early + often : with 4 months dev, figure 3 months to launch. Larry Page played with demo and said "just go". labs.google.com playground.

Now under Google local. They listen to feedback; they discussed frame layout endlessly for the merge; launched and people hated it, changed in a week.

DHTML overview : javascript handles events, which can modify the structure of the page. browser does nothing until event handler finishes, then re-renders the page off screen and displays it.

Back-button was solved with virtual pages. What does it mean to push the back button? Back should go back through search history, not random junk. Normally the browser remembers clicks in iframes as history. To get around this you use a "virtual" iframe which copies what is happening into a main page. this way you can control the pages that are shown and consequently the back bar.

Backend map providers got nervous about hacks. housingmaps.com had inadvertently broken a link back to the data provider so Google weren't weren't being charged. API wasn't a technical challenge, but a contractual one. Google loved the maps hacks but couldn't tell anyone.

Google would never build a bus map route for Seattle, but they are inspired by it. Google watches the api key to see what cool things are happening. Katrina imagery request was from the government. If you want a job write a good maps hack.

Growth areas is mostly about integrating all of Googles data sources.

Cafe in san-fran? 20 billion web pages that probably can tell you where "Silicon Valley" is, but Google maps doesn't know. User logs show that users think google knows that stuff -- maybe comes from the "know it all" search box which people think can do anything.

Processing : different data and different confidence; can you join it? API is telatlis(?) data ... navtech data is on the site. can they merge? can they merge other databases? adding spacial infrastructure to google search infrastructure. "cheapest gas via work"

Routing: shortest path algorithm with weighted graph. If underlying data is correct works fine. However traffic not taken into account; taxi driver berated him that maps takes him straight into traffic -- some drivers love it because tourists bring it and they make money. Ridefinder -- nobody uses it but you get gps feeds from taxi drivers which you can use to re-weight your maps.

Drawing: pretty anti-aliased is nice. Others have had to catch up

Questions: When is Sydney maps coming? They are very tied to data format ... and Sydney maps don't' come in that format. working to remove that, early next year. where can you get Aussie map data? sensis, map data sciences. all comes from the government in some form.

Different grid systems around the world? No cartographers on the team really. In japan, use different datum; lat/long use different views of ovalness of the earth. datums need to be normalised.

How make money? Haven't settled on that just yet. Google figures users == money; not sure how but something will come up. Advertisements is obvious for "local search" stuff.

API via c++? no, web is focus.

Satellite tile age? higher == fresher. Google Earth looks after imagery; new Sydney about 6 months. Imagery has a "novelty" feel; big hits when first released but users don't always come back.

Censorship of photos: have to respect.

Obfuscated javascript? makes code smaller. time to download is now in seconds, needs to be milliseconds.

Government fudging? Not that he knows about ...

Projections of tiles? New different ones on the way, maybe. Night satellite, etc. winter/summer etc. geological features, historical maps. Have the data any maybe one day.

Disputed borders: no easy answers. Chinese laws say you can't move maps out of the country; but they do have maps on servers in China. Might have to present different things to different users.

Spectral Analysis of Network Traffic

Thu 02 June 2005

Professor John Heidemann - Information Sciences Institute University of Southern California

Spectral analysis can help to find information in any sort of periodic data. Some network traffic has periodic data to it; for example a 100Mb connection with 1500MTU should be sending out packets at 7600Hz.

John's group have been looking into a few different areas.

Classification of DOS attacks between single and multiple attackers is possible. Each attacker is sending with a specific frequency. As you add attackers who are slightly out of sync you you tend to see lower frequencies on the FFT graph become more prominent. Musicians would know this from when two notes are slightly out of tune you will hear 'beats' where the two waves re-enforce. Their work was quite reliably able to classify attacks into single or multiple attackers.

To defeat this attackers would either need to be in sync, which entails global clocks, etc, or to vary the packet rate (you need quite a variety of rate to defeat their heuristics). If you are varying the rate then by definition you are not sending as fast as you can, so there may be some benefit there.

Fingerprinting attacks is also possible. They divide up traffic into segments and keep mean and covariance of the frequencies seen in each segment. You can particularly identify "troops" of people attacking you from these fingerprints. This can be useful for reporting cybercrime, as you need to show you are being targeted. Again defeating fingerprinting means varying your frequencies. The limits are one of network limiting, host limiting and tool limiting. Most attacks don't limit themselves, or give a predictable limiting fingerprint. They are usually written badly so are host limited (i.e. they do so much work sending a packet they can't saturate the link). These fingerprints could be detected under with a range background cross-traffic.

Performance analysis is another area the group has started looking into. The main benefit of using spectral analysis is you don't need to analyse individual data flows and the analysis is stateless. You can see where things are network limited by, for example, looking for spikes around 7600Hz on a 100Mb network. You do need to be careful with the FFT windowing to make sure you are seeing decent results.

Questions Do fingerprints look the same when being attacked from LA as from New York, for example? This is an area for future research. Have they studied wireless networks? No, but others are using these techniques. Could you use a vector of packet sizes and do the FFT on that, ameliorating some of the problems with fuzz at different packet sizes? That is possible area for future research. Should networking students take signal processing courses? Maybe!

Gordon Bell IEEE Lecture

Fri 20 May 2005

Titled: Industry's evolutionary path or Moore's Law: Que sera sera

Gordon Bell is a pioneer of the computer industry. When he talks he drops names like Bill Gates, Paul Allen, Bill Joy and Jim Gray. He spent some time at UNSW several decades ago.

CPU increases are stalling, further gains to Moore's Law will come from multi-core designs. Storage is going crazy, allowing many interesting possibilities. GPU's are also an interesting area keeping ahead of Moores law.

Gordon suggests everything "cyberizable" will be eventually. He gave flowchart of how this happens; world to continent to region (intranet) to campus to home to car to body to in body! He also said that data, telephony and television will converge; there is no point in keeping three delivery mechanisms around. This is already happening.

He expanded on "cyberization" saying it means there will be a digital copy of everything in located in cyberspace. This means that the gyprock in your walls will have RFID tags with serial numbers.

Gordon gave his law, Bell's Law. It's based around computer classes, and says that every decade a new, lower cost class of computers emerges. They are primarily defined by the platform, interface and interconnect. You can see how this has happened from mainframes to PC's to networked PC's to wireless mobile PDA's.

Gordon converted to believing that large, single image computers aren't the future in 1995 or so. However, clusters are too hard to program. He jokes that the Gordon Bell Prize is apparently given because programming clusters is so bad anyone that tries deserves a prize.

One reason he never thought UNIX took off until Linux was price and vendor lock in. Bill Joy's law is you don't write software unless you can sell 100,000 and it costs you less than $10 million. Bill Gates law is you don't write software unless you can sell 1,000,000 and it costs you $10 million.

Pharmix was one of Gordons investments. The created molecular mechanics accelerator for modelling drug-receptor interactions on a chip. They have a 1000x speedup over 1GHz PC, but they abandoned it to use commodity GPU hardware.

Very large disks are the driver for the new world vs. the old world. In the old world there is a mainframe which costs cents per transaction and around $85/gigabyte/year. The new world is scaled out PC's (Google); they have basically 0c/transaction and cost around $1million/year/petabyte. You now capture everything because you can. For example, Brewster Kahle runs archive.org for around $2k/terabyte/year, all on open source hardware.

An example of how this information could have been used is how book stores could have used "perfect inventory" to kill Amazon. They could put up a website an tell you if you should go down the road to get your book or order it for delivery. They missed the boat with that.

Gordon is interested in the "1TB life". They found out some guy called Vannevar Bush in 1945 proposed the "memex" which an individual can store all books, records, etc. Gordon showed how he calculated the average life based on a day; emails, web pages, scanned pages, 8 hours of recorded audio, a few digital photos will take about 65 years to fill 1TB, which these days can be carried around.

Add to this a GPS, a heart rate monitor, video and you have a complete record of everything you ever did.

He sees sensor networks as playing the next step in the Bells law chain. He told us about some of his investments in DUST networks based on wireless sensor technology. They have created some little units with GPS and IRDA which the army are dropping in the middle of nowhere. No one really knows why.

Gordon's vision for the next 10 years is mixed. He sees Moore's law continuing. He wants paper quality screens and terrabyte personal stores (running WinFS). He sees Murphy's law remaining with stays with complex systems throwing problems we didn't foresee. There will be astronomical sized databases. Personal authentication will be required to access anything of value network services. "It's the internet, stupid" (shades of "The network is the computer", anyone?). Of course he's positive about sensor networks.

Questions

He mentioned that POTS didn't evolve but the point was raised that network providers came up with packetization. He agreed.

A question about the difficultly of programming clusters was asked. Gordon suggested that individual problems can be broken to run on clusters, but general purpose applications do not do well (Google, when it comes down to it, is single applications). He feels this is the "holy grail" of clusters.

Obviously a lot of his technology raises questions about surveillance and big brother. Gordon said it worried him, but suggested there are many advantages. Also there was a bit of fate -- this is the way things are going; what can you do?

He was asked about the cell processor. He thought it was interesting.

Other questions about quantum computing and nerve interfaces; he liked the ideas but would wait till he saw products in the lab to invest in.

I missed a question about another technologist who evidently talks about the digital lifestyle.

He was asked about battery life not keeping up with computers. He mentioned we need to make trade offs between speed and battery life. There is no point having a battery that lasts twice as long if it takes twice as long to do anything.

Rob Pike Talk - "Interpreting the Data"

Wed 16 March 2005

Attendend a talk by Rob Pike from Google (who is important enough to have his own Wikipedia entry).

In short, the talk was on a language he jointly wrote which I imagine is the tool people at Google use to get figures like the Zeitgeist numbers (I don't think he actually mentioned the name, something like sawzill? Whatever it is I think it's also the name of a regular expression function in the language). Actually Google have lots of indexes, not just the entire internet but all the search queries, machine logs, developer check-in and build logs and probably what people eat for lunch. You can imagine it would be cool to see this data in interesting ways.

No database is going to handle all this data. In fact, a complex relational model database isn't really needed because most of the data can be thought of as associative ie. (result a + result b) + result c = result a + (result b + result c) and communicative ie. result a + result b + result c = result c + result b + result a.

The beauty of it is that you only ever think of things one record at a time. You then "emit" your transformation of the record upwards to an aggregator, which, as the name suggests, aggregates your results into something interesting. These aggreagtors plug in with the MapReduce implementation, which would be another few talks, although it can apparently sort a terrabyte of data in minutes. There are lots of aggregators, like sum, top, histograms, sort etc. All these communications happen through a Google standardised IDL layer called "protocol buffers" which have bindings into many languages like C/C++, Python, etc. It was suggested you can think of this IDL layer as "XML without the XML crap".

So as I understand it, you run Rob's tool over your index extracting those that you want. This is extremely parrellisable, since you only consider one record at a time and the results are associative/communicative. You pass this upwards to your aggregators, which plug into the MapReduce stuff which is also highly parallelised. They have plans to "pipe" these together (analogous to a Unix pipe), though they haven't done that yet (I imagine you'd want that for something like "most increasing searches" ... find all searches for last week, then all for this week, then take the difference of the two sets, then sort the result set, etc). Rob has a job he runs every morning to check stuff which takes a week of CPU time, and completes in a few minutes.

One interesting part is handling of crap index entries -- the language runs in two modes; one where you crash out for entries that you can't handle and one where they are ignored but logged. This means you test it, but when you run it on the google indexes where there are obviously pathalogical cases you didn't think about (a query string with a million 'a's for example) the program keeps running.

It was pointed out this model (consider only one record at a time) doesn't fit well with a graph structure of data, such as for calculating PageRank ... indeed Rob responded that alternative mechanisms are indeed used to calculate PageRank type figures. But you can certainly check the PageRank of a page via this tool, and easily run queries like "what is the highest PageRanked page on a site" (for example, adobe.com has the Acrobat Reader download page as the highest ranked page).