Probabilisitic datastructures

Advait (our AI supervisor) gave a good talk last week on probabilistic datastructures. He showed how you could save orders of magnitude of space if you were prepared to have an almost right answer (with probability bounds).

He motivated the talk using the example of which is trying to store unique pictures of cats. You can use a bloom filter to do the same job as a hash table here.   If you had 40Mb of raw cat pictures then you could use a 0.6Mb bloom filter an an efficient way of looking up whether you already had a certain picture with 4% error. If you wanted an exact answer with (e.g.) a hash table it would take you 4Mb. Obviously in reality you would have petabytes of cat pictures – then the savings start to add up.

We then looked at cardinality estimation – this is the problem of estimating how many unique cat pictures we have.  For our 40Mb example above you can do this in 125kb using a linear counter, or in 2kb using a loglog counter.

The next question was counting the frequency of each cat picture.

One might say we are trying to graph the long tail of cats.

One might say we are trying to graph the long tail of cats.

A count-min sketch can do this in just 48kb of space (with 4% error).

We went in to each of the above in detail to understand how they worked and Advait had a quiz on each one to see how they would store some example data.

Advait is using probabilistic algorithms in his research on visualising large datasets and so to close he showed us a sneak preview of his upcoming publication on this.

The talk was presented really clearly and confidently with lots of audience interaction.  All the talk of cat databases gave us a real world (!) scenario for the algorithms.  And it was nice to hear a bit about his research too.