Decomposition

Hey, I'm Sean J. Taylor and I study information systems at NYU.

«   Locality-Sensitive Hashing   »

I spent a few hours searching for material on locality-sensitive hashing and figured I would share it here so others wouldn't have to look so hard.

The problem LSH solves is that finding nearest neighbors is a very expensive--both in time and space--operation for large feature-spaces. By hashing input vectors, such as bag-of-word vectors, in a way such that similar vectors are likely to have the same hashes, lookup of near and nearest neighbors becomes a very quick operation.

Notice that this is kind of the opposite property of a cryptographic hash: you don't want similar passwords to have similar hashes because that reveals important information that is supposed to be obscured.

The killer application for LSH is in high-dimensional spaces. Think about matching text, images, even audio files (e.g. Shazam) quickly and efficiently. The application I have in mind is event detection and tracking on Twitter. Essentially matching a bunch of Tweets that are about the same thing and then finding the Tweet that started it all.

Implementations

Hamilton Ulmer has some Python code that is available on github that will hash a large number of vectors using all available cpu cores. This is the simplest and most readable implementation I have found in Python.

There's another Python implementation that I am checking out in bitbucket.

There's a Perl project called Nilsimsa that has been ported to Ruby and Python. Nilsima appears to implement a specific kind of LSH which is designed to work on blocks of text.

This C++ implementation looks fast and useful, but also like it's a real pain to compile. I haven't tried, but if someone could get this running with Ptyhon bindings it would probably be a big win.

Technical Resources

For a short tutorial, I would highly recommend these notes by Slaney and Casey which explain the problem and solution in about 4 pages and 9 equations.

The really technical (and not so useful for a noob like me) stuff, including the original paper on LSH, can be found on a site maintained by Alex Andoni. He says he has an alpha-version of some code available, but you must email him to find it.