Text Analysis (Part-6)
In the previous lectures we had already covered various important Text Analysis Technique, this lecture was not about the technique to perform text analysis but it was about the technique to faster the process of Text Analysis, this is named as Locality Sensitive Hashing (LSH).
Locality Sensitive Hashing
Why LSH?
The task of finding nearest neighbours like finding the similar document is very common. Although using brute force to check for all possible combinations will give us the exact nearest neighbour but it is not practical for larger data as it is time consuming. Approximate algorithms to accomplish this task has been an area of active research, although these algorithms don’t guarantee the exact answer, but provide a good approximation. These algorithms are faster and scalable. Locality sensitive hashing (LSH) is one such algorithm which have various applications.
Basic Idea
LSH refers to a family of functions (known as LSH families) to hash data points into buckets so that data points near each other are located in the same buckets with high probability, while data points far from each other are likely to be in different buckets. This makes it easier to identify observations with various degrees of similarity.
The goal of the LSH is to find the near duplicate pairs of documents from the given large collections of documents. For this LSH algorithm has been divided into 3 steps:
- Shingling
- Min Hashing
- Locality-sensitive hashing
The basic idea can be depicted as below
In this step, we convert each document into a set of characters of length k (also known as k-shingles or k-grams). The key idea is to represent each document in our collection as a set of k-shingles.
Example Consider there is one document, D, “Nadal”. Now if we’re interested in 2-shingles, then our set will be {Na, ad, da, al} and similarly set of 3-shingles will be {Nad, ada, dal}.
Certain facts regarding the shingles:
- Similar documents are more likely to share more shingles
- Reordering paragraphs in a document of changing words doesn’t have much affect on shingles
- k value of 8–10 is generally used in practice. A small value will result in many shingles which are present in most of the documents (thus, bad for differentiating documents)
Jaccard Index
We’ve a representation of each document in the form of shingles. Now, we need a metric to measure similarity between documents. Jaccard Index is one of the matrix that one can use. Jaccard Index between any two documents A & B can be defined as:

It’s also known as intersection over union (IOU). Suppose A: “Nadal” and B: “Nadia”, then 2-shingles representation will be:
A: {Na, ad, da, al} and B: {Na, ad, di, ia}.
Thus, Jaccard Index = 2/6
It is to be noted that more number of common shingles will result in bigger Jaccard Index and hence more likely that the documents are similar.
Hashing
The idea of hashing is to convert each document to a small signature using a hashing function H. Suppose a document in our corpus is denoted by d. Then:
- H(d) is the signature and it’s small enough to fit in memory
- If similarity(d1,d2) is high then Probability(H(d1)==H(d2)) is high
- If similarity(d1,d2) is low then Probability(H(d1)==H(d2)) is low
Min Hashing
This is the somewhat technical and critical aspect of this algorithm.
Step 1: Random permutation (π) of row index of document shingle matrix.
Step 2: Hash function is the index of the first (in the permuted order) row in which column C has value 1. Do this several time (use different permutations) to create signature of a column.
Locality-Sensitive Hashing
The general idea of LSH is to find a algorithm such that if we input signatures of 2 documents, it tells us that those 2 documents form a candidate pair or not i.e. their similarity is greater than a threshold t. It is to be noted that we are taking similarity of signatures as a proxy for Jaccard Similarity between the original documents.
- Hash columns of signature matrix M using several hash functions
- If 2 documents hash into same bucket for at least one of the hash function we can take the 2 documents as a candidate pair
Band Partition
The algorithm is as followed:
- Divide the signature matrix into b bands, each band having r rows
- For each band, hash its portion of each column to a hash table with k buckets
- Candidate column pairs are those that hash to the same bucket for at least 1 band
- Tune b and r to catch most similar pairs but few non similar pairs
Comments
Post a Comment