Text Analysis (Part-6)

September 01, 2020

In the previous lectures we had already covered various important Text Analysis Technique, this lecture was not about the technique to perform text analysis but it was about the technique to faster the process of Text Analysis, this is named as Locality Sensitive Hashing (LSH).

Locality Sensitive Hashing

Why LSH?

The task of finding nearest neighbours like finding the similar document is very common. Although using brute force to check for all possible combinations will give us the exact nearest neighbour but it is not practical for larger data as it is time consuming. Approximate algorithms to accomplish this task has been an area of active research, although these algorithms don’t guarantee the exact answer, but provide a good approximation. These algorithms are faster and scalable. Locality sensitive hashing (LSH) is one such algorithm which have various applications.

Basic Idea

LSH refers to a family of functions (known as LSH families) to hash data points into buckets so that data points near each other are located in the same buckets with high probability, while data points far from each other are likely to be in different buckets. This makes it easier to identify observations with various degrees of similarity.

The goal of the LSH is to find the near duplicate pairs of documents from the given large collections of documents. For this LSH algorithm has been divided into 3 steps:

Shingling

Min Hashing

Locality-sensitive hashing

The basic idea can be depicted as below

Shingling

In this step, we convert each document into a set of characters of length k (also known as k-shingles or k-grams). The key idea is to represent each document in our collection as a set of k-shingles.

Example Consider there is one document, D, “Nadal”. Now if we’re interested in 2-shingles, then our set will be {Na, ad, da, al} and similarly set of 3-shingles will be {Nad, ada, dal}.

Certain facts regarding the shingles:

Similar documents are more likely to share more shingles

Reordering paragraphs in a document of changing words doesn’t have much affect on shingles

k value of 8–10 is generally used in practice. A small value will result in many shingles which are present in most of the documents (thus, bad for differentiating documents)

Jaccard Index

We’ve a representation of each document in the form of shingles. Now, we need a metric to measure similarity between documents. Jaccard Index is one of the matrix that one can use. Jaccard Index between any two documents A & B can be defined as:

It’s also known as intersection over union (IOU). Suppose A: “Nadal” and B: “Nadia”, then 2-shingles representation will be:

A: {Na, ad, da, al} and B: {Na, ad, di, ia}.

Thus, Jaccard Index = 2/6

It is to be noted that more number of common shingles will result in bigger Jaccard Index and hence more likely that the documents are similar.

Hashing

The idea of hashing is to convert each document to a small signature using a hashing function H. Suppose a document in our corpus is denoted by d. Then:

H(d) is the signature and it’s small enough to fit in memory

If similarity(d1,d2) is high then Probability(H(d1)==H(d2)) is high

If similarity(d1,d2) is low then Probability(H(d1)==H(d2)) is low

Note: Choice of Hashing is tightly linked to the similarity metric we are using. For Jaccard Similarity the appropriate hashing function is min-hashing.

Min Hashing

This is the somewhat technical and critical aspect of this algorithm.

Step 1: Random permutation (π) of row index of document shingle matrix.

Step 2: Hash function is the index of the first (in the permuted order) row in which column C has value 1. Do this several time (use different permutations) to create signature of a column.

Locality-Sensitive Hashing

The general idea of LSH is to find a algorithm such that if we input signatures of 2 documents, it tells us that those 2 documents form a candidate pair or not i.e. their similarity is greater than a threshold t. It is to be noted that we are taking similarity of signatures as a proxy for Jaccard Similarity between the original documents.

Hash columns of signature matrix M using several hash functions

If 2 documents hash into same bucket for at least one of the hash function we can take the 2 documents as a candidate pair

Band Partition

The algorithm is as followed:

Divide the signature matrix into b bands, each band having r rows

For each band, hash its portion of each column to a hash table with k buckets

Candidate column pairs are those that hash to the same bucket for at least 1 band

Tune b and r to catch most similar pairs but few non similar pairs

Search This Blog

Data Science