Text Analysis(Part-2)

Previous lecture / blog was the introduction to Text Analysis, from this blog we are diving deep into the concepts of Text Analysis. This blog contains some of the common methods of Text Analysis that were introduced in the previous lectures

TERM FREQUENCY INVERSE DOCUMENT FREQUENCY - TFIDF

It is the very basic and simple implementation of text analysis, it is just the improvisation of the binary vectorizer and count vectorizer in which the table values are tfidf score for each word instead of the word count thus is called as tfidf vectorizer.

Idea

TFIDF uses the term frequency and document frequency in order to get the importance of the word. It says that when the term frequency increases it intuits that the particular word is important in characterising the document and document is centred around it.

On the other hand as the document frequency increases the comparison between two documents become difficult which intuits that the word is too common and will not help in the analysis of the document.

Thus, high term frequency and low document frequency is desired in order to get the better text analysis result.

TFIDF Score

The TFIDF score is determined by the term frequency (tf) and the document frequency (df), the tf tends to increase the score and whereas the df tends to decrease the score as described by the idea of tfidf.

The score is given as below:

where N = Number of documents in the corpus

when dfw = N tfidf(w, d) = 0, Thus here it is portrayed that if the word appear in all documents we are going to down its importance no matter what is its term frequency.

Seeing the impact of document frequency (df) it is referred as inverse document frequency (idf). Tfidf score scenario can be depicted as

LATENT SEMANTIC ANALYSIS - LSA

Before understanding and learning about Latent Semantic Analysis, some terminology and concepts like Principle Component Analysis are required to be learnt, so discussion of LSA will be proceeded after these concepts.

Basic Terminology

1. Uni-grams: Each term or a word in a predefined vocabulary that forms the dimensions of the space are called as the uni-grams

Note: columns may also have bi-grams, tri-grams or in general n-grams. Bi-grams are the two contiguous words in the document. Similarly, n-grams are the n-contiguous words in the documents

Problems with these "grams":

They increase the dimensionality increases exponentially

They will increase the sparsity as the number of zero values in a structure increases exponentially leading to the increase in space and time complexity

Despite of these problems n-grams are used in order to reduce the amount of the information loss and thus keeping the meaning of the document.

2. Stemming: Words having the same root word can be called as the stem words like Organise, Organises and Organising where Organis is the stem word. Stemming of words involves the use of heuristics to collapse multiple words to the same word.

Lemmatization is an alternative of stemming of words in which dictionary look up is done, also it considers the grammatical constructs of the document thus it is preferred over stemming.

It is to be noted that stemming and lemmatizing are the part of data processing and can be implemented after tokenization and stop word removal.

Principle Component Analysis - PCA

Principle Component Analysis is the statistical method that is aimed at finding a lower dimensional representation of our data in such away that we can order our dimensions with regard to variance that they capture in the original data. The new coordinates that it finds are also going to have orthogonal dimensions associated with them.

We can have simple example to explain the idea behind the PCA, the figure below shows the two dimensional state space having number of data points in it. Considering the case 1 where the origin is at O, high variance can be seen in both x and y axis represented by AB and BC respectively.

Now if in case 2 we move the origin to O' and we have the two new dimension PCA 1 and PCA 2 by rotating x and y we can notice that most of the variance is covered along one axis only. In the above case there is very high variance along PCA 1 and very low variance along PCA 2.

Thus PCA helps int the reduction of dimensions as in the above case 2 dimensions are reduced to one.

Generalising PCA

In a high dimensional space, say n-dimensional, PCA will find the coordinates where each of the dimension can be ordered in decreasing variance associated with them, mathematically

Thus PCA helps to get more compact dimensions each of the dimensions are essentially non-correlated.

As the terminology and PCA has been discussed we can move to the Latent Semantic Analysis, which is going to be studied under the name of Single Value Decomposition.

Single Value Decomposition - SVD

Single Value Decomposition is mostly used for the Text Analysis. In order to understand the idea behind SVD let us consider a n x m matrix where n rows are the words in a vocabulary and m columns are the documents present in the corpus. The values inside the matrix can be the count of words, a binary number representing the presence of the word or the tfidf score for a word.

The matrix when multiplied by its transpose in order to get the covariance of 1 document with respect to other m - 1 documents. Out of which we obtain a matrix V by Eigen Value Decomposition.

Thus the initial Matrix X has been decomposed to three matrices namely U, big lambda and V

Where U is the eigen value decomposition of XTX i.e covariance of 1 document with other documents and V is the eigen value decomposition of XXT i.e covariance of one word with other words.

The structure of decomposed matrix can be given as

Thus we get three matrices which equals n x m X matrix.

Advantages of SVD

The figure shows the highlighted boxes which implies that we can map the words and documents to the lower dimensional representation without loosing our data (This comes from PCA)

These dimensions can be obtained on the state space graph and could be helpful in computing document similarity using vectors

SVD helps to capture the correlation between the words that exist in the data

Important note on LSA

LSA which is based on SVD gives lower dimensional representation of our documents which hypothesises the hidden variables (that are the topics that can be in the mind of the author while writing the document)

Thus Latent Semantic Analysis give us the shift from Syntactical Representation (What it says) to Semantic Representation (What it meant).

Search This Blog

Data Science