Text Analysis(Part-1)

What is Text Analysis?

Text analysis is the process of automatically classifying and extracting meaningful information from unstructured text. It involves detecting and interpreting trends and patterns to obtain relevant insights from data in just seconds.

Text Analysis - Transition from Structured Data to Unstructured Data

Text analysis is usually performed on the text that is present inside the document in a Unstructured Format, unlike supervised or unsupervised learning as discussed earlier. Thus the first task of Text Analysis comes out to be conversion of this Unstructured Data to Structured Data i.e to represent the data in the tabular form and this process is known as the Text Vectorisation (Discussed Later in the Post).

Understanding the Document

As text analysis involves extraction of information from the document it becomes important to understand the basic characteristics of the document that are to be fed to the text analyser, some of them are mentioned below:

Length: Different documents have different length and thus while extracting information this factor is to be considered as the length of the documents on which the analyser is trained must be different from the documents it will encounter in future.

Language: A document is the sequence of words from a selected vocabulary and it becomes important to understand the language of the document in which it is written.

Author of the Document: It is the very important aspect to be considered during the training of text analysis model, the reason behind is that each individual have different set of vocabulary and writing styles. Thus while comparing to available documents, a machine should not spot the words used but the thing to be focused is what the document conveys and thus forcing text analysis to be the study of "semantics" of the document rather than the understanding the "concrete meaning" of each word.

The Three Schools

In the evolution of Text Analysis history has seen the contributions from the "Three major Schools", which are named and explained as followed:

The Information Retrieval Scientists: The information retrieval scientists are the one who are involved in building the search engines and help us find the things as per requirement from the huge mass on the internet. As per there domain, there focus on text analysis can be quoted as

"Finding a needle in the haystack when given query by the user"

The Bayesian's: The Bayesian's or the Statisticians were interested in understanding the statistical properties of the documents.

Neural Scientists: The people who were into Neural Net's and Deep Learning wanted to get away from the symbols i.e the the words and for this came up with some numeric continuous representation of the symbols.

Sub Disciplines of Text Analysis

There are numerous fields and systems where the text analysis is required and is currently used, some of them are mentioned below:

Works on Dialogue systems
Machine Translation
Text to speech
Question and answering systems (like siri, google assistant etc)
Spell Correction System

Basic Terminology

There is some basic terminology that should be known before we start knowing more about Text Analysis

1. Stop Words: Stop words are the words that do not really add to the meaning of the document at the statistical level, they actually appear as the grammatical constructs of the language. For example: the, that, when, whom, who etc.

2. Tokenisation: When we represent each word in a document as a separate entity, the process is called as the Tokenisation and each word is said to be as the tokens.

3. Term Frequency: Number of times the word appear in a document represents the Term Frequency, term frequency is Document Specific.

4. Document Specific: Number of documents in which the word appears is represented as Document Frequency, document frequency is Corpus Specific.

Different Approaches for Text Analysis

So far we have seen that "The Three Schools" that have contributed to Text Analysis and it is the right to introduce the methods established by these schools. The methods are discussed in depth in the further posts.

1. TFIDF

Term Frequency Inverse Document Frequency i.e TFIDF is the approach of identifying the how important is a word that appears in a document as compared to other words that appears in the other documents.

Thus Tfidf is the method that plays with the occurrences of different words and can be said that it gives the syntactical relations between the documents rather than seeing the semantic relations thus fails for comparison between two documents depicting the same idea in which the vocabulary varies drastically.

2. LSA

LSA stands for Latent Semantic Analysis is the approach that moves from syntactical representation of Language to Semantic Representation of Language, thus understanding the meaning of the document and making the Text Analysis more efficient.

Note: Both TFIDF and LSA were the methods designed by the "information retrievers school".

3. LDA

LSA stands for Latent Dirichlet Allocation designed by the The Bayesian's, being the statisticians they went for the probabilistic approach and like LSA, LDA also focuses on the idea depicted in the document rather than words as in Tfidf. LDA falls in the category of very popular model of Text Analysis called as the Topic Models.

Also it is to be noted that Bayesian's have involved themselves in the other popular models like Probabilistic Latent Semantic Analysis (pLSA), Hidden Markov Models (HMM) and Conditional Random Field (CRF).

4. Word Embedding

Word Embedding is one of the latest approaches used today, developed by The Neural Scientists from google, Stanford and Facebook. Most famous word embedding model or methods include Word2Vec, Glove Model (Global Vectors) and the Fast Text.

The idea used by the Neural Scientists is to create a high Dimensional Space of around 100 to 300 dimensions and map each word to that space. Thus getting the Vectors for each word in high dimension space in such a manner that similar words are close to each other and the words with no relations are far apart from each other.

Other than adding Semantic Analysis on the documents, Word Embedding made possible to include "mathematics" on a symbols(letters) of which no one has ever thought of, Let us consider a beautiful example to see how the mathematics was included in the word Embedding.

MATHS ON WORK!!

In short, word embedding helps to get the idea of the documents and gives mathematical representation of words as well.

Text Vectorisation

Very soon we will be getting deep into the Text Analysis concepts and will be studying above mentioned approaches in detail. Before entering into that we must notice that there is one technique called as Text Vectorisation which is used in almost all the approaches mentioned above.

Thus it becomes one of the initial steps for building any Text Analyser, other being the stop word removal (i.e Vectorisation is done after stop word removal)

As discussed earlier a document is a unstructured data and thus it is needed to be structured before we use it, and Text vectorisor is the one that help us in this, we can formally define it as

Text Vectorisation is the process of translating the corpus into a Tabular form in which every document of the corpus represents the Row and each word in a corpus represents the column

Thus each of the row can be interpreted as the vector in the |V| dimensions, where we is the size of the vocabulary used i n the corpus.

Why Vectorisation?

The method is called as the vectorisation because we taking each document in a high dimensional space and any point in that n-dimensional space represents the document called as the vector.

The similarity between two documents is dependent on the angle of between the vectors. Thus it becomes useful while comparing the two documents.

Approaches for Vectorisation

Before using any approach to vetorise we need to do the following:

Tokenization of the documents
Removing the punctuation
Stop word removal

After doing these operations we get something that is called as Bag of Words, in which there is no grammatical constraints nor the order is important and thus considered only as the collection of words in a bag.

Now we have the bag of words from the corpus, which will help us in structuring (tabulating) the data and this can be done in any of the two ways given below:

1. Binary Vectoriser

A binary vectoriser is the one which only marks the presence or the absence of a given word in a particular document. If a word is present in a document than it is marked as 1 otherwise it is marked as 0.

2. Count Vectorizer

As the name specifies it is the vectorisor that actually tells how many times a particular word has appeared in a particular document, i.e we keep the count of a particular word in a particular document, thus getting more information about the corpus.

The following example will create a clear image of what is explained above

Search This Blog

Data Science