Text Analysis (Part-5)

August 27, 2020

The last lecture on Tuesday was the completion of the contribution from the Bayesian's, the next lectures will be based on the contributions from the Neural Scientists i.e The Word Embedding. This blog consist of the discussion over one of the word embedding technique called as Word2Vec.

What is Word Embedding?

Word embedding is one of the most popular representation of document vocabulary. They are vector representations of a particular word that is capable of capturing context of a word in a document, semantic and syntactic similarity, relation with other words, etc.

Need of Word Embedding

The need of the word embedding can be understood by the following example of sentences.

Sentence 1: Have a good day

Sentence 2: Have a great day.

The two sentences hardly have different meaning. If we construct a vocabulary out of it we get a set V as

V = {Have, a, good, great, day}

Creating the one-hot encoded vector for each of these words in V. Length of our one-hot encoded vector would be equal to the size of V (=5). We would have a vector of zeros except for the element at the index representing the corresponding word in the vocabulary. That particular element would be one. Thus the encoding we have is

Have = [1,0,0,0,0]`

a = [0,1,0,0,0]`

good = [0,0,1,0,0]`

great = [0,0,0,1,0]`

day = [0,0,0,0,1]`

(` represents transpose)

If we visualize these encodings, we can think of a 5 dimensional space, where each word occupies one of the dimensions and has nothing to do with the rest (no projection along the other dimensions). This means ‘good’ and ‘great’ are as different as ‘day’ and ‘have’, which is not true.

Thus the objective of word embedding is to have words with similar context occupying close spatial positions. Mathematically, the cosine of the angle between such vectors should be close to 1, i.e. angle close to 0.

Diagrammatically,

Word2Vec

Word2Vec is one of the most popular technique to learn word embedding using neural network. It was developed by Tomas Mikolov in 2013 at Google.

Word2Vec is a method to construct such an embedding. It can be obtained using two methods (both involving Neural Networks):

1. Common Bag Of Words (CBOW)

2. Skip Gram

Common Bag of Words (CBOW)

This method takes the context of each word as the input and tries to predict the word corresponding to the context. Let us consider the previous example

Have a great day

If the word great is given as the input to the Neural Network, then we are trying to predict a target word day using a single context input word great. More specifically, we use the one hot encoding of the input word and measure the output error compared to one hot encoding of the target word day. In the process of predicting the target word, we learn the vector representation of the target word.

Deeper look into the actual architecture of word2vec is

The input or the context word is a one hot encoded vector of size V. The hidden layer contains N neurons and the output is again a V length vector with the elements being the softmax values.

The terms in the figure are as

Wvn is the weight matrix that maps the input x to the hidden layer (V*N dimensional matrix)

W`nv is the weight matrix that maps the hidden layer outputs to the final output layer (N*V dimensional matrix)

The hidden layer neurons just copy the weighted sum of inputs to the next layer. There is no activation like sigmoid, tanh or ReLU. The only non-linearity is the softmax calculations in the output layer.

The above model used a single context word to predict the target, multiple context words can be used to do the same.

The above model takes C context words, where Wvn is used to calculate hidden layer inputs, we take an average over all these C context word inputs.

So, it has been seen how word representations are generated using the context words. But there’s one more way we can do the same, we can use the target word (whose representation we want to generate) to predict the context and in the process, we produce the representations. The variant that does this is called , called Skip Gram.

Skip Gram

The above diagram of Skip Gram looks like multiple-context CBOW model just got flipped. To some extent that is true.

In this we input the target word into the network and the model outputs C probability distributions. Thus for each context position, we get C probability distributions of V probabilities, one for each word.

It is to be noted that in both the cases, the network uses back-propagation to learn.

Search This Blog

Data Science