Text Analysis (Part-7)

September 03, 2020

This is the last lecture on Text Analysis and in this we have discussed another Word Embedding Technique known as the GloVe.

GloVE

The basic idea behind the GloVe word embedding is to derive the relationship between the words from Global Statistics, one of the simplest ways is to look at the co-occurrence matrix. A co-occurrence matrix tells us how often a particular pair of words occur together. Each value in a co-occurrence matrix is a count of a pair of words occurring together.

Example: Consider a sentence in a corpus: “I play cricket, I love cricket and I love football”. The co-occurrence matrix for the corpus looks like this:

Now, we can easily compute the probabilities of a pair of words. For instance, let us consider on the word “cricket”:

p(cricket/play)=1

p(cricket/love)=0.5

Next, we can compute the ratio of probabilities:

p(cricket/play) / p(cricket/love) = 2

As the ratio > 1, we can infer that the most relevant word to cricket is “play” as compared to “love”. Similarly, if the ratio is close to 1, then both words are relevant to cricket.

We are able to derive the relationship between the words using simple statistics. This the idea behind the GloVe pretrained word embedding.

GloVe learns to encode the information of the probability ratio in the form of word vectors. The most general form of the model is given by:

Implementing GloVe Pretrained Embedding

Load the Model

# load the whole embedding into memory

embeddings_index = dict()

f = open('glove.6B.300d.txt')

for line in f:
values = line.split()
word = values[0]
coefs = np.asarray(values[1:], dtype='float32')
embeddings_index[word] = coefs

f.close()
print('Loaded %s word vectors.' % len(embeddings_index))

Output: Loaded 400,000 word vectors.
Creating the Embedding Matrix
# create a weight matrix for words in training docs

embedding_matrix = np.zeros((size_of_vocabulary, 300))

for word, i in tokenizer.word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector
Defining the Architecture - Pretrained Embeddings
model=Sequential()
# embedding layer
model.add(Embedding(size_of_vocabulary,300,weights = [embedding_matrix],input_length=100,trainable=False)) 

# lstm layer
model.add(LSTM(128,return_sequences=True,dropout=0.2))

# global maxpooling
model.add(GlobalMaxPooling1D())

# dense layer
model.add(Dense(64,activation='relu')) 
model.add(Dense(1,activation='sigmoid'))
 
# add loss function, metrics, optimizer
model.compile(optimizer='adam', loss='binary_crossentropy',metrics=["acc"]) 

# adding callbacks
es = EarlyStopping(monitor='val_loss', mode='min', verbose=1,patience=3)  
mc=ModelCheckpoint('best_model.h5', monitor='val_acc', mode='max', save_best_only=True,verbose=1)  

# Print summary of model
print(model.summary())

Output:

We can see that number of trainable parameters is just 227,969. That’s a huge drop compared to the embedding layer.
Training the Model
history = model.fit(np.array(x_tr_seq), np.array(y_tr), batch_size=128, epochs=10, validation_data=(np.array(x_val_seq), np.array(y_val)), np.array(y_val), verbose=1, callbacks=[es,mc])
Evaluating the Performance of the Model
# loading best model
from keras.models import load_model
model = load_model('best_model.h5')

# evaluation 
val_acc = model.evaluate(x_val_seq,y_val, batch_size=128)
print(val_acc)

Output: 88.49
Thus the accuracy of the model came out to be 88% and that is pretty good.

Search This Blog

Data Science