As we mentioned in the Word2Vec notebook, training your Embedding Matrix involves setting up some fake task for a Neural Network to optimize over.
Stanford’s GloVe Embedding model is very similar to the Word2Vec implementation, but with one crucial difference:
GloVe places a higher importance on frequency of co-occurrence between two words.
First, an enormous
vocab_size x vocab_size matrix is constructed as a result of a pass through of your entire corpus to get all unique words.
Then, to reduce dimensionality, we look for a factorization that minimizes the following
$\sum_i \sum_j f(Xij)(\Theta^T_j e_j + b_i + b_j - log(Xij))^2$
X_ijis the number of times
iappears in the context of
j(say, proximity of 10 words)
f()is a weighting term that zeros out if the two words don’t ever appear near each other.
b_jare bias terms at the word-level
Because of the exhaustiveness of the co-occurrence matrix construction, GloVe involves a considerable up-front computation cost. This calculation, however, does lend itself to some pretty straight-forward parallelization.