GloVe Embedding
As we mentioned in the Word2Vec notebook, training your Embedding Matrix involves setting up some fake task for a Neural Network to optimize over.
Stanford’s GloVe Embedding model is very similar to the Word2Vec implementation, but with one crucial difference:
GloVe places a higher importance on frequency of co-occurrence between two words.
Training Notes
First, an enormous vocab_size x vocab_size
matrix is constructed as a result of a pass through of your entire corpus to get all unique words.
Then, to reduce dimensionality, we look for a factorization that minimizes the following
$\sum_i \sum_j f(Xij)(\Theta^T_j e_j + b_i + b_j - log(Xij))^2$
Where
X_ij
is the number of timesi
appears in the context ofj
(say, proximity of 10 words)f()
is a weighting term that zeros out if the two words don’t ever appear near each other.b_i
andb_j
are bias terms at the word-level
Runtime
Because of the exhaustiveness of the co-occurrence matrix construction, GloVe involves a considerable up-front computation cost. This calculation, however, does lend itself to some pretty straight-forward parallelization.