GloVe Embedding

As we mentioned in the Word2Vec notebook, training your Embedding Matrix involves setting up some fake task for a Neural Network to optimize over.

Stanford’s GloVe Embedding model is very similar to the Word2Vec implementation, but with one crucial difference:

GloVe places a higher importance on frequency of co-occurrence between two words.

Training Notes

First, an enormous vocab_size x vocab_size matrix is constructed as a result of a pass through of your entire corpus to get all unique words.

Then, to reduce dimensionality, we look for a factorization that minimizes the following

$\sum_i \sum_j f(Xij)(\Theta^T_j e_j + b_i + b_j - log(Xij))^2$


  • X_ij is the number of times i appears in the context of j (say, proximity of 10 words)
  • f() is a weighting term that zeros out if the two words don’t ever appear near each other.
  • b_i and b_j are bias terms at the word-level


Because of the exhaustiveness of the co-occurrence matrix construction, GloVe involves a considerable up-front computation cost. This calculation, however, does lend itself to some pretty straight-forward parallelization.