Word2Vec

As mentioned in the Word Embeddings notebook, there are many ways to train a Neural Network to produce a Word Embedding matrix for a given vocabulary.

One of the more popular implementations of this is TensorFlow’s Word2Vec. This notebook should provide a high-level intuition of this training approach.

Fake Task

The key takeaway for understanding how we fit an embedding layer is that we set our data up to solve an arbitrary problem when iterating over a corpus of text. For instance:

Given a random context word, c, how likely is it that we’ll see a target word, t, within N spaces from the context?

Visually, this looks like the following

from IPython.display import Image

Image('images/word2vec_task.png')

png

Because this spits out to a softmax classifier, the outputs are calculated as

$Pr(t\vert c) = \frac{e^{\Theta^T e_c}}{\sum_j e^{\Theta^T_j e_c}}$

This means that we’re learning both the Embedding Matrix, E, and some throw-away matrix, Theta, that we won’t use after training.

However, this gets tricky when considering compute costs. Even in a 1000-word vocabulary (considered pretty small for text data), we’re doing a ton of summing in these denominator terms, then again when calculating the cost function

$\mathcal{L}(\hat{y}, y) = - \sum y_i log(\hat{y_i})$

Not ideal. Instead we might try…

Negative Sampling

Same idea. We want to build a fake task with the ultimate goal of extracting an Embedding Matrix, E.

Except this time, we’re going to be more thoughtful about the way we sample our training data. Consider a sentence

The quick brown fox jumped over the lazy dog
  • We’d start, as before, by picking a context word: fox
  • Then we pick a random word within N spaces from it. We’ll say dog.
  • Now (this is where it’s different), we’ll generate k random words from our vocabulary that aren’t within N of fox

    apple, door, nebraska, eerie, his

  • Finally we construct tuples of the form

    (fox, dog, 1) (fox, apple, 0) (fox, door, 0) (fox, nebraska, 0) (fox, eerie, 0) (fox, his, 0)

to optimize the equation

$Pr(y=1 \vert c, t) = \sigma(\Theta^t e_c)$

Or “given this context word and this taret word, how likely is it that they’re within proximity of one another?”

So instead of training on 10k values in our softmax layer each step, we’re only training k+1 binary classification tasks at a time.