Recurrent Neural Network Basics

22 Oct 2018

Recurrent Neural Networks are designed to learn information from sequential data.

We start with datasets of x time steps in a row, for example:

x words in a sentence
x sequential stock ticks
x days of weather in a row

Thus, we say that there are T_x elements in a given point of data.

In the most basic case, we have some handoff of information, a_i from layer i-1 to i. This gets run through a cell at each step and outputs:

y_i the prediction at the ith step
a_i the carry-forward information to the same cell

Visually

Each cell calculates given information of both, this element as well as the output of the last element

from IPython.display import Image
Image('images/base_rnn.png')

png

Specifically at the cell level, this happens through pairs of Weight matricies and Bias terms.

W_aa, b_a: to this activation from the last activation
W_ax, b_x: to this activation from the input data
W_ya, b_y: (not pictured): to this output from this activation

Image('images/rnn_cell.PNG')

png

The values for W_aa, W_ay, and W_ax are literally the same object, regardless which element, t, you’re looking at.

Cost Function

The output at each layer has its own cost function that looks like our typical logistic/softmax cost

$\mathcal{L}^{\langle t \rangle}(\hat{y}^{\langle t \rangle}, y^{\langle t \rangle}) = -y^{\langle t \rangle} log \hat{y}^{\langle t \rangle} -(1-y^{\langle t \rangle}) log (1-\hat{y}^{\langle t \rangle})$

The key difference here, however, is that there’s an over-arching cost L that applies across each layer

$\mathcal{L}(\hat{y}, y) = \sum_{t=1}^{T_x} \mathcal{L}^{\langle t \rangle}(\hat{y}^{\langle t \rangle}, y^{\langle t \rangle})$

Image('images/rnn_backprop.png')

png

This allows us a great deal of flexibility in how we construct our Recurrent Networks

Different Architectures

Image('images/rnn_types.png')

png

one-to-one: Vanilla MLP
one-to-many: text generation
many-to-one: gender prediction based on audio clip
many-to-many(1): language translation
many-to-many(2): video classification