Recurrent Neural Network Basics

Recurrent Neural Networks are designed to learn information from sequential data.

We start with datasets of x time steps in a row, for example:

  • x words in a sentence
  • x sequential stock ticks
  • x days of weather in a row

Thus, we say that there are T_x elements in a given point of data.

In the most basic case, we have some handoff of information, a_i from layer i-1 to i. This gets run through a cell at each step and outputs:

  • y_i the prediction at the ith step
  • a_i the carry-forward information to the same cell


Each cell calculates given information of both, this element as well as the output of the last element

from IPython.display import Image


Specifically at the cell level, this happens through pairs of Weight matricies and Bias terms.

  • W_aa, b_a: to this activation from the last activation
  • W_ax, b_x: to this activation from the input data
  • W_ya, b_y: (not pictured): to this output from this activation


The values for W_aa, W_ay, and W_ax are literally the same object, regardless which element, t, you’re looking at.

Cost Function

The output at each layer has its own cost function that looks like our typical logistic/softmax cost

$\mathcal{L}^{\langle t \rangle}(\hat{y}^{\langle t \rangle}, y^{\langle t \rangle}) = -y^{\langle t \rangle} log \hat{y}^{\langle t \rangle} -(1-y^{\langle t \rangle}) log (1-\hat{y}^{\langle t \rangle})$

The key difference here, however, is that there’s an over-arching cost L that applies across each layer

$\mathcal{L}(\hat{y}, y) = \sum_{t=1}^{T_x} \mathcal{L}^{\langle t \rangle}(\hat{y}^{\langle t \rangle}, y^{\langle t \rangle})$



This allows us a great deal of flexibility in how we construct our Recurrent Networks

Different Architectures



  • one-to-one: Vanilla MLP
  • one-to-many: text generation
  • many-to-one: gender prediction based on audio clip
  • many-to-many(1): language translation
  • many-to-many(2): video classification