Intuition on Neural Network sequence models

LSTM (Long Short-Term Memory)

LSTM neural network is just another kind of artificial neural network which falls in the category of Recurrent Neural Networks. Much like convolutional networks help a neural network learn about image features, LSTM cells help the network learn about temporal data, something which other Machine Learning models traditionally struggled with.

Core Concept: How do LSTM cells work?

Each LSTM cell in our Neural Network will only look at a single column of its inputs, and also at the previous column’s LSTM cell’s output. LSTM cell will have two different input vectors : the previous LSTM cell’s output (which gives it some information about the previous input column) and its own input column.

How LSTM Cells work:

Forget Gate

The “forget gate” is a sigmoid layer, that regulates how much the previous cell’s outputs will influence this one’s. It takes as input both the previous cell’s “hidden state” (another output vector), and the actual inputs from the previous layer. Since it is a sigmoid, it will return a vector of “probabilities”: values between 0 and 1. They will multiply the previous cell’s outputs to regulate how much influence they hold, creating this cell’s state.

Input Gate

Unlike the forget gate, the input gate’s output is added to the previous cell’s outputs (after they’ve been multiplied by the forget gate’s output). The input gate is the dot product of two different layers’ outputs, though they both take the same input as the forget gate (previous cell’s hidden state, and previous layer’s outputs):

  • A tanh unit , which actually extracts the new information. Notice tanh takes values between -1 and 1.

The LSTM cell’s outputs

The cell’s state is what the next LSTM cell will receive as input, along with this cell’s hidden state. The hidden state will be another tanh unit applied to this neuron’s state, multiplied by another sigmoid unit that takes the previous layer’s and cell’s outputs (just like the forget gate).

GRU (Gated Recurrent Units)

The GRU is the newer generation of Recurrent Neural networks and is pretty similar to an LSTM. GRU’s got rid of the cell state and used the hidden state to transfer information. It also only has two gates, a reset gate and an update gate.

Update Gate

The update gate acts similar to the forget and input gate of an LSTM. It decides what information to throw away and what new information to add.

Reset Gate

The reset gate is another gate is used to decide how much past information to forget.


RNN’s are good for processing sequence data for predictions but suffers from short-term memory. LSTM’s and GRU’s were created as a method to mitigate short-term memory using mechanisms called gates. Gates are just neural networks that regulate the flow of information flowing through the sequence chain. LSTM’s and GRU’s are used in state of the art deep learning applications like speech recognition, speech synthesis, natural language understanding, etc.

Bahdanau Attention in seq2seq modeling

Neural Machine Translation, the task for which Bahdanau et al., first introduced attention in sequence modeling is the task of learning a neural network model to perform human language translation. A seq2seq model aims to transform an input sequence (source) to an output sequence (target). Here source is a sequence in one human language, target in another desired human language (for eg, English to Spanish).

Learning the attention weights

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store