LSTM (Long Short-Term Memory)
LSTM neural network is just another kind of artificial neural network which falls in the category of Recurrent Neural Networks. Much like convolutional networks help a neural network learn about image features, LSTM cells help the network learn about temporal data, something which other Machine Learning models traditionally struggled with.
Core Concept: How do LSTM cells work?
Each LSTM cell in our Neural Network will only look at a single column of its inputs, and also at the previous column’s LSTM cell’s output. LSTM cell will have two different input vectors : the previous LSTM cell’s output (which gives it some information about the previous input column) and its own input column.
The core concept of LSTM’s are the cell state, and it’s various gates. The cell state act as a transport highway that transfers relative information all the way down the sequence chain. You can think of it as the “memory” of the network. The cell state, in theory, can carry relevant information throughout the processing of the sequence. So even information from the earlier time steps can make it’s way to later time steps, reducing the effects of short-term memory. As the cell state goes on its journey, information gets added or removed to the cell state via gates. The gates are different neural networks that decide which information is allowed on the cell state. The gates can learn what information is relevant to keep or forget during training.
How LSTM Cells work:
The “forget gate” is a sigmoid layer, that regulates how much the previous cell’s outputs will influence this one’s. It takes as input both the previous cell’s “hidden state” (another output vector), and the actual inputs from the previous layer. Since it is a sigmoid, it will return a vector of “probabilities”: values between 0 and 1. They will multiply the previous cell’s outputs to regulate how much influence they hold, creating this cell’s state.
Unlike the forget gate, the input gate’s output is added to the previous cell’s outputs (after they’ve been multiplied by the forget gate’s output). The input gate is the dot product of two different layers’ outputs, though they both take the same input as the forget gate (previous cell’s hidden state, and previous layer’s outputs):
- A sigmoid unit , regulating how much the new information will impact this cell’s output.
- A tanh unit , which actually extracts the new information. Notice tanh takes values between -1 and 1.
The product of these two units (which could, again, be 0, or be exactly equal to the tanh output, or anything in between) is added to this neuron’s cell state.
The LSTM cell’s outputs
The cell’s state is what the next LSTM cell will receive as input, along with this cell’s hidden state. The hidden state will be another tanh unit applied to this neuron’s state, multiplied by another sigmoid unit that takes the previous layer’s and cell’s outputs (just like the forget gate).
Some visualization below:
GRU (Gated Recurrent Units)
The GRU is the newer generation of Recurrent Neural networks and is pretty similar to an LSTM. GRU’s got rid of the cell state and used the hidden state to transfer information. It also only has two gates, a reset gate and an update gate.
The update gate acts similar to the forget and input gate of an LSTM. It decides what information to throw away and what new information to add.
The reset gate is another gate is used to decide how much past information to forget.
GRU’s has fewer tensor operations; therefore, they are a little speedier to train then LSTM’s. But researchers and engineers usually try both to determine which one works better for their use case.
Some visualization below:
RNN’s are good for processing sequence data for predictions but suffers from short-term memory. LSTM’s and GRU’s were created as a method to mitigate short-term memory using mechanisms called gates. Gates are just neural networks that regulate the flow of information flowing through the sequence chain. LSTM’s and GRU’s are used in state of the art deep learning applications like speech recognition, speech synthesis, natural language understanding, etc.
Bahdanau Attention in seq2seq modeling
Neural Machine Translation, the task for which Bahdanau et al., first introduced attention in sequence modeling is the task of learning a neural network model to perform human language translation. A seq2seq model aims to transform an input sequence (source) to an output sequence (target). Here source is a sequence in one human language, target in another desired human language (for eg, English to Spanish).
A seq2seq model has two parts, an encoder and a decoder. Both these parts are RNNs/LSTMs/GRUs in the early, introductory days of neural seq2seq models (Sutskever, et al.). This approach has an inherent problem in the way information flows between the two parts. A context vector, which is actually the output of encoder at last time step, is fed into the decoder. Context vector acted like a bottleneck between the two parts, hence controlling the success of the seq2seq model in the desired task. This is because, when dealing with source of large lengths, RNNs/LSTMs/GRUs being prone to not remember the whole source sequence hindered the decoding process, thus making the whole approach inefficient. Is there a way to not just look at the final timestep encoder output, and instead look at whatever the encoder outputs in all of the timesteps? Enter attention, as described by Bahdanau et al.
Some notations first:
x — the input sequence
y — the output sequence
s_t — decoder output at timestep ‘t’
c_t — context vector at timestep ‘t’, which is also the input to the decoder
How do we compute this new context vector, which solves our bottleneck problem? We use attention
Bahdanau attention is a way to compute the relative importance of the source sequence at each timestep w.r.t., each element in the output sequence. Let me explain.
In order to generate an output at timestep ‘t’, input to the decoder aka the context vector at time ‘t’, is a convex combination of all the encoder outputs at all timesteps. To do this convex combination, the weights (alphas) are obtained by the attention mechanism. We would have one set of alphas to generate each output at each timestep.
How to find out the convex combination weights? We learn them using fully-connected neural networks. The convex combination property of coefficients summing to one can be achieved by using a softmax on the logits of fully-connected network.
When generating decoder output at timestep ‘t’, the attention weight of encoder output at timestep ‘i’ is denoted by alpha_t,i.
w_1 and w_2 here are fully-connected layers which take input as, decoder output s_t-1 concatenated with encoder output at time ‘i’, ie., h_i.