What is LSTM – Long Short Term Memory? - ❤️️ Online Deep Learning tutorial

LSTM excels in sequence prediction tasks, capturing long-term dependencies. Ideal for time series, machine translation, and speech recognition due to order dependence. The article provides an in-depth introduction to LSTM, covering the LSTM model, architecture, working principles, and the critical role they play in various applications.

Long Short-Term Memory is an improved version of recurrent neural network designed by Hochreiter & Schmidhuber.

A traditional RNN has a single hidden state that is passed through time, which can make it difficult for the network to learn long-term dependencies. LSTMs model address this problem by introducing a memory cell, which is a container that can hold information for an extended period.

LSTM architectures are capable of learning long-term dependencies in sequential data, which makes them well-suited for tasks such as language translation, speech recognition, and time series forecasting.

LSTMs can also be used in combination with other neural network architectures, such as Convolutional Neural Networks (CNNs) for image and video analysis.

The LSTM architectures involves the memory cell which is controlled by three gates: the input gate, the forget gate, and the output gate. These gates decide what information to add to, remove from, and output from the memory cell.

The input gate controls what information is added to the memory cell.
The forget gate controls what information is removed from the memory cell.
The output gate controls what information is output from the memory cell.

This allows LSTM networks to selectively retain or discard information as it flows through the network, which allows them to learn long-term dependencies.

The LSTM maintains a hidden state, which acts as the short-term memory of the network. The hidden state is updated based on the input, the previous hidden state, and the memory cell’s current state.

Bidirectional LSTM Model

Bidirectional LSTM (Bi LSTM/ BLSTM) is recurrent neural network (RNN) that is able to process sequential data in both forward and backward directions. This allows Bi LSTM to learn longer-range dependencies in sequential data than traditional LSTMs, which can only process sequential data in one direction.

Bi LSTMs are made up of two LSTM networks, one that processes the input sequence in the forward direction and one that processes the input sequence in the backward direction.
The outputs of the two LSTM networks are then combined to produce the final output.

LSTM models, including Bi LSTMs, have demonstrated state-of-the-art performance across various tasks such as machine translation, speech recognition, and text summarization.

Networks in LSTM architectures can be stacked to create deep architectures, enabling the learning of even more complex patterns and hierarchies in sequential data. Each LSTM layer in a stacked configuration captures different levels of abstraction and temporal dependencies within the input data.

LSTM architecture has a chain structure that contains four neural networks and different memory blocks called cells.

Information is retained by the cells and the memory manipulations are done by the gates. There are three gates –

Forget Gate

The information that is no longer useful in the cell state is removed with the forget gate. Two inputs x_t (input at the particular time) and h_t-1 (previous cell output) are fed to the gate and multiplied with weight matrices followed by the addition of bias. The resultant is passed through an activation function which gives a binary output. If for a particular cell state the output is 0, the piece of information is forgotten and for output 1, the information is retained for future use. The equation for the forget gate is:

[Tex] f_t = σ(W_f · [h_{t-1}, x_t] + b_f) [/Tex]
where:

W_f represents the weight matrix associated with the forget gate.
[h_t-1, x_t] denotes the concatenation of the current input and the previous hidden state.
b_f is the bias with the forget gate.
σ is the sigmoid activation function.

Input gate

The addition of useful information to the cell state is done by the input gate. First, the information is regulated using the sigmoid function and filter the values to be remembered similar to the forget gate using inputs h_t-1and x_t_.. Then, a vector is created using tanh function that gives an output from -1 to +1, which contains all the possible values from h_t-1 and x_t. At last, the values of the vector and the regulated values are multiplied to obtain the useful information. The equation for the input gate is:

[Tex] i_t = σ(W_i · [h_{t-1}, x_t] + b_i) [/Tex]

[Tex]Ĉ_t = tanh(W_c · [h_{t-1}, x_t] + b_c) [/Tex]

We multiply the previous state by f_t, disregarding the information we had previously chosen to ignore. Next, we include i_t∗C_t. This represents the updated candidate values, adjusted for the amount that we chose to update each state value.

[Tex]C_t = f_t ⊙ C_{t-1} + i_t ⊙ Ĉ_t [/Tex]

where

⊙ denotes element-wise multiplication
tanh is tanh activation function

Output gate

The task of extracting useful information from the current cell state to be presented as output is done by the output gate. First, a vector is generated by applying tanh function on the cell. Then, the information is regulated using the sigmoid function and filter by the values to be remembered using inputs [Tex]h_{t-1} [/Tex]and [Tex]x_t[/Tex]. At last, the values of the vector and the regulated values are multiplied to be sent as an output and input to the next cell. The equation for the output gate is:

[Tex]o_t = σ(W_o · [h_{t-1}, x_t] + b_o) [/Tex]

Some of the famous applications of LSTM includes:

Language Modeling: LSTMs have been used for natural language processing tasks such as language modeling, machine translation, and text summarization. They can be trained to generate coherent and grammatically correct sentences by learning the dependencies between words in a sentence.
Speech Recognition: LSTMs have been used for speech recognition tasks such as transcribing speech to text and recognizing spoken commands. They can be trained to recognize patterns in speech and match them to the corresponding text.
Time Series Forecasting: LSTMs have been used for time series forecasting tasks such as predicting stock prices, weather, and energy consumption. They can learn patterns in time series data and use them to make predictions about future events.
Anomaly Detection: LSTMs have been used for anomaly detection tasks such as detecting fraud and network intrusion. They can be trained to identify patterns in data that deviate from the norm and flag them as potential anomalies.
Recommender Systems: LSTMs have been used for recommendation tasks such as recommending movies, music, and books. They can learn patterns in user behavior and use them to make personalized recommendations.
Video Analysis: LSTMs have been used for video analysis tasks such as object detection, activity recognition, and action classification. They can be used in combination with other neural network architectures, such as Convolutional Neural Networks (CNNs), to analyze video data and extract useful information.

Feature	LSTM (Long Short-term Memory)	RNN (Recurrent Neural Network)
Memory	Has a special memory unit that allows it to learn long-term dependencies in sequential data	Does not have a memory unit
Directionality	Can be trained to process sequential data in both forward and backward directions	Can only be trained to process sequential data in one direction
Training	More difficult to train than RNN due to the complexity of the gates and memory unit	Easier to train than LSTM
Long-term dependency learning	Yes	Limited
Ability to learn sequential data	Yes	Yes
Applications	Machine translation, speech recognition, text summarization, natural language processing, time series forecasting	Natural language processing, machine translation, speech recognition, image processing, video processing

Problem with Long-Term Dependencies in RNN

Recurrent Neural Networks (RNNs) are designed to handle sequential data by maintaining a hidden state that captures information from previous time steps. However, they often face challenges in learning long-term dependencies, where information from distant time steps becomes crucial for making accurate predictions. This problem is known as the vanishing gradient or exploding gradient problem.

Few common issues are listed below:

Vanishing Gradient

During backpropagation through time, gradients can become extremely small as they are multiplied through the chain of recurrent connections, causing the model to have difficulty learning dependencies that are separated by many time steps.

Exploding Gradient

Conversely, gradients can explode during backpropagation, leading to numerical instability and making it difficult for the model to converge.

Different Variants on Long Short-Term Memory

Over time, several variants and improvements to the original LSTM architecture have been proposed.

Vanilla LSTM

This is the original LSTM architecture proposed by Hochreiter and Schmidhuber. It includes memory cells with input, forget, and output gates to control the flow of information. The key idea is to allow the network to selectively update and forget information from the memory cell.

Peephole Connections

In the peephole LSTM, the gates are allowed to look at the cell state in addition to the hidden state. This allows the gates to consider the cell state when making decisions, providing more context information.

Gated Recurrent Unit (GRU)

GRU is an alternative to LSTM, designed to be simpler and computationally more efficient. It combines the input and forget gates into a single “update” gate and merges the cell state and hidden state. While GRUs have fewer parameters than LSTMs, they have been shown to perform similarly in practice.

Long Short-Term Memory (LSTM) is a powerful type of recurrent neural network (RNN) that is well-suited for handling sequential data with long-term dependencies. It addresses the vanishing gradient problem, a common limitation of RNNs, by introducing a gating mechanism that controls the flow of information through the network. This allows LSTMs to learn and retain information from the past, making them effective for tasks like machine translation, speech recognition, and natural language processing.

Also Check:

1. What is LSTM and why it is used?

LSTM, or Long Short-Term Memory, is a type of recurrent neural network designed for sequence tasks, excelling in capturing and utilizing long-term dependencies in data.

2. How does LSTM work?

LSTMs use a cell state to store information about past inputs. This cell state is updated at each step of the network, and the network uses it to make predictions about the current input. The cell state is updated using a series of gates that control how much information is allowed to flow into and out of the cell.

3.What are LSTM examples?

LSTM (Long Short-Term Memory) examples include speech recognition, machine translation, and time series prediction, leveraging its ability to capture long-term dependencies in sequential data.

4. What is the difference between LSTM and Gated Recurrent Unit (GRU)?

LSTM has a cell state and gating mechanism which controls information flow, whereas GRU has a simpler single gate update mechanism. LSTM is more powerful but slower to train, while GRU is simpler and faster.

5. What is difference between LSTM and RNN?

RNNs have a simple recurrent structure with unidirectional information flow.
LSTMs have a gating mechanism that controls information flow and a cell state for long-term memory.
LSTMs generally outperform RNNs in tasks that require learning long-term dependencies.

6. Is LSTM faster than CNN?

No, LSTMs and CNNs serve different purposes. LSTMs are for sequential data; CNNs are for spatial data.

7. Is LSTM faster than GRU?

Generally, yes. GRUs have fewer parameters, which can lead to faster training compared to LSTMs.

What is LSTM – Long Short Term Memory?

What is LSTM?

LSTM Architecture