In an LSTM network, the forget gate is a crucial component that helps manage the flow of information through the memory cell. Its primary function is to decide which information should be discarded from the cell state. This is important for maintaining relevant information over long sequences and discarding what is no longer needed.

Here's how the forget gate works:

Input to the Forget Gate: The forget gate takes two inputs: the previous hidden state ($h_{t-1}$) and the current input ($x_t$). These inputs are combined and passed through a sigmoid activation function.
Sigmoid Activation: The sigmoid function outputs values between 0 and 1. This output determines the extent to which each piece of information in the cell state should be forgotten. A value close to 0 means "forget this information," while a value close to 1 means "keep this information."
Element-wise Multiplication: The output of the forget gate is then multiplied element-wise with the cell state from the previous time step ($C_{t-1}$). This operation effectively removes the information that the forget gate has decided to discard.

Mathematically, the forget gate can be represented as:

$$ ft = \sigma(W_f \cdot [h{t-1}, x_t] + b_f) $$

Where:

$f_t$ is the forget gate vector.
$\sigma$ is the sigmoid function.
$W_f$ is the weight matrix for the forget gate.
$b_f$ is the bias for the forget gate.
$[h_{t-1}, x_t]$ is the concatenation of the previous hidden state and the current input.

By using the forget gate, LSTMs can effectively manage which information to retain and which to discard, allowing them to capture long-term dependencies in the data.

Referenced in:

All notes