In the Seq2Seq model that we explored, the encoder processes input data and produces a context vector- a single vector that encapsulates the entire input sequence. This vector is then passed to the decoder, which uses it to generate the output sequence. However, it’s important to consider the challenge of compressing all the information from the input sequence into a single vector. This compression likely leads to a loss of valuable information, as the context vector may not fully reflect the intricacies of the entire input.

Additionally, during the generation of the output sequence, the decoder might struggle to produce accurate results when relying solely on a single fixed representation of the source. Different parts of the input sequence can be more relevant at different stages of output generation. This limitation—where the context vector becomes a bottleneck—motivates the introduction of the Attention mechanism, which addresses the issue of fixed representation by allowing the model to focus on different parts of the input dynamically.

Attention Mechanism: A Solution to the Fixed Representation Problem

The core idea of attention is to create additional pathways from the encoder to the decoder, enabling the decoder to focus on specific parts of the input sequence at each step of the output generation process. Consider the following example:

attention

In this scenario, we want to translate the Spanish sentence “te quiero” to the English phrase “I love you”. After the encoder processes all tokens in the input sentence, the model is ready to predict the first word in the English sentence. Attention begins by assessing the similarity between each of the encoder’s hidden states and the first hidden state in the decoder. This similarity is quantified through attention scores, such as score(s1,h1), score(s2,h1), and score(s3,h1).

Next, these scores are passed through a softmax function to produce attention weights. The softmax function normalizes the scores, converting them into values between 0 and 1 that sum to 1. These weights indicate the significance of each encoded input word in determining the first translated word. Finally, the attention output is calculated as a weighted sum of the encoder states, using the attention weights.

For a decoder time step t, given m encoder states, the general computation scheme is as follows:

  1. Attention Scores:
score(ht,sk)$,where$k=1,...,m
  1. Attention Weights:
atk=escore(ht,sk)mi=1escore(ht,si)
  1. Attention Output:
ct=a(t)1s1+a(t)2s2+...+a(t)msm=mk=1a(t)ksk

Computing Attention Scores

There are several ways to compute attention scores, but two of the most popular methods are Bahdanau Attention and Luong Attention.

  • Bahdanau Attention (Additive Attention):

    • The attention score is computed using a feedforward neural network. The decoder’s hidden state ht1 is combined with each encoder’s hidden state sk to compute the attention score. The resulting context vector c(t) is then used, along with ht1, as input to the decoder at time step t.
score=vTatanh(Wa[sTk;ht1]
  • Luong Attention (Multiplicative Attention):

    • The attention score is calculated as the dot product between the decoder’s hidden state and each encoder’s hidden state. Luong proposed three variants: dot, general, and concat.

      • Dot: The simplest variant, where the attention score is the dot product of the decoder and encoder hidden states. score=sTkht

      • General: Similar to dot, but introduces a learned weight matrix. score=sTkWaht

      • Concat: Similar to Bahdanau’s method but with a slightly different formulation.

Bahdanau attention is often considered more flexible but computationally more expensive. Luong attention is simpler and faster, particularly in the dot-product form.

References