Introduction
The goal of this series of posts, is to form foundational knowledge that helps understanding modern state-of-the-art LLM models, and gain a comprehensive understanding of GPT. I prefer doing this via reading the seminal papers themselves, instead of online articles.
In my previous post, I covered some of the papers that formulated sequence based models from RNNs to the Attention mechanism in encoder-decoder architectures.
This post will focus on the “Attention is all you need” paper that introduced the transformer architecture to the world, and has since had an exponential affect on the AI landscape.
Papers to be covered in this series
[THIS POST] Transformers, following Vaswani et al. 2017 Google Attention is all you need
GPT-1, following Radford et al. 2018 Improving Language Understanding by Generative Pre-Training
GPT-2, following Radford et al. 2018 Language Models are Unsupervised Multitask Learners
BERT, following Devlin et.al. 2019 Google Pre-training of Deep Bidirectional Transformers for Language Understanding
RoBERTa, following Liu et. al. A Robustly Optimized BERT Pretraining Approach
GPT-3: Few shot learners, 2020 OpenAI Language Models are Few-Shot Learners
PaLM: following Chowdhery et al. 2022 Scaling Language Modeling with Pathways
Maybe: MACAW-LLM, following Lyu et al. 2023 MULTI-MODAL LANGUAGE MODELING
Paper
Transformers, following Vaswani et al. 2017 Google Attention is all you need
To understand the transformers paper, let’s understand the building blocks that make up the transformer architecture.
Building Block 1: Attention
In its essence, attention allows the model to look-back on the previous inputs, based on the current-state, in an efficient manner.
Softmax attention
Softmax attention is the simplest form, where we take advantage of 3 facts:-
- Softmax outputs a probability distribution / weights over a set of inputs
- Softmax amplifies larger values in the input
- Weighted average is a soft-selection that is differentiable
Using this fact, we can take a vector q, and compute softmax weights over an input set U, followed by weighted average as an output.
In the paper, attention takes the more complicated form with queries, keys, and values
An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.
Origins of attention
- Originally, additive attention was described previously in the paper by Dzmitry
Refer to the diagram above from the paper by Dzmitry . In it, attention is realised by creating a context vector C that is generated via an alignment model. The model creates weights \(alpha[tx,ty]\) that help in building a weighted sum on the input states \(h[j]\) of the encoded sentence. This helps provide an “attention” mechanism.
Queries, Keys, and Values
Today, the attention mechanism is generalised further with 3 separate query, key, and value vectors that create the final output. All 3 of these are generated from the same input vector in an encoder-only model.
Why create 3 vectors ?
Because they help the model learn different representations of the input that contribute to the output.
- Query vectors, enable the model to learn the optimal way to query previous state.
- Key vectors, enable the model to expose the input state in a way that optimises similarity matching between query and input.
- Value vectors, enable the model to learn the optimal way to output the knowledge about the input that is most useful for generation.
Together, these 3 vectors, enable the model to learn the best way to query the input, match it, and carry forward the most important knowledge that will help produce the right output.
The following diagram is helpful to understand how the query, key and value vectors interact to produce the outputs Y.
- Note that Q, K, and V are all coming from the original inputs \(X_i\)
- We compute similarity between Q and K, to produce alignment matrix \(E_ik\)
- When \(K_j\) is more relevanto \(Q_i\), then \(E_{ij}\) will be higher
- Softmax is used to create probability distribution from \(E_{ij}\) -> \(A_{ij}\)
- Finally, the \(V_i\) are weighted summed based on the \(A_{ij}\) to produce the \(Y_i\)
In summary, attention mechanism allows the model to transform the input space into 3 separate spaces, and creates a way for the model to learn to dynamically attend the most relevant parts of the historical input state.
Coming back to the paper, the authors hint that they prefer this multiplicative attention mechanism, due to its computational effeciencies - even though historically, the additive attention was proven to work better back then.
The multiplicative attention was introduced here by Luong et al.
The authors hypothize that the multiplicative attention had underperformed as it moves the logits into extreme ends where the gradients are close to 0. So they choose to scale down the logits before passing them to the softmax.
We compute the dot products of the query with all keys, divide each by √dk, and apply a softmax function to obtain the weights on the values
Mathematically
Visually
Building Block 2: Multi-Head & Self Attention
Further, the authors propose to do multi-head attention. This is essentially a way to parallelise the attention process on multiple heads instead of a single head.
So instead of doing a single attention with \( d_{model} \) dimensions. They, parallely run N attention models with \( d_{model}/N \) dimensions each.
The reason for doing this?
Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions. With a single attention head, averaging inhibits this.
Mathematically:
Self Attention
In the paper, the overall model is a encoder-decoder model, and self-attention is applied only on the decoder side of the stack. However, for generative AI, the encoder only stack with self attention is the popular mechanism. Self attention, is essentially where the attention is given to itself rather than a separate encoder model.
In a self-attention layer all of the keys, values and queries come from the same input space, in this case, the output of the previous layer in the encoder. Each position in the decoder can attend to all positions in the previous layer of the decoder.
The core of this goes back to the original intention described towards the beginning of the paper.
As a reminder
This inherently sequential nature (of RNNs) precludes parallelization within training examples, which becomes critical at longer sequence lengths, as memory constraints limit batching across examples
The authors detail this by comparing the complexity of different layers, and also how traversing the path for finding long-range dependencies is easy with attention, but relatively complex in other forms.
Below is the tabular version for comparison
Few key points
Comparison with convolutions
- In convolutions, long range dependencies would require a stack of N/k convolutional layers. Traversing such a path, hence takes Log_k(n). They are generally more expensive then recurrent layers by a factor of k in terms of complexity
Comparison with recurrence
- The core win here, is that recurrent connections require n sequential operations, which becomes O(1) with self attention
Attention is also more interpretable
- The authors are able to build attention distributions on the model, to realise that the model is relatively easier to reason about the relationship between positions and tokens.
Building Block 3: Positional Encoding
Since the authors completely got rid of the recurrence, or convolutional parts in the network - they need to provide the model with the positional information to compensate for this missing and crucial context.
To that effect, they chose to create positional embeddings (with the same dim size as the text embeddings).
But, they chose to not make them learnable parameters - and that makes sense to me.
They create the positional embeddings with the following logic
We
chose this function because we hypothesized it would allow the model to easily learn to attend by
relative positions, since for any fixed offset k, P Epos+k can be represented as a linear function of
P Epos.
The d2l.ai book has the best explanation to this that I could find.
If we plot different columns, we can see that one can easily be transformed into the other, via linear transformations.
Even after this though, I don’t think I fully understand this part well. For now, I’ve marked this as a TODO, and will come back to it later.
Final Architecture: Transformers
Paper
Transformers, following Vaswani et al. 2017 Google Attention is all you need
The problem its solving
This inherently sequential nature (of RNNs) precludes parallelization within training examples, which becomes critical at longer sequence lengths, as memory constraints limit batching across examples
Intention
In this work we propose the Transformer, a model architecture eschewing recurrence and instead relying entirely on an attention mechanism to draw global dependencies between input and output.
Architecture
Since we now understand attention, multi-head attention, and positional encodings - these building blocks can be put together to build the final architecture as shown in the paper.
Conclusion
The paper packs a ton of things into it. Its brilliant, but also probably takes a few iterations to absorb all the content well.
I intend to dive into the code that is available in tensor2tensor and update this post with more understanding and learnings from the code.
In the next post, I intend to cover GPT-1 and 2 and work our way towards the GPT-3 and other state-of-the-art model architectures and additions.