A Beginner’s Guide to Understanding Transformer…

20 min readSep 27, 2023

Before we dive into the world of Transformers, let’s talk about Recurrent Neural Networks (RNNs). They were like the superheroes of AI when it came to dealing with sequences, like sentences or time-series data. But, these heroes had some weaknesses, which made researcher look for a better solution.

1.1. The Reign of RNNs

Recurrent Neural Networks (RNNs) were widely used for sequential data tasks, such as natural language processing, speech recognition, and time series forecasting. Their recurrent nature allowed them to maintain a hidden state that could capture information from previous time steps.

Limitations of RNNs:

Slow Computation for Long Sequences: Imagine you’re reading a really long storybook with hundreds of pages. RNNs are like reading the book one page at a time, which can take a very long time. It’s like trying to finish a marathon by walking slowly.
Vanishing or Exploding Gradients : Imagine you’re trying to learn how to ride a bicycle. With RNNs, sometimes the learning process can get stuck, like you’re stuck in mud and can’t move forward. Other times, it can go wild, like you’re riding way too fast and can’t control the bicycle. This makes learning tricky and unpredictable.
Difficulty in Accessing Information from Long Time Ago: Imagine you’re trying to remember something important from a conversation you had a week ago. RNNs are like trying to recall that conversation, but your memory is fading quickly. It’s challenging to access information from a long time ago, just like it’s hard for RNNs to remember things from distant past data.

1.2. The Birth of Transformers

Because RNNs had their limitations, researchers started looking for a better way to do things. This search eventually gave birth to the transformer architecture, which has become incredibly fascinating and useful in the world of Generative AI.

Tranformer Team introduced the self-attention mechanism, akin to a skilled detective focusing on crucial clues in a complex case. This addressed the problem of vanishing gradients and allowed the model to capture long-range dependencies effectively. Transformers also revolutionized parallelization, enabling the processing of data simultaneously, like a team of investigators working efficiently on different aspects of a case. Moreover, multi-head attention was introduced, allowing the model to grasp various relationships in the data, just as having specialized experts in different fields provide comprehensive insights at a conference.

The Transformer team came up with some smart ideas, but I know it can all seem a bit confusing. Don’t worry! We’re going to break it down step by step, like solving a puzzle. Soon, you’ll understand how it all works, and it won’t seem so complicated.

In this architecture, there are two main parts: the encoder and the decoder. Let’s start by unraveling the encoder part.

2. Encoder Part

let’s walk through the steps of how a Transformer processes input data, using a simple text-based example.

2.1. Input Embedding

As We Know that Machine Can only learn understand number so first, we need to convert text into something they can understand — numbers. However, there’s a challenge. As the amount of text data grows, a single number can’t capture the richness of a word’s meaning. That’s where embeddings like GloVe or BERT come into play. They transform these numbers into a 2D array (let’s call it [x, 512]), with ‘x’ representing all the unique words in our training data, and 512 columns, a standard known as ‘Dmodel,’ as introduced in the original Transformer paper.

Imagine you have a sentence: “Hi I Am From Bharat” The first step is to convert each word into input’id and then an embedding vector. These vectors represent the words in a high-dimensional space, allowing the model to understand their relationships.

Real-World Analogy: Think of this step as translating each word spoken at the party into a universal language that everyone can understand. This language consists of vectors representing the words’ meanings.

2.2. Positional Encoding

Humans understand the meaning of words not only based on the words themselves but also on their positions in a sentence. Similarly, Transformers need a way to understand the sequential order of data. Positional encoding provides the model with information about the position of words or tokens in the input sequence.So, We want the positional encoding to represent a pattern that can be learned by the model.

Next, we add positional encoding to these word embeddings. As mentioned earlier, this helps the model understand the order of words in the sentence.

Real-World Analogy: Picture each person at the party wearing a badge with a unique position label, like “Person 1,” “Person 2,” and so on. This label tells you where each person is standing.

When it comes to adding positional encoding, researchers explored various methods, but two prominent approaches stand out: Linear Positional Encoding and Trigonometric Positional Encoding. Among these, Trigonometric Positional Encoding emerges as the superior choice for several compelling reasons:

Distinguishable Patterns: Think of trigonometric functions like unique and colorful patterns, such as swirls and waves. They help the model see distinct shapes in the data. On the other hand, linear values are like a simple, straight line that doesn’t provide much detail. Using these simple lines doesn’t give the model enough complexity to understand different positions in a sequence.
Generalization: Trigonometric functions are like flexible tools that can adjust to different situations. Imagine you have a rubber band that can stretch and fit various objects, big or small. Trigonometric functions work similarly; they adapt well to sequences of different lengths. Linear values, however, are rigid and might struggle to capture the precise positions when sequences are different lengths.
Variable Frequency: Think of trigonometric functions as having different speeds or rhythms. Some parts of a song might be fast, while others are slow. Similarly, trigonometric functions introduce this variability into the positional encodings. This helps the model recognize different relationships between words in a sequence. This is crucial when words that are related but far apart in the sentence need to be understood.

In summary, Trigonometric Positional Encoding outshines linear alternatives by creating distinguishable patterns, facilitating generalization across sequences of different lengths, and introducing variable frequencies to capture nuanced positional relationships. These advantages make it the preferred choice for enhancing the Transformer model’s understanding of word order and context.

let’s compare positional encoding using both linear values and trigonometric functions for the sentence “hi i am from bharat.”

Linear Positional Encoding:

In linear positional encoding, the positions are represented by simple linear values. Let’s assign each word in the sentence a linear position value:

“hi” at position 1
“i” at position 2
“am” at position 3
“from” at position 4
“bharat” at position 5

Now, let’s create a linear positional encoding vector for each word:

“hi” linear encoding: [1, 0, 0, 0, 0]
“i” linear encoding: [0, 1, 0, 0, 0]
“am” linear encoding: [0, 0, 1, 0, 0]
“from” linear encoding: [0, 0, 0, 1, 0]
“bharat” linear encoding: [0, 0, 0, 0, 1]

As you can see, the linear encoding simply assigns a value of 1 to the word’s position and 0 to all other positions. While this method provides some positional information, it lacks complexity and might not capture intricate positional relationships effectively.

Trigonometric Positional Encoding:

Now, let’s create positional encodings using the trigonometric functions for the same sentence:

Define the model’s dimensionality (d_model), let’s say it’s 5.
Calculate the positional encoding for each word using sine and cosine functions:

For “hi” at position 1:

Dimension 1: sin(1 / 10000^(2 * 0 / 5)) = sin(0.0001) ≈ 0.0001
Dimension 2: cos(1 / 10000^(2 * 0 / 5)) = cos(0.0001) ≈ 1.0000
Dimension 3: sin(1 / 10000^(2 * 1 / 5)) = sin(0.0010) ≈ 0.0010
Dimension 4: cos(1 / 10000^(2 * 1 / 5)) = cos(0.0010) ≈ 0.9999
Dimension 5: sin(1 / 10000^(2 * 2 / 5)) = sin(0.0025) ≈ 0.0025

So, the trigonometric positional encoding for “hi” is approximately [0.0001, 1.0000, 0.0010, 0.9999, 0.0025]. Repeat this process for each word in the sentence.

As you can see, the trigonometric encoding introduces varying values for each dimension based on the position, creating a more complex and distinguishable pattern compared to linear encoding. This allows the model to capture intricate positional relationships effectively, which is crucial for understanding the order of words in a sequence, especially in longer and more complex sentences.

Here’s the formula used for positional encoding:

PE(pos, 2i): Represents the sine function for the position ‘pos’ and the ‘2i’-th dimension of the positional encoding.
PE(pos, 2i+1): Represents the cosine function for the same position ‘pos’ and the ‘2i+1’-th dimension of the positional encoding.
‘pos’ is the position of the word in the sequence.
‘i’ represents the dimension of the positional encoding
‘d_model’ is the model’s dimensionality, often set to 512 in Transformer models.

2.3. Attention Mechanism

Before Starting with the Multi-Head Mechanism . lets first discuss about the self Attention Mechanism

2.3.1 Self-Attention Mechanism

Lets try to understand with the Party example analogy. Think of self-attention as a cocktail party.You’re given a transcript of all the conversations at the party, and you want to understand what each person is saying. Instead of reading every word from start to finish, you decide to pay more attention to the conversations that are relevant to your interests. In a similar way, think of a Transformer as it examines each word in a sentence and figures out how important each word is in relation to the overall context.

Now, let’s explore the mathematical concept behind self-attention. First, we define variables v1, v2, v3, v4, v5 (please ignore the Mv, Mq, Mk variables in red color in the image, as we’ll discuss them later) for each word vector obtained after positional encoding. Subsequently, we construct three sets of vectors for every word: key, value, and query vectors. Each of these sets corresponds to its own matrix:

Key Matrix (K): This matrix helps the model identify relevant words based on the current word. It’s like a guidebook.
Value Matrix (V): This matrix stores information about the words in the sentence. It’s like an encyclopedia.
Query Matrix (Q): This matrix represents the current word’s “question” to understand its context. It’s like asking a specific question about the sentence.

all the matrix are same initial stage ,

Now, we calculate the attention scores by performing a dot product between the query (Q) and key (K) matrices. This tells us how much attention each word should pay to the current word (“am”):

score are random for understandting purpose

Attention Scores (A):

[4] ← “hi” pays attention
[4] ← “i” pays attention
[6] ← “am” pays attention most to itself
[3] ← “from” pays some attention
[4] ← “bharat” pays attention

To make the scores more interpretable and stable, we scale them down and apply the softmax function. This converts the scores into probabilities:

Scaled and Softmaxed Scores:

[0.3888] ← “hi” pays attention
[0.3888] ← “i” pays attention
[0.9231] ← “am” pays attention most to itself
[0.0581] ← “from” pays some attention
[0.1411] ← “bharat” pays attention

Now, let’s calculate the final contextualized representation for the word “am” based on these softmax probabilities and the value vectors for each word(values are taken randomly for illustration purpose and … here means array length to be till 512):

Value Vector for “hi”: [1, 0,... 2]
Value Vector for “i”: [0, 1,... 2]
Value Vector for “am” itself: [2, 0,... 1]
Value Vector for “from”: [1, 1,... 0]
Value Vector for “bharat”: [0, 2,... 1]

We’ll calculate the weighted sum of these value vectors for “am” using the softmax probabilities:

Y1- Contribution from “hi”: [0.3888] * [1, 0,... 2] = [0.3888, 0,... 0.7776]
Y2-Contribution from “i”: [0.3888] * [0, 1,... 2] = [0, 0.3888,... 0.7776]
Y3-Contribution from “am” itself: [0.9231] * [2, 0,... 1] = [1.8462, 0, ...0.9231]
Y4-Contribution from “from”: [0.0581] * [1, 1,... 0] = [0.0581, 0.0581,... 0]
Y5-Contribution from “bharat”: [0.1411] * [0, 2,... 1] = [0, 0.2822,... 0.1411]

Summing Contributions: Now, we sum up all these contributions to calculate the final contextualized representation for “am”:

Final Contextualized Representation for “am”:

[0.3888 + 0 + 1.8462 + 0.0581 + 0] = [2.2911]
[0 + 0.3888 + 0 + 0.0581 + 0.2822] = [0.7291]

. till 512 columns
[0.7776 + 0.7776 + 0.9231 + 0 + 0.1411] = [2.6194]

The final contextualized representation for the word “am” is [2.2911, 0.7291, 2.6194]. This representation captures how much attention each word should pay to "am" based on their relevance and importance in the context of the sentence "hi i am from bharat."

It shows that “am” pays the most attention to itself and also considers contributions from other words in the sentence, allowing the Transformer model to understand the significance of “am” in its context within the sentence.

The process described above is known as the self-attention mechanism. However, there is a limitation in the previous explanation: we haven’t introduced any trainable weights, and as a result, the order of words, their proximity, and their shape are not taken into account. To address these limitations, we will introduce weights, and you will see in the image in red colour that we use matrices Mv, Mq, and Mk to enhance the value, query, and key computations.

2.3.2 Multi-Head Attention

Now, let’s add a twist to our self-attention analogy. What if you’re not just interested in one topic at the party but multiple topics simultaneously? Multi-head attention allows the Transformer to do just that. It divides the self-attention mechanism into multiple “heads,” each focusing on different aspects of the input data.

Conversion from self to Multi-Head Attention:

Multiple Sets of Parameters:

In multi-head attention, we start by creating multiple sets of learnable parameters for key, query, and value matrices. These sets are often referred to as heads.
For example, if we want to use four heads, we create four sets of key matrices (K1, K2, K3, K4), four sets of query matrices (Q1, Q2, Q3, Q4), and four sets of value matrices (V1, V2, V3, V4).

Splitting the Attention Space:

Next, we split the self-attention mechanism into multiple heads. Each head operates on the same input sequence but with different sets of parameters (key, query, and value matrices).
Each head independently computes attention scores and performs the weighted sum of values based on its parameters.

Concatenating and Projection:

After computing the outputs for each head, we concatenate them. This means stacking the output vectors of each head along a new dimension.
To ensure that the concatenated output retains the same dimensionality as the input, we project it using a linear transformation (a learned weight matrix).

Final Linear Transformation:

Finally, the concatenated and projected multi-head outputs are further linearly transformed to produce the final output of the multi-head attention mechanism.

Benefits of Multi-Head Attention:

Multi-head attention allows the model to focus on different aspects of the input sequence simultaneously, capturing various relationships and patterns.
It enhances the model’s ability to attend to different parts of the sequence effectively, leading to richer representations.
It reduces overfitting by learning diverse patterns in parallel.
It can learn complex dependencies within the data that may not be captured by a single head.

2.4 Add & Norm

After the multi-head attention phase, the Transformer model collects valuable information about the words in a sentence, each with its importance.Now, it’s time to process this information effectively using Add & Norm Layer.

The “Add” Step: Addressing Vanishing Gradients

In the “Add” step of the Add & Norm layer, the Transformer addresses the vanishing gradients problem using residual connections. This step involves adding the input of the previous layer to its output. In simpler terms, it’s like preserving the original information while allowing new insights to be added.

Imagine you’re at a party, and you’ve been taking notes on conversations as they happen. The “Add” step is like keeping your original notes intact while revising and adding new information from ongoing conversations. This ensures that valuable insights from earlier discussions are not lost, similar to how the Transformer retains essential information from the previous layer.

The “Norm” Step: Normalizing Overall Information

After the “Add” step, it’s essential to ensure that the combined information aligns well with the overall context. This is where normalization, or “Norm,” comes into play. It ensures that the overall information scales properly and fits cohesively with the larger context.

Using the provided formula with mean (μ), standard deviation (σ), scaling factor (γ), and regularization constant (β), the Transformer adjusts the combined information. It ensures that the information from different sources (words or conversations) harmonizes and doesn’t disrupt the overall flow.

2.5 FeedForward Layer

Following the “Add & Norm” step, the Transformer progresses to the “Feedforward” layer and another “Add & Norm” layer.

The feedforward layer is where the Transformer processes the information further. This layer applies mathematical transformations to refine and enhance the information.

The Feedforward Layer in the Transformer is like a step where the model takes the information it gathered from conversations (like at a party) and talks about it with friends to understand it better. It’s like having a group discussion to analyze what you’ve learned.

Here’s how it works in simple terms:

Input: The Feedforward Layer gets information from the previous step where the model paid attention to words. This information is like what you heard from conversations.
Transformation: Inside the Feedforward Layer, the model does some math to make the information better. It’s like having a conversation with friends to figure out what everything means. This math helps the model understand relationships and patterns in the information.
Making It Complex: The Feedforward Layer makes the information more complicated in a good way. It does this by using lots of hidden things (like in a puzzle) to see details better. For example, it might use 3072 hidden things.
Adding Curves: It also adds curves and twists to the information. This is like adding different flavors to your food to make it tastier. It helps the model see interesting things in the data.
Output: After all this, the Feedforward Layer gives the model a new version of the information. This new version is like a richer and more detailed story. It helps the model understand things better.

In simple words, the Feedforward Layer helps the Transformer model understand information deeply. It’s like having a deep conversation with friends after listening to different stories at a party. This makes the model really good at understanding language and doing complex tasks.

Lets take another analogy to understand it better .Initially, you assemble a team of 144 detectives (akin to attention blocks), grouped in sets of 12 for each layer of your investigation. However, there’s a problem: without FFN, all the detectives in each set tend to act the same way and produce similar findings. It’s like having a team of investigators who follow the same procedures and reach nearly identical conclusions.Now, to tackle the case more effectively, you decide to enhance your detective team by adding a Feedforward Network (FFN) to each detective ( attention block). These FFNs serve as special training programs for each detective, enabling them to develop unique skills and insights. This specialization makes each self-attention block behave like a distinct, trainable model within the broader Transformer architecture. By doing so, the Transformer can efficiently handle complex language understanding tasks, optimizing the contributions of each “detective” to arrive at a well-rounded solution, much like an ensemble of experts collaboratively solving a challenging case.

3. Decoder Part

The decoder’s primary function is to take an encoded representation of the input sequence (often generated by the encoder) and produce an output sequence, one step at a time. This output sequence could be a translation of the input text into another language, a summarized version of the input, or any other text-based task.

Decoder Components: The decoder consists of several key components, each contributing to its overall functionality:

Masked Multi-Head Self-Attention: Similar to the encoder, the decoder utilizes self-attention mechanisms. However, it uses a masked version of self-attention called Masked Multi-Head Self-Attention. This mechanism allows the decoder to focus only on the previous tokens in the output sequence, preventing it from “cheating” by looking ahead.
Multi-Head Cross-Attention: In addition to self-attention, the decoder employs multi-head cross-attention. This mechanism enables the decoder to consider the relevant parts of the input sequence (the encoder’s output) while generating each token in the output sequence. It aligns the decoder’s focus with the most pertinent information in the input.
Positional Encoding: Just like in the encoder, positional encoding is added to the embeddings of tokens in the decoder. This provides the model with information about the position of each token in the output sequence, allowing it to understand the sequential order.
Feedforward Layer: After attending to the relevant parts of the input and considering the position of each token, the decoder employs a feedforward layer. This layer performs mathematical transformations on the token representations, refining and enhancing them.
Add & Norm Layer: Similar to the encoder, the decoder uses an “Add & Norm” layer after each attention and feedforward operation. It combines the information gathered from different components and ensures that it aligns well with the overall context.

We’ve covered most of the components in the encoder part of the Transformer model. Now, let’s focus on the remaining component: Masked Multi-Head Attention.

Masked Multi-Head Attention is a key component of the Transformer decoder, and it plays a crucial role in generating sequences of text while ensuring that the model doesn’t cheat by looking ahead in the sequence. Let’s dive into the details with the example “hi i am from bharat.”

Masked Multi-Head Attention Overview:

In the context of our cocktail party analogy, think of Masked Multi-Head Attention as a participant at the party who is responding to a question while only considering the conversations they’ve heard up to that point. They shouldn’t peek at what others might say in the future.

Here’s how Masked Multi-Head Attention works:

1. Input and Splitting Heads:

Input: We start with the word embeddings for the input sequence, which in our example is “hi i am from bharat.”
We split these embeddings into multiple “heads.” These heads are parallel attention mechanisms that allow the model to focus on different parts of the input simultaneously.

2. Self-Attention within Each Head:

Within each head, the model calculates attention scores for each word in the input sequence based on its interactions with other words in the sequence. It does this for every word in the sequence.
For example, when calculating the attention for the word “am,” it considers how relevant each of the words “hi,” “i,” “am,” “from,” and “bharat” is to “am.”

3. Masking the Future:

Here’s where the “masking” comes into play. To prevent cheating and ensure that the model only attends to words that come before the current word, we apply a mask to the attention scores.
For example, when calculating the attention for the word “am,” the model cannot consider the words that come after “am” in the sequence (i.e., “from” and “bharat”). It can only look at “hi,” “i,” and “am.”

4. Combining Heads:

Each head generates its own set of attention-weighted values for each word in the sequence.
These individual results from each head are then combined to create a comprehensive representation of the word, taking into account multiple perspectives.

5. Output:

The output of Masked Multi-Head Attention is a set of contextually enriched word representations. These representations capture how each word relates to other words in the sequence while respecting the masking that prevents cheating.

Example with “hi i am from bharat”: Let’s consider the word “am” in the sequence. During Masked Multi-Head Attention for “am,” the model calculates attention scores for all the words in the sequence up to “am” (i.e., “hi,” “i,” and “am”). It assigns higher attention scores to the relevant words in this context. Since it’s masked, it doesn’t consider “from” and “bharat.”

This process ensures that the model generates each word in the output sequence while only considering the information available up to that point. It’s like responding to a question at the party based on what has been said so far without knowing the future conversations.

Inference and training of a Transformer model

Training

Time Step = 1

It all happens in one time step!

The encoder outputs, for each word a vector that not only captures its meaning (the embedding) or the position, but also its interaction with other words by means of the multi-head attention.

Input Sentence: “hi i am from bharat”

Step 1: Tokenization and Positional Encoding

Tokenization converts the input sentence into subword units: [“hi”, “i”, “am”, “from”, “bharat”].
Positional encoding is added to each token, indicating its position in the sequence.

Step 2: Encoder Processing

The tokenized and position-encoded input sentence is passed through the encoder of the trained Transformer model.
The encoder generates contextualized representations for each token in the input sentence.

Step 3: Start Decoding

The decoder starts the decoding process with a special token, typically <SOS>, indicating the beginning of the translation.
The decoder generates tokens one at a time, conditioning on the previous tokens it has generated.

Inference

Token Generation (Time Step 1)

Decoder Input: <SOS>
The decoder uses Masked Multi-Head Attention to attend to the encoder’s outputs, focusing on relevant parts of the source sentence (encoder representations).
The decoder generates the first token of the translation. Let’s assume it predicts the token “नमस्ते” (which means “hello” in Hindi).

Step 5: Token Generation (Time Step 2)

Decoder Input: <SOS> नमस्ते
The decoder again uses Masked Multi-Head Attention, paying attention to the encoder’s outputs and the generated “नमस्ते.”
The decoder generates the next token based on the attended information. Let’s assume it predicts “मैं” (which means “I” in Hindi).

Step 6: Token Generation (Time Step 3)

Decoder Input: <SOS> नमस्ते मैं
Similar to previous steps, the decoder attends to the encoder’s outputs and the generated tokens “नमस्ते” and “मैं.”
The decoder generates the next token. Let’s assume it predicts “भारत” (which means “India” in Hindi).

Step 7: Token Generation (Time Step 4)

Decoder Input: <SOS> नमस्ते मैं भारत
The decoder continues generating tokens by attending to the encoder’s outputs and the tokens generated so far.
It predicts the next token. Let’s assume it predicts “से” (which means “from” in Hindi).

Step 8: Token Generation (Time Step 5)

Decoder Input: <SOS> नमस्ते मैं भारत से
The decoder attends to the encoder’s outputs and the tokens generated up to this point.
It generates the next token. Let’s assume it predicts “हूँ” (which means “am” in Hindi).

Step 9: Token Generation (Time Step 6)

Decoder Input: <SOS> नमस्ते मैं भारत से हूँ
The decoder continues generating tokens, attending to the encoder’s outputs and the tokens generated so far.
It predicts the next token. Let’s assume it predicts the special token <EOS>, indicating the end of the translation.

Step 10: Translation Output

The generated sequence of tokens, excluding the <SOS> and <EOS> tokens, represents the translation: "नमस्ते, मैं भारत से हूँ."

Step 11: Post-processing

Post-processing may involve detokenization and handling special tokens to obtain the final translation.

Inference strategy

We selected, at every step, the word with the maximum softmax value. This strategy is called greedy and usually does not perform very well. • A better strategy is to select at each step the top B words and evaluate all the possible next words for each of them and at each step, keeping the top B most probable sequences. This is the Beam Search strategy and generally performs better.

References & Resources

Attention is All You Need (Paper)
The math behind Attention: Keys, Queries, and Values matrices — YouTube Video