In this post, we will explore the Decoder-Only Transformer, the foundation of ChatGPT, through a simple code example. For the code, I referred to Josh Starmer’s video, Coding a ChatGPT Like Transformer From Scratch in PyTorch. I highly recommend watching the video if you’re unfamiliar with the concept of Decoder-Only Transformer. At the end of the video, Josh outlines the key differences between a standard Transformer and a Decoder-Only Transformer.

A Decoder-Only Transformer has a single unit responsible for both encoding the input and generating the output. In contrast, a standard Transformer uses two units: an Encoder to process the input and a Decoder to generate the output.
A standard Transformer uses two types of attention during inference: Self-Attention and Encoder-Decoder Attention. During training, it uses Masked Self-Attention but only on the output. On the other hand, a Decoder-Only Transformer utilizes only one type of attention, Masked Self-Attention.

Now that we understand the differences between the Decoder-Only Transformer and a standard Transformer, let’s get started by importing the Python modules that we will use in this post.

Import Python Modules

import numpy as np
import matplotlib.pyplot as plt

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.optim import Adam

from torch.utils.data import TensorDataset, DataLoader

Creat Training Data

We want to build a model which can respond correctly to two different prompts.

“How is living in Amsterdam?”
“Living in Amsterdam is how?”

The answer to both prompts will be “Exciting”.

Our vocabulary for this model will consist of 7 words (or tokens): “how”, “is”, “living”, “in”, “amsterdam”, “exciting”, and “<EOS>” (End of Sentence).

# map the tokens to numbers for word embedding
# nn.Embedding only accepts numbers as input
token_to_id = {'how': 0,
               'is': 1,
               'living': 2,
               'in': 3,
               'amsterdam': 4,
               'exciting': 5,
               '<EOS>': 6}

# from numbers back to the original tokens
id_to_token = dict(map(reversed, token_to_id.items()))

# the tokens for input during training come from the promopts as well as from generating the output
inputs = torch.tensor([[token_to_id['how'],
                        token_to_id['is'],
                        token_to_id['living'],
                        token_to_id['in'],
                        token_to_id['amsterdam'],
                        token_to_id['<EOS>'],
                        token_to_id['exciting']],

                      [token_to_id['living'],
                       token_to_id['in'],
                       token_to_id['amsterdam'],
                       token_to_id['is'],
                       token_to_id['how'],
                       token_to_id['<EOS>'],
                       token_to_id['exciting']]])

labels = torch.tensor([[token_to_id['is'],
                        token_to_id['living'],
                        token_to_id['in'],
                        token_to_id['amsterdam'],
                        token_to_id['<EOS>'],
                        token_to_id['exciting'],
                        token_to_id['<EOS>']],

                       [token_to_id['in'],
                        token_to_id['amsterdam'],
                        token_to_id['is'],
                        token_to_id['how'],
                        token_to_id['<EOS>'],
                        token_to_id['exciting'],
                        token_to_id['<EOS>']]])

dataset = TensorDataset(inputs, labels)
dataloader = DataLoader(dataset)

# let's look at the first input and label data
next(iter(dataloader))

[tensor([[0, 1, 2, 3, 4, 6, 5]]), tensor([[1, 2, 3, 4, 6, 5, 6]])]

Positional Encoding

Since nn.Embedding will handle creating word embeddings for us, the next step is positional encoding.

Unlike RNNs, where tokens are processed sequentially, in Transformers, we process all tokens simultaneously. Therefore, we need to add a value that provides information about the order of the tokens in the sequence. For this purpose, we use alternating sine and cosine functions.

The formulas for positional encoding are:

\[PE_{pos, 2i}=sin(pos/10000^{2i/d\_model})\] \[PE_{pos, 2i+1}=cos(pos/10000^{2i/d\_model})\]

Where:

$pos$ is the position of the token in the input.
$d_model$ is the dimensionality of the word embedding
$i$ indicates the position within the embedding dimension. Both $pos$ and $i$ start at 0, and $i$ increments by 1 for each successive sine and cosine pair.

If we assume each token has 4-dimensional word embedding ($d_model=4$), each input token will also have 4 corresponding position encoding values. For example, the positional encoding values for the first token, “how”, are

\[PE_{0, 2\times0}=sin(0/10000^{2\times0/4})=sin(0)=0\] \[PE_{0, 2\times0+1}=cos(0/10000^{2\times0/4})=cos(0)=1\] \[PE_{0, 2\times1}=sin(0/10000^{2\times1/4})=sin(0)=0\] \[PE_{0, 2\times1+1}=cos(0/10000^{2\times0/4})=cos(0)=1\]

class PositionEncoding(nn.Module):

  def __init__(self, d_model=4, max_len=20):
    # max_len is the maximum number of tokens our Transformer can process (input + output)
    super().__init__()
    
    # create a zero-filled matrix of position encoding values
    pe = torch.zeros(max_len, d_model)
    
    # column vector that represents the positions
    position = torch.arange(start=0, end=max_len, step=1).float().unsqueeze(1)
    
    # row vector that represents 2*i for each word embedding
    embedding_index = torch.arange(start=0, end=d_model, step=2).float()
    
    # pos/10000**(2*i/d_model)
    div_term = 1/torch.tensor(10000)**(embedding_index/d_model)
    
    # we replace the even columns with values from the sine function
    pe[:, 0::2] = torch.sin(position * div_term)

    # we replace the odd columns with values from the cosine function
    pe[:, 1::2] = torch.cos(position * div_term)

    # to ensure the position encoding values get moved to a GPU if we use one
    self.register_buffer('pe', pe)

  def forward(self, word_embeddings):
    # take word embedding values and add the position encoding values element-wisely
    return word_embeddings + self.pe[:word_embeddings.size(0), :]

Masked Self-Attention

As discussed in the previous post, Masked Self-Attention works by comparing each word to itself and all preceding words in the sentence.

For example, in our first prompt, “how is living in Amsterdam,” the Masked Self-Attention values for the first token, “how,” will only reflect its similarity to itself. In contrast, the Masked Self-Attention values for the third token, “living,” will capture its similarity to itself and the preceding tokens, “how” and “is.”

The Masked Self-Attention mechanism is mathematically defined as:

$@Attention(Q,K,V)=SoftMax(\frac{QK^T}{\sqrt{d_k}}+M)V$$

In a standard Transformer, this mechanism is used only in the decoder during training. However, in a Decoder-Only Transformer, masked self-attention is applied continuously, both during training and inference, incorporating both input and output.

class Attention(nn.Module):

  def __init__(self, d_model=4):
    super().__init__()

    self.d_model = d_model

    # the weight matrices to compute Query, Key, and Value have d_model rows and d_model columns
    # we use nn.Linear as we will dot product -> encoded tokens * weights
    self.W_q = nn.Linear(in_features=d_model, out_features=d_model, bias=False)
    self.W_k = nn.Linear(in_features=d_model, out_features=d_model, bias=False)
    self.W_v = nn.Linear(in_features=d_model, out_features=d_model, bias=False)

    # to keep track of which indices are for row and columns
    self.row_dim = 0
    self.col_dim = 1

  def forward(self, encodings_for_q, encodings_for_k, encodings_for_v, mask=None):
    
    # compute Query, Key, and Values
    q = self.W_q(encodings_for_q)
    k = self.W_k(encodings_for_k)
    v = self.W_v(encodings_for_v)

    # matrix multiplication between q and the transpose of k -> similarities
    sims = torch.matmul(q, k.transpose(dim0=self.row_dim, dim1=self.col_dim))

    # scale the similarities by the square root of the dimension of k
    scaled_sims = sims / torch.tensor(k.size(self.col_dim)**0.5)

    # incase we want to add the mask to prevent early tokens from looking at later tokens
    if mask is not None:
      # add the mask to the scaled similarities
      # mask has negative infinity values for the token to be ignored and 0 for the others
      scaled_sims = scaled_sims.masked_fill(mask=mask, value=-1e9)

    # softmax function to determine the percentage of influence that each token should have on the others
    attention_percents = F.softmax(scaled_sims, dim=self.col_dim)

    # multiply the attention weights by V
    attention_scores = torch.matmul(attention_percents, v)

    return attention_scores

The Decoder-Only Transformer

We can now create a class for the Decoder-Only Transformer, building on the classes we’ve already implemented and adding the necessary steps for completing the model.

class DecoderOnlyTransformer(nn.Module):

  def __init__(self, num_tokens=7, d_model=4, max_len=20):
    # num_tokens is the number of tokens inthe vocabulary
    super().__init__()

    # create an embedding object
    self.we = nn.Embedding(num_embeddings=num_tokens,
                           embedding_dim=d_model)

    # position encoding object
    self.pe = PositionEncoding(d_model=d_model,
                               max_len=max_len)

    # attention object
    self.self_attention = Attention(d_model=d_model)

    # create a fully connected layer
    self.fc_layer = nn.Linear(in_features=d_model, out_features=num_tokens)

    # self.loss = nn.CrossEntropyLoss()

  def forward(self, token_ids):
    # token_ids: an array of token id numbers for inputs

    # 1. Word embedding
    word_embeddings = self.we(token_ids)

    # 2. Position encoding
    position_encoded = self.pe(word_embeddings)

    # 3. Create the mask
    # torch.tril() leaves the values in the lower triangle as they are (1) and turns everything else into 0s.
    mask = torch.tril(torch.ones((token_ids.size(dim=0), token_ids.size(dim=0))))
    # Trues for 0s
    mask = mask == 0

    # # 4. Masked self attention
    self_attention_values = self.self_attention(position_encoded,
                                                position_encoded,
                                                position_encoded,
                                                mask=mask)

    # 5. Residual connection
    residual_connection_values = position_encoded + self_attention_values

    # 6. Fully connected layer
    fc_layer_output = self.fc_layer(residual_connection_values)

    return fc_layer_output

Training

Now that we have defined our Transformer model, the next step is to train it by optimizing the model parameters.

# create a model from DecoderOnlyTransformer()
model = DecoderOnlyTransformer(num_tokens=len(token_to_id), d_model=4, max_len=20)

# define loss and optimizer
# nn.CrossEntropyLoss() applies softmax function for us
loss_fn = nn.CrossEntropyLoss()
optimizer = Adam(model.parameters(), lr=0.05)

# number of epochs
epochs = 100

for epoch in range(epochs):
  # put the model in training mode
  model.train()

  # we have two prompts
  for i, data in enumerate(dataloader):
    inputs, labels = data

    # forward pass
    outputs = model(inputs[0])

    # calculate the loss
    loss = loss_fn(outputs, labels[0])

    # make gradients zero
    optimizer.zero_grad()

    # backpropagation
    loss.backward()
    
    # update the parameters
    optimizer.step()

  if epoch % 10 == 0:
        print(f"Epoch: {epoch} | Loss: {loss:.5f}")

Epoch: 0 | Loss: 2.28202
Epoch: 10 | Loss: 0.66447
Epoch: 20 | Loss: 0.06849
Epoch: 30 | Loss: 0.00769
Epoch: 40 | Loss: 0.00344
Epoch: 50 | Loss: 0.00221
Epoch: 60 | Loss: 0.00162
Epoch: 70 | Loss: 0.00126
Epoch: 80 | Loss: 0.00101
Epoch: 90 | Loss: 0.00083

Make Predictions

Finally, we will use the model to predict the response to the prompts we created.

model_input = torch.tensor([token_to_id["how"],
                            token_to_id["is"],
                            token_to_id["living"],
                            token_to_id["in"],
                            token_to_id["amsterdam"],
                            token_to_id["<EOS>"]])

input_length = model_input.size(dim=0)

# the model generates a prediction for each input token
predictions = model(model_input)

# we're only interested in the prediction of <EOS> token (last prediction)
predicted_id = torch.tensor([torch.argmax(predictions[-1, :])])
predicted_ids = predicted_id

# we create a loop to generate output tokens
# until we reach the maximum number of tokens that our model can generate
max_length = 20
for i in range(input_length, max_length):
  # or the model generates the <EOS> token
  if (predicted_id == token_to_id["<EOS>"]): # if the prediction is <EOS>, then we are done
    break

  # each time we generate a new output token, we add it to the input
  model_input = torch.cat((model_input, predicted_id))

  predictions = model(model_input)
  predicted_id = torch.tensor([torch.argmax(predictions[-1,:])])
  predicted_ids = torch.cat((predicted_ids, predicted_id))

print("Predicted Tokens:\n")
for id in predicted_ids:
    print("\t", id_to_token[id.item()])

Predicted Tokens:

	 exciting
	 <EOS>

model_input = torch.tensor([token_to_id["living"],
                            token_to_id["in"],
                            token_to_id["amsterdam"],
                            token_to_id["is"],
                            token_to_id["how"],
                            token_to_id["<EOS>"]])

input_length = model_input.size(dim=0)

# the model generates a prediction for each input token
predictions = model(model_input)

# we're only interested in the prediction of <EOS> token (last prediction)
predicted_id = torch.tensor([torch.argmax(predictions[-1, :])])
predicted_ids = predicted_id

# we create a loop to generate output tokens
# until we reach the maximum number of tokens that our model can generate
max_length = 20
for i in range(input_length, max_length):
  # or the model generates the <EOS> token
  if (predicted_id == token_to_id["<EOS>"]): # if the prediction is <EOS>, then we are done
    break

  # each time we generate a new output token, we add it to the input
  model_input = torch.cat((model_input, predicted_id))

  predictions = model(model_input)
  predicted_id = torch.tensor([torch.argmax(predictions[-1,:])])
  predicted_ids = torch.cat((predicted_ids, predicted_id))

print("Predicted Tokens:\n")
for id in predicted_ids:
    print("\t", id_to_token[id.item()])

Predicted Tokens:

	 exciting
	 <EOS>

We observe that the model functions precisely as intended.

Building a Decoder-Only Transformer Model

Understanding the Foundation of ChatGPT Through Practical Coding