Transformer Architecture

This project implements a Transformer block from scratch, using PyTorch for tensor operations and certain pre-built layers. The block consists of the following components:

Diagram

Transformer Block Diagram

Layer Normalization: A from-scratch implementation of LayerNorm. It uses nn.Parameter for trainable scale and shift parameters.
Masked Multi-Head Attention: The MultiHeadAttention logic is implemented from scratch. It utilizes PyTorch’s nn.Linear for the weight matrices and nn.Dropout.
Dropout: Applied via PyTorch’s nn.Dropout for regularization within the attention mechanism and on the residual connections.
Shortcut Connections (Residual Connections): Implemented by adding the input of a sublayer to its output.
Feed Forward Network: A FeedForward network that uses PyTorch’s nn.Linear layers and a custom-implemented GELU activation function.

Implementation Details

The configuration for the Transformer block is defined in the GPT_CONFIG_124M dictionary. This includes parameters such as embedding dimension, number of attention heads, and dropout rate.
The TransformerBlock class combines all the components into a single block.

Usage

An example implementation is provided in the notebook to demonstrate how to use the TransformerBlock.

# example implementation
torch.manual_seed(123)
x = torch.rand(2, 4, 768)
block = TransformerBlock(GPT_CONFIG_124M)
output = block(x)
print("Input Shape:", x.shape)
print("Output shape:", output.shape)

Written on July 11, 2025