Transformer Architecture
This project implements a Transformer block from scratch, using PyTorch for tensor operations and certain pre-built layers. The block consists of the following components:
Diagram
- Layer Normalization: A from-scratch implementation of
LayerNorm
. It usesnn.Parameter
for trainable scale and shift parameters. - Masked Multi-Head Attention: The
MultiHeadAttention
logic is implemented from scratch. It utilizes PyTorch’snn.Linear
for the weight matrices andnn.Dropout
. - Dropout: Applied via PyTorch’s
nn.Dropout
for regularization within the attention mechanism and on the residual connections. - Shortcut Connections (Residual Connections): Implemented by adding the input of a sublayer to its output.
- Feed Forward Network: A
FeedForward
network that uses PyTorch’snn.Linear
layers and a custom-implementedGELU
activation function.
Implementation Details
- The configuration for the Transformer block is defined in the
GPT_CONFIG_124M
dictionary. This includes parameters such as embedding dimension, number of attention heads, and dropout rate. - The
TransformerBlock
class combines all the components into a single block.
Usage
An example implementation is provided in the notebook to demonstrate how to use the TransformerBlock
.
# example implementation
torch.manual_seed(123)
x = torch.rand(2, 4, 768)
block = TransformerBlock(GPT_CONFIG_124M)
output = block(x)
print("Input Shape:", x.shape)
print("Output shape:", output.shape)
Written on July 11, 2025