- AI, But Simple
- Posts
- Transformer Models From Scratch in Python
Transformer Models From Scratch in Python
AI, But Simple Issue #52

Hello from the AI, but simple team! If you enjoy our content, consider supporting us so we can keep doing what we do.
Our newsletter is no longer sustainable to run at no cost, so we’re relying on different measures to cover operational expenses. Thanks again for reading!
Transformer Models From Scratch in Python
AI, But Simple Issue #52
Transformer neural networks (Vaswani et al. 2017) are a type of neural network specialized for modern natural language processing (NLP) tasks, such as translation, text generation, and chatbots.
They consist of an encoder and a decoder that work as an all-around NLP solution. In terms of the architecture, the main innovation behind the transformer is the self-attention mechanism—originally created for machine translation with RNNs.
The attention mechanism allows transformers to weigh the importance of different tokens (words or symbols) in an input sequence when making predictions.
In this issue, we’ll look at creating a transformer encoder from scratch using Python and Pytorch. We’ll dive into the full code and conceptual base behind the transformer, the architecture of choice for the most popular large language models (such as GPT or BERT).

This week, we have our third tutorial with code. Feel free to follow along, play around with the code, and learn about transformers. If you want to review some information on how transformers work conceptually, check out the issue below:
Introduction to Transformers and the Attention Mechanism (Intermediate)
Note that basic Python knowledge along with some knowledge of libraries such as Pytorch is expected, so if you’re still new to it, we suggest going through some beginner tutorials to learn the syntax.
The code in this issue can be executed on a normal home PC and does not require a large amount of VRAM. It can also be executed on hosted notebooks such as Google Colab.
Big thanks to our partners for keeping this newsletter free. If you have a second, clicking the ad below helps us a ton—and who knows, you might find something you’ll enjoy.
Start learning AI in 2025
Keeping up with AI is hard – we get it!
That’s why over 1M professionals read Superhuman AI to stay ahead.
Get daily AI news, tools, and tutorials
Learn new AI skills you can use at work in 3 mins a day
Become 10X more productive
The first thing we’ll do to build this architecture is to inspect how it works—how the data flows through the structure. We notice that inputs are embedded and positionally encoded before being passed into the encoder.
The encoder is composed of two main components: self-attention (orange) and a feedforward layer (blue).

Let’s say our input sentence is a simple one, such as “Deep learning is simple.”
The first thing we need to do is to start with all of our imports. We’ll import all libraries at once to have them ready for later—we don’t need to import them one by one:
from transformers import AutoTokenizer
import torch
from torch import nn
import torch.nn.functional as F
import math
We need to process the data first. Let’s tokenize the sentence, then embed the tokens:
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
text = 'Deep learning is simple.'
print(tokenizer(text, add_special_tokens=False, return_tensors='pt'))
input = tokenizer(text, add_special_tokens=False, return_tensors='pt')
The above code should produce the following output: {'input_ids': tensor([[2784, 4083, 2003, 3722, 1012]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1]])}
In the above code, we basically took a portion of the BERT model to tokenize our sentence. In the output, if we look at the input_ids tensor, we see that the tokenizer has tokenized the sentence “Deep learning is simple.” into tokens.