This repository contains the code for the implementation of Transformer architecture in the paper Attention is All You Need. This README serves as a basic intro to the takeaways, results, and training steps for better documentation of the project.
The project was mainly inspired by
- the motivation to implement the native Transformer architecture
- generating baby Shakespeare text is interesting :) and it is good to have a simple task for the model to run on
Lastly, thanks to Andrej's project for providing the idea of generating Shakespeare text, some insights on data shaping, and a nice introduction on Transformer.
- The objective function can be different for different models. Define input and compute loss carefully.
- Evaluate different datasets before training. Test whether the produced model makes sense by using a partial dataset.
- The data pair (input, previous output) in autoregressive models adds another layer of complexity to both
- the dataset design since we need to probably maintain a high-quality "Q&A" like dataset.
- the training process as developers need to think about how to feed the previous output to the model without giving it the answer.
- It is worth building a data pipeline from the beginning, instead of thinking about each data processing step on demand.
Part of the generated samples:
QUEEN whom Gentleman
His of
All's.
KING my V. short a the at. for. most and. DIANA
The cornets. day.
****
of the the on. and An were.
BRANDON
Read Exeunt ****
Of the hither. draw the the the
requireth's.
Neighbour's : ****
Another repose, But the ****
We have the following key parameters:
- Token type number: 27743
- Dataset size: 2045795 (as it contains 36 Shakespeare's plays)
- Model parameters: 16 M parameters
- Data factor: 10%. Since the dataset is large, a partial dataset is used. The first 10% of all Shakespeare play tokens. The number 10% is called the data factor.
- Clone this repo using
git clone --recursive https://github.com/xiaoxi-s/transformer-with-shakespeare.git
- Run
conda env create -f requirements.yml
, which will create an environment calledtransformer
- Activate the environment with
conda activate transformer
- Install wandb manually with conda
conda install -c conda-forge wandb
as wandb isn't readily available in any conda's channel now.
Note: If you didn't supplement the --recursive
flag in the clone command, you can run git submodule update --init --recursive
to download the submodule explicitly.
- Export Weights and Bias API key to environment variable using
export WANDB_API_KEY=<api key goes here>
- Run
python main.py -e <epoch number> -f <data factor>
to start training
Hyperparameters are specified in hyperparams.py
.
If you want to disable wandb, supplement the flag -q
to the python main.py ...
program.
The maps between word (token) and index are stored in ind_to_vocab.pkl
and vocab_to_ind.pkl
.
Run python build_vocab.py
under ./data
folder to regenerate the maps
One of the future directions is to improve training efficiency using distributed/parallel training. However, the network architecture limits how much parallel training can help. One epoch with 512 batch size using gpu_8x_a100_80gb_sxm4
on Lambda Cloud will take ~17 minutes. See the following detail section for the output. Increasing the batch size to 1024 will result in CUDA out-of-memory error.
(transformer) ubuntu@207-211-161-88:~/transformer-with-shakespeare$ python3 main.py -e 2 -f 1 -q
Disable wandb
Hello World!
CUDA available: True
CUDA device count: 8
Epochs: 2
Data factor: 1.0
Enable PyTorch Data parallelism
17.199999 M parameters
Token type number: 27743
Loading data...
Length of data: 2041475
Shape of np data: (2041475, 2, 128)
Tensorizing data...
data shape: torch.Size([2041475, 2, 128])
Train dataset length: 1429033
Test dataset length: 612442
Epoch 1/2: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2792/2792 [17:11<00:00, 2.71batch/s]
The time spent in the first epoch is ~17 minutes.
(transformer) ubuntu@129-146-98-70:~/transformer-with-shakespeare$ cat train.out
Enable wandb
Hello World!
CUDA available: True
CUDA device count: 1
Epochs: 77
Data factor: 1.0
Enable PyTorch Data parallelism
17.199999 M parameters
Token type number: 27743
Loading data...
Length of data: 2041475
Shape of np data: (2041475, 2, 128)
Tensorizing data...
data shape: torch.Size([2041475, 2, 128])
Train dataset length: 1429033
Test dataset length: 612442
Epoch 1/77: 6%|▌ | 677/11165 [01:28<22:43, 7.69batch/s]
The time per epoch is ~22 minutes. At least for the current architecture, parallel training does not improve training efficiency very much.