Transformer with Shakespeare

This repository contains the code for the implementation of Transformer architecture in the paper Attention is All You Need. This README serves as a basic intro to the takeaways, results, and training steps for better documentation of the project.

The project was mainly inspired by

the motivation to implement the native Transformer architecture
generating baby Shakespeare text is interesting :) and it is good to have a simple task for the model to run on

Lastly, thanks to Andrej's project for providing the idea of generating Shakespeare text, some insights on data shaping, and a nice introduction on Transformer.

Takeaways

The objective function can be different for different models. Define input and compute loss carefully.
Evaluate different datasets before training. Test whether the produced model makes sense by using a partial dataset.
The data pair (input, previous output) in autoregressive models adds another layer of complexity to both
1. the dataset design since we need to probably maintain a high-quality "Q&A" like dataset.
2. the training process as developers need to think about how to feed the previous output to the model without giving it the answer.
It is worth building a data pipeline from the beginning, instead of thinking about each data processing step on demand.

Part of the generated samples:

QUEEN whom Gentleman
     His of
     All's.
KING my V. short a the at. for. most and. DIANA
     The cornets. day.
****
     of the the on. and An were.
BRANDON
     Read Exeunt ****
     Of the hither. draw the the the
requireth's.
     Neighbour's : ****
     Another repose, But the ****

Specifications

We have the following key parameters:

Token type number: 27743
Dataset size: 2045795 (as it contains 36 Shakespeare's plays)
Model parameters: 16 M parameters
Data factor: 10%. Since the dataset is large, a partial dataset is used. The first 10% of all Shakespeare play tokens. The number 10% is called the data factor.

Setup and Training

Environment setup

Clone this repo using git clone --recursive https://github.com/xiaoxi-s/transformer-with-shakespeare.git
Run conda env create -f requirements.yml, which will create an environment called transformer
Activate the environment with conda activate transformer
Install wandb manually with conda conda install -c conda-forge wandb as wandb isn't readily available in any conda's channel now.

Note: If you didn't supplement the --recursive flag in the clone command, you can run git submodule update --init --recursive to download the submodule explicitly.

Training

Export Weights and Bias API key to environment variable using export WANDB_API_KEY=<api key goes here>
Run python main.py -e <epoch number> -f <data factor> to start training

Hyperparameters are specified in hyperparams.py.

If you want to disable wandb, supplement the flag -q to the python main.py ... program.

Word (token) index

The maps between word (token) and index are stored in ind_to_vocab.pkl and vocab_to_ind.pkl.

Run python build_vocab.py under ./data folder to regenerate the maps

Future

Distributed/parallel training

One of the future directions is to improve training efficiency using distributed/parallel training. However, the network architecture limits how much parallel training can help. One epoch with 512 batch size using gpu_8x_a100_80gb_sxm4 on Lambda Cloud will take ~17 minutes. See the following detail section for the output. Increasing the batch size to 1024 will result in CUDA out-of-memory error.

8 A100 GPU cluster

(transformer) ubuntu@207-211-161-88:~/transformer-with-shakespeare$ python3 main.py -e 2 -f 1 -q
Disable wandb
Hello World!
CUDA available:  True
CUDA device count:  8
Epochs:  2
Data factor:  1.0
Enable PyTorch Data parallelism
17.199999 M parameters
Token type number:  27743
Loading data...
Length of data:  2041475
Shape of np data:  (2041475, 2, 128)
Tensorizing data...
data shape:  torch.Size([2041475, 2, 128])
Train dataset length:  1429033
Test dataset length:  612442
Epoch 1/2: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2792/2792 [17:11<00:00,  2.71batch/s]

The time spent in the first epoch is ~17 minutes.

1 A100 GPU

(transformer) ubuntu@129-146-98-70:~/transformer-with-shakespeare$ cat train.out
Enable wandb
Hello World!
CUDA available:  True
CUDA device count:  1
Epochs:  77
Data factor:  1.0
Enable PyTorch Data parallelism
17.199999 M parameters
Token type number:  27743
Loading data...
Length of data:  2041475
Shape of np data:  (2041475, 2, 128)
Tensorizing data...
data shape:  torch.Size([2041475, 2, 128])
Train dataset length:  1429033
Test dataset length:  612442
Epoch 1/77:   6%|▌         | 677/11165 [01:28<22:43,  7.69batch/s]

The time per epoch is ~22 minutes. At least for the current architecture, parallel training does not improve training efficiency very much.

Name		Name	Last commit message	Last commit date
Latest commit History 159 Commits
data		data
figs		figs
models		models
shakespeare @ 38061c3		shakespeare @ 38061c3
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
eval.py		eval.py
finetune.py		finetune.py
hyperparams.py		hyperparams.py
input.txt		input.txt
main.py		main.py
requirements.yml		requirements.yml
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Transformer with Shakespeare

Takeaways

Specifications

Setup and Training

Environment setup

Training

Word (token) index

Future

Distributed/parallel training

8 A100 GPU cluster

1 A100 GPU

About

Releases

Packages

Languages

xiaoxi-s/transformer-with-shakespeare

Folders and files

Latest commit

History

Repository files navigation

Transformer with Shakespeare

Takeaways

Specifications

Setup and Training

Environment setup

Training

Word (token) index

Future

Distributed/parallel training

8 A100 GPU cluster

1 A100 GPU

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages