NOTE: Since I run this code on my local PC, i need to reduce batch_size
and max_sequence_length
mkdir dataset
cd dataset
kaggle competitions download -c jigsaw-toxic-comment-classification-challenge
unzip -q jigsaw-toxic-comment-classification-challenge.zip
unzip -q train.csv.zip
download_tokenizer_model.ipynb
: use this one to save pretrained on my pcprepare_data.ipynb
: use this one to preprocess data and export to a new datautils.py
: helper functiontrain.py
: training./bert/
: where I saved both tokenizer and modeldataset/
: data here
- add scheduler to optimizer
- train with bigger batch_size, full dataset
- inference