Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BERTurk Training Dataset Preparation #28

Open
kaansonmezoz opened this issue Oct 31, 2021 · 6 comments
Open

BERTurk Training Dataset Preparation #28

kaansonmezoz opened this issue Oct 31, 2021 · 6 comments

Comments

@kaansonmezoz
Copy link

Hello Stefan,

I'm going to train another BERT model with different pre-training object from scratch. Then I will use it to compare with BERTurk and other Turkish pre-trained language models. In order to evaluate pre-training task impact properly, the model should be trained with similar data and parameters.

In the README file it was state that:

The current version of the model is trained on a filtered and sentence segmented version of the Turkish OSCAR corpus, a recent Wikipedia dump, various OPUS corpora and a special corpus provided by Kemal Oflazer.

I've already collected Kemal Oflazer's and OSCAR's corpus. But there are things I'm curious about. If you can answer them, I will be happy 🙂

  1. Did you apply filtering and sentence segmentation only to OSCAR corpus or did you apply it to others too ?
  2. What kind of filtering did you apply ? Was it like removing sentences with less than 5 tokens from the corpus ?
  3. Have you used only full stop for sentence segmentation ?
  4. Do you remember which Wikipedia dump has been used ?
  5. Which OPUS corpora have you used ? There are plenty of datasets in OPUS. There are even datasets from Wikipedia such as WikiMatrix v1, Wikipedia and wikimedia v20210402. Did you use them too ?
  6. Did you apply extra pre-processing methods a part from BertTokenizer's ?

Also, if you have the public datasets' corpora, do you mind sharing it ? It would make things a lot easier for me and save me from the trouble 🙂

Thanks in advance 🙂

@stefan-it
Copy link
Owner

Hi @kaansonmezoz ,

thanks for your interest in our models 🤗

  1. The complete training corpus was filtered and sentence segmented, basically with:
from nltk.tokenize import sent_tokenize

for sent in sent_tokenize(line, "turkish"):
  if len(sent.split()) > 5:
    print(sent)

So it is not only applied to the OSCAR subcorpus here.

  1. I used sentences longer than 5 tokens (split on whitespaces), see above :)

  2. Not only full stops are considered for sentence segmentation, NLTK has some more tokens to be considered.

  3. I just looked it up in my "data lake", the trwiki-latest-pages-articles.xml.bz2 dump has a 480M 2. Feb 2020 timestamp.

  4. I could found the following OPUS-related files:

bible-uedin.txt GNOME.txt JW300.txt  OpenSubtitles.txt  opus.all QED.txt  SETIMES.txt  Tanzil.txt  Tatoeba.txt  TED2013.txt  Wikipedia.txt

With a timestamp of 3. Feb 2020.

  1. For pre-processing (of pre-training data) the official BERT implementation was used, so basically all pre-processing steps can be found here: https://github.com/google-research/bert/blob/master/tokenization.py#L161-L182, so first a basic tokenization step is done, followed by the wordpiece stuff. I did not add extra steps.

Please just give me your mail addresse and I can immediately send you the link to the corpus used for pre-training 🤗

@hazalturkmen
Copy link

hazalturkmen commented Nov 3, 2021

Hi @stefan-it,
Can i get the links to the corpus used for pre-training?
thanks,

@stefan-it
Copy link
Owner

Hey @hazalturkmen , no problem, just give me an email addresse where I can contact you 🤗

@hazalturkmen
Copy link

Thanks, @stefan-it ,
Here is my email address:

hazalturkmen91@gmail.com

@kaansonmezoz
Copy link
Author

kaansonmezoz commented Nov 4, 2021

@stefan-it Thank you for detailed explanation. My email is sonmezozkaan@gmail.com 🙂

You are a life saver ! ❤️

@stefan-it
Copy link
Owner

Mails are out 🤗

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants