-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BERTurk Training Dataset Preparation #28
Comments
Hi @kaansonmezoz , thanks for your interest in our models 🤗
from nltk.tokenize import sent_tokenize
for sent in sent_tokenize(line, "turkish"):
if len(sent.split()) > 5:
print(sent) So it is not only applied to the OSCAR subcorpus here.
With a timestamp of
Please just give me your mail addresse and I can immediately send you the link to the corpus used for pre-training 🤗 |
Hi @stefan-it, |
Hey @hazalturkmen , no problem, just give me an email addresse where I can contact you 🤗 |
Thanks, @stefan-it , |
@stefan-it Thank you for detailed explanation. My email is sonmezozkaan@gmail.com 🙂 You are a life saver ! ❤️ |
Mails are out 🤗 |
Hello Stefan,
I'm going to train another BERT model with different pre-training object from scratch. Then I will use it to compare with BERTurk and other Turkish pre-trained language models. In order to evaluate pre-training task impact properly, the model should be trained with similar data and parameters.
In the README file it was state that:
I've already collected Kemal Oflazer's and OSCAR's corpus. But there are things I'm curious about. If you can answer them, I will be happy 🙂
WikiMatrix v1
,Wikipedia
andwikimedia v20210402
. Did you use them too ?Also, if you have the public datasets' corpora, do you mind sharing it ? It would make things a lot easier for me and save me from the trouble 🙂
Thanks in advance 🙂
The text was updated successfully, but these errors were encountered: