-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New classification models #1657
base: dev
Are you sure you want to change the base?
Conversation
Hey! |
def SentenceDelimiter(x_long): | ||
seg = pysbd.Segmenter(clean=False) | ||
xs = [a for a in seg.segment(x_long[0]) if len(a)>0] | ||
return tuple(xs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey!
For texts with sentences that exceed 256, we get this (in ner_mult_long_demo
):
RuntimeError: input sequence after bert tokenization shouldn't exceed 256 tokens.
Would it be possible to add some code here that for each sentence in xs
, it splits them into chunks of max 256 each? Using transformers.BertTokenizerFast
or something like that.
I know it may cause some inaccuracy in some cases, if for example the sentence gets split in the middle of "Michael Jackson", but at least it won't cause a RuntimeError
and fail the whole thing :)
There's just an important caveat - if a chunk is about to be split in the middle of a word, e.g. "un|believable", the word should remain intact, and move to the next chunk, rather than splitting the word.
Many thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something like this seems to at least prevent most of the crashes:
from transformers import BertTokenizer
...
def SentenceDelimiter(x_long):
seg = pysbd.Segmenter(clean=False)
tokenizer = BertTokenizer.from_pretrained("bert-base-multilingual-cased")
sentences = [a for a in seg.segment(x_long[0]) if len(a) > 0]
def split_long_sentence(sentence: str) -> list:
tokens = tokenizer.tokenize(sentence)
if len(tokens) <= 250:
return [sentence]
chunks = []
current_chunk = []
current_token_count = 0
for token in tokens:
if current_token_count + 1 > 250:
if current_chunk:
chunks.append(current_chunk)
current_chunk = [token]
current_token_count = 1
else:
current_chunk.append(token)
current_token_count += 1
if current_chunk:
chunks.append(current_chunk)
return [tokenizer.convert_tokens_to_string(chunk) for chunk in chunks]
processed_sentences = []
for sentence in sentences:
processed_sentences.extend(split_long_sentence(sentence))
return tuple(processed_sentences)
I couldn't get it to work with 256
(maybe I'm not using the correct pretrained model, or the right parameters, so many times it still exceeded 256 after all, so I put 250
.
This doesn't keep the words intact, but at least prevents most of the crashes for those that would crash and give no results.
Ideally, "250" and "bert-base-multilingual-cased" should come as arguments in the SentenceDelimiter
function, but I'm not sure how to pass those from the json config file.
Thanks!
Major Features and Improvements
Bug Fixes and Other Changes