Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New classification models #1657

Open
wants to merge 39 commits into
base: dev
Choose a base branch
from
Open

New classification models #1657

wants to merge 39 commits into from

Conversation

Kolpnick
Copy link
Collaborator

@Kolpnick Kolpnick commented Jul 27, 2023

Major Features and Improvements

  • Added new multilingual intents classification model
  • Added new multilingual emotions classification model
  • Added new multilingual sentiments classification model
  • Added new multilingual topics classification model
  • Added new multilingual insults classification model
  • Added new multilingual NER model
  • Added config with sentence split for NER

Bug Fixes and Other Changes

@ghnp5
Copy link

ghnp5 commented Sep 14, 2024

Hey!
Just wondering when do you believe this might get in? 😊
Many thanks!

def SentenceDelimiter(x_long):
seg = pysbd.Segmenter(clean=False)
xs = [a for a in seg.segment(x_long[0]) if len(a)>0]
return tuple(xs)
Copy link

@ghnp5 ghnp5 Jan 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey!

For texts with sentences that exceed 256, we get this (in ner_mult_long_demo):

RuntimeError: input sequence after bert tokenization shouldn't exceed 256 tokens.

Would it be possible to add some code here that for each sentence in xs, it splits them into chunks of max 256 each? Using transformers.BertTokenizerFast or something like that.

I know it may cause some inaccuracy in some cases, if for example the sentence gets split in the middle of "Michael Jackson", but at least it won't cause a RuntimeError and fail the whole thing :)

There's just an important caveat - if a chunk is about to be split in the middle of a word, e.g. "un|believable", the word should remain intact, and move to the next chunk, rather than splitting the word.

Many thanks!

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like this seems to at least prevent most of the crashes:

from transformers import BertTokenizer

...

def SentenceDelimiter(x_long):
    seg = pysbd.Segmenter(clean=False)
    tokenizer = BertTokenizer.from_pretrained("bert-base-multilingual-cased")

    sentences = [a for a in seg.segment(x_long[0]) if len(a) > 0]

    def split_long_sentence(sentence: str) -> list:
        tokens = tokenizer.tokenize(sentence)

        if len(tokens) <= 250:
            return [sentence]

        chunks = []
        current_chunk = []
        current_token_count = 0

        for token in tokens:
            if current_token_count + 1 > 250:
                if current_chunk:
                    chunks.append(current_chunk)
                current_chunk = [token]
                current_token_count = 1
            else:
                current_chunk.append(token)
                current_token_count += 1

        if current_chunk:
            chunks.append(current_chunk)

        return [tokenizer.convert_tokens_to_string(chunk) for chunk in chunks]

    processed_sentences = []
    for sentence in sentences:
        processed_sentences.extend(split_long_sentence(sentence))

    return tuple(processed_sentences)

I couldn't get it to work with 256 (maybe I'm not using the correct pretrained model, or the right parameters, so many times it still exceeded 256 after all, so I put 250.

This doesn't keep the words intact, but at least prevents most of the crashes for those that would crash and give no results.

Ideally, "250" and "bert-base-multilingual-cased" should come as arguments in the SentenceDelimiter function, but I'm not sure how to pass those from the json config file.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants