New classification models #1657

Kolpnick · 2023-07-27T07:07:11Z

Major Features and Improvements

Added new multilingual intents classification model
Added new multilingual emotions classification model
Added new multilingual sentiments classification model
Added new multilingual topics classification model
Added new multilingual insults classification model
Added new multilingual NER model
Added config with sentence split for NER

Bug Fixes and Other Changes

Fixed issue #1648

ghnp5 · 2024-09-14T11:37:16Z

Hey!
Just wondering when do you believe this might get in? 😊
Many thanks!

ghnp5 · 2025-01-26T15:11:35Z

deeppavlov/models/tokenizers/sentence_delimiter.py

+def SentenceDelimiter(x_long):
+    seg = pysbd.Segmenter(clean=False)
+    xs = [a for a in seg.segment(x_long[0]) if len(a)>0]
+    return tuple(xs)


Hey!

For texts with sentences that exceed 256, we get this (in ner_mult_long_demo):

RuntimeError: input sequence after bert tokenization shouldn't exceed 256 tokens.

Would it be possible to add some code here that for each sentence in xs, it splits them into chunks of max 256 each? Using transformers.BertTokenizerFast or something like that.

I know it may cause some inaccuracy in some cases, if for example the sentence gets split in the middle of "Michael Jackson", but at least it won't cause a RuntimeError and fail the whole thing :)

There's just an important caveat - if a chunk is about to be split in the middle of a word, e.g. "un|believable", the word should remain intact, and move to the next chunk, rather than splitting the word.

Many thanks!

Something like this seems to at least prevent most of the crashes:

from transformers import BertTokenizer ... def SentenceDelimiter(x_long): seg = pysbd.Segmenter(clean=False) tokenizer = BertTokenizer.from_pretrained("bert-base-multilingual-cased") sentences = [a for a in seg.segment(x_long[0]) if len(a) > 0] def split_long_sentence(sentence: str) -> list: tokens = tokenizer.tokenize(sentence) if len(tokens) <= 250: return [sentence] chunks = [] current_chunk = [] current_token_count = 0 for token in tokens: if current_token_count + 1 > 250: if current_chunk: chunks.append(current_chunk) current_chunk = [token] current_token_count = 1 else: current_chunk.append(token) current_token_count += 1 if current_chunk: chunks.append(current_chunk) return [tokenizer.convert_tokens_to_string(chunk) for chunk in chunks] processed_sentences = [] for sentence in sentences: processed_sentences.extend(split_long_sentence(sentence)) return tuple(processed_sentences)

I couldn't get it to work with 256 (maybe I'm not using the correct pretrained model, or the right parameters, so many times it still exceeded 256 after all, so I put 250.

This doesn't keep the words intact, but at least prevents most of the crashes for those that would crash and give no results.

Ideally, "250" and "bert-base-multilingual-cased" should come as arguments in the SentenceDelimiter function, but I'm not sure how to pass those from the json config file.

Thanks!

Kolpnick and others added 30 commits July 27, 2023 10:01

Added new classification models

292edf4

Added new NER model

eff06dc

Added new models to tests

d2d3ef7

Fixed NER bug with multiple GPUs

46ca69c

Added NER input clipping

a30ee37

Add delimiter component

09fafcb

Add delimiter requirement

8fd57ea

Add support long sequences

e93073b

Add delimiter

2ef420f

Skip empty sequences

658a85f

Restore the short variant

df3aee0

Create long demo config

ac1511e

Updated ner config

ed4270f

Update DEMO NER

f19103e

Fix

1926219

Add sentence_concatenator

3431e3a

Add sentence_concatenator

e371cac

Add sentence_concatenator

35dea2e

Fix delimiter

528cc4d

Updated NER configs

d359673

Fixed NER bug with interpunction

3b7680c

Updated model download url

998894b

Updated model download url

108d96e

Updated NER model download url

1dbed3c

Updated NER config

e12a869

Added new intents classification config

60c8060

Moved bug fixes to new PR

64081e1

feat: config for address model

8e9a4f4

fix: address model conifg updated

ff1c10b

fix: update address model config download link

44debd2

annakorz and others added 7 commits May 17, 2024 11:52

update: config for address model

851e10a

fix: address config

87a78cf

upd demo weights path

f4ac5f1

upd demo output format

0b5eb17

changed sentence delimeter output format

6f2cd01

revert sentence delimeter output format

339132e

upd demo model weights

a4ce357

LogicZMaksimka added 2 commits October 25, 2024 14:45

upd demo ner model

f32ef88

add download

bf83193

ghnp5 reviewed Jan 26, 2025

View reviewed changes

This was referenced Jan 27, 2025

Named Entity demo - what version? deeppavlov/demo2#41

Open

NER - caught in the middle of an entity, starting with "I-" instead of "B-" #1699

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New classification models #1657

New classification models #1657

Kolpnick commented Jul 27, 2023 •

edited

Loading

ghnp5 commented Sep 14, 2024

ghnp5 Jan 26, 2025 •

edited

Loading

ghnp5 Jan 26, 2025

New classification models #1657

Are you sure you want to change the base?

New classification models #1657

Conversation

Kolpnick commented Jul 27, 2023 • edited Loading

Major Features and Improvements

Bug Fixes and Other Changes

ghnp5 commented Sep 14, 2024

ghnp5 Jan 26, 2025 • edited Loading

Choose a reason for hiding this comment

ghnp5 Jan 26, 2025

Choose a reason for hiding this comment

Kolpnick commented Jul 27, 2023 •

edited

Loading

ghnp5 Jan 26, 2025 •

edited

Loading