Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError: NLP engine 'transformers' is not available. #1239

Closed
ragesh2000 opened this issue Dec 26, 2023 · 28 comments
Closed

ValueError: NLP engine 'transformers' is not available. #1239

ragesh2000 opened this issue Dec 26, 2023 · 28 comments

Comments

@ragesh2000
Copy link

ragesh2000 commented Dec 26, 2023

I am trying to use Transformers based Named Entity Recognition models using the following configuration, Iam getting the following error

configuration = {
    "nlp_engine_name": "transformers",
    "models": [{
                    "lang_code": "en",
                    "model_name": {
                        "spacy": "en_core_web_sm",
                        "transformers": "bigcode/starpii",
                    },
                }],
}

ValueError: NLP engine 'transformers' is not available. Make sure you have all required packages installed

what else need to be installed ? I have followed as in the documentation

@omri374
Copy link
Contributor

omri374 commented Dec 26, 2023

transformers doesn't come with the vanilla Presidio installation. Have you installed it with the [transformers] extra?

pip install "presidio_analyzer[transformers]"
pip install presidio_anonymizer
python -m spacy download en_core_web_sm

@ragesh2000
Copy link
Author

Thanks @omri374. That helped.

@ragesh2000
Copy link
Author

Sorry for reopening the issue. I have one more clarification needed. When we are using transformer model in this way, the model will look for entities in both spacy and transformer models ?. If thats the case is there any chance of conflict in the entity names?. Or is there anything specific i need to do in my code? @omri374

@omri374
Copy link
Contributor

omri374 commented Dec 27, 2023

The TransformersNlpEngine replaces the spaCy NER model with a transformers model, so you wouldn't get results from both. If you would like to have them running in parallel, see this issue: #1238. In short, one of them would have to be an NLP engine and the other, a recognizer.

@ragesh2000
Copy link
Author

ragesh2000 commented Dec 27, 2023

I asked this because i got a warning and the output was missing a required entity
/home/ragesh/miniconda3/envs/presidio/lib/python3.8/site-packages/spacy_huggingface_pipelines/token_classification.py:138: UserWarning: Skipping annotation, {'entity_group': 'USERNAME', 'score': 0.9650769, 'word': ' rageshkr', 'start': 10, 'end': 19} is overlapping or can't be aligned for doc 'my name is rageshkr and iam going to dubai'
why this happening? Is this warning is something related to the parameter model_to_presidio_entity_mapping in config? Iam not sure about the mapping i have to do here

@omri374
Copy link
Contributor

omri374 commented Dec 27, 2023

Can you please share a reproducible example?

@ragesh2000
Copy link
Author

Sure.

from presidio_analyzer import AnalyzerEngine, RecognizerRegistry
from presidio_analyzer.nlp_engine import NlpEngineProvider

conf_file = '/home/ragesh/Documents/presidio/config.yaml'

provider = NlpEngineProvider(conf_file=conf_file)
nlp_engine = provider.create_engine()
analyzer = AnalyzerEngine(
    nlp_engine=nlp_engine, 
    supported_languages=["en"]
)

results_english = analyzer.analyze(text='my name is rageshkr and iam going to dubai', language="en", return_decision_process=True)

and my config file is

nlp_engine_name: transformers
models:
  -
    lang_code: en
    model_name:
      spacy: en_core_web_sm
      transformers: bigcode/starpii

ner_model_configuration:
  labels_to_ignore:
  - O
  aggregation_strategy: simple # "simple", "first", "average", "max"
  stride: 16
  alignment_mode: strict # "strict", "contract", "expand"

@ragesh2000
Copy link
Author

@omri374 Any update on this ?

@omri374
Copy link
Contributor

omri374 commented Dec 28, 2023

Yes I'm on it. Will update soon.

@omri374
Copy link
Contributor

omri374 commented Dec 28, 2023

I think the reason you're missing an entity is not because of this warning, but because of the mapping of the model's entity names to Presidio's. The model outputs USERNAME which isn't in the mapping between the model and the library.

To fix it, there are two options:

  1. customize the NerModelConfiguration object:
from presidio_analyzer import AnalyzerEngine
from presidio_analyzer.nlp_engine import TransformersNlpEngine, NerModelConfiguration


model_config = [{"lang_code": "en", "model_name": {
    "spacy": "en_core_web_sm",  # use a small spaCy model for lemmas, tokens etc.
    "transformers": "bigcode/starpii"
    }
}]

#bigcode/starpii entity mappings:
mapping = dict(
    USERNAME="USERNAME",
    EMAIL="EMAIL",
    KEY= "KEY",
    PASSWORD= "PASSWORD",
    IP_ADDRESS: "IP_ADDRESS"
)

ner_model_configuration = NerModelConfiguration(model_to_presidio_entity_mapping=mapping)

nlp_engine = TransformersNlpEngine(models=model_config, ner_model_configuration=ner_model_configuration)
analyzer_engine = AnalyzerEngine(nlp_engine=nlp_engine)

The other is to add the requested entities to the transformers recognizer, but it requires a bit of tweaking:

# nlp_engine = ... As defined before, just with the default mappings

analyzer_engine = AnalyzerEngine(nlp_engine=nlp_engine)
transformers_rec = [rec for rec in analyzer_engine.registry.recognizers if rec.name == "TransformersRecognizer"][0]
transformers_rec.supported_entities.append("USERNAME")


results = analyzer_engine.analyze(text=text, language="en", return_decision_process=True)

This behavior (of not returning undefined entities) is a side effect of #1221 (I think). If you have any suggestions on how to improve the behavior here, please let us know!

@ragesh2000
Copy link
Author

Method 1 seems to be working for me. But i would like to know what does this model_to_presidio_entity_mapping means ? is that means the list of all entities in the transformer model ? @omri374

@omri374
Copy link
Contributor

omri374 commented Dec 28, 2023

Yes, it is used to translate the entities the model was trained on, to Presidio's. It is needed because there may be different ways to detect the same entity and this way you can achieve alignment. It is also used to be able to filter entities in or out, in a model agnostic way.

For example, you could have translated USER_NAME to PERSON to conform with Presidio's built in entities.

@ragesh2000
Copy link
Author

So if there is an entity that the model was trained on and no corresponding entity is there in Presidio, how should be the mapping?

@omri374
Copy link
Contributor

omri374 commented Dec 29, 2023

Like the mapping in my previous example. The supported entities for this model are taken from this mapping. User name, for instance, is not a predefined entity in presidio but with this mapping it is returned.

@ragesh2000
Copy link
Author

ok. is that possible to give the input text as a file to presidio ? @omri374

@omri374
Copy link
Contributor

omri374 commented Dec 31, 2023

Do you mean the configuration? Yes, through a yaml file: https://microsoft.github.io/presidio/analyzer/nlp_engines/transformers/#creating-a-configuration-file

@ragesh2000
Copy link
Author

ragesh2000 commented Dec 31, 2023

Not the configuration. I mean instead of giving a string input to analyse, can we give a .txt or .json file to analyse? @omri374

@omri374
Copy link
Contributor

omri374 commented Dec 31, 2023

I see. There is some support for json here. It shows some examples of using data frames or json as input.

@ragesh2000
Copy link
Author

thankz @omri374

@WithIbadKhan
Copy link

WithIbadKhan commented May 7, 2024

Here is any need for the paid API or this is fully open-source? @omri374

@omri374
Copy link
Contributor

omri374 commented May 7, 2024

@WithlbadKhan there is no paid API for Presidio. Presidio is completely open-source

@WithIbadKhan
Copy link

And is this possible that we make pipeline for large pdf text do the redaction? @omri374

@omri374
Copy link
Contributor

omri374 commented May 7, 2024

Please see this as a starting point: https://github.com/microsoft/presidio/blob/main/docs/samples/python/example_pdf_annotation.ipynb

@WithIbadKhan
Copy link

Please see this as a starting point: https://github.com/microsoft/presidio/blob/main/docs/samples/python/example_pdf_annotation.ipynb

You are the best Thanks @omri374

@WithIbadKhan
Copy link

WithIbadKhan commented May 8, 2024

And the one other question please
which is for example I want to add the special entity here. for example to find also money so what do I do? Because here doesn't detect Money from the text.

@omri374

@omri374
Copy link
Contributor

omri374 commented May 8, 2024

@WithIbadKhan, this depends on how you want money to be detected. A good place to start is the tutorial for adding recognizers: https://microsoft.github.io/presidio/tutorial/. For example, you can create a regex pattern to detect a numeric value followed by a money sign.

@WithIbadKhan
Copy link

WithIbadKhan commented May 15, 2024 via email

@omri374
Copy link
Contributor

omri374 commented May 22, 2024

@WithIbadKhan please take a look at this tutorial: https://microsoft.github.io/presidio/tutorial/02_regex/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants