Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

predict() fails with threshold probability 0 #2420

Closed
ADBond opened this issue Sep 24, 2024 · 4 comments
Closed

predict() fails with threshold probability 0 #2420

ADBond opened this issue Sep 24, 2024 · 4 comments
Labels
bug Something isn't working good first issue Good for newcomers

Comments

@ADBond
Copy link
Contributor

ADBond commented Sep 24, 2024

probability -> bayes factor -> log2 -> ValueError: math domain error

import splink.comparison_library as cl
from splink import DuckDBAPI, Linker, SettingsCreator, block_on, splink_datasets

db_api = DuckDBAPI()

df = splink_datasets.fake_1000

settings = SettingsCreator(
    link_type="dedupe_only",
    comparisons=[
        cl.NameComparison("first_name"),
        cl.JaroAtThresholds("surname"),
        cl.DateOfBirthComparison(
            "dob",
            input_is_string=True,
        ),
        cl.ExactMatch("city").configure(term_frequency_adjustments=True),
        cl.EmailComparison("email"),
    ],
    blocking_rules_to_generate_predictions=[
        block_on("first_name", "dob"),
        block_on("surname"),
    ]
)

linker = Linker(df, settings, db_api)

pairwise_predictions = linker.inference.predict(0)

Loosely related: #1716.

Similar issues: #2333, #2334.

@ADBond ADBond added bug Something isn't working good first issue Good for newcomers labels Sep 24, 2024
@browo097302
Copy link
Contributor

Hi, I was wondering if I might be able to help with this. I'm a beginner with opensource contributions but I have had a look at why this math error may be occuring.

It seems that where the predict method is defined it calls a function predict_from_comparison_vectors_sqls_using_settings() which passes in threshold_match_probability(). If this is zero it gets passed through a few other functions until it gets to predict_from_comparison_vectors_sqls() which uses a log2 and so a 0 here gets a math error.

This could be a complete lack of understanding but I was thinking something along the lines of a check for threshold_match_probability() > 0 within the predict_from_comparison_vectors_sqls_using_settings() function so that its validated every time the function is called.

Or, a simple way would be to have a check for 0 just before the predict_from_comparison_vectors_sqls_using_settings() is called enusring only valid values get passed in.

I would imagine having one within predict_from_comparison_vectors_sqls_using_settings() would be better design as it ensures every call is validated.

Please let me know if this approach seems useful, or if I’m on the right track. I'd love to try and code the solution and submit a PR if you think it makes sense.

Thanks

@RobinL
Copy link
Member

RobinL commented Sep 25, 2024

Hello! Thanks yeah you're along the right lines.

The simplest solution is to explicitly treat a threshold match probability of 0 as a null match probability.

If you look here:

# In case user provided both, take the minimum of the two thresholds
if threshold_match_probability is not None:
thres_prob_as_weight = prob_to_match_weight(threshold_match_probability)
else:
thres_prob_as_weight = None
if threshold_match_probability is not None or threshold_match_weight is not None:
thresholds = [
thres_prob_as_weight,
threshold_match_weight,
]
threshold = max([t for t in thresholds if t is not None])
threshold_expr = f" where log2({bayes_factor_expr}) >= {threshold} "
else:
threshold_expr = ""

It's probably jsut a case of refactoring this logic so if the threshold is 0, it's treated as if it's None

@browo097302
Copy link
Contributor

Hi RobinL,

Thanks for getting back. Yeah that makes a lot of sense acctually and a clean way of doing it.

I'll look into refactoring the logic around threshold_match_probability so that if the value is 0, it's handled the same way as None. More speicifcally, adjusting that part in predict.py to ensure 0 values are treated as null.

Once it's ready, I'll submit a PR for review.

Cheers

RobinL added a commit that referenced this issue Sep 30, 2024
Handle threshold_match_probablity 0 in predict() #2420
@RobinL
Copy link
Member

RobinL commented Sep 30, 2024

Closed by #2425

@RobinL RobinL closed this as completed Sep 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

3 participants