`predict()` fails with threshold probability 0 #2420

ADBond · 2024-09-24T10:28:08Z

probability -> bayes factor -> log2 -> ValueError: math domain error

import splink.comparison_library as cl
from splink import DuckDBAPI, Linker, SettingsCreator, block_on, splink_datasets

db_api = DuckDBAPI()

df = splink_datasets.fake_1000

settings = SettingsCreator(
    link_type="dedupe_only",
    comparisons=[
        cl.NameComparison("first_name"),
        cl.JaroAtThresholds("surname"),
        cl.DateOfBirthComparison(
            "dob",
            input_is_string=True,
        ),
        cl.ExactMatch("city").configure(term_frequency_adjustments=True),
        cl.EmailComparison("email"),
    ],
    blocking_rules_to_generate_predictions=[
        block_on("first_name", "dob"),
        block_on("surname"),
    ]
)

linker = Linker(df, settings, db_api)

pairwise_predictions = linker.inference.predict(0)

Loosely related: #1716.

Similar issues: #2333, #2334.

The text was updated successfully, but these errors were encountered:

browo097302 · 2024-09-25T12:06:06Z

Hi, I was wondering if I might be able to help with this. I'm a beginner with opensource contributions but I have had a look at why this math error may be occuring.

It seems that where the predict method is defined it calls a function predict_from_comparison_vectors_sqls_using_settings() which passes in threshold_match_probability(). If this is zero it gets passed through a few other functions until it gets to predict_from_comparison_vectors_sqls() which uses a log2 and so a 0 here gets a math error.

This could be a complete lack of understanding but I was thinking something along the lines of a check for threshold_match_probability() > 0 within the predict_from_comparison_vectors_sqls_using_settings() function so that its validated every time the function is called.

Or, a simple way would be to have a check for 0 just before the predict_from_comparison_vectors_sqls_using_settings() is called enusring only valid values get passed in.

I would imagine having one within predict_from_comparison_vectors_sqls_using_settings() would be better design as it ensures every call is validated.

Please let me know if this approach seems useful, or if I’m on the right track. I'd love to try and code the solution and submit a PR if you think it makes sense.

Thanks

RobinL · 2024-09-25T13:13:52Z

Hello! Thanks yeah you're along the right lines.

The simplest solution is to explicitly treat a threshold match probability of 0 as a null match probability.

If you look here:

splink/splink/internals/predict.py

Lines 101 to 114 in 8b44ab5

    
           # In case user provided both, take the minimum of the two thresholds 
        
           if threshold_match_probability is not None: 
        
               thres_prob_as_weight = prob_to_match_weight(threshold_match_probability) 
        
           else: 
        
               thres_prob_as_weight = None 
        
           if threshold_match_probability is not None or threshold_match_weight is not None: 
        
               thresholds = [ 
        
                   thres_prob_as_weight, 
        
                   threshold_match_weight, 
        
               ] 
        
               threshold = max([t for t in thresholds if t is not None]) 
        
               threshold_expr = f" where log2({bayes_factor_expr}) >= {threshold} " 
        
           else: 
        
               threshold_expr = ""

It's probably jsut a case of refactoring this logic so if the threshold is 0, it's treated as if it's None

browo097302 · 2024-09-25T13:28:42Z

Hi RobinL,

Thanks for getting back. Yeah that makes a lot of sense acctually and a clean way of doing it.

I'll look into refactoring the logic around threshold_match_probability so that if the value is 0, it's handled the same way as None. More speicifcally, adjusting that part in predict.py to ensure 0 values are treated as null.

Once it's ready, I'll submit a PR for review.

Cheers

Handle threshold_match_probablity 0 in predict() #2420

RobinL · 2024-09-30T15:10:01Z

Closed by #2425

ADBond added bug Something isn't working good first issue Good for newcomers labels Sep 24, 2024

browo097302 mentioned this issue Sep 26, 2024

Handle threshold_match_probablity 0 in predict() #2420 #2425

Merged

11 tasks

RobinL added a commit that referenced this issue Sep 30, 2024

Merge pull request #2425 from browo097302/master

88376ac

Handle threshold_match_probablity 0 in predict() #2420

RobinL closed this as completed Sep 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`predict()` fails with threshold probability 0 #2420

`predict()` fails with threshold probability 0 #2420

ADBond commented Sep 24, 2024

browo097302 commented Sep 25, 2024

RobinL commented Sep 25, 2024

browo097302 commented Sep 25, 2024

RobinL commented Sep 30, 2024

predict() fails with threshold probability 0 #2420

predict() fails with threshold probability 0 #2420

Comments

ADBond commented Sep 24, 2024

browo097302 commented Sep 25, 2024

RobinL commented Sep 25, 2024

browo097302 commented Sep 25, 2024

RobinL commented Sep 30, 2024

`predict()` fails with threshold probability 0 #2420

`predict()` fails with threshold probability 0 #2420