What is the best way to adjust weights for known important comparisons levels #2275

vfrank66 · 2024-07-20T11:19:25Z

vfrank66
Jul 20, 2024

I want to know how to fix a training model to adjust the probabilities in the comparison levels. This has been discussed several times in multiple discussion topics, and documented about its importance, https://moj-analytical-services.github.io/splink/demos/tutorials/04_Estimating_model_parameters.html#estimate-the-parameters-of-the-model. I do not fully understand how correct the model training in order to get the weights I desire. What it currently feels like is I am using splink as a starting point and then manually changing to get desired results, but maybe that is also a desirable model.

Starting with the documentation,

For instance, for gender, we would find that the gender matches 50% of the time, and mismatches 50% of the time.

For dob on the other hand, we would find that the dob matches 1% of the time, has a "one character difference" 3% of the time, and everything else happens 96% of the time.

I have comparison columns : first_name, last_name, birth_date, gender, postal_code, phone_number, email, and ssn (barely populated). In my data I have families ~20% and twins (this was already discussed #2023, #2168). Because of these families, I encounter too many non-true matches based on the obvious numerous matching columns of spouses, or parent and child. This will never be perfect I understand this, but when I have a mismatch on birth_date I still have lots of matches. Thus, my match weights are not good.
I first attempted to filter out based on threshold_match_probability but I have to set threshold_match_probability=0.9997. In addition to this being high I will not filter out birth_date mismatches plus other less important column mismatches which are true-matches.

How does one correct this in training in order to get better weighted comparisons? The difference in my scenario is that I do not want to adjust a single column for twins, I want to adjust almost all the columns, depending on the importance I am either change the probabilities on "exact" comparison level and/or "else" comparison level.

Example of what I have done after training is manually going in

This would be an important column:
Original from training

        {
            "output_column_name": "sanitized_birth_date",
            "comparison_levels": [
                {
                    "sql_condition": "\"sanitized_birth_date_l\" = \"sanitized_birth_date_r\"",
                    "label_for_charts": "Exact match",
                    "m_probability": 0.620336523375718,
                    "u_probability": 3.0022599984249296e-05
                },
                {
                    "sql_condition": "ELSE",
                    "label_for_charts": "All other comparisons",
                    "m_probability": 0.007857140123586116,
                    "u_probability": 0.7794177431887669
                }
            ],
            "comparison_description": "Exact match vs. Dates within the following thresholds Day(s): 7, Month(s): 1, Year(s): 0 vs. anything else"
        },

Manually changed to

        {
            "output_column_name": "sanitized_birth_date",
            "comparison_levels": [
                {
                    "sql_condition": "\"sanitized_birth_date_l\" = \"sanitized_birth_date_r\"",
                    "label_for_charts": "Exact match",
                    "m_probability": 0.99997570038288,
                    "u_probability": 3.00820675443198e-07
                },
                {
                    "sql_condition": "ELSE",
                    "label_for_charts": "All other comparisons",
                    "m_probability": 2.39324490773107e-7,
                    "u_probability": 0.989109759082269
                }
            ],
            "comparison_description": "Exact match vs. Dates within the following thresholds Day(s): 7, Month(s): 1, Year(s): 0 vs. anything else"
        },

This would be a less important column which I would sightly change but also change all the other less important columns (less important based on my data):
Original training

        {
            "output_column_name": "sanitized_postal_code",
            "comparison_levels": [
                {
                    "sql_condition": "\"sanitized_postal_code_l\" = \"sanitized_postal_code_r\"",
                    "label_for_charts": "Exact match sanitized_postal_code",
                    "m_probability": 0.995540511595968,
                    "u_probability": 0.04520965136707917,
                    "tf_adjustment_column": "sanitized_postal_code",
                    "tf_adjustment_weight": 1.0
                },
                {
                    "sql_condition": "ELSE",
                    "label_for_charts": "All other comparisons",
                    "m_probability": 0.004459488404031912,
                    "u_probability": 0.9547903486329208
                }
            ],
            "comparison_description": "Exact match on full postcode vs. all other comparisons"
        },

Then manually changed to lower the match weight.

        {
            "output_column_name": "sanitized_postal_code",
            "comparison_levels": [
                {
                    "sql_condition": "\"sanitized_postal_code_l\" = \"sanitized_postal_code_r\"",
                    "label_for_charts": "Exact match sanitized_postal_code",
                    "m_probability": 0.8311019580744606,
                    "u_probability": 0.04556442388955464,
                    "tf_adjustment_column": "sanitized_postal_code",
                    "tf_adjustment_weight": 1.0
                },
                {
                    "sql_condition": "ELSE",
                    "label_for_charts": "All other comparisons",
                    "m_probability": 0.06889804192553947,
                    "u_probability": 0.9544355761104454
                }
            ],
            "comparison_description": "Exact match on full postcode vs. all other comparisons"
        },

I have already sanitized the columns, any bad data or missing data is nulled. I have about 11 linker.estimate_parameters_using_expectation_maximisation() which is actually just to populate the m and u probablities. I pre-set some of the m_probablities cl.exact_match("sanitized_phone_number", m_probability_else=0.3),.

Some additional info which may be useful

My training set size : ~9 million
Predication set size : ~18 million

Training setup:
linker.estimate_probability_two_random_records_match(deterministic_rules, recall=0.60)
linker.estimate_u_using_random_sampling(max_pairs=1e10)

My saved model
"probability_two_random_records_match": 7.569556080173639e-08

I am missing a fundamental assumption about this model but it is not clear to me what I do not understand or if this is what overtraining looks like.

RobinL · 2024-09-04T13:08:43Z

RobinL
Sep 4, 2024
Maintainer

Apologies realised I missed this when you posted it. I've done an example here:
#2362

to show how to adjust post training.

I'm working on a PR here that will allow the user to fix these values such that specific user-specified m and u values they are not changed during training.

1 reply

RobinL Sep 6, 2024
Maintainer

This is now available in splink>=4.0.1, see code example here

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What is the best way to adjust weights for known important comparisons levels #2275

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

What is the best way to adjust weights for known important comparisons levels #2275

vfrank66 Jul 20, 2024

Replies: 1 comment · 1 reply

RobinL Sep 4, 2024 Maintainer

RobinL Sep 6, 2024 Maintainer

vfrank66
Jul 20, 2024

Replies: 1 comment 1 reply

RobinL
Sep 4, 2024
Maintainer

RobinL Sep 6, 2024
Maintainer