Replies: 1 comment 1 reply
-
Apologies realised I missed this when you posted it. I've done an example here: to show how to adjust post training. I'm working on a PR here that will allow the user to fix these values such that specific user-specified m and u values they are not changed during training. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I want to know how to fix a training model to adjust the probabilities in the comparison levels. This has been discussed several times in multiple discussion topics, and documented about its importance, https://moj-analytical-services.github.io/splink/demos/tutorials/04_Estimating_model_parameters.html#estimate-the-parameters-of-the-model. I do not fully understand how correct the model training in order to get the weights I desire. What it currently feels like is I am using splink as a starting point and then manually changing to get desired results, but maybe that is also a desirable model.
Starting with the documentation,
I have comparison columns : first_name, last_name, birth_date, gender, postal_code, phone_number, email, and ssn (barely populated). In my data I have families ~20% and twins (this was already discussed #2023, #2168). Because of these families, I encounter too many non-true matches based on the obvious numerous matching columns of spouses, or parent and child. This will never be perfect I understand this, but when I have a mismatch on birth_date I still have lots of matches. Thus, my match weights are not good.
I first attempted to filter out based on threshold_match_probability but I have to set threshold_match_probability=0.9997. In addition to this being high I will not filter out birth_date mismatches plus other less important column mismatches which are true-matches.
How does one correct this in training in order to get better weighted comparisons? The difference in my scenario is that I do not want to adjust a single column for twins, I want to adjust almost all the columns, depending on the importance I am either change the probabilities on "exact" comparison level and/or "else" comparison level.
Example of what I have done after training is manually going in
This would be an important column:
Original from training
Manually changed to
This would be a less important column which I would sightly change but also change all the other less important columns (less important based on my data):
Original training
Then manually changed to lower the match weight.
I have already sanitized the columns, any bad data or missing data is nulled. I have about 11 linker.estimate_parameters_using_expectation_maximisation() which is actually just to populate the m and u probablities. I pre-set some of the m_probablities
cl.exact_match("sanitized_phone_number", m_probability_else=0.3),
.Some additional info which may be useful
My training set size : ~9 million
Predication set size : ~18 million
Training setup:
linker.estimate_probability_two_random_records_match(deterministic_rules, recall=0.60)
linker.estimate_u_using_random_sampling(max_pairs=1e10)
My saved model
"probability_two_random_records_match": 7.569556080173639e-08
I am missing a fundamental assumption about this model but it is not clear to me what I do not understand or if this is what overtraining looks like.
Beta Was this translation helpful? Give feedback.
All reactions