How to handle array of first names comparissons? #2133

ymerouani · 2024-04-04T22:42:58Z

ymerouani
Apr 4, 2024

Hello everyone,
Wanted to start with a big thank you for all your hard work.

So I'm trying to dedupe a dataset of inventors where I don't have single first names or middle names. Even individuals I know to be the same person sometimes provide two first names, sometimes only one of the first names and sometimes no first names.
Example:
1 | Jane Josephine Maria | Doe |
2 | Jane Maria | Doe
3 | | Doe
4 | Augustine | Doe

In the above example I assume row 1 = row 2 (same individual), and row 4 being a different individual. Row 3 could belong to either the first individual or the second, I would like other rules like year and sector of invention decide which it most probably belong to. I already solved the issue of father-son and mother-daughter record pairs by creating a column to help me block them from being compared and even splitting the dataframe by gender and doing each gender separately to not make things less complicated.

Blocking didn't work out well for me because I have many null first names. This is the comparison I designed for first names:

comparison_first_name = {
"comparison_levels": [
cll.null_level("first_name"),
cll.exact_match_level("first_name_array", term_frequency_adjustments=True, m_probability=1),
cll.array_intersect_level("first_name_array", term_frequency_adjustments=True, min_intersection=2, m_probability=0.9),
cll.array_intersect_level("first_name_array", term_frequency_adjustments=True, min_intersection=1, m_probability=0.8),
cll.else_level(m_probability = 0.0000000001),
],
}

But unfortunately it is matching sibling pairs even when they don't share a first name. To make things worse, using the example above, even when I restrict the matching severely cases 1-2 get clustered with cases 4 through case 3. I think what is impacting here, is that even if the match weight is completely negative on this comparison level, other columns bring up the probability. This is where I would like to use a blocking rule I guess, but it doesn't handle null values well.

So I guess I really have two questions:
1- How do I more efficiently hinder siblings from being matched through proper use of arrays?
2- When clustering, how do I force case 3 to either link to cases 1-2 or case 4 based on probability so that it doesn't link cases 1-2 and 4?

I'm writing this quite late, and hope I'm making sense at all :)
Cheers!

ymerouani · 2024-04-05T12:28:22Z

ymerouani
Apr 5, 2024
Author

Note:
I just found the discussion on families and twins which is very helpful, still unsure about question 2.

1 reply

RobinL Apr 5, 2024
Maintainer

This one is also relevant:
#2022

But we generally don't use arrays for name data, instead we split into forename_1, forename_2, etc. and surname

You should get stronger negative match weights if you do this for your else levels, which should help to split up siblings

You can to some extent cover the problem of them being out of order with a columns_reversed_level : https://moj-analytical-services.github.io/splink/comparison_level_library.html?h=reverse

RobinL · 2024-04-05T13:45:10Z

RobinL
Apr 5, 2024
Maintainer

On (2), unless you have additional information, I don't know how you'd decide whether to link to (1,2) or (4). i.e. how does the algorithm 'choose' which one to link to (and indeed, how do you know it links to either?).

If it's possible to write down the logical conditions, then it should be possible to build a model so that your match probabilities, when clustered, result in the clusters you want.

We might be able to help more if you're able to post a bit more detail about your code and data

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to handle array of first names comparissons? #2133

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

How to handle array of first names comparissons? #2133

ymerouani Apr 4, 2024

Replies: 2 comments · 1 reply

ymerouani Apr 5, 2024 Author

RobinL Apr 5, 2024 Maintainer

RobinL Apr 5, 2024 Maintainer

ymerouani
Apr 4, 2024

Replies: 2 comments 1 reply

ymerouani
Apr 5, 2024
Author

RobinL Apr 5, 2024
Maintainer

RobinL
Apr 5, 2024
Maintainer