Replies: 2 comments 1 reply
-
Note: |
Beta Was this translation helpful? Give feedback.
-
On (2), unless you have additional information, I don't know how you'd decide whether to link to (1,2) or (4). i.e. how does the algorithm 'choose' which one to link to (and indeed, how do you know it links to either?). If it's possible to write down the logical conditions, then it should be possible to build a model so that your match probabilities, when clustered, result in the clusters you want. We might be able to help more if you're able to post a bit more detail about your code and data |
Beta Was this translation helpful? Give feedback.
-
Hello everyone,
Wanted to start with a big thank you for all your hard work.
So I'm trying to dedupe a dataset of inventors where I don't have single first names or middle names. Even individuals I know to be the same person sometimes provide two first names, sometimes only one of the first names and sometimes no first names.
Example:
1 | Jane Josephine Maria | Doe |
2 | Jane Maria | Doe
3 | | Doe
4 | Augustine | Doe
In the above example I assume row 1 = row 2 (same individual), and row 4 being a different individual. Row 3 could belong to either the first individual or the second, I would like other rules like year and sector of invention decide which it most probably belong to. I already solved the issue of father-son and mother-daughter record pairs by creating a column to help me block them from being compared and even splitting the dataframe by gender and doing each gender separately to not make things less complicated.
Blocking didn't work out well for me because I have many null first names. This is the comparison I designed for first names:
comparison_first_name = {
"comparison_levels": [
cll.null_level("first_name"),
cll.exact_match_level("first_name_array", term_frequency_adjustments=True, m_probability=1),
cll.array_intersect_level("first_name_array", term_frequency_adjustments=True, min_intersection=2, m_probability=0.9),
cll.array_intersect_level("first_name_array", term_frequency_adjustments=True, min_intersection=1, m_probability=0.8),
cll.else_level(m_probability = 0.0000000001),
],
}
But unfortunately it is matching sibling pairs even when they don't share a first name. To make things worse, using the example above, even when I restrict the matching severely cases 1-2 get clustered with cases 4 through case 3. I think what is impacting here, is that even if the match weight is completely negative on this comparison level, other columns bring up the probability. This is where I would like to use a blocking rule I guess, but it doesn't handle null values well.
So I guess I really have two questions:
1- How do I more efficiently hinder siblings from being matched through proper use of arrays?
2- When clustering, how do I force case 3 to either link to cases 1-2 or case 4 based on probability so that it doesn't link cases 1-2 and 4?
I'm writing this quite late, and hope I'm making sense at all :)
Cheers!
Beta Was this translation helpful? Give feedback.
All reactions