Only unique values from column per cluster #779

ajdinameris · 2022-09-19T15:26:29Z

ajdinameris
Sep 19, 2022

Hi everyone,

I am working with Splink 3.0 and I am trying to deduplicate a data set where end result of a single cluster_id will only have unique values from a column tournament_id. To better illustrate the problem, I have created this small data example:

unique_id	first_name	surname	display_name	tournament_id
101	John	Jones	John Jones	101
102	John	Jones	John Jones	102
103	John	Wayne	John Wayne	102
104	John	Jones	John Jones	102

Running the cluster_pairwise_predictions_at_threshold would cluster together all 3 John Jones, but that would be incorrect as the same person can't enter the same tournament twice (defined as tournament_id).

cluster_id	unique_id	first_name	surname	display_name	tournament_id
101	101	John	Jones	John Jones	101
101	102	John	Jones	John Jones	102
103	103	John	Wayne	John Wayne	102
101	104	John	Jones	John Jones	102

I have been playing a bit with the blocking rules like:
l.display_name = r.display_name AND l.tournament_id <>r.tournament_id
, but that doesn't work and is quite far from a good solution considering that I would be running this on a large data set.

Is there any way to do this efficiently?

Thank you all in advance, and special thanks to the Splink creators for a wonderful piece of software!

RobinL · 2022-09-19T16:45:59Z

RobinL
Sep 19, 2022
Maintainer

Hiya. Interesting question. So if I understand the question you're saying:

The same player can appear across multiple tournaments
But within a single tournament duplicates cannot exist.

I think your proposed solution is sensible (i.e. including the l.tournament_id <> r.tournament_id condition in all of your blocking rules. It should work in the sense that it will prohibit any comparisons being created within tournaments. Are you saying that's not working?

Do you have more columns than just first name, surname and display name? The potential problem here is that those columns might not be enough information to effectively detect duplicates across tournaments. e.g. it's quite likely you'll have many John Smiths in a large dataset, so unless you have other information (e.g. date of birth, some geographical information) it will be difficult to distinguish between them

4 replies

ajdinameris Sep 20, 2022
Author

Hi Robin, thanks for answering.

Your assumptions are correct, the same player can appear across multiple tournaments, but in the same time it's impossible for a single player to appear in the same tournament twice.

As for other info, I do have dob, age bracket, city, region, email and some other unique identifiers... So you suggest that I try adding the l.tournament_id <> r.tournament_id as a condition to all blocking rules, in the fashion I mentioned above l.display_name = r.display_name AND l.tournament_id <>r.tournament_id?

I'll try this out and come back with the results.

RobinL Sep 20, 2022
Maintainer

Yep - exactly. The key thing is that the AND l.tournament_id <>r.tournament_id needs to be added to all blocking rules to ensure it's always imposed by Splink. Let me know how it goes!

Great that you have the other identifiers - it should be possible to get a good result

ajdinameris Sep 28, 2022
Author

Hiya @RobinL,
Still work in progress.
Even after adding l.tournament_id <> r.tournament_id as a condition to all blocking rules, and going an extra mile and experimenting with this comparison in the settings.json:

{
            "output_column_name": "tournament_id",
            "comparison_levels": [
                {
                    "sql_condition": "\"tournament_id_l\" IS NULL OR \"tournament_id_r\" IS NULL",
                    "label_for_charts": "Null",
                    "is_null_level": True
                },
                {
                    "sql_condition": "\"tournament_id_l\" = \"tournament_id_r\"",
                    "label_for_charts": "Exact match",
                    "m_probability": 0.000000000000001,
                    "u_probability": 99999.933431980606
                },
                {
                    "sql_condition": "ELSE",
                    "label_for_charts": "All other comparisons",
                    "m_probability": 0.9944726056945643,
                    "u_probability": 0.9665680193936341
                }
            ],
            "comparison_description": "Exact match vs. anything else"
        }

Predictions generated from the pairwise comparisons are correct that they do not include predictions where the tournament_id is the same, but Splink is still clustering together people with the same tournament_id.

Could it be that it's just ignoring the imposed condition?
One example that I noticed is Splink comparing unique_id's 144401 and 155501 (different tournament_id) and also 144401 with 155502 (different tournament_id), and probably reasoning that since 144401 and 155501 are being compared, and 144401 and 155502 are being compared that 155501 and 155502 (same tournament_id) are the same entity, or should be compared?
What is your take on this, any help would be much appreciated?

RobinL Sep 29, 2022
Maintainer

Suppose you have the following pairwise links:
tournament1_record1 -> tournament2_recorda
tournament2_recorda -> tournament1_record2

Then when you cluster the following chain of links will be made:

t1_1 -> t2_a -> t1_2

and hence t1_1 and t1_2 will be in the same cluster.

It's not easy to stop this happening - it's not really obvious to me what algorithm you would use. One thing that would help would be increasing the match threshold I guess.

In general, the problem you have is not one that Splink solves for you

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Only unique values from column per cluster #779

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Only unique values from column per cluster #779

ajdinameris Sep 19, 2022

Replies: 1 comment · 4 replies

RobinL Sep 19, 2022 Maintainer

ajdinameris Sep 20, 2022 Author

RobinL Sep 20, 2022 Maintainer

ajdinameris Sep 28, 2022 Author

RobinL Sep 29, 2022 Maintainer

ajdinameris
Sep 19, 2022

Replies: 1 comment 4 replies

RobinL
Sep 19, 2022
Maintainer

ajdinameris Sep 20, 2022
Author

RobinL Sep 20, 2022
Maintainer

ajdinameris Sep 28, 2022
Author

RobinL Sep 29, 2022
Maintainer