Only unique values from column per cluster #779
Replies: 1 comment 4 replies
-
Hiya. Interesting question. So if I understand the question you're saying:
I think your proposed solution is sensible (i.e. including the Do you have more columns than just first name, surname and display name? The potential problem here is that those columns might not be enough information to effectively detect duplicates across tournaments. e.g. it's quite likely you'll have many John Smiths in a large dataset, so unless you have other information (e.g. date of birth, some geographical information) it will be difficult to distinguish between them |
Beta Was this translation helpful? Give feedback.
-
Hi everyone,
I am working with Splink 3.0 and I am trying to deduplicate a data set where end result of a single
cluster_id
will only have unique values from a columntournament_id
. To better illustrate the problem, I have created this small data example:Running the
cluster_pairwise_predictions_at_threshold
would cluster together all 3John Jones
, but that would be incorrect as the same person can't enter the same tournament twice (defined astournament_id
).I have been playing a bit with the blocking rules like:
l.display_name = r.display_name AND l.tournament_id <>r.tournament_id
, but that doesn't work and is quite far from a good solution considering that I would be running this on a large data set.
Is there any way to do this efficiently?
Thank you all in advance, and special thanks to the Splink creators for a wonderful piece of software!
Beta Was this translation helpful? Give feedback.
All reactions