Trying to understand the behaviour for clustering on results of find_matches_to_new_records
#2282
-
Beta Was this translation helpful? Give feedback.
Replies: 4 comments 2 replies
-
Hi @RobinL - do you have any suggestions on how we can debug or proceed with this issue? |
Beta Was this translation helpful? Give feedback.
-
This looks like something of a bug. Having a look at this it seems to be due to the fact that the clustering uses the input data set (in this case the golden records) as a starting point, and along the way there is an assumption that these are the full set of nodes. We are planning to make some behind-the-scenes adjustments to clustering, as well as allowing an option to cluster without a linker, and will definitely keep this in mind so we can remove this issue. In the meantime as a workaround, you should be able to circumvent this by running the clustering with a new linker (for Splink 4 users reading, set this up with a new df_inc = linker.find_matches_to_new_records(df_new, blocking_rules=[], match_weight_threshold=-200)
df_i = df_inc.as_pandas_dataframe().sort_values("match_weight", ascending=False)
linker = DuckDBLinker(
df_new, settings, connection=con
)
cluster_df = linker.cluster_pairwise_predictions_at_threshold(df_inc, threshold_match_probability=0.3) I'll also note that this approach can have the effect of clustering together records in your golden set via transitive links - if some new record matches with high confidence two records from your golden set, these will all be clustered together at the end. If you really wanted to avoid such outcomes you would need to prune the set of edges before clustering. |
Beta Was this translation helpful? Give feedback.
-
hi @ADBond thanks for the clarification, on the transitive links I tried out another example, along with your suggested workaround which looks like below
and then clustered it using another linker, initialized with the same connection
in this case, the new incoming records that have a matching record in gold records do get clustered with the corresponding gold record id correctly(1, 3, 4), but unique_ids 1 and 10 should have been clustered transitively as well, as you mentioned and according to what we see in the predictions df. Would you happen to have any insights on this? |
Beta Was this translation helpful? Give feedback.
This looks like something of a bug. Having a look at this it seems to be due to the fact that the clustering uses the input data set (in this case the golden records) as a starting point, and along the way there is an assumption that these are the full set of nodes. We are planning to make some behind-the-scenes adjustments to clustering, as well as allowing an option to cluster without a linker, and will definitely keep this in mind so we can remove this issue.
In the meantime as a workaround, you should be able to circumvent this by running the clustering with a new linker (for Splink 4 users reading, set this up with a new
DatabaseAPI
as well), with input of your new datadf_new
: