Trying to understand the behaviour for clustering on results of `find_matches_to_new_records` #2282

bpandey-CS · 2024-07-23T15:56:48Z

bpandey-CS
Jul 23, 2024

Hi,

I have a given set of gold records on which I am trying to find matches for incoming incremental records, an example of the code i am looking at would be :

import duckdb
import pandas as pd

from splink.duckdb.comparison_library import exact_match, levenshtein_at_thresholds
from splink.duckdb.linker import DuckDBLinker

con = duckdb.connect(":memory:")

# let's assume this is my given set of gold records
data_golden = [
    {"unique_id": 1, "first_name": "John", "surname": "Smith", "dob": "1980-01-01"},
    {"unique_id": 2, "first_name": "Lucy", "surname": "Jones", "dob": "1997-08-23"}
]
df_golden = pd.DataFrame(data_golden)

settings = {
    "link_type": "dedupe_only",
    "blocking_rules_to_generate_predictions": [],
    "comparisons": [
        exact_match("surname"),
        levenshtein_at_thresholds("dob", 1),
    ],
    "retain_matching_columns": True,
    "retain_intermediate_calculation_columns": True,
    "sql_dialect": "duckdb",
    "linker_uid": "gpc7s0jw",
    "probability_two_random_records_match": 0.0001
}
linker = DuckDBLinker(
    df_golden, settings, connection=con
)

After we initialize out linker i define/get my incremental records and try to find

data_new = [
    {"unique_id": 3, "first_name": "John3", "surname": "Smith", "dob": "1980-01-01"},
    {"unique_id": 4, "first_name": "Jon1", "surname": "Smith", "dob": "1980-01-01"},
    {"unique_id": 5, "first_name": "Lucy1", "surname": "Jones", "dob": "1997-08-23"},
    {"unique_id": 6, "first_name": "Nancy", "surname": "Webber", "dob": "1997-08-23"},
    {"unique_id": 7, "first_name": "Martha", "surname": "Webber", "dob": "1997-08-23"},
]

df_new = pd.DataFrame(data_new)

df_inc = linker.find_matches_to_new_records(df_new, blocking_rules=[])
df_inc.as_pandas_dataframe().sort_values("match_weight", ascending=False)

here the output df looks very similar to the .predict dataframe:

but when I end up clustering the results are not exactly what I expect

cluster_df = linker.cluster_pairwise_predictions_at_threshold(df_inc, threshold_match_probability=0.3)
cluster_df.as_pandas_dataframe()

instead of showing clusters of incoming records with gold records, as suggested by the output find_matches_to_new_records(where finds records with high match_probabilities), it shows the original gold records in cluster. Trying to understand the behavior of cluster_pairwise_predictions_at_threshold on the output of find_matches_to_new_records.

Answered by ADBond

Jul 31, 2024

This looks like something of a bug. Having a look at this it seems to be due to the fact that the clustering uses the input data set (in this case the golden records) as a starting point, and along the way there is an assumption that these are the full set of nodes. We are planning to make some behind-the-scenes adjustments to clustering, as well as allowing an option to cluster without a linker, and will definitely keep this in mind so we can remove this issue.

In the meantime as a workaround, you should be able to circumvent this by running the clustering with a new linker (for Splink 4 users reading, set this up with a new DatabaseAPI as well), with input of your new data df_new:

df_inc =

View full answer

pkandarpa-cs · 2024-07-31T11:35:58Z

pkandarpa-cs
Jul 31, 2024

Hi @RobinL - do you have any suggestions on how we can debug or proceed with this issue?

0 replies

ADBond · 2024-07-31T15:29:58Z

ADBond
Jul 31, 2024
Maintainer

This looks like something of a bug. Having a look at this it seems to be due to the fact that the clustering uses the input data set (in this case the golden records) as a starting point, and along the way there is an assumption that these are the full set of nodes. We are planning to make some behind-the-scenes adjustments to clustering, as well as allowing an option to cluster without a linker, and will definitely keep this in mind so we can remove this issue.

In the meantime as a workaround, you should be able to circumvent this by running the clustering with a new linker (for Splink 4 users reading, set this up with a new DatabaseAPI as well), with input of your new data df_new:

df_inc = linker.find_matches_to_new_records(df_new, blocking_rules=[], match_weight_threshold=-200)
df_i = df_inc.as_pandas_dataframe().sort_values("match_weight", ascending=False)

linker = DuckDBLinker(
    df_new, settings, connection=con
)

cluster_df = linker.cluster_pairwise_predictions_at_threshold(df_inc, threshold_match_probability=0.3)

which should give

I'll also note that this approach can have the effect of clustering together records in your golden set via transitive links - if some new record matches with high confidence two records from your golden set, these will all be clustered together at the end. If you really wanted to avoid such outcomes you would need to prune the set of edges before clustering.

0 replies

bpandey-CS · 2024-08-01T11:45:45Z

bpandey-CS
Aug 1, 2024
Author

hi @ADBond thanks for the clarification, on the transitive links I tried out another example, along with your suggested workaround which looks like below

import duckdb
import pandas as pd
import polars as pl

from splink.duckdb.comparison_library import exact_match, levenshtein_at_thresholds
from splink.duckdb.linker import DuckDBLinker

con = duckdb.connect("data/duckdb/duck.duckdb")

# let's assume this is my given set of gold records
data_golden = [
    {"unique_id": 1, "first_name": "John", "surname": "Smith", "dob": "1980-01-01"},
    {"unique_id": 2, "first_name": "Lucy", "surname": "Jones", "dob": "1997-08-23"},
    {"unique_id": 10, "first_name": "John4", "surname": "Smith", "dob": "1980-01-01"}
]
df_golden = pl.DataFrame(data_golden)

settings = {
    "link_type": "dedupe_only",
    "blocking_rules_to_generate_predictions": [],
    "comparisons": [
        exact_match("surname"),
        levenshtein_at_thresholds("dob", 1),
    ],
    "retain_matching_columns": True,
    "retain_intermediate_calculation_columns": True,
    "sql_dialect": "duckdb",
    "linker_uid": "gpc7s0jw",
    "probability_two_random_records_match": 0.0001
}
linker = DuckDBLinker(
    df_golden.to_arrow(), settings, connection=con
)

data_new = [
    {"unique_id": 3, "first_name": "John3", "surname": "Smith", "dob": "1980-01-01"},
    {"unique_id": 4, "first_name": "Jon1", "surname": "Smith", "dob": "1980-01-01"},
    {"unique_id": 5, "first_name": "Lucy1", "surname": "Jones", "dob": "1997-08-23"},
    {"unique_id": 6, "first_name": "Nancy", "surname": "Webber", "dob": "1997-08-23"},
    {"unique_id": 7, "first_name": "Martha", "surname": "Webber", "dob": "1997-08-23"}
]

df_new = pl.DataFrame(data_new)

df_inc = linker.find_matches_to_new_records(df_new.to_arrow(), blocking_rules=[])
df_inc.as_pandas_dataframe().sort_values("match_weight", ascending=False)

and then clustered it using another linker, initialized with the same connection

linker = DuckDBLinker(
    df_new.to_arrow(), settings, connection=con
)
cluster_df = linker.cluster_pairwise_predictions_at_threshold(df_inc, threshold_match_probability=0.3)
cluster_df.as_pandas_dataframe()

in this case, the new incoming records that have a matching record in gold records do get clustered with the corresponding gold record id correctly(1, 3, 4), but unique_ids 1 and 10 should have been clustered transitively as well, as you mentioned and according to what we see in the predictions df. Would you happen to have any insights on this?

1 reply

ADBond Aug 6, 2024
Maintainer

This looks to be the same issue, but in reverse - now that the new records are the 'input' data, the linker does not recognise the previous nodes as relevant when it comes to clustering. I'm not able to check right now, but I think that to get a 'full' clustering would require setting up a linker that has input data which is a concatenation of df_new and df_golden

bpandey-CS · 2024-08-05T06:14:30Z

bpandey-CS
Aug 5, 2024
Author

hi @ADBond @RobinL any thoughts on the above example?

1 reply

nabebaye Sep 24, 2024

#2412 should help with this issue, you'll be able to use the nodes from the golden and new datasets to cluster results together

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trying to understand the behaviour for clustering on results of `find_matches_to_new_records` #2282

{{title}}

Replies: 4 comments 2 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Trying to understand the behaviour for clustering on results of find_matches_to_new_records #2282

bpandey-CS Jul 23, 2024

Replies: 4 comments · 2 replies

pkandarpa-cs Jul 31, 2024

ADBond Jul 31, 2024 Maintainer

bpandey-CS Aug 1, 2024 Author

ADBond Aug 6, 2024 Maintainer

bpandey-CS Aug 5, 2024 Author

nabebaye Sep 24, 2024

Trying to understand the behaviour for clustering on results of `find_matches_to_new_records` #2282

bpandey-CS
Jul 23, 2024

Replies: 4 comments 2 replies

pkandarpa-cs
Jul 31, 2024

ADBond
Jul 31, 2024
Maintainer

bpandey-CS
Aug 1, 2024
Author

ADBond Aug 6, 2024
Maintainer

bpandey-CS
Aug 5, 2024
Author