Replies: 2 comments
-
Incremental linkage with clustering is not fully supported by Splink. Fundamentally, it's because a new record B could result in two distinct historical clusters A and C being linked, meaning all records in A and C and new record B need to be in the same cluster. So you can't just have the new records join existing clusters. With 10 million records, my recommendation would be to simply run the whole linkage and clustering (of all records including new records) every day. If you're able to run this in DuckDB on a large machine (e.g. 32 cores or more), it shouldn't take too long . Another approach you could consider would be to:
and then use the results, combined with your existing clusters, to get the result you want manually. This is quite challenging however, because of the clustering aspect and having to account for transitivity (e.g. a new record B joining two existing clusters). You could possibly simplify the problem by disallowing transitivity - which would mean any new records joined the highest-matching existing cluster, and thus were prevented from joining two existing clusters. That would give you greater stability in your clusters, but result in false negatives There's possibly a little further detail here: |
Beta Was this translation helpful? Give feedback.
-
Thank you @RobinL for your prompt response. |
Beta Was this translation helpful? Give feedback.
-
Hi Team,
I have been working with Splink Library for a couple of weeks. The potential of this library is really awesome. Here is the problem statement: I must implement subsequent incremental data linkage with the already loaded/scored historical data. The increment data after assigning cluster ID needs to be added to the historical and retrained for subsequent loads. It's like training the model again and again. I have managed to run the code and assign cluster IDs to the very first load (i.e., Historical Load). Now moving forward every day I will be getting the increment data load. I need to perform data linkage using historical data.
Example: I have loaded data with cluster IDs for historical load (assuming 10 million records) using splink. Every day I will be getting 50k additional records that need to be assigned cluster id based on matching cluster IDs with historical data or new cluster IDs (if not matching with any data)
I have tried using find_matches_to_new_records which works only for one record and that too linking with new data only, which I want to match with historical data which is not working as expected.
Any help would be appreciated.
Beta Was this translation helpful? Give feedback.
All reactions