How to implement incremental load with existing historical data #2354

mvikas9 · 2024-08-25T07:22:07Z

mvikas9
Aug 25, 2024

Hi Team,
I have been working with Splink Library for a couple of weeks. The potential of this library is really awesome. Here is the problem statement: I must implement subsequent incremental data linkage with the already loaded/scored historical data. The increment data after assigning cluster ID needs to be added to the historical and retrained for subsequent loads. It's like training the model again and again. I have managed to run the code and assign cluster IDs to the very first load (i.e., Historical Load). Now moving forward every day I will be getting the increment data load. I need to perform data linkage using historical data.

Example: I have loaded data with cluster IDs for historical load (assuming 10 million records) using splink. Every day I will be getting 50k additional records that need to be assigned cluster id based on matching cluster IDs with historical data or new cluster IDs (if not matching with any data)

I have tried using find_matches_to_new_records which works only for one record and that too linking with new data only, which I want to match with historical data which is not working as expected.

Any help would be appreciated.

RobinL · 2024-08-25T08:38:20Z

RobinL
Aug 25, 2024
Maintainer

Incremental linkage with clustering is not fully supported by Splink.
By 'fully supported', I mean, Splink does not have a method/function that will minimise the calculations needed to compute the new clusters without needing the redo the whole calculation of existing (historical) records.

Fundamentally, it's because a new record B could result in two distinct historical clusters A and C being linked, meaning all records in A and C and new record B need to be in the same cluster. So you can't just have the new records join existing clusters.

With 10 million records, my recommendation would be to simply run the whole linkage and clustering (of all records including new records) every day. If you're able to run this in DuckDB on a large machine (e.g. 32 cores or more), it shouldn't take too long .

Another approach you could consider would be to:

run a dedupe_only on the new 50k records
run a link_only on the 50k records vs the historical 10 million

and then use the results, combined with your existing clusters, to get the result you want manually. This is quite challenging however, because of the clustering aspect and having to account for transitivity (e.g. a new record B joining two existing clusters). You could possibly simplify the problem by disallowing transitivity - which would mean any new records joined the highest-matching existing cluster, and thus were prevented from joining two existing clusters. That would give you greater stability in your clusters, but result in false negatives

There's possibly a little further detail here:
#869 (comment)

0 replies

mvikas9 · 2024-08-25T11:21:08Z

mvikas9
Aug 25, 2024
Author

Thank you @RobinL for your prompt response.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to implement incremental load with existing historical data #2354

{{title}}

Replies: 2 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

How to implement incremental load with existing historical data #2354

mvikas9 Aug 25, 2024

Replies: 2 comments

RobinL Aug 25, 2024 Maintainer

mvikas9 Aug 25, 2024 Author

mvikas9
Aug 25, 2024

RobinL
Aug 25, 2024
Maintainer

mvikas9
Aug 25, 2024
Author