Chunked processing #1343

BCavani · 2023-06-19T14:27:43Z

BCavani
Jun 19, 2023

I've been handling exceptionally large datasets, and I must say that the splink package stands out when it comes to fuzzy-merge efficiency. I truly appreciate its capabilities in this regard!

Nevertheless, I encountered a challenge while attempting to merge the datasets in a single step due to memory constraints. As a workaround, I had to devise a method to split the datasets into smaller chunks and then remove duplicates using the 'match_probability' measure.

I'm curious to know if splink provides any built-in support for this type of exercise. Perhaps there is something I may have overlooked?

RobinL · 2023-06-19T15:22:48Z

RobinL
Jun 19, 2023
Maintainer

If you're using Spark, that's effectively what salting does, see here. Specifically it will increase the number of tasks the Spark job is chunked into.

Also worth noting in practice I found that spark scales better than duckdb even in local mode (I e. Without a cluster), partly because it parallelizes across your CPU cores better

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chunked processing #1343

{{title}}

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Chunked processing #1343

BCavani Jun 19, 2023

Replies: 1 comment

RobinL Jun 19, 2023 Maintainer

BCavani
Jun 19, 2023

RobinL
Jun 19, 2023
Maintainer