Replies: 1 comment
-
If you're using Spark, that's effectively what salting does, see here. Specifically it will increase the number of tasks the Spark job is chunked into. Also worth noting in practice I found that spark scales better than duckdb even in local mode (I e. Without a cluster), partly because it parallelizes across your CPU cores better |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I've been handling exceptionally large datasets, and I must say that the splink package stands out when it comes to fuzzy-merge efficiency. I truly appreciate its capabilities in this regard!
Nevertheless, I encountered a challenge while attempting to merge the datasets in a single step due to memory constraints. As a workaround, I had to devise a method to split the datasets into smaller chunks and then remove duplicates using the 'match_probability' measure.
I'm curious to know if splink provides any built-in support for this type of exercise. Perhaps there is something I may have overlooked?
Beta Was this translation helpful? Give feedback.
All reactions