Replies: 1 comment 5 replies
-
Those settings are breaking a tiny workload (100 rows) into a vast number of tasks (possibly something like 400k or more), so that runtime will be almost all the spark cluster managing splitting the job into parts and coordinating their execution. For 100 rows you'd want a value of 1 for all those settings. For 500m comparisons they're probably still too high (I'd recommend about 200 for parallelism and 200 for shuffle partition) If I recall correctly, the total tasks for the blocking stage is something like parallelismsaltingnumber blocking rukes |
Beta Was this translation helpful? Give feedback.
5 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I've successfully trained the Splink model in PySpark to link two datasets, each containing 1.5 million rows. Now, I'm experimenting with the predict() function's runtime using only 100 rows from each table, with 128 cores fully utilized and 512 GB of storage memory available. Despite expecting a maximum runtime of 30 minutes, it's extending beyond 3 hours and running.
To speed up the process, I've taken the following measures:
During training, the blocking rules generate 500 million comparisons, which cannot be further reduced. Additionally, I've investigated whether any single comparison is causing significant delays, as discussed in this post. However, no particular feature appears to be causing delays during experimentation.
Given these efforts, I'm now seeking further ways to enhance performance. Any suggestions?
Beta Was this translation helpful? Give feedback.
All reactions