-
I trying to link two dataset together with 7 blocking rule. The data is quite big with 300K and 400K which is why it took a while to estimate parameter using em with DuckDB. My ideal is using The current code:
|
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
Duckdb should, at least in theory, run in parallel for each individual training session, although I recognise in practice it doesn't always use all cores. I think we'll find as duckdb matures it becomes better and better at parallelizing queries. Unfortunately there isn't a straightforward way to run multiple training sessions in parallel due to the way Splink updates parameter estimates after each training session. It's also worth noting from a performance point of view that blocking rules for predictions and blocking rules for em training might want to be different. See https://moj-analytical-services.github.io/splink/topic_guides/blocking_rules.html Finally, it's worth noting that Spark seems to parallelise better even in local mode, so for larger data, Spark can sometimes be faster than DuckDB even on a single machine. |
Beta Was this translation helpful? Give feedback.
Duckdb should, at least in theory, run in parallel for each individual training session, although I recognise in practice it doesn't always use all cores. I think we'll find as duckdb matures it becomes better and better at parallelizing queries.
Unfortunately there isn't a straightforward way to run multiple training sessions in parallel due to the way Splink updates parameter estimates after each training session.
It's also worth noting from a performance point of view that blocking rules for predictions and blocking rules for em training might want to be different. See https://moj-analytical-services.github.io/splink/topic_guides/blocking_rules.html
Finally, it's worth noting that Spar…