Is there a way to parallel estimate_parameters_using_expectation_maimisation with DuckDB? #1269

Mahora65 · 2023-05-31T13:47:40Z

Mahora65
May 31, 2023

I trying to link two dataset together with 7 blocking rule. The data is quite big with 300K and 400K which is why it took a while to estimate parameter using em with DuckDB.

My ideal is using multiprocessing and map to map linker object to each blocking rule but I have no idea how to combine EMTrainingSession back together or if this idea even make sense. Is there a way to parallelism on estimate parameters with DuckDB?

The current code:

for rule in settings['blocking_rules_to_generate_predictions']:
    training_session.append(linker.estimate_parameters_using_expectation_maximisation(rule, fix_u_probabilities= False))

Answered by RobinL

May 31, 2023

Duckdb should, at least in theory, run in parallel for each individual training session, although I recognise in practice it doesn't always use all cores. I think we'll find as duckdb matures it becomes better and better at parallelizing queries.

Unfortunately there isn't a straightforward way to run multiple training sessions in parallel due to the way Splink updates parameter estimates after each training session.

It's also worth noting from a performance point of view that blocking rules for predictions and blocking rules for em training might want to be different. See https://moj-analytical-services.github.io/splink/topic_guides/blocking_rules.html

Finally, it's worth noting that Spar…

View full answer

RobinL · 2023-05-31T14:18:23Z

RobinL
May 31, 2023
Maintainer

Duckdb should, at least in theory, run in parallel for each individual training session, although I recognise in practice it doesn't always use all cores. I think we'll find as duckdb matures it becomes better and better at parallelizing queries.

Unfortunately there isn't a straightforward way to run multiple training sessions in parallel due to the way Splink updates parameter estimates after each training session.

It's also worth noting from a performance point of view that blocking rules for predictions and blocking rules for em training might want to be different. See https://moj-analytical-services.github.io/splink/topic_guides/blocking_rules.html

Finally, it's worth noting that Spark seems to parallelise better even in local mode, so for larger data, Spark can sometimes be faster than DuckDB even on a single machine.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there a way to parallel estimate_parameters_using_expectation_maimisation with DuckDB? #1269

{{title}}

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Is there a way to parallel estimate_parameters_using_expectation_maimisation with DuckDB? #1269

Mahora65 May 31, 2023

Replies: 1 comment

RobinL May 31, 2023 Maintainer

Mahora65
May 31, 2023

RobinL
May 31, 2023
Maintainer