predict() is super slow even for a small dataset! #2119

meharc · 2024-04-01T00:22:21Z

meharc
Apr 1, 2024

I've successfully trained the Splink model in PySpark to link two datasets, each containing 1.5 million rows. Now, I'm experimenting with the predict() function's runtime using only 100 rows from each table, with 128 cores fully utilized and 512 GB of storage memory available. Despite expecting a maximum runtime of 30 minutes, it's extending beyond 3 hours and running.

To speed up the process, I've taken the following measures:

Set "spark.default.parallelism" to 640.
Adjusted "spark.sql.shuffle.partitions" to 4000.
Implemented salting partitions (128) during comparisons.
Utilized break_lineage_method="parquet" during training.

During training, the blocking rules generate 500 million comparisons, which cannot be further reduced. Additionally, I've investigated whether any single comparison is causing significant delays, as discussed in this post. However, no particular feature appears to be causing delays during experimentation.

Given these efforts, I'm now seeking further ways to enhance performance. Any suggestions?

RobinL · 2024-04-01T07:18:02Z

RobinL
Apr 1, 2024
Maintainer

Those settings are breaking a tiny workload (100 rows) into a vast number of tasks (possibly something like 400k or more), so that runtime will be almost all the spark cluster managing splitting the job into parts and coordinating their execution. For 100 rows you'd want a value of 1 for all those settings. For 500m comparisons they're probably still too high (I'd recommend about 200 for parallelism and 200 for shuffle partition)

If I recall correctly, the total tasks for the blocking stage is something like parallelismsaltingnumber blocking rukes

5 replies

meharc Apr 1, 2024
Author

Thank you, Robin. I realize now that I've been scrutinizing the problem too narrowly.

During training, I configured the salting partitions to be 128. Now, I'm attempting to provide a new settings dictionary with fewer salting partitions for prediction. Yet, the model persists in using the old value (128). How can I explicitly replace this?

Additionally, I haven't been utilizing the num_partition_on_repartition parameter. Would setting it to 4000 be excessive for 500 million comparisons? I've observed that the suggested size per output file is 100 MB.

RobinL Apr 1, 2024
Maintainer

Are you able to post the script you're running? It's a bit difficult to advise without a more precise sense of your model spec.

4,000 sounds high for num_partition_on_repartition - are you really getting 4,000 100mb files = 400GB of output? Does that mean you have lots of columns, or some columns that contain a lot of information?

The salting partitions set in the blocking_rules_for_prediction are generally independent of any specified for EM training, so one shouldn't carry through to the other, but there may be a bug?

meharc Apr 1, 2024
Author

I have pasted the code snipped below. I was expecting salting partitions = 2, however, it is using the value set during training (128)

settings_dict = {
    "link_type": "link_only", 
    "unique_id_column_name": "property_id",
    "blocking_rules_to_generate_predictions":[
        {"blocking_rule": "l.listing_state = r.listing_state and l.listing_city_cleaned = r.listing_city_cleaned and l.kd_geo_postal_code_cleaned = r.kd_geo_postal_code_cleaned", "salting_partitions": 2}
    ]
}

#set configuration
conf = SparkConf()

conf.set("spark.default.parallelism", "1") # default = number of cores
conf.set("spark.sql.shuffle.partitions", "1") # each parition should be able to sit on a node.
sc = SparkContext.getOrCreate(conf=conf)

# Create a Spark session
spark = SparkSession(sc)

#Use this function at the start of your workflow to ensure Splink is registered on your Databricks cluster.
enable_splink(spark=spark)

# Initialise the linker, passing in the input dataset(s). 
linker = SparkLinker(
  input_table_or_tables= [df_airbnb, df_vrbo],
  settings_dict = settings_dict,
  break_lineage_method="parquet",
  spark = spark
)

linker.load_model("/dbfs/FileStore/data_science/listing_matching/splink/saved_splink_model_latest.json")

linker._settings_dict['blocking_rules_to_generate_predictions']

meharc Apr 8, 2024
Author

@RobinL Do you have any feedback or insights regarding the information shared above?

RobinL Apr 9, 2024
Maintainer

Sorry for delay. It's because the load_model function loads the entire settings from the saved Splink model.json file, overwriting any previously set settings

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

predict() is super slow even for a small dataset! #2119

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 5 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

predict() is super slow even for a small dataset! #2119

meharc Apr 1, 2024

Replies: 1 comment · 5 replies

RobinL Apr 1, 2024 Maintainer

meharc Apr 1, 2024 Author

RobinL Apr 1, 2024 Maintainer

meharc Apr 1, 2024 Author

meharc Apr 8, 2024 Author

RobinL Apr 9, 2024 Maintainer

meharc
Apr 1, 2024

Replies: 1 comment 5 replies

RobinL
Apr 1, 2024
Maintainer

meharc Apr 1, 2024
Author

RobinL Apr 1, 2024
Maintainer

meharc Apr 1, 2024
Author

meharc Apr 8, 2024
Author

RobinL Apr 9, 2024
Maintainer