Error on PySpark deduping example: no jars directory in splink #864

mmagoffin-sd · 2022-10-25T19:11:48Z

mmagoffin-sd
Oct 25, 2022

I just moved from using splink with DuckDB to using it with Spark. I tried to follow the first notebook cell, but I get the following error:

I tried removing the problem line, but then I get a different error:

So, I looked into what path the similarity jar location gave me:
'd:\Users\mmagoffin\AppData\Local\hatch\env\virtual\ds-misc-mmagoffin-S1xDGiIH\ds-misc-mmagoffin\lib\site-packages\splink\jars/scala-udf-similarity-0.0.9.jar'

However, looking inside of the splink directory there, the jars directory does not exist. I also tried just copying the Jaro-Winkler jar to the working directory, but I got the same error.

Any clue how I can fix this?

More context if it helps: I'm on Windows 10, Spark 3.3.0, and splink 3.4.1

And update: even though I've been able to fix the secondary jars directory problem, because I'm unable to set a Checkpoint directory here, I'm unable to run linker.estimate_u_using_random_sampling(target_rows=5e5) as I get a checkpointing error within the spark_linker script. Its even more confusing as setCheckpointDir is successful in creating a directory and writing to it, so I'm sure where the error is.

Answered by mamonu

Oct 26, 2022

Can you either

put the jar in the path :
d:\Users\mmagoffin\AppData\Local\hatch\env\virtual\ds-misc-mmagoffin-S1xDGiIH\ds-misc-mmagoffin\lib\site-packages\splink\jars/
or change the path (where currently you store the output of the function similarity_jar_location)
to your working directory.

Also a warning: when you define settings for the spark session you need to restart your kernel in
order for these to be updated. If you just change the path and run the cell you dont get the jars loaded into the session

View full answer

mamonu · 2022-10-26T00:35:58Z

mamonu
Oct 26, 2022

Can you either

put the jar in the path :
d:\Users\mmagoffin\AppData\Local\hatch\env\virtual\ds-misc-mmagoffin-S1xDGiIH\ds-misc-mmagoffin\lib\site-packages\splink\jars/
or change the path (where currently you store the output of the function similarity_jar_location)
to your working directory.

Also a warning: when you define settings for the spark session you need to restart your kernel in
order for these to be updated. If you just change the path and run the cell you dont get the jars loaded into the session

6 replies

RobinL Oct 27, 2022
Maintainer

I think the setCheckpointDir is to do with the difference between the demo setup (in Spark local mode) the Spark setup you're using. I think you probably need to set the path to a temporary directory accessible to your cluster - probably somewhere in your hadoop file system.

mmagoffin-sd Oct 27, 2022
Author

I'm currently starting a SparkSession with configuring the master to "local[4]"- should that be needing a file path to something in my Hadoop file system?

RobinL Oct 27, 2022
Maintainer

It just needs to be any path to which Spark can write temporary files, wherever that would be in your setup

mmagoffin-sd Oct 27, 2022
Author

hmm must be an issue with my spark set up then, thank you!

mmagoffin-sd Oct 28, 2022
Author

update: to be able to use setCheckpointDir I had to download the x64 C++ file: https://www.microsoft.com/en-us/download/details.aspx?id=26999 and run the command [your hadoop dir]\bin\winutils.exe chmod -R 777 [your path to your tmp dir] in whatever your working directory is

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error on PySpark deduping example: no jars directory in splink #864

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Error on PySpark deduping example: no jars directory in splink #864

mmagoffin-sd Oct 25, 2022

Replies: 1 comment · 6 replies

mamonu Oct 26, 2022

RobinL Oct 27, 2022 Maintainer

mmagoffin-sd Oct 27, 2022 Author

RobinL Oct 27, 2022 Maintainer

mmagoffin-sd Oct 27, 2022 Author

mmagoffin-sd Oct 28, 2022 Author

mmagoffin-sd
Oct 25, 2022

Replies: 1 comment 6 replies

mamonu
Oct 26, 2022

RobinL Oct 27, 2022
Maintainer

mmagoffin-sd Oct 27, 2022
Author

RobinL Oct 27, 2022
Maintainer

mmagoffin-sd Oct 27, 2022
Author

mmagoffin-sd Oct 28, 2022
Author