Error on PySpark deduping example: no jars directory in splink #864
-
I just moved from using splink with DuckDB to using it with Spark. I tried to follow the first notebook cell, but I get the following error: I tried removing the problem line, but then I get a different error: So, I looked into what path the similarity jar location gave me: However, looking inside of the splink directory there, the jars directory does not exist. I also tried just copying the Jaro-Winkler jar to the working directory, but I got the same error. Any clue how I can fix this? More context if it helps: I'm on Windows 10, Spark 3.3.0, and splink 3.4.1 And update: even though I've been able to fix the secondary jars directory problem, because I'm unable to set a Checkpoint directory here, I'm unable to run linker.estimate_u_using_random_sampling(target_rows=5e5) as I get a checkpointing error within the spark_linker script. Its even more confusing as setCheckpointDir is successful in creating a directory and writing to it, so I'm sure where the error is. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 6 replies
-
Can you either
Also a warning: when you define settings for the spark session you need to restart your kernel in |
Beta Was this translation helpful? Give feedback.
Can you either
put the jar in the path :
d:\Users\mmagoffin\AppData\Local\hatch\env\virtual\ds-misc-mmagoffin-S1xDGiIH\ds-misc-mmagoffin\lib\site-packages\splink\jars/
or change the path (where currently you store the output of the function
similarity_jar_location
)to your working directory.
Also a warning: when you define settings for the spark session you need to restart your kernel in
order for these to be updated. If you just change the path and run the cell you dont get the jars loaded into the session