-
Is there a way to change where splink writes its data files when using the spark backend? Currently the files are written to dbfs which on Azure is outside my companies control plane. I would like to write these files to ADLS2. If splink cant use ADLS2 directly I can mount it a file mount and write to it as a regular file mount. Is this possible? I haven't seen a configuration that allows this. Any help would be appreciated. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 6 replies
-
I'm not sure I know enough about the difference to understand the question. But Splink should only be writing out to whatever directory you set as your checkpoint dir e.g. spark.sparkContext.setCheckpointDir("./tmp_checkpoints") Can you point that to the desired location? E.g. In our production jobs we point that to a location in AWS S3 |
Beta Was this translation helpful? Give feedback.
Unfortunately there isn't a way within Splink to do that. I'm not a databricks user so some of this stuff is outside my expertise.
Whilst it's not supported by the released versions of Splink if you manually edit the source code here:
splink/splink/spark/linker.py
Line 373 in 615a1ba
then you could hard code a path wherever you wanted and presumably it should work ('presumably' because I don't really know anything about abfss and dbfs, so i'm guessing a bit)