Change splink default dbfs locations #1825

eric6204 · 2024-01-05T20:08:50Z

eric6204
Jan 5, 2024

Is there a way to change where splink writes its data files when using the spark backend? Currently the files are written to dbfs which on Azure is outside my companies control plane. I would like to write these files to ADLS2. If splink cant use ADLS2 directly I can mount it a file mount and write to it as a regular file mount. Is this possible? I haven't seen a configuration that allows this. Any help would be appreciated.

Answered by RobinL

Jan 18, 2024

Unfortunately there isn't a way within Splink to do that. I'm not a databricks user so some of this stuff is outside my expertise.

Whilst it's not supported by the released versions of Splink if you manually edit the source code here:

splink/splink/spark/linker.py

Line 373 in 615a1ba

checkpoint_dir = self._get_checkpoint_dir_path(spark_df)

then you could hard code a path wherever you wanted and presumably it should work ('presumably' because I don't really know anything about abfss and dbfs, so i'm guessing a bit)

View full answer

RobinL · 2024-01-05T21:45:26Z

RobinL
Jan 5, 2024
Maintainer

I'm not sure I know enough about the difference to understand the question. But Splink should only be writing out to whatever directory you set as your checkpoint dir e.g.

spark.sparkContext.setCheckpointDir("./tmp_checkpoints")

Can you point that to the desired location?

E.g. In our production jobs we point that to a location in AWS S3

6 replies

eric6204 Jan 17, 2024
Author

Changing the checkpoint dir did not change the output path of the data. It still wrote to dbfs.

I tried using the ADLS path but it is not understood by the spark context. The directory must be an HDFS path and not a path to Azure Blob Storage. I created a mount point that maps ADLS as a file path. When I use it the context creates a filepath in dbfs instead of using the mount point.

RobinL Jan 18, 2024
Maintainer

Setting spark.sparkContext.setCheckpointDir() is a spark thing as opposed to Splink - i.e. Splink (should) just use whatever Spark checkpoint dir is set by the user.

Could you check whether the checkpoint dir has been set correctly in Spark.

one way to do this would be to run something like:

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("CheckpointTest").getOrCreate()

spark.sparkContext.setCheckpointDir("./hello")

data = [("Alice", 1), ("Bob", 2), ("Carol", 3)]
columns = ["Name", "ID"]
dummy_df = spark.createDataFrame(data, columns)

checkpoint_df = dummy_df.checkpoint()

checkpoint_df.show()

and see if the data is written where you expect

eric6204 Jan 18, 2024
Author

It appears that there is a setting somewhere that is placing all checkpointed data in dbfs. It may be something specific to Azure Databricks. Your code above writes the data to dbfs

I typically have to explicitly write my data to ADLS and then create my table in the hive metastore using that path. I guess that's the question I am asking can I explicitly do something like this:

`CHECKPOINT_PATH=f"abfss://{splink_container}@{storage_account}/delta/splink/_checkpoints"
PATH=f"abfss://{splink_container}@{storage_account}/delta/splink"

(df.write.format('delta')
.option('checkpointLocation', CHECKPOINT_PATH)
.option('overwriteSchema', 'true')
.mode('overwrite')
.save(PATH))`

RobinL Jan 18, 2024
Maintainer

Unfortunately there isn't a way within Splink to do that. I'm not a databricks user so some of this stuff is outside my expertise.

Whilst it's not supported by the released versions of Splink if you manually edit the source code here:

splink/splink/spark/linker.py

Line 373 in 615a1ba

checkpoint_dir = self._get_checkpoint_dir_path(spark_df)

then you could hard code a path wherever you wanted and presumably it should work ('presumably' because I don't really know anything about abfss and dbfs, so i'm guessing a bit)

Answer selected by eric6204

eric6204 Feb 14, 2024
Author

You’re suggestion worked. I added a new attribute to the init for the checkpoint_dir.

Then I can use to pass in the location that I want to write data out. I changed the parquet and delta data writes. I use mostly delta, but parquet is acceptable.

I also used the save method that takes a path:
spark_df.write.mode("overwrite").format("delta").save(write_path)

Lastly, I found that the spark_df: SplinkDataFrame is able to understand adsl and dfbs urls but not dbfs mounts point. So, I added a method to check my input checkpoint_dir and if it was a mount, I converted it to a dbfs url.

Now any of the following URLs types can be used with the parquet or delta_lake_files break lineage methods.

/dbfs/mnt/splink/eric/splink/test1/
dbfs:/mnt/splink/eric/splink/test2/
abfss://splink@<storage_account>/eric/splink/test3

Use any of this you need to give additional functionality to your tool. And thank so much with your help!

sthamodh Mar 7, 2024

Hi @eric6204 ,

Thank you for this explanation. I believe this is connected to the the discussion page I have opened: https://github.com/moj-analytical-services/splink/discussions/2039

Would you happen to know why splink would fail on a step where it is looking for a given table?

Thanks,

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change splink default dbfs locations #1825

{{title}}

Replies: 1 comment 6 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Change splink default dbfs locations #1825

eric6204 Jan 5, 2024

Replies: 1 comment · 6 replies

RobinL Jan 5, 2024 Maintainer

eric6204 Jan 17, 2024 Author

RobinL Jan 18, 2024 Maintainer

eric6204 Jan 18, 2024 Author

RobinL Jan 18, 2024 Maintainer

eric6204 Feb 14, 2024 Author

sthamodh Mar 7, 2024

eric6204
Jan 5, 2024

Replies: 1 comment 6 replies

RobinL
Jan 5, 2024
Maintainer

eric6204 Jan 17, 2024
Author

RobinL Jan 18, 2024
Maintainer

eric6204 Jan 18, 2024
Author

RobinL Jan 18, 2024
Maintainer

eric6204 Feb 14, 2024
Author