Best way to deal with outputs? #2387

LeCodeMinister · 2024-09-06T11:45:33Z

LeCodeMinister
Sep 6, 2024

TL;DR: What's the best way to deal with the predictions/clusters produced as SplinkDataFrames?

So, I've used Splink 4.0.0 with the duckdb backend to do some deduplication on a data set at work. I've produced predictions and clusters and I'm fairly happy with those results. What's the best way to export those outputs (be it the predictions or clusters) to a database? Do I convert both to Pandas DataFrames (even though this is very inefficient)? Or do I do some cleansing and filtering beforehand?

This isn't a question related to syntax or logic. I'm just curious as to how everyone is using these outputs.

Apologies if this is a stupid question...

Answered by RobinL

Sep 6, 2024

It's generally most efficient to extract the results from the SplinkDataFrame in its native format - as in, native to the backend you've chosen

So if you're using the DuckDBLinker, as a DuckDBPyRelation, using splink_dataframe.as_duckdbpyrelation() or if you're using the Spark backend as a Spark dataframe (using predictions_splink_dataframe.as_spark_dataframe()

You can also get the table/view name from the SplinKDataFrame using predictions_splink_dataframe.physical_name and then just query the database directly

e.g.

con = duckdb.connect()
con = DuckDBAPI(con)
linker = Linker(df, settings, db_api)

predictions = linker.inference.predict()

con.sql(f"select * from {predictions.physical_name…

View full answer

RobinL · 2024-09-06T12:36:59Z

RobinL
Sep 6, 2024
Maintainer

It's generally most efficient to extract the results from the SplinkDataFrame in its native format - as in, native to the backend you've chosen

So if you're using the DuckDBLinker, as a DuckDBPyRelation, using splink_dataframe.as_duckdbpyrelation() or if you're using the Spark backend as a Spark dataframe (using predictions_splink_dataframe.as_spark_dataframe()

You can also get the table/view name from the SplinKDataFrame using predictions_splink_dataframe.physical_name and then just query the database directly

e.g.

con = duckdb.connect()
con = DuckDBAPI(con)
linker = Linker(df, settings, db_api)

predictions = linker.inference.predict()

con.sql(f"select * from {predictions.physical_name}")

Note that splink_dataframe.as_duckdbpyrelation() is available from splink>=4.0.1

See also
https://moj-analytical-services.github.io/splink/api_docs/splink_dataframe.html

1 reply

LeCodeMinister Sep 6, 2024
Author

That's exactly what I hoped for, being able to directly query the SplinkDataFrame with DuckDB.

I initially thought I was blind for never coming across the very useful .as_duckdbpyrelation() before but it makes sense as this was released yesterday.

Thank you so much, Robin!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Best way to deal with outputs? #2387

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Best way to deal with outputs? #2387

LeCodeMinister Sep 6, 2024

Replies: 1 comment · 1 reply

RobinL Sep 6, 2024 Maintainer

LeCodeMinister Sep 6, 2024 Author

LeCodeMinister
Sep 6, 2024

Replies: 1 comment 1 reply

RobinL
Sep 6, 2024
Maintainer

LeCodeMinister Sep 6, 2024
Author