Best way to deal with outputs? #2387
-
TL;DR: What's the best way to deal with the predictions/clusters produced as SplinkDataFrames? So, I've used Splink 4.0.0 with the duckdb backend to do some deduplication on a data set at work. I've produced predictions and clusters and I'm fairly happy with those results. What's the best way to export those outputs (be it the predictions or clusters) to a database? Do I convert both to Pandas DataFrames (even though this is very inefficient)? Or do I do some cleansing and filtering beforehand? This isn't a question related to syntax or logic. I'm just curious as to how everyone is using these outputs. Apologies if this is a stupid question... |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
It's generally most efficient to extract the results from the So if you're using the DuckDBLinker, as a You can also get the table/view name from the SplinKDataFrame using e.g.
Note that See also |
Beta Was this translation helpful? Give feedback.
It's generally most efficient to extract the results from the
SplinkDataFrame
in its native format - as in, native to the backend you've chosenSo if you're using the DuckDBLinker, as a
DuckDBPyRelation
, usingsplink_dataframe.as_duckdbpyrelation()
or if you're using the Spark backend as a Spark dataframe (usingpredictions_splink_dataframe.as_spark_dataframe()
You can also get the table/view name from the SplinKDataFrame using
predictions_splink_dataframe.physical_name
and then just query the database directlye.g.