Running Splink WITHOUT duckb installed on Spark EMR #1239

jkginfinite · 2023-05-15T18:39:10Z

jkginfinite
May 15, 2023

Hello,

For reasons that are confidential, my organization is not able to install duckdb on AWS EMR and we still need to run Splink using pySpark.

For the most part everything works fine, except the function;
linker.estimate_u_using_random_sampling(target_rows=5e5)

It yields the error;

Traceback (most recent call last):
File "/efs/dq_i/python/aws3-011123/aws3/lib/python3.10/site-packages/splink/linker.py", line 879, in estimate_u_using_random_sampling
estimate_u_values(self, target_rows)
File "/efs/dq_i/python/aws3-011123/aws3/lib/python3.10/site-packages/splink/estimate_u.py", line 138, in estimate_u_values
param_records = compute_proportions_for_new_parameters(param_records)
File "/efs/dq_i/python/aws3-011123/aws3/lib/python3.10/site-packages/splink/expectation_maximisation.py", line 90, in compute_proportions_for_new_parameters
return duckdb.query(sql).to_df().to_dict("records")

I am wondering if there is a work around for this? Does duckdb HAVE to be installed for the estimate_u_using_random_sampling function to work?

Answered by RobinL

Jun 12, 2023

@JustinWinthers @JoeGanser To update, and for future readers of this discussion - it's now possible 'officially' to install Splink without duckdb:
https://moj-analytical-services.github.io/splink/installations.html#duckdb-less-installation

View full answer

RobinL · 2023-05-15T19:05:58Z

RobinL
May 15, 2023
Maintainer

Unfortunately the answer is currently yes, but if I recall correctly, the only dependency on duckdb if you're using Spark is here:

splink/splink/expectation_maximisation.py

Line 88 in 6e3a2e1

return duckdb.query(sql).to_df().to_dict("records")

If you rewrote that to run using another SQL engine you do have access to (sqlite, polars etc), or rewrote it to work in pandas, then I believe everything else would work.

Note that same function is also used when training m values using the EM algorithm. i.e. at the moment you'd get the same error on some of the other training steps, but if you modify it to not use duckdb, it should fix those as well

I haven't checked the following code but chatgpt4 thinks this might work:

Chat GPT4 suggested code

import sqlite3

def compute_proportions_for_new_parameters(m_u_df):
    """Using the results from compute_new_parameters_sql, compute
    m and u
    """

    # Create an in-memory SQLite database
    conn = sqlite3.connect(":memory:")
    m_u_df.to_sql("m_u_df", conn, if_exists="replace", index=False)

    sql = """
    SELECT
        comparison_vector_value,
        output_column_name,
        m_count * 1.0 / SUM(m_count) OVER (PARTITION BY output_column_name)
            AS m_probability,
        u_count * 1.0 / SUM(u_count) OVER (PARTITION BY output_column_name)
            AS u_probability
    FROM m_u_df
    WHERE comparison_vector_value != -1
    AND output_column_name != '_probability_two_random_records_match'

    UNION ALL

    SELECT
        comparison_vector_value,
        output_column_name,
        m_count * 1.0 AS m_probability,
        u_count * 1.0 AS u_probability
    FROM m_u_df
    WHERE output_column_name = '_probability_two_random_records_match'
    ORDER BY output_column_name, comparison_vector_value ASC
    """

    result = conn.execute(sql).fetchall()

    # Convert the result to a list of dictionaries
    column_names = ["comparison_vector_value", "output_column_name", "m_probability", "u_probability"]
    result_dicts = [dict(zip(column_names, row)) for row in result]

    # Close the SQLite connection
    conn.close()

    return result_dicts

If you're able to get this to work, I'd be grateful if you'd confirm as we'd consider accepting it as a PR, especially if it didn't introduce any new dependencies . We know of at least one other user with the same problem as yourselves.

2 replies

RobinL May 16, 2023
Maintainer

Thinking about this a little further, a better strategy could be to:
(1) check whether duckdb is installed
(2) if so, execute the existing code
(3) if not, use sqlglot to transpile the sql from duckdb to sqlite, and execute using sqlite (sqlglot.transpile(sql, read='duckdb', write='sqlite'))

This means we'll be using the higher-performing duckdb query when duckdb is installed, but falls back on using sqlite in case duckdb is not installed

jkginfinite May 17, 2023
Author

thanks Robin

ThomasHepworth · 2023-05-18T13:27:44Z

ThomasHepworth
May 18, 2023
Maintainer

👋 just to check, are you trying to run splink on locked down machines that have pre-installed packages, or are you able to run pip install splink --no-deps?

I've got a fix to remove the duckdb dependency, but how we manage dependencies for this depends on the above.

3 replies

JustinWinthers May 18, 2023

I work with Joe. We can’t run pip. Locked down machines in a secure environment.

ThomasHepworth May 18, 2023
Maintainer

~~Ah, ok. How were you previously installing splink?~~

~~Remote install into your environment?~~

The above doesn't actually matter.

Can I just check that if we remove the requirement to have duckdb installed that you'll be fine?

I have just opened a PR that offloads some of the code to pandas if duckdb isn't installed.

JustinWinthers May 18, 2023

Thank you for your help! We ended up using Spark.sql. But we’ll look at your PR as well. Thanks again!

JustinWinthers · 2023-05-18T20:04:15Z

JustinWinthers
May 18, 2023

We ended up using spark.sql and settled with this code and everything seems to be working. Robin, thanks for leading us in the right direction and for all of your work on this amazing package!

import spark.sql

def compute_proportions_for_new_parameters(m_u_df):
    """Using the results from compute_new_parameters_sql, compute
    m and u
    """

    # create temporary view from Spark DataFrame
    m_u_df.createOrReplaceTempView("m_u_df")

    # Now we can use Spark SQL
    result_df = spark.sql("""
        SELECT
            comparison_vector_value,
            output_column_name,
            m_count/SUM(m_count) OVER (PARTITION BY output_column_name) AS m_probability,
            u_count/SUM(u_count) OVER (PARTITION BY output_column_name) AS u_probability
        FROM m_u_df
        WHERE comparison_vector_value != -1
        AND output_column_name != '_probability_two_random_records_match'

        UNION ALL

        SELECT
            comparison_vector_value,
            output_column_name,
            m_count AS m_probability,
            u_count AS u_probability
        FROM m_u_df
        WHERE output_column_name = '_probability_two_random_records_match'
        ORDER BY output_column_name, comparison_vector_value ASC
    """)

    # 'result_df' is now a DataFrame with the results of the SQL query

    return result_df.to_dict("records")

2 replies

JustinWinthers May 18, 2023

@RobinL - do you see any issues with this approach? Did you choose DuckDB in this case to ensure the data wasn’t distributed across nodes?

RobinL May 18, 2023
Maintainer

Yeah, should be fine. The main reason for using duckdb is just for this particular application it's faster than spark, but given your constraint running in Spark should be fine. We'll also have a PR done soon that will essentially do the same thing , but great to hear you've already got something working that solves the problem for you

RobinL · 2023-06-12T11:04:47Z

RobinL
Jun 12, 2023
Maintainer

@JustinWinthers @JoeGanser To update, and for future readers of this discussion - it's now possible 'officially' to install Splink without duckdb:
https://moj-analytical-services.github.io/splink/installations.html#duckdb-less-installation

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running Splink WITHOUT duckb installed on Spark EMR #1239

{{title}}

Replies: 4 comments 7 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Running Splink WITHOUT duckb installed on Spark EMR #1239

jkginfinite May 15, 2023

Replies: 4 comments · 7 replies

RobinL May 15, 2023 Maintainer

RobinL May 16, 2023 Maintainer

jkginfinite May 17, 2023 Author

ThomasHepworth May 18, 2023 Maintainer

JustinWinthers May 18, 2023

ThomasHepworth May 18, 2023 Maintainer

JustinWinthers May 18, 2023

JustinWinthers May 18, 2023

JustinWinthers May 18, 2023

RobinL May 18, 2023 Maintainer

RobinL Jun 12, 2023 Maintainer

jkginfinite
May 15, 2023

Replies: 4 comments 7 replies

RobinL
May 15, 2023
Maintainer

RobinL May 16, 2023
Maintainer

jkginfinite May 17, 2023
Author

ThomasHepworth
May 18, 2023
Maintainer

ThomasHepworth May 18, 2023
Maintainer

JustinWinthers
May 18, 2023

RobinL May 18, 2023
Maintainer

RobinL
Jun 12, 2023
Maintainer