Specifying custom comparisons across multiple columns #2174

jacob-shoemaker · 2024-05-10T18:32:40Z

jacob-shoemaker
May 10, 2024

I am struggling to correctly create a comparisons dictionary that considers matching across multiple columns.
comparisons_dict = { "output_column_name": "AccountName", "comparison_description": "AccountName Comparsion", "comparison_levels": [ { "sql_condition": '"jaro_winkler_sim(AccountName_l, AccountName_r)" > 0.8 AND "jaro_winkler_sim(Address_l, Address_r)" > 0.8 ' }, { "sql_condition": "EIN_l = EIN_r" } ] }

Ideally, what I am saying is that I am looking to link where the threshold is met when comparing both the Account Name and Address, or when the EIN is an exact match. Any help here would be greatly appreciated.

Additionally, I am very lost on the when and how to use blocking rules. My dataset is only around 20k rows, so do I even need it in this scenario?

RobinL · 2024-05-12T09:23:06Z

RobinL
May 12, 2024
Maintainer

Take a look at this part of the docs:

https://moj-analytical-services.github.io/splink/topic_guides/comparisons/customising_comparisons.html?h=customi#method-4-providing-the-spec-as-a-dictionary

for example something like

 {
            "sql_condition": "ein_l = ein_r or (jaro_winkler_sim(AccountName_l, AccountName_r) and jaro_winkler_sim(Address_l, Address_r)" > 0.8 ) ",
            "label_for_charts": "ein acc add",
            
        },

0 replies

jacob-shoemaker · 2024-05-13T18:28:09Z

jacob-shoemaker
May 13, 2024
Author

Thank you for your help. I have updated the settings accordingly, but now running into an issue where there are no matches at all, even where the AccountName and Address are exactly the same. What could I be doing wrong?

`conf = SparkConf()
path = similarity_jar_location()
conf.set("spark.jars", path)

sc = SparkContext.getOrCreate(conf= conf)
spark = SparkSession(sc)

comparisons_dict = {
"output_column_name": "combined",
"comparison_levels": [
{
'sql_condition': '(jaro_winkler_sim(AccountName_l, AccountName_r) > 0.8 and jaro_winkler_sim(PhysicalStreet_l, PhysicalStreet_r) > 0.8)',
'label_for_charts': 'ein acc add',
"is_null_level": True,
}
],
}

settings = {
"link_type": "link_only",
"comparisons": [
comparisons_dict
],
"retain_matching_columns": True,
"retain_intermediate_calculation_columns": True,
}

input_table_aliases = ["df_sf", "df_jd"]
linker = SparkLinker([df_sf, df_jd], settings, input_table_aliases=input_table_aliases, spark=spark)
linker.estimate_u_using_random_sampling(max_pairs=1e6)`

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Specifying custom comparisons across multiple columns #2174

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Specifying custom comparisons across multiple columns #2174

jacob-shoemaker May 10, 2024

Replies: 2 comments

RobinL May 12, 2024 Maintainer

jacob-shoemaker May 13, 2024 Author

jacob-shoemaker
May 10, 2024

RobinL
May 12, 2024
Maintainer

jacob-shoemaker
May 13, 2024
Author