Specifying custom comparisons across multiple columns #2174
Replies: 2 comments
-
Take a look at this part of the docs: for example something like {
"sql_condition": "ein_l = ein_r or (jaro_winkler_sim(AccountName_l, AccountName_r) and jaro_winkler_sim(Address_l, Address_r)" > 0.8 ) ",
"label_for_charts": "ein acc add",
}, |
Beta Was this translation helpful? Give feedback.
-
Thank you for your help. I have updated the settings accordingly, but now running into an issue where there are no matches at all, even where the AccountName and Address are exactly the same. What could I be doing wrong? `conf = SparkConf() sc = SparkContext.getOrCreate(conf= conf) comparisons_dict = { settings = { input_table_aliases = ["df_sf", "df_jd"] |
Beta Was this translation helpful? Give feedback.
-
I am struggling to correctly create a comparisons dictionary that considers matching across multiple columns.
comparisons_dict = { "output_column_name": "AccountName", "comparison_description": "AccountName Comparsion", "comparison_levels": [ { "sql_condition": '"jaro_winkler_sim(AccountName_l, AccountName_r)" > 0.8 AND "jaro_winkler_sim(Address_l, Address_r)" > 0.8 ' }, { "sql_condition": "EIN_l = EIN_r" } ] }
Ideally, what I am saying is that I am looking to link where the threshold is met when comparing both the Account Name and Address, or when the EIN is an exact match. Any help here would be greatly appreciated.
Additionally, I am very lost on the when and how to use blocking rules. My dataset is only around 20k rows, so do I even need it in this scenario?
Beta Was this translation helpful? Give feedback.
All reactions