How to create blocking rules across differently named columns? #1960

JosephKuchar · 2024-02-12T19:26:56Z

JosephKuchar
Feb 12, 2024

I haven't seen this addressed in the documentation, so apologies if it's there and I've just missed it!

I have two datasets with four different columns in each dataset that contain names, and also four different corresponding address/city/postal code columns. I'd like to add blocking rules, basically
city_1_left = city_1_right
city_1_left = city_2_right
city_2_left = city_1_right
city_2_left = city_2_right
etc

Is there a way when setting up blocking rules to have different column names used? All the examples I see in the documentation refer to instances where the columns in the different dataframes have the same names.

This will eventually extend to the comparisons, as I'll need to compare multiple addresses and names against each other, but at the moment I'm just trying to figure out the blocking.

Thanks!

Joseph

Answered by RobinL

Feb 12, 2024

Hiya,

Yes, this is possible, here's a runnable example:

(Note that when you provide a blocking rule as a string:
"l.city = r.city2" then under the hood it turns into a sql join expression (INNER JOIN l.city=r.city2))

import random

from splink.datasets import splink_datasets
from splink.duckdb.blocking_rule_library import block_on
from splink.duckdb.comparison_library import (
    exact_match,
    levenshtein_at_thresholds,
)
from splink.duckdb.linker import DuckDBLinker

df = splink_datasets.fake_1000


df["city2"] = df["city"].apply(lambda x: random.choice(df["city"]))

settings = {
    "probability_two_random_records_match": 0.01,
    "link_type": "dedupe_only",
    "blocking_rules_to_…

View full answer

RobinL · 2024-02-12T20:40:37Z

RobinL
Feb 12, 2024
Maintainer

Hiya,

Yes, this is possible, here's a runnable example:

(Note that when you provide a blocking rule as a string:
"l.city = r.city2" then under the hood it turns into a sql join expression (INNER JOIN l.city=r.city2))

import random

from splink.datasets import splink_datasets
from splink.duckdb.blocking_rule_library import block_on
from splink.duckdb.comparison_library import (
    exact_match,
    levenshtein_at_thresholds,
)
from splink.duckdb.linker import DuckDBLinker

df = splink_datasets.fake_1000


df["city2"] = df["city"].apply(lambda x: random.choice(df["city"]))

settings = {
    "probability_two_random_records_match": 0.01,
    "link_type": "dedupe_only",
    "blocking_rules_to_generate_predictions": [
        "l.city = r.city",
        "l.city = r.city2",
    ],
    "comparisons": [
        levenshtein_at_thresholds("first_name", 2),
        exact_match("surname"),
        exact_match("dob"),
        exact_match("city", term_frequency_adjustments=True),
        exact_match("city2", term_frequency_adjustments=True),
        exact_match("email"),
    ],
    "retain_intermediate_calculation_columns": True,
    "additional_columns_to_retain": ["cluster"],
    "max_iterations": 10,
    "em_convergence": 0.01,
}


linker = DuckDBLinker(df, settings)

linker.estimate_probability_two_random_records_match([block_on(["first_name", "surname"])], recall=0.8)
linker.estimate_u_using_random_sampling(target_rows=1e6)


df_predict = linker.predict()

```

1 reply

JosephKuchar Feb 13, 2024
Author

Wonderful, thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to create blocking rules across differently named columns? #1960

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

How to create blocking rules across differently named columns? #1960

JosephKuchar Feb 12, 2024

Replies: 1 comment · 1 reply

RobinL Feb 12, 2024 Maintainer

JosephKuchar Feb 13, 2024 Author

JosephKuchar
Feb 12, 2024

Replies: 1 comment 1 reply

RobinL
Feb 12, 2024
Maintainer

JosephKuchar Feb 13, 2024
Author