Skip to content

How to create blocking rules across differently named columns? #1960

Answered by RobinL
JosephKuchar asked this question in Q&A
Discussion options

You must be logged in to vote

Hiya,

Yes, this is possible, here's a runnable example:

(Note that when you provide a blocking rule as a string:
"l.city = r.city2" then under the hood it turns into a sql join expression (INNER JOIN l.city=r.city2))

import random

from splink.datasets import splink_datasets
from splink.duckdb.blocking_rule_library import block_on
from splink.duckdb.comparison_library import (
    exact_match,
    levenshtein_at_thresholds,
)
from splink.duckdb.linker import DuckDBLinker

df = splink_datasets.fake_1000


df["city2"] = df["city"].apply(lambda x: random.choice(df["city"]))

settings = {
    "probability_two_random_records_match": 0.01,
    "link_type": "dedupe_only",
    "blocking_rules_to_…

Replies: 1 comment 1 reply

Comment options

You must be logged in to vote
1 reply
@JosephKuchar
Comment options

Answer selected by JosephKuchar
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants