Evaluation from ground truth column does not work without blocking rules specified #2274

jfoster17 · 2024-07-19T21:28:33Z

What happens?

I was attempting to follow the Evaluation from ground truth column with some of my own data. My data is relatively small scale and has no easy-to-express blocking rules, so I set up my linker without any blocking_rules. The model seemed to train just fine, but when I attempted to evaluate against my ground truth column, I got an SQL error that was initially opaque to me: Error was: Binder Error: Referenced column "match_key" not found in FROM clause!

It would be nice if the evaluation function did not strictly require blocking rules.

To Reproduce

This can be reproduced from the tutorial data by simply removing the blocking_rules.

from splink.datasets import splink_datasets
import altair as alt
alt.renderers.enable("html")

df = splink_datasets.fake_1000

df.head(2)

from splink.duckdb.linker import DuckDBLinker
from splink.duckdb.blocking_rule_library import block_on
import splink.duckdb.comparison_template_library as ctl
import splink.duckdb.comparison_library as cl

settings = {
    "link_type": "dedupe_only",
    "blocking_rules_to_generate_predictions": [
        #block_on("first_name"),
        #block_on("surname"),
    ],
    "comparisons": [
        ctl.name_comparison("first_name"),
        ctl.name_comparison("surname"),
        ctl.date_comparison("dob", cast_strings_to_date=True),
        cl.exact_match("city", term_frequency_adjustments=True),
        ctl.email_comparison("email", include_username_fuzzy_level=False),
    ],
    "retain_matching_columns": True,
    "retain_intermediate_calculation_columns": True,
}

linker = DuckDBLinker(df, settings, set_up_basic_logging=False)
deterministic_rules = [
    "l.first_name = r.first_name and levenshtein(r.dob, l.dob) <= 1",
    "l.surname = r.surname and levenshtein(r.dob, l.dob) <= 1",
    "l.first_name = r.first_name and levenshtein(r.surname, l.surname) <= 2",
    "l.email = r.email"
]

linker.estimate_probability_two_random_records_match(deterministic_rules, recall=0.7)

linker.estimate_u_using_random_sampling(max_pairs=1e6, seed=5)

session_dob = linker.estimate_parameters_using_expectation_maximisation(block_on("dob"))
session_email = linker.estimate_parameters_using_expectation_maximisation(block_on("email"))

linker.truth_space_table_from_labels_column(
    "cluster", match_weight_round_to_nearest=0.1
).as_pandas_dataframe(limit=5)

Which generates a long error trace including the SQL to do this calculation and ends with:

Error was: Binder Error: Referenced column "match_key" not found in FROM clause!
Candidate bindings: "__splink__df_predict_4ff203160.match_weight"
LINE 10:     not (cast(match_key as int) = 0)

OS:

Mac OS 13.5

Splink version:

3.9.14

Have you tried this on the latest `master` branch?

I agree

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

I agree

The text was updated successfully, but these errors were encountered:

RobinL · 2024-07-20T07:55:48Z

Thanks for the report - yeah, this def looks like something we should fix

jfoster17 added the bug Something isn't working label Jul 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluation from ground truth column does not work without blocking rules specified #2274

Evaluation from ground truth column does not work without blocking rules specified #2274

jfoster17 commented Jul 19, 2024

RobinL commented Jul 20, 2024

Evaluation from ground truth column does not work without blocking rules specified #2274

Evaluation from ground truth column does not work without blocking rules specified #2274

Comments

jfoster17 commented Jul 19, 2024

What happens?

To Reproduce

OS:

Splink version:

Have you tried this on the latest master branch?

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

RobinL commented Jul 20, 2024

Have you tried this on the latest `master` branch?