Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluation from ground truth column does not work without blocking rules specified #2274

Open
2 tasks done
jfoster17 opened this issue Jul 19, 2024 · 1 comment
Open
2 tasks done
Labels
bug Something isn't working

Comments

@jfoster17
Copy link

What happens?

I was attempting to follow the Evaluation from ground truth column with some of my own data. My data is relatively small scale and has no easy-to-express blocking rules, so I set up my linker without any blocking_rules. The model seemed to train just fine, but when I attempted to evaluate against my ground truth column, I got an SQL error that was initially opaque to me: Error was: Binder Error: Referenced column "match_key" not found in FROM clause!

It would be nice if the evaluation function did not strictly require blocking rules.

To Reproduce

This can be reproduced from the tutorial data by simply removing the blocking_rules.

from splink.datasets import splink_datasets
import altair as alt
alt.renderers.enable("html")

df = splink_datasets.fake_1000

df.head(2)

from splink.duckdb.linker import DuckDBLinker
from splink.duckdb.blocking_rule_library import block_on
import splink.duckdb.comparison_template_library as ctl
import splink.duckdb.comparison_library as cl

settings = {
    "link_type": "dedupe_only",
    "blocking_rules_to_generate_predictions": [
        #block_on("first_name"),
        #block_on("surname"),
    ],
    "comparisons": [
        ctl.name_comparison("first_name"),
        ctl.name_comparison("surname"),
        ctl.date_comparison("dob", cast_strings_to_date=True),
        cl.exact_match("city", term_frequency_adjustments=True),
        ctl.email_comparison("email", include_username_fuzzy_level=False),
    ],
    "retain_matching_columns": True,
    "retain_intermediate_calculation_columns": True,
}

linker = DuckDBLinker(df, settings, set_up_basic_logging=False)
deterministic_rules = [
    "l.first_name = r.first_name and levenshtein(r.dob, l.dob) <= 1",
    "l.surname = r.surname and levenshtein(r.dob, l.dob) <= 1",
    "l.first_name = r.first_name and levenshtein(r.surname, l.surname) <= 2",
    "l.email = r.email"
]

linker.estimate_probability_two_random_records_match(deterministic_rules, recall=0.7)

linker.estimate_u_using_random_sampling(max_pairs=1e6, seed=5)

session_dob = linker.estimate_parameters_using_expectation_maximisation(block_on("dob"))
session_email = linker.estimate_parameters_using_expectation_maximisation(block_on("email"))

linker.truth_space_table_from_labels_column(
    "cluster", match_weight_round_to_nearest=0.1
).as_pandas_dataframe(limit=5)

Which generates a long error trace including the SQL to do this calculation and ends with:

Error was: Binder Error: Referenced column "match_key" not found in FROM clause!
Candidate bindings: "__splink__df_predict_4ff203160.match_weight"
LINE 10:     not (cast(match_key as int) = 0)

OS:

Mac OS 13.5

Splink version:

3.9.14

Have you tried this on the latest master branch?

  • I agree

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

  • I agree
@jfoster17 jfoster17 added the bug Something isn't working label Jul 19, 2024
@RobinL
Copy link
Member

RobinL commented Jul 20, 2024

Thanks for the report - yeah, this def looks like something we should fix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants