Splink 4.0.0: How to debug value error in training.py? #2316
-
On my test data, this code worked in Splink 3.9.15:
The output was:
But this equivalent "training" code fails in Splink 4.0.0:
The error message is:
This is the affected code in training.py:
I arbitrarily tried several other "deterministic_rules", but the error message is the same in Splink 4.0.0. How can I easily calculate |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 20 replies
-
Thanks for the report. I'm not sure. The error message suggests there are no observed matches since that appears to be the only way you can get a value of I would start by adding these lines
to here: Here's a testing script I wrote to try and produce a reprex (minimal reproducible example), but I can't reproduce the error and it seems to give the right answer: testing scriptimport pandas as pd
import splink.comparison_library as cl
from splink import DuckDBAPI, Linker, SettingsCreator, block_on, splink_datasets
db_api = DuckDBAPI()
# fmt: off
data_1 = [
{"unique_id":1, "first_name": "Alice", "surname": "Smith", "dob": "1970-01-01", "city": "London"},
{"unique_id":2, "first_name": "Bob", "surname": "Jones", "dob": "1984-02-02", "city": "London"},
{"unique_id":3, "first_name": "Dave", "surname": "Smith", "dob": "1970-01-02", "city": "London"},
]
data_2 = [
{"unique_id":4, "first_name": "Alice", "surname": "Smith", "dob": "1971-05-05", "city": "London"},
{"unique_id":5, "first_name": "Bob", "surname": "Jones", "dob": "1990-10-10", "city": "London"},
]
# fmt: on
df_1 = pd.DataFrame(data_1)
df_2 = pd.DataFrame(data_2)
df_all = pd.concat([df_1, df_2])
settings = SettingsCreator(
link_type="dedupe_only",
comparisons=[
cl.ExactMatch("first_name"),
cl.ExactMatch("surname"),
cl.ExactMatch("dob"),
cl.ExactMatch("city"),
],
)
linker = Linker(df_all, settings, db_api)
deterministic_rules = [
block_on("first_name"),
]
linker.training.estimate_probability_two_random_records_match(
deterministic_rules,
recall=1.0,
)
settings = SettingsCreator(
link_type="link_only",
comparisons=[
cl.ExactMatch("first_name"),
cl.ExactMatch("surname"),
cl.ExactMatch("dob"),
cl.ExactMatch("city"),
],
)
linker = Linker([df_1, df_2], settings, db_api)
linker.training.estimate_probability_two_random_records_match(
deterministic_rules,
recall=1.0,
)
settings = SettingsCreator(
link_type="link_and_dedupe",
comparisons=[
cl.ExactMatch("first_name"),
cl.ExactMatch("surname"),
cl.ExactMatch("dob"),
cl.ExactMatch("city"),
],
)
linker = Linker([df_1, df_2], settings, db_api)
linker.training.estimate_probability_two_random_records_match(
deterministic_rules,
recall=1.0,
)
settings = SettingsCreator(
link_type="link_and_dedupe",
comparisons=[
cl.ExactMatch("first_name"),
cl.ExactMatch("surname"),
cl.ExactMatch("dob"),
cl.ExactMatch("city"),
],
)
linker = Linker([df_1, df_2], settings, db_api)
deterministic_rules = [
block_on("dob"),
]
linker.training.estimate_probability_two_random_records_match(
deterministic_rules,
recall=0.85,
)
settings = SettingsCreator(
link_type="link_and_dedupe",
comparisons=[
cl.ExactMatch("first_name"),
cl.ExactMatch("surname"),
cl.ExactMatch("dob"),
cl.ExactMatch("city"),
],
)
linker = Linker([df_1, df_2], settings, db_api)
deterministic_rules = [
block_on("city"),
]
linker.training.estimate_probability_two_random_records_match(
deterministic_rules,
recall=0.85,
) |
Beta Was this translation helpful? Give feedback.
I have tried my test script above (the one that creates
ssn_1 = np.random.randint(1, 670251, size=670251)
etc.) onrequirements.txt
.requirements.txt
because it's pretty heavyweight and there's not much space on my old windows laptop)I can't replicate the error on either
What happens if you create a fresh venv, install only splink, and try the test script. Do you still get the same error? How about on a different version of Python?
A bit stumped on this one!
If you run the test script in VS code debugger (Debug -> Start Debugging), then it'd be useful narrow d…