Using temporal attributes and manually change ELSE #2362

niquola · 2024-08-28T21:41:01Z

niquola
Aug 28, 2024

In our use case, records have temporal attributes (which are valid for a relatively short period).
Disagreement on these attributes should not introduce a big penalty.
Is it normal (common) to manually decrease ELSE in the model?
If so, what is the best way to do it? Change in JSON and load again?

RobinL · 2024-09-04T11:39:08Z

RobinL
Sep 4, 2024
Maintainer

Edit: This is now possible in splink==4.0.1, see example here:
#2379

Previous answer:

Yes - it's reasonably common to override the ELSE part of the model. Disagreement penalties are driven by the m probabilities of the model, and therefore are particularly susceptible to being poorly estimated (because you have to use expectation maximisation to estimate m probabilities, which doesn't always work that well).

At the moment there isn't a great way to manually set the m values. We hope to add better support for this in future. For the moment, the easiest way is probably to change the JSON and load again, as you suggest.

Here's an alternative that will work for now, but is not part of the public API:

import splink.comparison_library as cl
from splink import DuckDBAPI, Linker, SettingsCreator, block_on, splink_datasets
from splink.datasets import splink_dataset_labels

db_api = DuckDBAPI()


settings = SettingsCreator(
    link_type="dedupe_only",
    comparisons=[
        cl.ExactMatch("first_name"),
        cl.ExactMatch("surname"),
        cl.ExactMatch("dob"),
        cl.ExactMatch("city"),
        cl.ExactMatch("email"),
    ],
    blocking_rules_to_generate_predictions=[
        block_on("first_name"),
        block_on("surname"),
    ],
    retain_matching_columns=True,
    retain_intermediate_calculation_columns=True,
)

linker = Linker(splink_datasets.fake_1000, settings, db_api)

linker.training.estimate_probability_two_random_records_match(
    [block_on("first_name", "surname")],
    recall=0.7,
)

linker.training.estimate_u_using_random_sampling(max_pairs=1e6)

linker.training.estimate_parameters_using_expectation_maximisation(
    block_on("first_name", "surname")
)

linker.training.estimate_parameters_using_expectation_maximisation(block_on("dob"))

surname_comparison = linker._settings_obj._get_comparison_by_output_column_name("surname")
else_comparison_level = surname_comparison._get_comparison_level_by_comparison_vector_value(0)
else_comparison_level._m_probability = 0.00001

linker.visualisations.match_weights_chart()

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using temporal attributes and manually change ELSE #2362

{{title}}

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Using temporal attributes and manually change ELSE #2362

niquola Aug 28, 2024

Replies: 1 comment

RobinL Sep 4, 2024 Maintainer

niquola
Aug 28, 2024

RobinL
Sep 4, 2024
Maintainer