Distinguishing between families and twins persons (multi-modal distributions) #2023

RobinL · 2024-03-04T11:39:21Z

RobinL
Mar 4, 2024
Maintainer

A common challenge in Splink is linking data where the entity falls into a hierarchy that makes the entity type unclear. For example, a person is a member of a family.

This is a problem because there's no way of explicitly telling Splink that the entity type is a person, a family, or anything else.

How can we solve this problem?

RobinL · 2024-03-04T11:39:25Z

RobinL
Mar 4, 2024
Maintainer Author

Theoretical background

The underlying likelihood function behind Splink assumes a bimodal distributino of data (two peaks) as described in this article

However, in the case of families and persons, the distribution is multimodal: the likelihood funciton probably has two peaks, one where the model is matching families (and so doesn't punsh a mismatch on first name). and a second where the model is matching persons (and therefore does punsh mismatches on first name)

A second important theoretical point is that the grouping of records (persons) into higher level entities (twins, families) generally breaks the Fellegi-Sunter assumption of independence of columns conditional on match status.

For example, the Fellegi Sunter model assumes that the distribution of surname conditional on address is the same as the overall distribution of surname. This is obviously violated when people live in families.

As a result, even if we could solve the problem of 'telling' Splink we're interested in the 'person' entity and not the 'family' entity during training, the Fellegi Sunter methology will likely find it difficult to distinguish between family members, especially if address is modelled as a separate comparison to surname.

The twins problem

The problem is most acute in the case of twins, who will share the same:

surname
date of birth
address

Generally the only distinguishing characteristics will be the first name(s)

But the problem is, in the dataset there are likely to be many other true matches which match on everything except forename.

As a result, there's no perfect solution: you either:

Excessively punish mismatches on first name, in which case you resolve the twins, but miss matches where true matches have typos in first names
Have a modest punshment for mismatches on first name, which result in false positive matches on twins, but caters better for genuine typos in first names.

Possible solution

I'm not aware of any perfect solution to this problem but I think there are broadly two techniques to help mitigate the problem:

Where columns are hopelessly correlated for your entity type (e.g. address and surname when you have lots of families in your dataset) model then as a single Comparison, rather than two separate Comparisons, to avoid double counting
Manually override match weights on sepcific columns to manually 'separate' problematic links.

On (2), that could involve manually inserting a very negative match weight on a mismatch on first name.

I have also heard there may be ways of extending the likelihood functino to explicitly account for a multi modal (no binary) classificatino of entities, which would allow a model to simultanously model families and persons. But I dont know how that would be done in practice.

2 replies

RobinL Mar 4, 2024
Maintainer Author

example of manually overriding match weights - click to expand

from splink.datasets import splink_datasets
from splink.duckdb.blocking_rule_library import block_on
from splink.duckdb.comparison_library import (
    exact_match,
    levenshtein_at_thresholds,
)
from splink.duckdb.linker import DuckDBLinker

df = splink_datasets.fake_1000

settings = {
    "probability_two_random_records_match": 0.01,
    "link_type": "dedupe_only",
    "blocking_rules_to_generate_predictions": [
        block_on(["first_name"]),
        block_on(["surname"]),
    ],
    "comparisons": [
        levenshtein_at_thresholds("first_name", 2),
        exact_match("surname"),
        exact_match("dob"),
        exact_match("city", term_frequency_adjustments=True),
        exact_match("email"),
    ],
    "retain_intermediate_calculation_columns": True,
    "additional_columns_to_retain": ["cluster"],
}


linker = DuckDBLinker(df, settings)


linker.estimate_probability_two_random_records_match(
    [
        block_on(["first_name", "surname"]),
    ], recall=0.7
)

linker.estimate_u_using_random_sampling(target_rows=1e6)


blocking_rule = "l.first_name = r.first_name and l.surname = r.surname"
linker.estimate_parameters_using_expectation_maximisation(blocking_rule)


blocking_rule = "l.dob = r.dob"
linker.estimate_parameters_using_expectation_maximisation(blocking_rule)

linker._settings_obj.comparisons[0].comparison_levels[2].m_probability = 1e-6
linker._settings_obj.comparisons[0].comparison_levels[2].u_probability = 0.9

linker._settings_obj.comparisons[0].comparison_levels[3].m_probability = 1e-6
linker._settings_obj.comparisons[0].comparison_levels[3].u_probability = 0.9

linker.match_weights_chart()

df_predict = linker.predict()

zmbc Mar 11, 2024

While there is no perfect solution, I'm not sure the situation is always quite as bad as this example. A partial match on first name still seems like it could be quite helpful -- true matches are unlikely to have totally different first names, especially if you are already accounting for nicknames and name switches. And if one of your datasets is known to have no duplicates, that could help substantially with pruning out improper matches between twins!

ianiredan · 2024-03-11T16:02:34Z

ianiredan
Mar 11, 2024

Your example on manually setting an m and u probability is helpful for the dataset I am working on to try to separate out twins from duplicates, thank you.

0 replies

samkodes · 2024-03-15T22:38:06Z

samkodes
Mar 15, 2024

Neat example - a couple of thoughts.

First, I have had a thought or two about setting prior distributions on m-probabilities for comparisons. This would be a little more squishy than manually over-riding after model fitting, but could be used to inform the EM.

Dirichlet distributions are typically used for this sort of thing (https://en.wikipedia.org/wiki/Dirichlet_distribution). A Dirichlet distribution of order K is a probability distribution on the (K-1)-simplex; for K=2 you get a Beta distribution.

The nice property of Bayesian inference with Dirichlet distributions is that there is an easily interpretable conjugate posterior.
So, for example, if our prior is Dirichlet( a_1, ..., a_m, ... , a_k) and we observe one data point in category m, the posterior is Dirichlet(a_1, .., a_m + 1, ... , a_k). This equation generalizes to any number of observations - we simply increment the corresponding entry in the parameter vector by the number of observations with that value. These parameter vectors can be normalized to get the mean of the distribution, i.e. the mean is ( a_1/sum(a_i), ..., a_k/sum(a_i)).

So in Splink, for each comparison the settings object could specify a simple vector of positive numbers, or more interpretably a probability vector with a positive "weight" used to scale it. The "weight" would be the strength of the prior measured in terms of equivalent observation points. This would parametrize the Dirichlet prior used in EM. At each round of the EM the posterior mean could be used to predict, and then the posterior mean could be updated using the expression above where observations are weighted by their match probability.

This would allow us to pre-specify the confidence we have in the m-parameters having a certain structure.

Second. Your idea of multi-modal extensions is very interesting!!!! Conceptually this would not be a big challenge to implement (though given the existing architecture it may be tricky). FS models are latent class models - the idea is that there is a discrete latent variable (in FS, "match" or "non-match") for each pair, and that we fit a model for each class (m-probabilities and u-probabilities, respectively). Because we don't know the latent variables, the EM process allows us to estimate them iteratively using Bayes' Rule. But there is nothing special about a 2-class model; any number of classes could be used, and the math is essentially the same. You can google "Latent Class Analysis" to get a sense.

In the case of families, I could imagine three classes of pairs: Matches, family-matches, and non-matches. The trick would be steering the model to actually use the three classes in the way we would like; all EM is going to do is find a locally-optimal best use of the classes to describe the data, and there's no reason the classes will have the interpretation we want. The non-match class is the easiest one to deal with though because it is the most distinct, especially if we train the u-probabilities on random pairs and keep them fixed during EM.

As for distinguishing between matches and family-matches, I can think of at least three options.

First, we might exploit the dependence you describe by specifying different comparisons for the family-match and match classes. Splink doesn't allow this currently, but it would be an interesting extension - though the engine would have to change a bit because you couldn't get per-comparison Bayes factors if the classes had different comparisons. For example, if we combine name and address comparisons into a dependent comparison for the family-match class, we may find that the EM uses this class more for family-matches. However deciding what the dependence structure should be is a little tricky - these dependence assumptions are conditional on latent class. For example, non-matches that share an address probably have different first names AND last names. Family matches that share an address probably have different first names but have a good chance of having same last names. Exact matches probably have same first and last names regardless of address. What does it mean to have a family match if the address doesn't match? Should this be possible?

Second, if we implement the Dirichlet prior approach described above, we could specify a prior on the family-match class that has a much lower probability of first name match compared to the match class, but a much higher probability of last name match compared to the non-match class. Priors on other comparisons could also be specified on a per-class basis, or priors on joint comparisons could be used.

Third, simply specifying reasonable initial values for comparison value probabilities for each class to be used in EM can often help a latent class model to land on the interpretation you're intending.

1 reply

samkodes Mar 26, 2024

FYI, for another problem I was skimming through Herzog, Sheuren, Winkler "Data Quality and Record Linkage Techniques" (2007) and found this nugget (section 9.6.1, p. 103): "To address the natural partitioning problem, A×B is partitioned into three sets or classes: C1, C2, and C3. The three-class EM algorithm works best when we are matching persons within households and there are multiple persons per household."
So a three-class approach to this problem has good precedent!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distinguishing between families and twins persons (multi-modal distributions) #2023

{{title}}

Replies: 3 comments 3 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Distinguishing between families and twins persons (multi-modal distributions) #2023

RobinL Mar 4, 2024 Maintainer

Replies: 3 comments · 3 replies

RobinL Mar 4, 2024 Maintainer Author

Theoretical background

The twins problem

Possible solution

RobinL Mar 4, 2024 Maintainer Author

zmbc Mar 11, 2024

ianiredan Mar 11, 2024

samkodes Mar 15, 2024

samkodes Mar 26, 2024

RobinL
Mar 4, 2024
Maintainer

Replies: 3 comments 3 replies

RobinL
Mar 4, 2024
Maintainer Author

RobinL Mar 4, 2024
Maintainer Author

ianiredan
Mar 11, 2024

samkodes
Mar 15, 2024