Replies: 3 comments 3 replies
-
Theoretical backgroundThe underlying likelihood function behind Splink assumes a bimodal distributino of data (two peaks) as described in this article However, in the case of families and persons, the distribution is multimodal: the likelihood funciton probably has two peaks, one where the model is matching families (and so doesn't punsh a mismatch on first name). and a second where the model is matching persons (and therefore does punsh mismatches on first name) A second important theoretical point is that the grouping of records (persons) into higher level entities (twins, families) generally breaks the Fellegi-Sunter assumption of independence of columns conditional on match status. For example, the Fellegi Sunter model assumes that the distribution of surname conditional on address is the same as the overall distribution of surname. This is obviously violated when people live in families. As a result, even if we could solve the problem of 'telling' Splink we're interested in the 'person' entity and not the 'family' entity during training, the Fellegi Sunter methology will likely find it difficult to distinguish between family members, especially if address is modelled as a separate comparison to surname. The twins problemThe problem is most acute in the case of twins, who will share the same:
Generally the only distinguishing characteristics will be the first name(s) But the problem is, in the dataset there are likely to be many other true matches which match on everything except forename. As a result, there's no perfect solution: you either:
Possible solutionI'm not aware of any perfect solution to this problem but I think there are broadly two techniques to help mitigate the problem:
On (2), that could involve manually inserting a very negative match weight on a mismatch on first name. I have also heard there may be ways of extending the likelihood functino to explicitly account for a multi modal (no binary) classificatino of entities, which would allow a model to simultanously model families and persons. But I dont know how that would be done in practice. |
Beta Was this translation helpful? Give feedback.
-
Your example on manually setting an m and u probability is helpful for the dataset I am working on to try to separate out twins from duplicates, thank you. |
Beta Was this translation helpful? Give feedback.
-
Neat example - a couple of thoughts. First, I have had a thought or two about setting prior distributions on m-probabilities for comparisons. This would be a little more squishy than manually over-riding after model fitting, but could be used to inform the EM. Dirichlet distributions are typically used for this sort of thing (https://en.wikipedia.org/wiki/Dirichlet_distribution). A Dirichlet distribution of order K is a probability distribution on the (K-1)-simplex; for K=2 you get a Beta distribution. The nice property of Bayesian inference with Dirichlet distributions is that there is an easily interpretable conjugate posterior. So in Splink, for each comparison the settings object could specify a simple vector of positive numbers, or more interpretably a probability vector with a positive "weight" used to scale it. The "weight" would be the strength of the prior measured in terms of equivalent observation points. This would parametrize the Dirichlet prior used in EM. At each round of the EM the posterior mean could be used to predict, and then the posterior mean could be updated using the expression above where observations are weighted by their match probability. This would allow us to pre-specify the confidence we have in the m-parameters having a certain structure. Second. Your idea of multi-modal extensions is very interesting!!!! Conceptually this would not be a big challenge to implement (though given the existing architecture it may be tricky). FS models are latent class models - the idea is that there is a discrete latent variable (in FS, "match" or "non-match") for each pair, and that we fit a model for each class (m-probabilities and u-probabilities, respectively). Because we don't know the latent variables, the EM process allows us to estimate them iteratively using Bayes' Rule. But there is nothing special about a 2-class model; any number of classes could be used, and the math is essentially the same. You can google "Latent Class Analysis" to get a sense. In the case of families, I could imagine three classes of pairs: Matches, family-matches, and non-matches. The trick would be steering the model to actually use the three classes in the way we would like; all EM is going to do is find a locally-optimal best use of the classes to describe the data, and there's no reason the classes will have the interpretation we want. The non-match class is the easiest one to deal with though because it is the most distinct, especially if we train the u-probabilities on random pairs and keep them fixed during EM. As for distinguishing between matches and family-matches, I can think of at least three options. First, we might exploit the dependence you describe by specifying different comparisons for the family-match and match classes. Splink doesn't allow this currently, but it would be an interesting extension - though the engine would have to change a bit because you couldn't get per-comparison Bayes factors if the classes had different comparisons. For example, if we combine name and address comparisons into a dependent comparison for the family-match class, we may find that the EM uses this class more for family-matches. However deciding what the dependence structure should be is a little tricky - these dependence assumptions are conditional on latent class. For example, non-matches that share an address probably have different first names AND last names. Family matches that share an address probably have different first names but have a good chance of having same last names. Exact matches probably have same first and last names regardless of address. What does it mean to have a family match if the address doesn't match? Should this be possible? Second, if we implement the Dirichlet prior approach described above, we could specify a prior on the family-match class that has a much lower probability of first name match compared to the match class, but a much higher probability of last name match compared to the non-match class. Priors on other comparisons could also be specified on a per-class basis, or priors on joint comparisons could be used. Third, simply specifying reasonable initial values for comparison value probabilities for each class to be used in EM can often help a latent class model to land on the interpretation you're intending. |
Beta Was this translation helpful? Give feedback.
-
A common challenge in Splink is linking data where the entity falls into a hierarchy that makes the entity type unclear. For example, a person is a member of a family.
This is a problem because there's no way of explicitly telling Splink that the entity type is a person, a family, or anything else.
How can we solve this problem?
Beta Was this translation helpful? Give feedback.
All reactions