Using sets to identify duplicates - Jaccard distance #2376

RichardMcAteer · 2024-09-03T18:36:14Z

RichardMcAteer
Sep 3, 2024

Perhaps there are better ways to solve this, hoping to make use of SPLINK to deduplicate fairly large datasets of pharmacovigilance data (> 10 million records).
I've pulled a recent (and better data quality) portion of records (~40,000) and have these for some blocking -
I have ages on a portion of the data (~80%) and if I use age groupings (infant, adult, etc.) I can get that up to 88%.
I have sex on 96% of the data. Between sex and the ages I can cover about 98% of the data. Blocking on these is probably a decent way to compare most of the data.

The problem I have is that matches will likely be on the drugs involved in the reaction, and the set of observed reactions to the drugs, but two accounts of a drug reaction can describe effects in varying ways, and the drugs involved can also differ as people don't always describe things the same way. Hot-coding the drugs involved to get columns per drug would result in thousands of columns, similarly with reactions. If two people the same age and sex report the same drug combination with the same reactions, these are likely matches.

An exact match is easy enough, and I have turned the drugs on a reaction into their base ingredients, and have these as delimited sets. This gets around problems like one reporter saying they took "Extra strength Tylenol" and their pharmacist reporting the same reaction as being to "acetaminophen", for example. Similarly, where an account (say, a case study) references only an ingredient, different manufacturers may have a duty to report, so while the drug might be reported as a generic name, each reporter may use THEIR brand name or registered generic (e.g., APO-XXX vs TEVA-XXX) so reducing all drugs to the base active ingredients is necessary.

Where it gets tricky is if in once case the reporter lists X,Y,Z and vitamin C, but another report omits vitamin C as being unlikely to be a suspect product. Here I was hoping to use my delimited drug lists to compare sets, with a Jaccard similarity as a way of measuring the overlap between the drugs on a case.

E.g., these two lists of suspect drugs could be from a pair of reports that are in fact matches, but one reporter used the name of the salt when describing the ondansetron. I suppose I could try an edit distance on these (these two strings would be 92% similar) but it isn't explained as easily. "they matched on 9 of 10 drugs" is easy to explain, but "if you shuffle a bunch of letters they are similar" isn't a great explanation. Also, sometimes drugs are spelled similarly but are very different, and this is why I'm trying to use them more as tokens than as spelled words. For example, a DTAP VACCINE and a TDAP VACCINE are different vaccines given to different populations, despite being similar spelling, and 'Hydroxyzine' and 'Hydralazine' have high edit distance similarity especially in a string of a few drugs, but aren't the same.

acetylsalicylic acid@betamethasone@furosemide@gabapentin@heparin sodium@labetalol hydrochloride@metoclopramide@mirtazapine@nifedipine@ondansetron

acetylsalicylic acid@betamethasone@furosemide@gabapentin@heparin sodium@labetalol hydrochloride@metoclopramide@mirtazapine@nifedipine@ondansetron hydrochloride

Almost the same thing is true of drug reactions; they are fortunately drawn from an international standard, but the same conditions can be described in a variety of ways. A common reaction might be something like "Nausea", but this could be described in other ways, such as Vomiting. If two reports match on 5 reaction terms, but one has "Vomiting" while the other has "Nausea", it may just be a slight difference in the accounts/reporting. Hence wanting to do a Jaccard or similar across the terms. For these I may have an approach I can use of walking a hierarchy to group like terms together, to prevent treating "Nausea" and "Vomiting" as completely separate tokens, since they have the same parent term.

Has anyone done this kind of set-based comparison? I can define a function to take the strings, break them into sets, and calculate the Jaccard similarity, but looking at the package I'm not sure how to use a custom calculation like this in it.

Answered by RobinL

Sep 4, 2024

OK, here's a start, with plenty of room for improvement. But hopefully will give you some ideas

import duckdb
import pandas as pd

import splink.comparison_level_library as cll
import splink.comparison_library as cl
from splink import DuckDBAPI, Linker, SettingsCreator, block_on, splink_datasets

# Define the first table (cases data)
cases_data = [
    {
        "unique_id": "987613663917",
        "sex_category": "Female",
        "age_norm": 21,
        "age_group_calc": "ADULT",
        "dob": "2003-01-01",
        "drugs_active_substance_list": "busulfan@cyclophosphamide@cytarabine@daunorubicin@fludarabine phosphate@ganciclovir sodium@rituximab@valganciclovir",
        "react_terms": "…

View full answer

RobinL · 2024-09-03T20:49:52Z

RobinL
Sep 3, 2024
Maintainer

Thanks for the question. I'm not hugely familiar with this sort of data. Are you able to post a couple of fake sample records just so I'm able to get a better sense of what it looks like - hopefully that will help us be able to answer your question better. Perhaps two which are different but you think should match, and two which are different and do not match or something like that

3 replies

RichardMcAteer Sep 4, 2024
Author

Thanks so much - Sure thing; I have other columns that are more sparsely populated; for example, I have DOB on 43% of my cases, but at 43% I won't block with it. Where two records agree on DOB it's a strong probability of a match; where they disagree it's a non-match, and where one is blank, who knows!? I could try to convert DOBs to ages, but age is calculated at the time of the adverse drug reaction, and might be reported later, so receiving a report in a given year doesn't necessarily mean that the reaction happened that year (though it's likely).

These 4 examples of cases that should match up as they derive from the same published case study; I know this as I have extra data (narrative fields, paper forms, faxes, etc.) that I can use to derive more information, and in the text of each of these reports it mentions the journal article. It's time consuming to look it up though for each case and to do all the comparisons, which is why I'm hoping to identify likely duplicates first and do the more in depth comparisons after (e.g., embed the narratives and look to see if suspected duplicates are near each other). One thing I need to do at some point is get a table set up matching active moieties, as the drugs fludarabine and fludarabine phosphate would be the same drug, but treating them as tokens doesn't match these. I've used the @ symbol as a delimiter as it's not found in any drug names or reactions. These would normally be stored relationally, I've flattened the data. I could hot code them, but then you end up with large sparse matrices or you have to use a dense matrix approach like CSR which doesn't play well with many tools.

From the journal article: Gabarin N, Dadak R, Roy M, Kaplan AJ, Haider S, Khalaf D. Antiviral therapy defiant mixed viral retinitis post hematopoietic allogeneic stem cell transplant. Clin Case Rep. 2023;11(3). e7095. DOI: 10.1002/ccr3.7095

unique_id	SEX_CATEGORY	AGE_NORM	AGE_GROUP_CALC	DRUGS_ACTIVE_SUBSTANCE_LIST	REACT_TERMS
987613663917	Female	21	ADULT	busulfan@cyclophosphamide@cytarabine@daunorubicin@fludarabine phosphate@ganciclovir sodium@rituximab@valganciclovir	Cytomegalovirus infection@Retinitis viral@Off label use@Varicella zoster virus infection
987612996491	Female	21	ADULT	busulfan@cyclophosphamide@cytarabine@daunorubicin@fludarabine@ganciclovir sodium@rituximab@valganciclovir hydrochloride	Cytomegalovirus infection@Retinitis viral@Off label use@Varicella zoster virus infection
987613011984	Female	21	ADULT	busulfan@cyclophosphamide@fludarabine@lymphocyte immune globulin anti-thymocyte globulin@methotrexate	Epstein-Barr virus infection reactivation@Retinitis viral@Cytomegalovirus infection reactivation
987613041883	Female	21	ADULT	anti-thymocyte globulin (rabbit)@busulfan@cyclophosphamide@cytarabine@daunorubicin hydrochloride@fludarabine@methotrexate sodium	Epstein-Barr virus infection@Retinitis viral@Cytomegalovirus infection reactivation@Infection reactivation@Varicella zoster virus infection

As you can see, we have matching age and sex which is nice and drops these all in the same block.

l.unique_id	r.unique_id	DRUGS_ACTIVE_SUBSTANCE_LIST Jaccard	REACT_TERMS Jaccard	Edit Distance Similarity Drugs	Edit Distance Similarity Reactions
987613663917	987612996491	0.6	1.0	80	100
987613663917	987613011984	0.18	0.17	40	37
987613663917	987613041883	0.25	0.29	39	50
987612996491	987613011984	0.18	0.17	42	37
987612996491	987613041883	0.36	0.29	30	50
987613041883	987613011984	0.33	0.33	34	49

Without moiety matching or synonyms across reactions I end up with mismatches like Epstein-Barr virus infection reactivation not matching Epstein-Barr virus infection, but the edit distance helps counter that to some extent. I could come up with a more complex way of trying to match up similar terms, but with tens of thousands of terms for diseases and outcomes it's a lot to work on.

Some examples of reactions that don't match would be those with obvious differences - I'm not sure it's worth making up a table to describe that a 44 year old man taking aspirin is different from a 23 year old woman taking paroxetine; they don't agree on any attributes. Where it gets harder is where there are potential matches - maybe matching reactions but age and sex are unknown, and similar drugs? Or age and sex match, and the drug list is short.

What I probably need to do is to have it be more sensitive to more drugs matching, and less sensitive where the drug list is short. There are many more single drug reactions than multi-drug reactions. I know that a really robust way to do all of this is in the paper Duplicate detection in adverse drug reaction surveillance by Noren et al., but it requires having estimates of several probabilities that can only really be assessed with identified duplicates (and it's tricky to implement, but it's a more long-term goal).

Norén, G.N., Orre, R., Bate, A. et al. Duplicate detection in adverse drug reaction surveillance. Data Min Knowl Disc 14, 305–328 (2007). https://doi.org/10.1007/s10618-006-0052-8

RobinL Sep 4, 2024
Maintainer

OK, here's a start, with plenty of room for improvement. But hopefully will give you some ideas

import duckdb
import pandas as pd

import splink.comparison_level_library as cll
import splink.comparison_library as cl
from splink import DuckDBAPI, Linker, SettingsCreator, block_on, splink_datasets

# Define the first table (cases data)
cases_data = [
    {
        "unique_id": "987613663917",
        "sex_category": "Female",
        "age_norm": 21,
        "age_group_calc": "ADULT",
        "dob": "2003-01-01",
        "drugs_active_substance_list": "busulfan@cyclophosphamide@cytarabine@daunorubicin@fludarabine phosphate@ganciclovir sodium@rituximab@valganciclovir",
        "react_terms": "Cytomegalovirus infection@Retinitis viral@Off label use@Varicella zoster virus infection",
    },
    {
        "unique_id": "987612996491",
        "sex_category": "Female",
        "age_norm": 21,
        "age_group_calc": "ADULT",
        "dob": None,
        "drugs_active_substance_list": "busulfan@cyclophosphamide@cytarabine@daunorubicin@fludarabine@ganciclovir sodium@rituximab@valganciclovir hydrochloride",
        "react_terms": "Cytomegalovirus infection@Retinitis viral@Off label use@Varicella zoster virus infection",
    },
    {
        "unique_id": "987613011984",
        "sex_category": "Female",
        "age_norm": 21,
        "age_group_calc": "ADULT",
        "dob": "2003-02-12",
        "drugs_active_substance_list": "busulfan@cyclophosphamide@fludarabine@lymphocyte immune globulin anti-thymocyte globulin@methotrexate",
        "react_terms": "Epstein-Barr virus infection reactivation@Retinitis viral@Cytomegalovirus infection reactivation",
    },
    {
        "unique_id": "987613041883",
        "sex_category": "Female",
        "age_norm": 21,
        "age_group_calc": "ADULT",
        "dob": "2003-04-02",
        "drugs_active_substance_list": "anti-thymocyte globulin (rabbit)@busulfan@cyclophosphamide@cytarabine@daunorubicin hydrochloride@fludarabine@methotrexate sodium",
        "react_terms": "Epstein-Barr virus infection@Retinitis viral@Cytomegalovirus infection reactivation@Infection reactivation@Varicella zoster virus infection",
    },
]


cases_df = pd.DataFrame(cases_data)

duckdb.register("cases_df", cases_df)

sql = """
SELECT
    unique_id,
    sex_category,
    age_norm,
    age_group_calc,
    dob,
    string_split(drugs_active_substance_list, '@') AS drugs_active_substance_array,
    string_split(react_terms, '@') AS react_terms_array
FROM cases_df;
"""

cleaned_splink_data = duckdb.sql(sql).df()
cleaned_splink_data
db_api = DuckDBAPI()


settings = SettingsCreator(
    link_type="dedupe_only",
    blocking_rules_to_generate_predictions=[
        block_on("age_norm"),
        block_on("drugs_active_substance_array[1]"),
    ],
    comparisons=[
        cl.ExactMatch("sex_category"),
        cl.CustomComparison(
            comparison_description="Age comparison",
            comparison_levels=[
                cll.And(
                    cll.NullLevel("dob"),
                    cll.NullLevel("age_norm"),
                    cll.NullLevel("age_group_calc"),
                ).configure(is_null_level=True),
                cll.ExactMatchLevel("dob").configure(tf_adjustment_column="dob"),
                cll.ExactMatchLevel("age_norm").configure(
                    tf_adjustment_column="age_norm"
                ),
                cll.ExactMatchLevel("age_group_calc").configure(
                    tf_adjustment_column="age_group_calc"
                ),
                cll.ElseLevel(),
            ],
        ),
        cl.ArrayIntersectAtSizes("drugs_active_substance_array", [4, 3, 2, 1]),
        cl.ArrayIntersectAtSizes("react_terms_array", [4, 3, 2, 1]),
    ],
)

linker = Linker(cleaned_splink_data, settings, db_api)

deterministic_rules = [block_on("age_norm", "drugs_active_substance_array")]
linker.training.estimate_probability_two_random_records_match(deterministic_rules, recall=0.5)

linker.training.estimate_u_using_random_sampling(max_pairs=1e7)

linker.training.estimate_parameters_using_expectation_maximisation(block_on("dob"))
linker.training.estimate_parameters_using_expectation_maximisation(
    block_on("drugs_active_substance_list")
)

linker.inference.predict(threshold_match_probability=0.9)

There are a variety of avenues to explore for improving this:

Rather than just looking at the size of the intersection of the array, you could incorporate term frequencies of the various elements of the array. See here for an example of this kind of thing. I suspect this could get you substantially higher accuracy
You may be interested in a measure of the relative size of the intersection of the array (because if two arrays are large, some elements are likely to match just by chance). See here Maybe something like 'intersection as percentage of size'. it's tricky though because of the possibility of different size of the arrays. Perhaps we also need to account for the. number of comparisons, For example, if you're comparing an array of 10 elements with an array of 8 elements, there are 80 (i think?) different comparisons, meaning that by chance it's likely that (say) 1 of them matches. So you want categories like 'array intersection of 2 with < 10 comparisons, array intersectino of 2 with <100 comparisons)
You can allow for fuzzy matches within an array comparison by asking questions like 'when two array elements are compared, if they're not equal, are they within e.g. a levenshtein distance of 2, see here
You could define a fuzzy age comparison that allowed for (say) an absolute difference in age of 1 year. There is a cll.PercentageDifferenceLevel but not yet a cll.AbsoluteDifferenceLevel, we should probably implement the latter!
Similarly you may want a levenshtein level of dob if typos are possible

If you work in the government or not-for-profit sector, feel free to reach out at robinlinacre@hotmail.com (my spam address, I'll then reply from my work address), and we can hopefully help out a bit more. If you're working in the for-profit sector, feel free to post follow ups here and we'll do our best to try and get round to answering

Answer selected by RichardMcAteer

RichardMcAteer Sep 9, 2024
Author

Thanks so much - this is a side project for me supporting another group, so I didn't have time to write back last week. I'll give this a try and look into some of the options you present at the bottom of this. I may also drop you a line on your spam address as I do in fact work for government.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using sets to identify duplicates - Jaccard distance #2376

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Using sets to identify duplicates - Jaccard distance #2376

RichardMcAteer Sep 3, 2024

Replies: 1 comment · 3 replies

RobinL Sep 3, 2024 Maintainer

RichardMcAteer Sep 4, 2024 Author

RobinL Sep 4, 2024 Maintainer

RichardMcAteer Sep 9, 2024 Author

RichardMcAteer
Sep 3, 2024

Replies: 1 comment 3 replies

RobinL
Sep 3, 2024
Maintainer

RichardMcAteer Sep 4, 2024
Author

RobinL Sep 4, 2024
Maintainer

RichardMcAteer Sep 9, 2024
Author