Using sets to identify duplicates - Jaccard distance #2376
-
Perhaps there are better ways to solve this, hoping to make use of SPLINK to deduplicate fairly large datasets of pharmacovigilance data (> 10 million records). The problem I have is that matches will likely be on the drugs involved in the reaction, and the set of observed reactions to the drugs, but two accounts of a drug reaction can describe effects in varying ways, and the drugs involved can also differ as people don't always describe things the same way. Hot-coding the drugs involved to get columns per drug would result in thousands of columns, similarly with reactions. If two people the same age and sex report the same drug combination with the same reactions, these are likely matches. An exact match is easy enough, and I have turned the drugs on a reaction into their base ingredients, and have these as delimited sets. This gets around problems like one reporter saying they took "Extra strength Tylenol" and their pharmacist reporting the same reaction as being to "acetaminophen", for example. Similarly, where an account (say, a case study) references only an ingredient, different manufacturers may have a duty to report, so while the drug might be reported as a generic name, each reporter may use THEIR brand name or registered generic (e.g., APO-XXX vs TEVA-XXX) so reducing all drugs to the base active ingredients is necessary. Where it gets tricky is if in once case the reporter lists X,Y,Z and vitamin C, but another report omits vitamin C as being unlikely to be a suspect product. Here I was hoping to use my delimited drug lists to compare sets, with a Jaccard similarity as a way of measuring the overlap between the drugs on a case. E.g., these two lists of suspect drugs could be from a pair of reports that are in fact matches, but one reporter used the name of the salt when describing the ondansetron. I suppose I could try an edit distance on these (these two strings would be 92% similar) but it isn't explained as easily. "they matched on 9 of 10 drugs" is easy to explain, but "if you shuffle a bunch of letters they are similar" isn't a great explanation. Also, sometimes drugs are spelled similarly but are very different, and this is why I'm trying to use them more as tokens than as spelled words. For example, a DTAP VACCINE and a TDAP VACCINE are different vaccines given to different populations, despite being similar spelling, and 'Hydroxyzine' and 'Hydralazine' have high edit distance similarity especially in a string of a few drugs, but aren't the same. acetylsalicylic acid@betamethasone@furosemide@gabapentin@heparin sodium@labetalol hydrochloride@metoclopramide@mirtazapine@nifedipine@ondansetron acetylsalicylic acid@betamethasone@furosemide@gabapentin@heparin sodium@labetalol hydrochloride@metoclopramide@mirtazapine@nifedipine@ondansetron hydrochloride Almost the same thing is true of drug reactions; they are fortunately drawn from an international standard, but the same conditions can be described in a variety of ways. A common reaction might be something like "Nausea", but this could be described in other ways, such as Vomiting. If two reports match on 5 reaction terms, but one has "Vomiting" while the other has "Nausea", it may just be a slight difference in the accounts/reporting. Hence wanting to do a Jaccard or similar across the terms. For these I may have an approach I can use of walking a hierarchy to group like terms together, to prevent treating "Nausea" and "Vomiting" as completely separate tokens, since they have the same parent term. Has anyone done this kind of set-based comparison? I can define a function to take the strings, break them into sets, and calculate the Jaccard similarity, but looking at the package I'm not sure how to use a custom calculation like this in it. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 3 replies
-
Thanks for the question. I'm not hugely familiar with this sort of data. Are you able to post a couple of fake sample records just so I'm able to get a better sense of what it looks like - hopefully that will help us be able to answer your question better. Perhaps two which are different but you think should match, and two which are different and do not match or something like that |
Beta Was this translation helpful? Give feedback.
OK, here's a start, with plenty of room for improvement. But hopefully will give you some ideas