Skip to content

Using sets to identify duplicates - Jaccard distance #2376

Answered by RobinL
RichardMcAteer asked this question in Q&A
Discussion options

You must be logged in to vote

OK, here's a start, with plenty of room for improvement. But hopefully will give you some ideas

import duckdb
import pandas as pd

import splink.comparison_level_library as cll
import splink.comparison_library as cl
from splink import DuckDBAPI, Linker, SettingsCreator, block_on, splink_datasets

# Define the first table (cases data)
cases_data = [
    {
        "unique_id": "987613663917",
        "sex_category": "Female",
        "age_norm": 21,
        "age_group_calc": "ADULT",
        "dob": "2003-01-01",
        "drugs_active_substance_list": "busulfan@cyclophosphamide@cytarabine@daunorubicin@fludarabine phosphate@ganciclovir sodium@rituximab@valganciclovir",
        "react_terms": "…

Replies: 1 comment 3 replies

Comment options

You must be logged in to vote
3 replies
@RichardMcAteer
Comment options

@RobinL
Comment options

Answer selected by RichardMcAteer
@RichardMcAteer
Comment options

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants