How to use splink to fuzzy match requests to a known data set? #1345

richard-a-lott · 2023-06-20T13:32:01Z

richard-a-lott
Jun 20, 2023

Firstly it's worth mentioning I'm a Data Engineer, not Scientist, so please forgive any obvious mistakes with this!

I have a set of target data with around 20M unique rows. I also have a web UI, where users can enter requests. I want to match the individual request data to the target data. I expect there may be mistakes in the request data entered, omissions etc, hence I am looking at using a fuzzy match to identify the most likely target row.

E.g. A user enters a request, and splink returns the row from the target data with the highest probability of matching that request. If the probability is over a set threshold I accept that as an actual match, otherwise it gets manually reviewed.

Currently, however, I only have the target data, as the requests UI is yet to go live.

I've read through the splink documentation and it seems like it could do what I need, however I'm unsure exactly how to train the model when I only have a unique data set to start with.

My questions are:

Is splink suitable for what I'm trying to achieve? (Or is there anything better/simpler)
How can I train the model when I have nothing for it to match against? One suggestion is to fake some requests (including typos omissions) as an initial training set, then re-train after collecting real requests.

Any help would be greatly appreciated, thanks!

richard-a-lott · 2023-06-21T09:34:37Z

richard-a-lott
Jun 21, 2023
Author

Actually, never mind. Thanks!

2 replies

RobinL Jun 21, 2023
Maintainer

Fwiw the function you're after is https://moj-analytical-services.github.io/splink/linker.html?h=find#splink.linker.Linker.find_matches_to_new_records

For your use case I'd train the u values but not bother training the m values (which will result in defaults being used)

richard-a-lott Jun 21, 2023
Author

Thanks for the response Robin! I'll give that a try and see how it works out!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to use splink to fuzzy match requests to a known data set? #1345

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

How to use splink to fuzzy match requests to a known data set? #1345

richard-a-lott Jun 20, 2023

Replies: 1 comment · 2 replies

richard-a-lott Jun 21, 2023 Author

RobinL Jun 21, 2023 Maintainer

richard-a-lott Jun 21, 2023 Author

richard-a-lott
Jun 20, 2023

Replies: 1 comment 2 replies

richard-a-lott
Jun 21, 2023
Author

RobinL Jun 21, 2023
Maintainer

richard-a-lott Jun 21, 2023
Author