How to use splink to fuzzy match requests to a known data set? #1345
Replies: 1 comment 2 replies
-
Actually, never mind. Thanks! |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Firstly it's worth mentioning I'm a Data Engineer, not Scientist, so please forgive any obvious mistakes with this!
I have a set of target data with around 20M unique rows. I also have a web UI, where users can enter requests. I want to match the individual request data to the target data. I expect there may be mistakes in the request data entered, omissions etc, hence I am looking at using a fuzzy match to identify the most likely target row.
E.g. A user enters a request, and splink returns the row from the target data with the highest probability of matching that request. If the probability is over a set threshold I accept that as an actual match, otherwise it gets manually reviewed.
Currently, however, I only have the target data, as the requests UI is yet to go live.
I've read through the splink documentation and it seems like it could do what I need, however I'm unsure exactly how to train the model when I only have a unique data set to start with.
My questions are:
Any help would be greatly appreciated, thanks!
Beta Was this translation helpful? Give feedback.
All reactions