Evaluating blocking rules on a test set, goal of maximizing AUC metric #1298
Unanswered
jkginfinite
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I am trying to find a way to evaluate splink's performance based upon a set of blocking rules. The goal is to find a best set of blocking rules that maximizes a metric - such as the AUC score. The objective is de-duplication.
I am trying to build a test set to compare the linker against.
I saw your page; https://moj-analytical-services.github.io/splink/demos/07_Quality_assurance.html and was hoping for some clarification.
On the rows of the "labels set" in your example, are these labels already made with splink? Or are these labels made by hand?
My team has a dataset of rows of known duplicates, where a specific "master_id" value is linked to multiple rows (where each row is another record is of the same person). We propose to use this a "test set", where for each duplicate we would MANUALLY put in the "clerical_match_score" for each row as simply 1 or 0 if the rows are duplicates of each other.
My questions are;
What would the input structure of a "test set" need to look like? The example from "fake_1000_labels.csv" on the page 07_Quality_assurance.html just looks like labels, no predictive features. Wouldn't we need predictive features for test evaluation?
I assume the unique identifying column in the test/train set should be unique for every row - not some master_id that can have multiple values for one master_id?
Once structure of test set is formed, I assume the code would be
#train splink linker
linker.register_table(test_set, "test")
linker._initialise_df_concat_with_tf()
from sklearn.metrics import auc
roc_table = linker.truth_space_table_from_labels_table("test")
roc_t = roc_table.as_pandas_dataframe()
auc_score = auc(roc_t['FP_rate'],roc_t['TP_rate'])
print(auc_score)
right?
Beta Was this translation helpful? Give feedback.
All reactions