We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
I want to better understand the the clustering I get after estimating pairwise match probabilities, thresholding, and getting connected components.
It's useful to consider a quasi-identifier such as a name, and to compute the following two metrics:
For instance, if I know that names are quite clean in my data, then I want the name variation rate to be very low.
The er-evaluation package implements the two metrics, but it uses Pandas and it's quite slow. The formulas are given in this paper (my paper): https://arxiv.org/pdf/2404.05622
The text was updated successfully, but these errors were encountered:
No branches or pull requests
Is your proposal related to a problem?
I want to better understand the the clustering I get after estimating pairwise match probabilities, thresholding, and getting connected components.
Describe the solution you'd like
It's useful to consider a quasi-identifier such as a name, and to compute the following two metrics:
For instance, if I know that names are quite clean in my data, then I want the name variation rate to be very low.
Describe alternatives you've considered
The er-evaluation package implements the two metrics, but it uses Pandas and it's quite slow. The formulas are given in this paper (my paper): https://arxiv.org/pdf/2404.05622
The text was updated successfully, but these errors were encountered: