Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix incorrect PESEL checksum validation in PlPeselRecognizer #1520

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

BlaiseCz
Copy link

@BlaiseCz BlaiseCz commented Jan 31, 2025

Bug Description

The PESEL checksum validation in PlPeselRecognizer.validate_result() is incorrect. The current implementation does not correctly compute the control digit, leading to false negatives, where valid PESEL numbers are incorrectly rejected.

This affects Presidio's ability to correctly recognize and validate PESEL numbers, impacting anonymization and sensitive data detection.


To Reproduce

Run the following test:

from presidio_analyzer.predefined_recognizers import PlPeselRecognizer

pesel_recognizer = PlPeselRecognizer()

valid_pesel = "44051401359"  # This is a valid PESEL
print(pesel_recognizer.validate_result(valid_pesel))  # Expected: True, Actual: False

**Note if unsure, check this: https://kalkulatory.gofin.pl/kalkulatory/sprawdzanie-pesel-weryfikacja-pesel

Observed Behavior

  • The function returns False for a valid PESEL due to incorrect checksum computation.

Expected Behavior

  • A valid PESEL (with the correct checksum) should return True.

Root Cause: Incorrect Checksum Calculation

The issue lies in the final checksum validation step. The existing code:

checksum = sum(digit * weight for digit, weight in zip(digits[:10], weights))
checksum %= 10

return checksum == digits[10]  # ❌ Incorrect final checksum check!

This incorrectly compares checksum directly to the last digit of PESEL instead of computing the correct control digit.


Proposed Fix

The correct formula to compute the PESEL checksum is:

def validate_result(self, pattern_text: str) -> bool:  # noqa D102
    if len(pattern_text) != 11 or not pattern_text.isdigit():
        return False  # Ensure the input is a valid 11-digit number

    digits = [int(digit) for digit in pattern_text]
    weights = [1, 3, 7, 9, 1, 3, 7, 9, 1, 3]  # Correct weights

    checksum = sum(digit * weight for digit, weight in zip(digits[:10], weights)) % 10
    check_digit = (10 - checksum) % 10  # ✅ Corrected final checksum computation

    return check_digit == digits[10]  # ✅ Now correctly compares with the last digit

Why This Fix Works

  • Ensures the checksum modulo 10 logic is correctly applied.
  • Guarantees that only valid PESELs pass the validation.
  • Fixes the false-negative issue without introducing false positives.

Additional Context

  • This issue impacts Polish users relying on PESEL validation in Presidio.
  • The bug affects data masking and validation accuracy.
  • Fixing this ensures compliance with official PESEL formatting rules.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant