Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Raise an exception on receiving duplicate filepaths #242

Merged
merged 3 commits into from
Nov 22, 2023

Conversation

sanjanag
Copy link
Member

Duplicate filepaths caused out of memory error when doing joins and merges in pandas dataframes.

@sanjanag sanjanag requested a review from elisno November 21, 2023 22:43
Copy link

codecov bot commented Nov 21, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (f653207) 96.04% compared to head (7b43c65) 95.84%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #242      +/-   ##
==========================================
- Coverage   96.04%   95.84%   -0.20%     
==========================================
  Files          16       16              
  Lines         986      987       +1     
  Branches      194      194              
==========================================
- Hits          947      946       -1     
- Misses         20       21       +1     
- Partials       19       20       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

tests/test_run.py Outdated Show resolved Hide resolved
Copy link
Member

@jwmueller jwmueller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, made a small suggestion you can consider, but feel free to merge.

Does this fix warrant releasing a new version of CleanVision? I'm not sure how you encountered the bug you faced that inspired this simple fix (why did you have duplicate filepaths in the first place before?)

@jwmueller
Copy link
Member

Fixes #236

sanjanag and others added 2 commits November 22, 2023 11:22
Co-authored-by: Jonas Mueller <1390638+jwmueller@users.noreply.github.com>
Co-authored-by: Jonas Mueller <1390638+jwmueller@users.noreply.github.com>
@sanjanag sanjanag linked an issue Nov 22, 2023 that may be closed by this pull request
@sanjanag
Copy link
Member Author

LGTM, made a small suggestion you can consider, but feel free to merge.

Does this fix warrant releasing a new version of CleanVision? I'm not sure how you encountered the bug you faced that inspired this simple fix (why did you have duplicate filepaths in the first place before?)

I would hold off on a new release, as this is an odd occurrence. Also this #222 is another bug I want to fix before doing it . I was working with Amazon Berkeley Objects dataset, and it has a separate file for product metadata and images separately. I wrote some code to map the two of them, and the duplicates filepaths got introduced in that (not because of some error in code, but how the metadata file is structured).

@sanjanag sanjanag merged commit d1334e0 into cleanlab:main Nov 22, 2023
21 of 22 checks passed
@sanjanag sanjanag deleted the catch-dup-files branch November 22, 2023 19:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Catch duplicate entries in filepaths argument
3 participants