Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset Verification #22

Merged
merged 1 commit into from
Sep 19, 2018
Merged

Dataset Verification #22

merged 1 commit into from
Sep 19, 2018

Conversation

ejhumphrey
Copy link
Contributor

Adds a script to verify the dataset contains the expected files (audio, vggish, and sparse labels), along with the necessary checksums for each.

Potentially blocked on #21, will close #6 when it's back.

Open question: Does it add value to also verify label statistics if we're at least matching on the checksum? Same could be said for the duration and VGGish shapes, though those have been added to prevent regressions in the future.

@coveralls
Copy link

Pull Request Test Coverage Report for Build 16

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 94.068%

Totals Coverage Status
Change from base Build 14: 0.0%
Covered Lines: 222
Relevant Lines: 236

💛 - Coveralls

1 similar comment
@coveralls
Copy link

coveralls commented Aug 28, 2018

Pull Request Test Coverage Report for Build 16

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 94.068%

Totals Coverage Status
Change from base Build 14: 0.0%
Covered Lines: 222
Relevant Lines: 236

💛 - Coveralls

@ejhumphrey ejhumphrey mentioned this pull request Aug 30, 2018
@ejhumphrey
Copy link
Contributor Author

Because of #24, it feels like some kind of duplicate check should happen in this script now. Not sure how computationally intensive it is / would be to iterate over a brute force nearest-neighbor model, but the checksums give us something, at least.

@ejhumphrey ejhumphrey requested a review from bmcfee September 17, 2018 11:54
@ejhumphrey
Copy link
Contributor Author

@bmcfee sorry, just realized no one got tagged on this

Copy link
Collaborator

@bmcfee bmcfee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comments, but lgtm otherwise



def _check_duration(fname, expected_duration, tolerance):
dur = sf.info(fname).duration
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a heads-up on this one: sf.info() uses header data to infer duration, and this isn't always reliable. See librosa/librosa#686 .

I think it's fine to leave it as is here, but maybe put a comment in place that this isn't not 100% foolproof.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gah, dammit ... this is the same problem that sox bit me on the bogus FMA headers.

I'd rather load the audio to compute duration post-decoding and eat that computational cost, since this is supposed to be an infrequent but robust guarantee that things are legit. MD5 checksums would catch deltas, but this would help identify this specific (known) issue.

if act_shape != shape:
raise warnings.warn('{}:{} has mismatched shapes: {} != {}'
.format(json_file, key, act_shape, shape))
success &= False
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could just be success = False

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 ya old habits die hard for not re-initing a var

@ejhumphrey ejhumphrey merged commit 3b0957a into master Sep 19, 2018
@ejhumphrey
Copy link
Contributor Author

closes #6

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Dataset integrity checks
3 participants