-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataset Verification #22
Conversation
1 similar comment
Because of #24, it feels like some kind of duplicate check should happen in this script now. Not sure how computationally intensive it is / would be to iterate over a brute force nearest-neighbor model, but the checksums give us something, at least. |
@bmcfee sorry, just realized no one got tagged on this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor comments, but lgtm otherwise
|
||
|
||
def _check_duration(fname, expected_duration, tolerance): | ||
dur = sf.info(fname).duration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a heads-up on this one: sf.info()
uses header data to infer duration, and this isn't always reliable. See librosa/librosa#686 .
I think it's fine to leave it as is here, but maybe put a comment in place that this isn't not 100% foolproof.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gah, dammit ... this is the same problem that sox bit me on the bogus FMA headers.
I'd rather load the audio to compute duration post-decoding and eat that computational cost, since this is supposed to be an infrequent but robust guarantee that things are legit. MD5 checksums would catch deltas, but this would help identify this specific (known) issue.
if act_shape != shape: | ||
raise warnings.warn('{}:{} has mismatched shapes: {} != {}' | ||
.format(json_file, key, act_shape, shape)) | ||
success &= False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could just be success = False
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 ya old habits die hard for not re-initing a var
closes #6 |
Adds a script to verify the dataset contains the expected files (audio, vggish, and sparse labels), along with the necessary checksums for each.
Potentially blocked on #21, will close #6 when it's back.
Open question: Does it add value to also verify label statistics if we're at least matching on the checksum? Same could be said for the duration and VGGish shapes, though those have been added to prevent regressions in the future.