Dataset Verification #22

ejhumphrey · 2018-08-28T20:20:28Z

Adds a script to verify the dataset contains the expected files (audio, vggish, and sparse labels), along with the necessary checksums for each.

Potentially blocked on #21, will close #6 when it's back.

Open question: Does it add value to also verify label statistics if we're at least matching on the checksum? Same could be said for the duration and VGGish shapes, though those have been added to prevent regressions in the future.

coveralls · 2018-08-28T20:41:44Z

Pull Request Test Coverage Report for Build 16

0 of 0 changed or added relevant lines in 0 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage remained the same at 94.068%

Totals
Change from base Build 14:	0.0%
Covered Lines:	222
Relevant Lines:	236

💛 - Coveralls

coveralls · 2018-08-28T20:41:44Z

Pull Request Test Coverage Report for Build 16

0 of 0 changed or added relevant lines in 0 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage remained the same at 94.068%

Totals
Change from base Build 14:	0.0%
Covered Lines:	222
Relevant Lines:	236

💛 - Coveralls

ejhumphrey · 2018-08-30T12:13:37Z

Because of #24, it feels like some kind of duplicate check should happen in this script now. Not sure how computationally intensive it is / would be to iterate over a brute force nearest-neighbor model, but the checksums give us something, at least.

ejhumphrey · 2018-09-17T11:54:50Z

@bmcfee sorry, just realized no one got tagged on this

bmcfee

minor comments, but lgtm otherwise

bmcfee · 2018-09-18T13:07:03Z

scripts/verify_dataset.py

+
+
+def _check_duration(fname, expected_duration, tolerance):
+    dur = sf.info(fname).duration


Just a heads-up on this one: sf.info() uses header data to infer duration, and this isn't always reliable. See librosa/librosa#686 .

I think it's fine to leave it as is here, but maybe put a comment in place that this isn't not 100% foolproof.

gah, dammit ... this is the same problem that sox bit me on the bogus FMA headers.

I'd rather load the audio to compute duration post-decoding and eat that computational cost, since this is supposed to be an infrequent but robust guarantee that things are legit. MD5 checksums would catch deltas, but this would help identify this specific (known) issue.

bmcfee · 2018-09-18T13:07:50Z

scripts/verify_dataset.py

+        if act_shape != shape:
+            raise warnings.warn('{}:{} has mismatched shapes: {} != {}'
+                                .format(json_file, key, act_shape, shape))
+            success &= False


could just be success = False

👍 ya old habits die hard for not re-initing a var

ejhumphrey · 2018-09-19T12:22:07Z

closes #6

Added script and baseline checksums

c2cdbea

ejhumphrey mentioned this pull request Aug 30, 2018

De-duping #24

Closed

ejhumphrey requested a review from bmcfee September 17, 2018 11:54

bmcfee approved these changes Sep 18, 2018

View reviewed changes

ejhumphrey merged commit 3b0957a into master Sep 19, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset Verification #22

Dataset Verification #22

ejhumphrey commented Aug 28, 2018

coveralls commented Aug 28, 2018

coveralls commented Aug 28, 2018 •

edited

Loading

ejhumphrey commented Aug 30, 2018

ejhumphrey commented Sep 17, 2018

bmcfee left a comment

bmcfee Sep 18, 2018

ejhumphrey Sep 19, 2018

bmcfee Sep 18, 2018

ejhumphrey Sep 19, 2018

ejhumphrey commented Sep 19, 2018



		def _check_duration(fname, expected_duration, tolerance):
		dur = sf.info(fname).duration

Dataset Verification #22

Dataset Verification #22

Conversation

ejhumphrey commented Aug 28, 2018

coveralls commented Aug 28, 2018

Pull Request Test Coverage Report for Build 16

💛 - Coveralls

coveralls commented Aug 28, 2018 • edited Loading

Pull Request Test Coverage Report for Build 16

💛 - Coveralls

ejhumphrey commented Aug 30, 2018

ejhumphrey commented Sep 17, 2018

bmcfee left a comment

Choose a reason for hiding this comment

bmcfee Sep 18, 2018

Choose a reason for hiding this comment

ejhumphrey Sep 19, 2018

Choose a reason for hiding this comment

bmcfee Sep 18, 2018

Choose a reason for hiding this comment

ejhumphrey Sep 19, 2018

Choose a reason for hiding this comment

ejhumphrey commented Sep 19, 2018

coveralls commented Aug 28, 2018 •

edited

Loading