Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replicate scores for major papers using Frechet Audio Distance #3

Closed
gudgud96 opened this issue Oct 26, 2022 · 7 comments
Closed

Replicate scores for major papers using Frechet Audio Distance #3

gudgud96 opened this issue Oct 26, 2022 · 7 comments
Assignees

Comments

@gudgud96
Copy link
Owner

gudgud96 commented Oct 26, 2022

As mentioned in fcaspe/ddx7#1, I am unable to replicate the FAD score to a satisfactory level yet as reported in the paper.

Need further investigation on whether the diff is due to inherent implementation diffs compared to the Google version, or diffs outside of FAD calculation. Hence I decide to look into some major works to do a more detailed benchmark of the FAD scores reported VS calculated here. Candidates to be listed (will start with DDX7), paper suggestions are welcomed.

@gudgud96 gudgud96 self-assigned this Oct 26, 2022
@yoyolicoris
Copy link

I also encountered the same issue recently.
Maybe Google had re-trained their VGGish model so the output can be different?
The torchvggish hasn't been updated for a long time.

@gudgud96
Copy link
Owner Author

Doesn't seem like there is a re-train from their last modified date: https://storage.googleapis.com/audioset

image

Let me do a further diff check on both models.

@gudgud96
Copy link
Owner Author

Did a check based on test_audio/ provided in google-research/frechet-audio-distance based on distorted sine waves:

https://github.com/google-research/google-research/blob/master/frechet_audio_distance/gen_test_files.py#L86

My results are pretty close to the originals:

baseline vs test1 baseline vs test2
google-research 12.4375 4.7680
frechet_audio_distance 12.7398 4.9815

@yoyololicon do you have a false case that you could share? I could probably do more digging in to see if there are any failure modes.

@yoyolicoris
Copy link

@gudgud96
The test code I got from my colleague.
It basically compares the differences between the two embeddings, and we think the difference is significant.

import numpy as  np
import torch
import tensorflow as tf
import tensorflow_hub as hub

model_torch = torch.hub.load('harritaylor/torchvggish', 'vggish')
model_torch.postprocess = False
model_torch.eval()

model_tf = hub.load("https://tfhub.dev/google/vggish/1")

sample = np.random.uniform(-1, 1, size=16000*5)

with torch.no_grad():
    torch_embeddings = model_torch(sample, 16000).cpu().numpy()
tf_embeddings = model_tf(sample)

np.linalg.norm(torch_embeddings-tf_embeddings, axis=1).mean()

@gudgud96
Copy link
Owner Author

@yoyololicon Ran your script, the issue is that torchvggish has an extra ReLU layer as compared to the google-research original implementation, see harritaylor/torchvggish#24

As suggested by @brentspell (credits to Brent once again!) we can disable the final ReLU layer in torchvggish, I have used it in my implementation as well:

model_torch = torch.hub.load('harritaylor/torchvggish', 'vggish')
model_torch.postprocess = False
model_torch.embeddings = nn.Sequential(*list(model_torch.embeddings.children())[:-1])
model_torch.eval()

You should be able to see that the difference is very minimal in this case. Hope it helps!

@yoyolicoris
Copy link

@gudgud96 Oh, I completely miss this. Thanks for pointing it out!

@gudgud96
Copy link
Owner Author

Close issue for now, as basic test on sine tones can pass. If there are queries regarding accuracies / interest towards replicating exact numbers of papers, we shall re-open this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants