Skip to content

[CVPR 2024] AV2AV: Direct Audio-Visual Speech to Audio-Visual Speech Translation with Unified Audio-Visual Speech Representation

License

Notifications You must be signed in to change notification settings

choijeongsoo/av2av

Repository files navigation

AV2AV

Official PyTorch implementation for the following paper:

AV2AV: Direct Audio-Visual Speech to Audio-Visual Speech Translation with Unified Audio-Visual Speech Representation
Jeongsoo Choi*, Se Jin Park*, Minsu Kim*, Yong Man Ro
CVPR 2024 (Highlight)
[Paper] [Demo]

Method

Setup

  • Python >=3.7,<3.11
git clone -b main --single-branch https://github.com/choijeongsoo/av2av
cd av2av
git submodule init
git submodule update
pip install -e fairseq
pip install -r requirements.txt
conda install "ffmpeg<5" -c conda-forge

Dataset

Name Language Link
LRS3 English here
mTEDx Spanish, French, Italian, and Portuguese here
  • We use curated lists of this work for filtering mTEDx.
  • For more details, please refer to the 'Dataset' section in our paper.

Data Preprocessing

  • We follow Auto-AVSR to preprocess audio-visual data.

Model Checkpoints

Stage Download Link
AV Speech Unit Extraction mavhubert_large_noise.pt
Multilingual AV2AV Translation utut_sts_ft.pt
Zero-shot AV-Renderer unit_av_renderer.pt

Inference

Pipeline for Audio-Visual Speech to Audio-Visual Speech Translation (AV2AV)

$ cd av2av
$ PYTHONPATH=fairseq python inference.py \
  --in-vid-path samples/en/TRajLqEaWhQ_00002.mp4 \
  --out-vid-path samples/es/TRajLqEaWhQ_00002.mp4 \
  --src-lang en --tgt-lang es \
  --av2unit-path /path/to/mavhubert_large_noise.pt \
  --utut-path /path/to/utut_sts_ft.pt \
  --unit2av-path /path/to/unit_av_renderer.pt \
  • Our model supports 5 languages: en (English), es (Spanish), fr (French), it (Italian), pt (Portuguese)

Acknowledgement

This repository is built upon AV-HuBERT, UTUT, speech-resynthesis, Wav2Lip, and Fairseq. We appreciate the open-source of the projects.

Citation

If our work is useful for your research, please consider citing the following papers:

@inproceedings{choi2024av2av,
  title={AV2AV: Direct Audio-Visual Speech to Audio-Visual Speech Translation with Unified Audio-Visual Speech Representation},
  author={Choi, Jeongsoo and Park, Se Jin and Kim, Minsu and Ro, Yong Man},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2024}
}
@article{kim2024textless,
  title={Textless Unit-to-Unit training for Many-to-Many Multilingual Speech-to-Speech Translation},
  author={Kim, Minsu and Choi, Jeongsoo and Kim, Dahun and Ro, Yong Man},
  journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},
  year={2024}
}

About

[CVPR 2024] AV2AV: Direct Audio-Visual Speech to Audio-Visual Speech Translation with Unified Audio-Visual Speech Representation

Resources

License

Stars

Watchers

Forks

Languages