Official PyTorch implementation for the following paper:
AV2AV: Direct Audio-Visual Speech to Audio-Visual Speech Translation with Unified Audio-Visual Speech Representation
Jeongsoo Choi*, Se Jin Park*, Minsu Kim*, Yong Man Ro
CVPR 2024 (Highlight)
[Paper] [Demo]
- Python >=3.7,<3.11
git clone -b main --single-branch https://github.com/choijeongsoo/av2av
cd av2av
git submodule init
git submodule update
pip install -e fairseq
pip install -r requirements.txt
conda install "ffmpeg<5" -c conda-forge
Name | Language | Link |
---|---|---|
LRS3 | English | here |
mTEDx | Spanish, French, Italian, and Portuguese | here |
- We use curated lists of this work for filtering mTEDx.
- For more details, please refer to the 'Dataset' section in our paper.
- We follow Auto-AVSR to preprocess audio-visual data.
Stage | Download Link |
---|---|
AV Speech Unit Extraction | mavhubert_large_noise.pt |
Multilingual AV2AV Translation | utut_sts_ft.pt |
Zero-shot AV-Renderer | unit_av_renderer.pt |
$ cd av2av
$ PYTHONPATH=fairseq python inference.py \
--in-vid-path samples/en/TRajLqEaWhQ_00002.mp4 \
--out-vid-path samples/es/TRajLqEaWhQ_00002.mp4 \
--src-lang en --tgt-lang es \
--av2unit-path /path/to/mavhubert_large_noise.pt \
--utut-path /path/to/utut_sts_ft.pt \
--unit2av-path /path/to/unit_av_renderer.pt \
- Our model supports 5 languages: en (English), es (Spanish), fr (French), it (Italian), pt (Portuguese)
This repository is built upon AV-HuBERT, UTUT, speech-resynthesis, Wav2Lip, and Fairseq. We appreciate the open-source of the projects.
If our work is useful for your research, please consider citing the following papers:
@inproceedings{choi2024av2av,
title={AV2AV: Direct Audio-Visual Speech to Audio-Visual Speech Translation with Unified Audio-Visual Speech Representation},
author={Choi, Jeongsoo and Park, Se Jin and Kim, Minsu and Ro, Yong Man},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year={2024}
}
@article{kim2024textless,
title={Textless Unit-to-Unit training for Many-to-Many Multilingual Speech-to-Speech Translation},
author={Kim, Minsu and Choi, Jeongsoo and Kim, Dahun and Ro, Yong Man},
journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},
year={2024}
}