AV2AV

Official PyTorch implementation for the following paper:

AV2AV: Direct Audio-Visual Speech to Audio-Visual Speech Translation with Unified Audio-Visual Speech Representation
Jeongsoo Choi*, Se Jin Park*, Minsu Kim*, Yong Man Ro
CVPR 2024 (Highlight)
[Paper] [Demo]

Method

Setup

Python >=3.7,<3.11

git clone -b main --single-branch https://github.com/choijeongsoo/av2av
cd av2av
git submodule init
git submodule update
pip install -e fairseq
pip install -r requirements.txt
conda install "ffmpeg<5" -c conda-forge

Dataset

Name	Language	Link
LRS3	English	here
mTEDx	Spanish, French, Italian, and Portuguese	here

We use curated lists of this work for filtering mTEDx.
For more details, please refer to the 'Dataset' section in our paper.

Data Preprocessing

We follow Auto-AVSR to preprocess audio-visual data.

Model Checkpoints

Stage	Download Link
AV Speech Unit Extraction	mavhubert_large_noise.pt
Multilingual AV2AV Translation	utut_sts_ft.pt
Zero-shot AV-Renderer	unit_av_renderer.pt

Inference

Pipeline for Audio-Visual Speech to Audio-Visual Speech Translation (AV2AV)

$ cd av2av
$ PYTHONPATH=fairseq python inference.py \
  --in-vid-path samples/en/TRajLqEaWhQ_00002.mp4 \
  --out-vid-path samples/es/TRajLqEaWhQ_00002.mp4 \
  --src-lang en --tgt-lang es \
  --av2unit-path /path/to/mavhubert_large_noise.pt \
  --utut-path /path/to/utut_sts_ft.pt \
  --unit2av-path /path/to/unit_av_renderer.pt \

Our model supports 5 languages: en (English), es (Spanish), fr (French), it (Italian), pt (Portuguese)

Acknowledgement

This repository is built upon AV-HuBERT, UTUT, speech-resynthesis, Wav2Lip, and Fairseq. We appreciate the open-source of the projects.

Citation

If our work is useful for your research, please consider citing the following papers:

@inproceedings{choi2024av2av,
  title={AV2AV: Direct Audio-Visual Speech to Audio-Visual Speech Translation with Unified Audio-Visual Speech Representation},
  author={Choi, Jeongsoo and Park, Se Jin and Kim, Minsu and Ro, Yong Man},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2024}
}
@article{kim2024textless,
  title={Textless Unit-to-Unit training for Many-to-Many Multilingual Speech-to-Speech Translation},
  author={Kim, Minsu and Choi, Jeongsoo and Kim, Dahun and Ro, Yong Man},
  journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
av2unit		av2unit
fairseq @ 0338cdc		fairseq @ 0338cdc
imgs		imgs
samples/en		samples/en
unit2av		unit2av
unit2unit		unit2unit
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
dict.txt		dict.txt
inference.py		inference.py
requirements.txt		requirements.txt
util.py		util.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AV2AV

Method

Setup

Dataset

Data Preprocessing

Model Checkpoints

Inference

Pipeline for Audio-Visual Speech to Audio-Visual Speech Translation (AV2AV)

Acknowledgement

Citation

About

Languages

License

choijeongsoo/av2av

Folders and files

Latest commit

History

Repository files navigation

AV2AV

Method

Setup

Dataset

Data Preprocessing

Model Checkpoints

Inference

Pipeline for Audio-Visual Speech to Audio-Visual Speech Translation (AV2AV)

Acknowledgement

Citation

About

Resources

License

Stars

Watchers

Forks

Languages