F5-TTS: Diffusion Transformer with ConvNeXt V2, faster trained and inference.
E2 TTS: Flat-UNet Transformer, closest reproduction from paper.
Sway Sampling: Inference-time flow step sampling strategy, greatly improves performance
- 2024/10/08: F5-TTS & E2 TTS base models on 🤗 Hugging Face, 🤖 Model Scope, 🟣 Wisemodel.
# Create a python 3.10 conda env (you could also use virtualenv)
conda create -n f5-tts python=3.10
conda activate f5-tts
# Install pytorch with your CUDA version, e.g.
pip install torch==2.3.0+cu118 torchaudio==2.3.0+cu118 --extra-index-url https://download.pytorch.org/whl/cu118
Then you can choose from a few options below:
pip install git+https://github.com/SWivid/F5-TTS.git
git clone https://github.com/SWivid/F5-TTS.git
cd F5-TTS
# git submodule update --init --recursive # (optional, if need bigvgan)
pip install -e .
If initialize submodule, you should add the following code at the beginning of src/third_party/BigVGAN/bigvgan.py
.
import os
import sys
sys.path.append(os.path.dirname(os.path.abspath(__file__)))
# Build from Dockerfile
docker build -t f5tts:v1 .
# Or pull from GitHub Container Registry
docker pull ghcr.io/swivid/f5-tts:main
Currently supported features:
- Basic TTS with Chunk Inference
- Multi-Style / Multi-Speaker Generation
- Voice Chat powered by Qwen2.5-3B-Instruct
# Launch a Gradio app (web interface)
f5-tts_infer-gradio
# Specify the port/host
f5-tts_infer-gradio --port 7860 --host 0.0.0.0
# Launch a share link
f5-tts_infer-gradio --share
# Run with flags
# Leave --ref_text "" will have ASR model transcribe (extra GPU memory usage)
f5-tts_infer-cli \
--model "F5-TTS" \
--ref_audio "ref_audio.wav" \
--ref_text "The content, subtitle or transcription of reference audio." \
--gen_text "Some text you want TTS model generate for you."
# Run with default setting. src/f5_tts/infer/examples/basic/basic.toml
f5-tts_infer-cli
# Or with your own .toml file
f5-tts_infer-cli -c custom.toml
# Multi voice. See src/f5_tts/infer/README.md
f5-tts_infer-cli -c src/f5_tts/infer/examples/multi/story.toml
- In order to have better generation results, take a moment to read detailed guidance.
- The Issues are very useful, please try to find the solution by properly searching the keywords of problem encountered. If no answer found, then feel free to open an issue.
Read training & finetuning guidance for more instructions.
# Quick start with Gradio web interface
f5-tts_finetune-gradio
Use pre-commit to ensure code quality (will run linters and formatters automatically)
pip install pre-commit
pre-commit install
When making a pull request, before each commit, run:
pre-commit run --all-files
Note: Some model components have linting exceptions for E722 to accommodate tensor notation
- E2-TTS brilliant work, simple and effective
- Emilia, WenetSpeech4TTS valuable datasets
- lucidrains initial CFM structure with also bfs18 for discussion
- SD3 & Hugging Face diffusers DiT and MMDiT code structure
- torchdiffeq as ODE solver, Vocos as vocoder
- FunASR, faster-whisper, UniSpeech for evaluation tools
- ctc-forced-aligner for speech edit test
- mrfakename huggingface space demo ~
- f5-tts-mlx Implementation with MLX framework by Lucas Newman
- F5-TTS-ONNX ONNX Runtime version by DakeQQ
If our work and codebase is useful for you, please cite as:
@article{chen-etal-2024-f5tts,
title={F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching},
author={Yushen Chen and Zhikang Niu and Ziyang Ma and Keqi Deng and Chunhui Wang and Jian Zhao and Kai Yu and Xie Chen},
journal={arXiv preprint arXiv:2410.06885},
year={2024},
}
Our code is released under MIT License. The pre-trained models are licensed under the CC-BY-NC license due to the training data Emilia, which is an in-the-wild dataset. Sorry for any inconvenience this may cause.