#
GPT-SoVITS
Last update: Mar 8, 2024
#
#
Introduction
GPT-SoVITS is an open-source repository focused on TTS & cross-language inference, with a Colab port coming soon. Credits to RVC-Boss.
Currently it only supports Chinese, English & Japanese. More languages are coming soon.
You'll require great specs & a NVIDIA GPU with >=6G VRAM to run it smoothly. Otherwise, use the Colab.
This guide is a translation of the original one with a few tweaks, made by Delik. [ Discord: @delik - Wechat: Dellikk ]
#
#
Installation
#
#
1. Download prezip
- Download the prezip of the latest version here.
#
#
2. Extract
- Unzip the folder. It's advisable to use 7-ZIP to do so.
#
#
3. Launch
- Open the folder & run
go-web.bat
to open WebUI.
#
#
Training
#
#
1. Prepare dataset
The dataset should be between 1 - 30 minutes. But you must prioritize quality over quantity.
For the best results, ensure the audio is properly cleaned, free of undesired noises & distortions.
GPT-So-VITS is made for TTS only, so it's also best to remove any singing/muffly voice parts.
# Learn how to clean datasets.
#
#
2. Audio Slicer
Copy the path file of your dataset & paste it in the Audio slicer input bar.
Create a new folder somewhere. Copy its path folder & paste in Audio slicer output. This is where the outputs are getting stored.
Adjust the parameters if needed.
Finally, click Start Audio Slicer to complete this step.
#
#
3. ASR
The Input folder path should be the same as Audio slicer output. Jst copy the path & paste it inside the bar.
If the dataset is in English/Japanese, use
Faster-Whisper large v3
.If it's in Chinese, use
达摩ASR
.Then click
Start batch ASR
.If you run GPT-SoVITS for the first time, you might need to wait for a few minutes for it to download the ASR model you select.
Finally, locate the
.list
file & copy the path. It will be in output/asr_opt, if you didn't change the folder for Output folder path.
#
#
4. Text Labelling (optional)
Paste the
.list
file path into .list annotation file path.Tick Open labelling WebUI to open Text Labelling WebUI. A new tab will open.
Listen to each clip & edit the text if it's not transcribed properly.
The functions are self-explanatory. Use next index & previous index to check the next/previous page.
If you make changes, remember to save file & submit text.
#
#
5. Formatting
Click 1-GPT-SOVITS-TTS & 1A-Dataset formatting to enter the training page.
Input the name of your model in Experiment/model name, & the
.list
file path to Text labelling file.Scroll down to the end & start One-click formatting to begin formatting.
#
#
6. SoVITS Training
- Scroll up then click 1B-Fine-tuned training.
#
# Recommended settings for SoVITS training:
- Batch size
2
| Use1
if the GPU has 6GB VRAM.- Total epochs
8
- Text model learning rate weighting
- <=
0.4
- Save frequency
4
-
-
- After that, click Start SoVITS training
#
#
7. GPT Training
# Recommended settings for GPT training:
- Batch size
- 2 (1 if your gpu has 6G vram)
- Total epochs
- 10
- Save frequency
- 5
- DPO training
- disabled (explained later)
-
-
After that, click Start GPT training
You can't train both simultaneously unless you have 2 or more GPUs.
#
#
DPO training (optional)
DPO training greatly improves the performance (not audio quality) & stability of the model.
It can infer more text at once without slicing & it's less prone to errors (like repeating/skipping words) when inferring.
# For this, you'll requiere:- A GPU with 12G VRAM or more.
- A very high quality dataset (you need to do text labelling) to enable this.
- Using a batch size of 1. Keep the other settings same as above.
- Otherwise, this will worsen the model.
#
#
Inference
Go to the 1C-inference tab.
Press refreshing model paths & select your models from the dropdowns respectively.
Tick Open TTS inference WEBUI.
Upload a clip for reference audio (must be 3-10 seconds). Then fill-in the Text for reference audio, which is what does the character say in the audio. Choose the language on the right.
The reference audio is very important as it determines the speed & emotion of the output. Try different ones to polish your output.
You can reopen the text proofreading tool to download the reference audio, and copy & paste the text for reference audio.
Hover above the "duration" to adjust the length of the reference audio, & hover above "it" to delete the current reference audio.
No reference text mode exists, but it's not advisable to use it. It affects the quality a lot.
Fill the Inference text & set the Inference language, then click
Start inference
.If the text is too long choose the options in How to slice the sentence.
If you did not get your desired output, you can infer it again or change reference audio and/or adjust GPT parameters.