# GPT-SoVITS

Last update: Mar 8, 2024

# ‎

# Introduction

See original guide

GPT-SoVITS is an open-source repository focused on TTS & cross-language inference, with a Colab port coming soon. Credits to RVC-Boss.
Currently it only supports Chinese, English & Japanese. More languages are coming soon.
You'll require great specs & a NVIDIA GPU with >=6G VRAM to run it smoothly. Otherwise, use the Colab.
This guide is a translation of the original one with a few tweaks, made by Delik. [ Discord: @delik - Wechat: Dellikk ] ‎

# ‎

# Installation

# ‎

# 1. Download prezip

Download the prezip of the latest version here.

# ‎

# 2. Extract

Unzip the folder. It's advisable to use 7-ZIP to do so.

# ‎

# 3. Launch

Open the folder & run go-web.bat to open WebUI.

# ‎

# Training

# ‎

# 1. Prepare dataset

The dataset should be between 1 - 30 minutes. But you must prioritize quality over quantity.
For the best results, ensure the audio is properly cleaned, free of undesired noises & distortions.
GPT-So-VITS is made for TTS only, so it's also best to remove any singing/muffly voice parts.
# Learn how to clean datasets.

# ‎

# 2. Audio Slicer

Copy the path file of your dataset & paste it in the Audio slicer input bar.
Create a new folder somewhere. Copy its path folder & paste in Audio slicer output. This is where the outputs are getting stored.
Adjust the parameters if needed.
Finally, click Start Audio Slicer to complete this step.

# ‎

# 3. ASR

The Input folder path should be the same as Audio slicer output. Jst copy the path & paste it inside the bar.
If the dataset is in English/Japanese, use Faster-Whisper large v3.

If it's in Chinese, use 达摩ASR.
Then click Start batch ASR.

If you run GPT-SoVITS for the first time, you might need to wait for a few minutes for it to download the ASR model you select.
Finally, locate the .list file & copy the path. It will be in output/asr_opt, if you didn't change the folder for Output folder path.

# ‎

# 4. Text Labelling (optional)

Paste the .list file path into .list annotation file path.
Tick Open labelling WebUI to open Text Labelling WebUI. A new tab will open.

Listen to each clip & edit the text if it's not transcribed properly.
The functions are self-explanatory. Use next index & previous index to check the next/previous page.
If you make changes, remember to save file & submit text.

# ‎

# 5. Formatting

Click 1-GPT-SOVITS-TTS & 1A-Dataset formatting to enter the training page.
Input the name of your model in Experiment/model name, & the .list file path to Text labelling file.
Scroll down to the end & start One-click formatting to begin formatting.

# ‎

# 6. SoVITS Training

Scroll up then click 1B-Fine-tuned training.
‎
‎

# ‎

# Recommended settings for SoVITS training:

Batch size: 2 | Use 1 if the GPU has 6GB VRAM.
Total epochs: 8
Text model learning rate weighting: <=0.4
Save frequency: 4
‎: ‎

After that, click Start SoVITS training

# ‎

# 7. GPT Training

# Recommended settings for GPT training:

Batch size: 2 (1 if your gpu has 6G vram)
Total epochs: 10
Save frequency: 5
DPO training: disabled (explained later)
‎: ‎

After that, click Start GPT training

You can't train both simultaneously unless you have 2 or more GPUs.

# ‎

# DPO training (optional)

DPO training greatly improves the performance (not audio quality) & stability of the model.
It can infer more text at once without slicing & it's less prone to errors (like repeating/skipping words) when inferring.
# For this, you'll requiere:
- A GPU with 12G VRAM or more.
- A very high quality dataset (you need to do text labelling) to enable this.
- Using a batch size of 1. Keep the other settings same as above.
- Otherwise, this will worsen the model.

# ‎

# Inference

Go to the 1C-inference tab.
Press refreshing model paths & select your models from the dropdowns respectively.
Tick Open TTS inference WEBUI.
Upload a clip for reference audio (must be 3-10 seconds). Then fill-in the Text for reference audio, which is what does the character say in the audio. Choose the language on the right.
- The reference audio is very important as it determines the speed & emotion of the output. Try different ones to polish your output.
- You can reopen the text proofreading tool to download the reference audio, and copy & paste the text for reference audio.
- Hover above the "duration" to adjust the length of the reference audio, & hover above "it" to delete the current reference audio.
- No reference text mode exists, but it's not advisable to use it. It affects the quality a lot.
  ‎
Fill the Inference text & set the Inference language, then click Start inference.
- If the text is too long choose the options in How to slice the sentence.
- If you did not get your desired output, you can infer it again or change reference audio and/or adjust GPT parameters.

# ‎

# `You have reached the end.`

Report Issues