Last update: Mar 8, 2024


# Introduction

See original guide

  • GPT-SoVITS is an open-source repository focused on TTS & cross-language inference, with a Colab port coming soon. Credits to RVC-Boss.

  • Currently it only supports Chinese, English & Japanese. More languages are coming soon.

  • You'll require great specs & a NVIDIA GPU with >=6G VRAM to run it smoothly. Otherwise, use the Colab.

  • This guide is a translation of the original one with a few tweaks, made by Delik. [ Discord: @delik - Wechat: Dellikk ] ‎


# Installation


# 1. Download prezip

  • Download the prezip of the latest version here.


# 2. Extract

  • Unzip the folder. It's advisable to use 7-ZIP to do so.


# 3. Launch

  • Open the folder & run go-web.bat to open WebUI.


# Training


# 1. Prepare dataset

  • The dataset should be between 1 - 30 minutes. But you must prioritize quality over quantity.

  • For the best results, ensure the audio is properly cleaned, free of undesired noises & distortions.

  • GPT-So-VITS is made for TTS only, so it's also best to remove any singing/muffly voice parts.

  • # Learn how to clean datasets.


# 2. Audio Slicer

  1. Copy the path file of your dataset & paste it in the Audio slicer input bar.

  2. Create a new folder somewhere. Copy its path folder & paste in Audio slicer output. This is where the outputs are getting stored.

  3. Adjust the parameters if needed.

  4. Finally, click Start Audio Slicer to complete this step.


# 3. ASR

  1. The Input folder path should be the same as Audio slicer output. Jst copy the path & paste it inside the bar.

  2. If the dataset is in English/Japanese, use Faster-Whisper large v3.

    If it's in Chinese, use 达摩ASR.

  3. Then click Start batch ASR.

    If you run GPT-SoVITS for the first time, you might need to wait for a few minutes for it to download the ASR model you select.

  4. Finally, locate the .list file & copy the path. It will be in output/asr_opt, if you didn't change the folder for Output folder path.


# 4. Text Labelling (optional)

  1. Paste the .list file path into .list annotation file path.

  2. Tick Open labelling WebUI to open Text Labelling WebUI. A new tab will open.

  • Listen to each clip & edit the text if it's not transcribed properly.

  • The functions are self-explanatory. Use next index & previous index to check the next/previous page.

  • If you make changes, remember to save file & submit text.


# 5. Formatting

  1. Click 1-GPT-SOVITS-TTS & 1A-Dataset formatting to enter the training page.

  2. Input the name of your model in Experiment/model name, & the .list file path to Text labelling file.

  3. Scroll down to the end & start One-click formatting to begin formatting.


# 6. SoVITS Training

  1. Scroll up then click 1B-Fine-tuned training.

  • # Recommended settings for SoVITS training:
Batch size
2 | Use 1 if the GPU has 6GB VRAM.
Total epochs
Text model learning rate weighting
Save frequency
  1. After that, click Start SoVITS training


# 7. GPT Training

  • # Recommended settings for GPT training:
Batch size
2 (1 if your gpu has 6G vram)
Total epochs
Save frequency
DPO training
disabled (explained later)

After that, click Start GPT training


# DPO training (optional)

  • DPO training greatly improves the performance (not audio quality) & stability of the model.

  • It can infer more text at once without slicing & it's less prone to errors (like repeating/skipping words) when inferring.

  • # For this, you'll requiere:
    • A GPU with 12G VRAM or more.
    • A very high quality dataset (you need to do text labelling) to enable this.
    • Using a batch size of 1. Keep the other settings same as above.
    • Otherwise, this will worsen the model.


# Inference

  1. Go to the 1C-inference tab.

  2. Press refreshing model paths & select your models from the dropdowns respectively.

  3. Tick Open TTS inference WEBUI.

  4. Upload a clip for reference audio (must be 3-10 seconds). Then fill-in the Text for reference audio, which is what does the character say in the audio. Choose the language on the right.

    • The reference audio is very important as it determines the speed & emotion of the output. Try different ones to polish your output.

    • You can reopen the text proofreading tool to download the reference audio, and copy & paste the text for reference audio.

    • Hover above the "duration" to adjust the length of the reference audio, & hover above "it" to delete the current reference audio.

    • No reference text mode exists, but it's not advisable to use it. It affects the quality a lot.

  5. Fill the Inference text & set the Inference language, then click Start inference.

    • If the text is too long choose the options in How to slice the sentence.

    • If you did not get your desired output, you can infer it again or change reference audio and/or adjust GPT parameters.


# You have reached the end.

Report Issues