⚡Ultra Compact Text-to-Speech: A Quantized F5TTS

· 510 words · 3 minute read

F5TTS

How small can a F5TTS model get? Very small!

Now you can generate synthetic voice using AI, with pretty high quality under very constrained hardware. In order to run F5-TTS quantized version you just need around ~400mb of VRAM. (Spoiler: mac only currently)

Yes, that’s everything you need to do to do voice generating (and voice cloning) on a macbook under 30s. The entire model pipeline is including the below types of weights:

  • F5-TTS weight (223 MB)
  • Duration predictor (86.2 MB)
  • Vocos (53.4 MB) Total ~ 363 MB

How to install Quantized F5-TTS 🔗

I have quantized the model and made an all-in-one package for you to just directly install from pip:

pip install f5-tts-mlx-quantized

Github

This is a forked version of original f5-tts-mlx which I already refactored the model loading pipeline to directly load the quantized version of f5-tts instead of the full precision one.

How much VRAM has been saved? 🔗

Here is a Markdown table comparing the model sizes before and after quantization:

ComponentOriginal SizeQuantized Size
F5-TTS Weight1.3 GB223 MB (4bit)
Duration Predictor86.2 MB86.2 MB
Vocos53.4 MB53.4 MB
Total Size1.4396 GB362.6 MB

How the quantization was done? 🔗

You can find the information about how quantization was done inside the code. To save you the effort I will just quote the important part here.

# cfm.py
...
    @classmethod
    def from_pretrained(
        cls, hf_model_name_or_path: str, convert_weights=False, bit = None
    ) -> F5TTS:
        if bit is None:
            if "8bit" in hf_model_name_or_path:
                print("Loading model with 8bit quantization")
                bit = 8
            elif "4bit" in hf_model_name_or_path:
                print("Loading model with 4bit quantization")
                bit = 4
        path = fetch_from_hub(hf_model_name_or_path)

... 
        if bit is not None:
            nn.quantize(f5tts, bits = bit, class_predicate= lambda p, m: isinstance(m, nn.Linear) and m.weight.shape[1] % 64 == 0)
...

There is some strange part during quantization as you can see: class_predicate= lambda p, m: isinstance(m, nn.Linear) and m.weight.shape[1] % 64 == 0). The purpose of the class predicate is to filter out layers that cannot be divided by 64 (which violate the group size needed for quantization in MLX). Since originally there is a layer in F5TTS that is not divisible by 64.

So the step is:

  1. Load the model
  2. Make the filter condition for group size
  3. Quantize the model
  4. Extract the quantized model weight for later use.

What can I do with Text-to-speech 🔗

A few things:

  • Clone your own voice?
  • Make an AI podcast? (like this VoiceOver)
  • Make a voice chatbot (like our paper 🍓Ichigo?)

I have an example usage of the voiceover repo in my SubStack link to demonstrate the ability of the model to generate podcast/voiceover. The voice is pretty natural and does not have “robotic” feeling of the model that is at the same size.

Best part, you can do it on any Macbook, totally free. No need expensive Text-to-speech service online anymore.

Appreciation 🔗

  • Lucas Newman for original implementation of F5 TTS on MLX
  • Yushen Chen for the original Pytorch implementation of F5 TTS and pretrained model.
  • Phil Wang for the E2 TTS implementation that this model is based on.
comments powered by Disqus