Kokoro-82M is a new text-to-speech (TTS) model, delivering exceptional performance with only 82 million parameters.

This article explores its technical implementation and performance characteristics and provides some examples of generations of voices, which could give you an idea about the quality.

💪
TL;DR - The model achieves state-of-the-art performance with 82M parameters, outperforming 14x bigger models.

Important links:

Architecture overview

Kokoro-82M is built on two key components:

The model uses a decoder-only architecture, eliminating the need for diffusion or separate encoders. This architectural choice contributes to its minimal parameter count while maintaining high output quality, as you can see in the example section.

Performance metrics

The model achieved first place in the TTS Spaces Arena benchmark, outperforming significantly larger models, as visible from the image below:

Kokoro 82M leaderboard

Kokoro 82M leaderboard

It achieved first place in the TTS Spaces Arena benchmark, surpassing models with significantly larger parameter counts. XTTS v2, with its 467M parameters and training on over 10,000 hours of audio, places second. MetaVoice, at 1.2B parameters and 100,000 hours of training data, ranks lower. Even Fish Speech, trained on a million hours of audio with approximately 500M parameters, doesn’t match Kokoro-82M’s performance.

Training

The training process itself reveals interesting efficiency characteristics. Using an A100 80GB GPU, the entire training completed in approximately 500 GPU hours at a cost of $0.80 per hour, totaling around $400. The model reached optimal performance in under 20 epochs, using a curated dataset of less than 100 hours of audio.

Usage

You can run the following code in Google Colab or Lightning.AI to run the model:


# 1️⃣ Install dependencies silently
!git lfs install
!git clone https://huggingface.co/hexgrad/Kokoro-82M
%cd Kokoro-82M
!apt-get -qq -y install espeak-ng > /dev/null 2>&1
!pip install -q phonemizer torch transformers scipy munch

# 2️⃣ Build the model and load the default voicepack
from models import build_model
import torch
device = 'cuda' if torch.cuda.is_available() else 'cpu'
MODEL = build_model('kokoro-v0_19.pth', device)
VOICE_NAME = [
    'af', # Default voice is a 50-50 mix of Bella & Sarah
    'af_bella', 'af_sarah', 'am_adam', 'am_michael',
    'bf_emma', 'bf_isabella', 'bm_george', 'bm_lewis',
    'af_nicole', 'af_sky',
][0]
VOICEPACK = torch.load(f'voices/{VOICE_NAME}.pt', weights_only=True).to(device)
print(f'Loaded voice: {VOICE_NAME}')

# 3️⃣ Call generate, which returns 24khz audio and the phonemes used
from kokoro import generate
text = "How could I know? It's an unanswerable question. Like asking an unborn child if they'll lead a good life. They haven't even been born."
audio, out_ps = generate(MODEL, text, VOICEPACK, lang=VOICE_NAME[0])
# Language is determined by the first letter of the VOICE_NAME:
# 🇺🇸 'a' => American English => en-us
# 🇬🇧 'b' => British English => en-gb

# 4️⃣ Display the 24khz audio and print the output phonemes
from IPython.display import display, Audio
display(Audio(data=audio, rate=24000, autoplay=True))
print(out_ps)

Or if you prefer, you can use the online demo: https://huggingface.co/spaces/hexgrad/Kokoro-TTS

Kokoro TTS demo

Kokoro TTS demo

And the produced sample to hear the quality of produced speech from the previous image:

Example generation and voices

For generating some samples, I used this article: ModernBERT, then extracted the content in Markdown with Monkt, removed some elements from header and footer and used the online demo of Kokoro TTS (Long form 0.19).



The quality for an open-source model is pretty good, not to mention the generation speed for long-form content.

Voice mixing

The voice that achieved top ranking in the TTS Spaces Arena is actually a combined voice profile created by mixing two base voices – Bella and Sarah – in equal proportions.

While this mix is conveniently provided as af.pt in the repository, it can be reproduced programmatically using PyTorch’s tensor operations. The mixing process involves loading both voice tensors with weights_only=True flag, stacking them, and computing their mean. This straightforward averaging approach effectively combines the vocal characteristics of both voices, as verified by comparing the computed tensor against the provided af.pt using torch.equal(). See the code snippet below:


import torch
bella = torch.load('voices/af_bella.pt', weights_only=True)
sarah = torch.load('voices/af_sarah.pt', weights_only=True)
af = torch.mean(torch.stack([bella, sarah]), dim=0)
assert torch.equal(af, torch.load('voices/af.pt', weights_only=True))

Limitations

The model’s limitations come from both architectural decisions and training data constraints. The model lacks voice cloning capabilities due to its limited training dataset of under 100 hours. It depends on espeak-ng for grapheme-to-phoneme conversion, introducing potential failure points in the text processing pipeline.

The training data’s focus on long-form content results in better performance on narrative text than conversational speech. While the 82M parameter count enables efficient deployment, it may not match the capabilities of billion-parameter diffusion transformers or large language models with TTS capabilities.

Though the architecture could theoretically support multiple languages, the current English-centric training data restricts its multilingual applications.

Future directions

The project maintains active development through its GitHub repository and Hugging Face model hub presence. Current development focuses on improving multilingual support and reducing external dependencies while maintaining the model’s efficient parameter utilization.

Kokoro-82M demonstrates that efficient architecture design and focused training strategies can outperform larger models while maintaining minimal resource requirements. Its success challenges conventional scaling assumptions in text-to-speech technology and opens new possibilities for efficient, production-ready TTS systems.

My final conclusion is, that Kokoro is great for many reasons. Fast generation, good quality, long-form content support and yes, it is open-source with Apache 2.0 license.

Last Update: 20/01/2025