Speech & Voice AI (TTS / STT)
Open models for text-to-speech, voice cloning, and transcription — from tiny CPU models to expressive cloning.
alternativas (7)
★ Kokoro-82M
Ideal para: Lightweight, high-quality TTS
A tiny 82M-parameter open TTS model with quality far above its size — great default for lightweight speech.
- +Very small & fast
- +Surprisingly good quality
- +Permissive license
- −Fewer voices than big models
Pocket TTS
Ideal para: On-device / CPU TTS
Kyutai's ~100M TTS that runs several times faster than real-time on a CPU — fits in your pocket.
- +Runs on CPU
- +Faster than real-time
- +Voice cloning
- −Small model limits
F5-TTS
Ideal para: Voice cloning
Fast, high-quality open TTS with strong zero-shot voice cloning.
- +Great cloning
- +Fast inference
- −Model license differs from code
VibeVoice
Ideal para: Long-form, multi-speaker
Microsoft's open frontier TTS for long-form, multi-speaker audio like podcasts and dialogue.
- +Long-form audio
- +Multiple speakers
- +Microsoft-backed
- −Heavier model
Zonos
Ideal para: Expressive cloning
Zyphra's expressive open TTS trained on 200k+ hours, with high-fidelity 5-second voice cloning and emotion control.
- +Very expressive
- +Emotion control
- +5s cloning
- −Larger to run
Higgs Audio
Ideal para: Multilingual TTS
Boson AI's text-audio foundation model with strong multilingual TTS (100+ languages).
- +100+ languages
- +Foundation model
- −Research / non-commercial license
whisper.cpp
Ideal para: On-device transcription
Fast, portable C/C++ port of OpenAI Whisper for on-device speech-to-text (transcription).
- +Runs anywhere
- +No GPU needed
- +Many languages
- −STT only
- −not TTS
Comparar
Marca las que quieras comparar
| Alternativa | Task | Best for | License |
|---|---|---|---|
| ★Kokoro-82M | TTS | Lightweight TTS | Apache-2.0 |
| Pocket TTS | TTS | CPU / on-device | Open |
| F5-TTS | TTS + cloning | Cloning a voice | Open |
| VibeVoice | TTS (long-form) | Podcasts / dialogue | MIT |
| Zonos | TTS + cloning | Expressive voices | Apache-2.0 |
| Higgs Audio | TTS (foundation) | Multilingual speech | Research / NC |
| whisper.cpp | STT (transcribe) | Local transcription | MIT |
Building voice features? For text-to-speech, Kokoro and Pocket TTS are tiny and run locally, F5-TTS and Zonos excel at voice cloning, VibeVoice handles long multi-speaker audio, and Higgs Audio covers many languages. For speech-to-text, whisper.cpp transcribes on-device. Compare by task, strength, and license (mind the non-commercial ones).