Qwen3-omni is a natively end-to-end, omni-modal LLM
State-of-the-art TTS model under 25MB
On-device Speech Recognition for Apple Silicon
An Open Source text-to-speech system built by inverting Whisper
Towards Human-Sounding Speech
Uses Qwen3-ASR, local LLM, Whisper, TEN-VAD
Python library and CLI tool to interface with Google Translate
Synchronized Translation for Videos
kaldi-asr/kaldi is the official location of the Kaldi project
Foundational model for human-like, expressive TTS
Capable of understanding text, audio, vision, video
Framework for building realtime multimodal voice AI agents apps
GLM-4-Voice | End-to-End Chinese-English Conversational Model
A lightweight text-to-speech model with zero-shot voice cloning
MARS5 speech model (TTS) from CAMB.AI
Free, high-quality text-to-speech API endpoint to replace OpenAI
A speech-text foundation model for real time dialogue
Voice Recognition to Text Tool
Framework for building real-time voice and multimodal AI agents
Open source text-to-speech tool, supports extra-long text
A cross-platform software for text translation and recognition
SOTA discrete acoustic codec models with 40/75 tokens per second
A nearly-live implementation of OpenAI's Whisper
The free, Open Source alternative to OpenAI, Claude and others
AI teacher that lives as a buddy next to your cursor