Showing 709 open source projects for "speech"

View related business solutions
  • Earn up to 15% annual interest with Nexo. Icon
    Earn up to 15% annual interest with Nexo.

    Let your crypto work for you

    Put idle assets to work with competitive interest rates, borrow without selling, and trade with precision. All in one platform. Geographic restrictions, eligibility, and terms apply.
    Get started with Nexo.
  • Earn up to 15% annual interest with Nexo. Icon
    Earn up to 15% annual interest with Nexo.

    Access competitive interest rates on your digital assets.

    Generate interest, borrow against your crypto, and trade a range of cryptocurrencies — all in one platform. Geographic restrictions, eligibility, and terms apply.
    Get started with Nexo.
  • 1
    GPT-SoVITS

    GPT-SoVITS

    1 min voice data can also be used to train a good TTS model

    GPT‑SoVITS is a state-of-the-art voice conversion and TTS system that enables zero‑shot and few‑shot synthesis based on a short vocal sample (e.g., 5 seconds). It supports cross‑lingual speech synthesis across English, Chinese, Japanese, Korean, Cantonese, and more. It's powered by VITS architecture enhanced for few‑sample adaptation and real‑time usability.
    Downloads: 65 This Week
    Last Update:
    See Project
  • 2
    BrowserAI

    BrowserAI

    Run local LLMs like llama, deepseek, kokoro etc. inside your browser

    ...The platform provides a developer-friendly SDK with pre-configured popular models, and it allows for seamless switching between MLC and Transformer engines. Additionally, it supports features such as speech recognition, text-to-speech, structured output generation, and Web Worker support for non-blocking UI performance.
    Downloads: 3 This Week
    Last Update:
    See Project
  • 3
    OpenVINO

    OpenVINO

    OpenVINO™ Toolkit repository

    OpenVINO™ is an open-source toolkit for optimizing and deploying AI inference. Boost deep learning performance in computer vision, automatic speech recognition, natural language processing and other common tasks. Use models trained with popular frameworks like TensorFlow, PyTorch and more. Reduce resource demands and efficiently deploy on a range of Intel® platforms from edge to cloud. This open-source version includes several components: namely Model Optimizer, OpenVINO™ Runtime, Post-Training Optimization Tool, as well as CPU, GPU, MYRIAD, multi device and heterogeneous plugins to accelerate deep learning inferencing on Intel® CPUs and Intel® Processor Graphics. ...
    Downloads: 34 This Week
    Last Update:
    See Project
  • 4
    Qwen3-ASR

    Qwen3-ASR

    Qwen3-ASR is an open-source series of ASR models

    ...The architecture combines advanced neural acoustic modeling with context-aware language prediction so that outputs maintain both fidelity to the original speech and grammatical coherence. This makes Qwen3-ASR suitable for voice-driven applications like AI assistants, dictation tools, speech analytics pipelines, and accessibility features, where accurate and fluid transcription is critical.
    Downloads: 1 This Week
    Last Update:
    See Project
  • Fully managed relational database service for MySQL, PostgreSQL, and SQL Server Icon
    Fully managed relational database service for MySQL, PostgreSQL, and SQL Server

    Focus on your application, and leave the database to us

    Cloud SQL manages your databases so you don't have to, so your business can run without disruption. It automates all your backups, replication, patches, encryption, and storage capacity increases to give your applications the reliability, scalability, and security they need.
    Try for free
  • 5
    GLM-TTS

    GLM-TTS

    Controllable & emotion-expressive zero-shot TTS

    GLM-TTS is an advanced text-to-speech synthesis system built on large language model technologies that focuses on producing high-quality, expressive, and controllable spoken output, including features like emotion modulation and zero-shot voice cloning. It uses a two-stage architecture where a generative LLM first converts text into intermediate speech token sequences and then a Flow-based neural model converts those tokens into natural audio waveforms, enabling rich prosody and voice character even for unseen speakers. ...
    Downloads: 3 This Week
    Last Update:
    See Project
  • 6
    ElevenLabs Python

    ElevenLabs Python

    The official Python SDK for the ElevenLabs API

    elevenlabs-python is the official Python SDK for the ElevenLabs API, giving developers a convenient way to access ElevenLabs’ high-quality, lifelike voices. The library wraps the HTTP API into a typed Python client, so you can perform text-to-speech, streaming, voice cloning, voice management, and agents-related operations with simple method calls. It exposes ElevenLabs’ main models such as Eleven Multilingual v2, Eleven Flash v2.5, and Eleven Turbo v2.5, each targeting different trade-offs between latency, cost, and quality. The SDK is designed for quick setup: after installing the package and setting an API key, you can generate speech in multiple languages and play or process the resulting audio bytes. ...
    Downloads: 3 This Week
    Last Update:
    See Project
  • 7
    VideoChat

    VideoChat

    Real-time voice interactive digital human

    VideoChat is a real-time voice-interactive “digital human” system that combines automatic speech recognition, large language models, text-to-speech, and talking-head generation into a single conversational pipeline. It supports both pure end-to-end voice solutions based on multimodal large language models (GLM-4-Voice feeding directly into talking-head generation) and a more traditional cascaded pipeline using ASR → LLM → TTS → talking head.
    Downloads: 2 This Week
    Last Update:
    See Project
  • 8
    KrillinAI

    KrillinAI

    Video translation and dubbing tool powered by LLMs

    KrillinAI is an end-to-end content localization, translation, and dubbing tool aimed at helping creators transform videos into multiple languages with minimal manual effort. It integrates several stages of the pipeline: video acquisition (either from local files or remote via download tools), speech recognition (ASR), subtitle segmentation and alignment, machine translation (with context-aware translation to preserve semantics), and voice cloning + text-to-speech (TTS) to produce dubbed audio tracks. KrillinAI supports both landscape and portrait videos, which makes it suitable for a wide range of platforms — from YouTube to TikTok or other vertical-video sites — and ensures correct formatting and layout for the final video. ...
    Downloads: 10 This Week
    Last Update:
    See Project
  • 9
    RealtimeTTS

    RealtimeTTS

    Converts text to speech in realtime

    RealtimeTTS is a low-latency text-to-speech library built for real-time applications such as voice chat with LLMs, assistants, and interactive tools. It is designed around a streaming model: you can feed it text incrementally (for example, as an LLM responds) and get audio output almost immediately, which keeps end-to-end latency very low. The library is engine-agnostic and plugs into a wide range of cloud and local TTS systems, including OpenAI, ElevenLabs, Azure, Coqui, Piper, StyleTTS2, Edge TTS, Google TTS, system TTS and others, so you can swap providers without rewriting your pipeline. ...
    Downloads: 4 This Week
    Last Update:
    See Project
  • Power through agendas and documents, make more informed decisions and conduct board meetings faster. Icon
    Power through agendas and documents, make more informed decisions and conduct board meetings faster.

    For team managers searching for a solution to manage their meetings

    iBabs not only captures the entire decision-making process – it takes all the paperwork out of meetings. iBabs empowers everyone who has ever organized or attended, a meeting. With a seemingly simple app that offers complete control and a comprehensive overview of all those fiddly details. With about 3000 organizations and over 300,000 users, iBabs gives you peace of mind. So you can quickly organize effective meetings, and good decisions can be made with confidence. iBabs didn’t just happen overnight. We started analyzing and simplifying board meeting processes many years ago. We understand all the work that goes into meetings, and how to streamline everything so it all flows smoothly. On any device, confidentially, securely and automatically. Make good decisions with confidence.
    Learn More
  • 10
    ebook2audiobook

    ebook2audiobook

    Generate audiobooks from e-books, voice cloning & 1107+ languages

    ebook2audiobook is a tool to convert legally obtained eBooks (non-DRM) into fully narrated audiobooks, complete with chapters and metadata. It automates the pipeline: it reads the eBook file, splits it into appropriate segments (chapters, paragraphs), uses text-to-speech (TTS) models to synthesize audio, optionally applies voice cloning, and outputs a final audiobook — ideal for people who prefer listening over reading, or for accessibility purposes. The tool supports a wide array of underlying TTS backends (XTTSv2, Bark, VITS, Fairseq, Tacotron2, YourTTS and more), which gives flexibility depending on hardware availability, voice preference, and language. ...
    Downloads: 30 This Week
    Last Update:
    See Project
  • 11
    AI Runner

    AI Runner

    Offline inference engine for art, real-time voice conversations

    AI Runner is an offline inference engine designed to run a collection of AI workloads on your own machine, including image generation for art, real-time voice conversations, LLM-powered chatbots and automated workflows. It is implemented as a desktop-oriented Python application and emphasizes privacy and self-hosting, allowing users to work with text-to-speech, speech-to-text, text-to-image and multimodal models without sending data to external services. At the core of its LLM stack is a mode-based architecture with specialized “modes” such as Author, Code, Research, QA and General, and a workflow manager that automatically routes user requests to the right agent based on the task. The project has a strong focus on developer ergonomics, with thorough development guidelines, environment configuration using .env variables, and a clear structure for tests, tools and agents.
    Downloads: 13 This Week
    Last Update:
    See Project
  • 12
    MegaTTS 3

    MegaTTS 3

    Official PyTorch Implementation

    MegaTTS3 is an open-source text-to-speech (TTS) and voice-cloning system from ByteDance that aims to deliver high-quality, expressive speech synthesis, including zero-shot voice cloning of previously unseen speakers. Its backbone is a lightweight diffusion-transformer (on the order of ~0.45 B parameters), which enables efficient inference while still producing high-fidelity audio.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 13
    Style-Bert-VITS2

    Style-Bert-VITS2

    Style-Bert-VITS2: Bert-VITS2 with more controllable voice styles

    Style-Bert-VITS2 is a text-to-speech system based on Bert-VITS2 that focuses on highly controllable voice styles and emotional expression. It takes the original Bert-VITS2 v2.1 and its Japanese-Extra variant and extends them so you can control emotion and speaking style with fine-grained intensity, not just choose a generic tone. The project targets both power users and beginners: Windows users without Git or Python can install and run it using bundled .bat scripts, while advanced users can work with virtual environments, uv, and Python tooling. ...
    Downloads: 13 This Week
    Last Update:
    See Project
  • 14
    Speakr

    Speakr

    Speakr is a personal, self-hosted web application

    Speakr is an open-source, real-time text-to-speech (TTS) web application that allows users to convert written text into natural-sounding speech in just a few clicks. It provides a clean, user-friendly interface where users can input text, choose a voice style or language, and immediately hear the output, making it ideal for accessibility, content creation, and learning applications.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 15
    Qwen2-Audio

    Qwen2-Audio

    Repo of Qwen2-Audio chat & pretrained large audio language model

    ...Code & examples provided with Hugging Face transformers, and usage via AutoProcessor, model classes etc. High performance on many standard benchmarks: ASR, speech-emotion recognition, vocal sound classification, speech translation etc.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 16
    FastRTC

    FastRTC

    The python library for real-time communication

    ...FastRTC also integrates nicely with UI frameworks (e.g. via a web demo using Gradio), so developers can rapidly prototype and deploy real-time streaming applications without deep knowledge of low-level WebRTC internals. Because voice-enabled AI agents often involve many moving parts (speech-to-text, text processing, text-to-speech, streaming, session/chat management), FastRTC helps by handling the streaming aspect, leaving the rest to be plugged in modularly.
    Downloads: 4 This Week
    Last Update:
    See Project
  • 17
    Open-LLM-VTuber

    Open-LLM-VTuber

    Open source AI VTuber platform with voice chat and Live2D avatars

    Open-LLM-VTuber is an open source platform designed to create AI-powered VTuber characters that can interact with users through voice and animated avatars. It enables hands-free conversations with large language models by combining speech recognition, language processing, and text-to-speech synthesis into a single system. Users can speak directly to the AI character, and the system can respond with a generated voice while animating a Live2D avatar to simulate a talking virtual personality. Open-LLM-VTuber is modular, allowing developers to swap or configure different language models, speech recognition engines, and voice synthesis systems depending on their needs. ...
    Downloads: 21 This Week
    Last Update:
    See Project
  • 18
    RunAnywhere

    RunAnywhere

    Production ready toolkit to run AI locally

    ...It also includes integrated pipelines that combine speech-to-text, large language models, and text-to-speech into a complete conversational system.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 19
    Lingvo

    Lingvo

    Framework for building neural networks

    Lingvo is a TensorFlow based framework focused on building and training sequence models, especially for language and speech tasks. It was originally developed for internal research and later open sourced to support reproducible experiments and shared model implementations. The framework provides a structured way to define models, input pipelines, and training configurations using a common interface for layers, which encourages reuse across different tasks.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 20
    OuteTTS

    OuteTTS

    Interface for OuteTTS models

    OuteTTS is an interface library for running OuteTTS text-to-speech models across a range of backends, making it easier to deploy the same model on different hardware and runtimes. It provides a high-level Interface API that wraps model configuration, speaker handling, and audio generation so you can focus on integrating speech into your application rather than wiring up low-level engines. The project supports multiple backends including llama.cpp (Python bindings and server), Hugging Face Transformers, ExLlamaV2, VLLM and a JavaScript interface via Transformers.js, allowing it to run on CPUs, NVIDIA CUDA GPUs, AMD ROCm, Vulkan-capable GPUs, and Apple Metal. ...
    Downloads: 1 This Week
    Last Update:
    See Project
  • 21
    VideoCaptioner

    VideoCaptioner

    AI-powered tool for generating, optimizing, and translating subtitles

    VideoCaptioner is an open source AI-powered subtitle processing tool designed to simplify the workflow of creating subtitles for videos. It integrates speech recognition, language processing, and translation technologies to automatically generate and refine subtitles from video or audio sources. VideoCaptioner uses speech-to-text engines such as Whisper variants to transcribe spoken content and convert it into subtitle text with accurate timestamps. After transcription, large language models are used to intelligently restructure subtitles into natural sentences, correct wording, and improve readability for viewers. ...
    Downloads: 10 This Week
    Last Update:
    See Project
  • 22
    comfyui-mixlab-nodes

    comfyui-mixlab-nodes

    Workflow and speech recognition app

    ...The project also brings Real-time Design features like screen capture and floating video nodes, enabling creative pipelines that mix live screen content, generative models, and visual effects. For audio and speech, it provides nodes for SpeechRecognition and SpeechSynthesis, plus workflows that combine voice generation with real-time face swapping and other audio-visual effects. On the AI side, it integrates multiple LLM providers (cloud and local), supports OpenAI-compatible endpoints, Siliconflow models, and includes prompt-focused utilities for random prompt generation, Chinese prompts, clip interrogation.
    Downloads: 4 This Week
    Last Update:
    See Project
  • 23
    OpenVoice

    OpenVoice

    Instant voice cloning by MIT and MyShell. Audio foundation model

    OpenVoice is a versatile instant voice cloning system that can replicate a speaker’s tone color from just a short audio clip and then generate speech in multiple languages. It is designed not only to match the timbre of the reference voice, but also to give granular control over style parameters such as emotion, accent, rhythm, pauses, and intonation. The model supports cross-lingual and even zero-shot cross-lingual voice cloning, so a speaker recorded in one language can be made to speak naturally in others. ...
    Downloads: 21 This Week
    Last Update:
    See Project
  • 24
    Streamer-Sales

    Streamer-Sales

    LLM Large Model of Selling Anchor

    ...By analyzing product characteristics and marketing information, the model can produce engaging explanations that emphasize benefits, features, and emotional appeal to encourage viewers to make purchasing decisions. The system integrates multiple AI technologies including retrieval-augmented generation to incorporate product knowledge, speech synthesis to convert generated scripts into voice output, and digital human generation to create virtual hosts. It also supports automatic speech recognition and agent-based tools that can retrieve additional information such as logistics or product details during live sessions.
    Downloads: 2 This Week
    Last Update:
    See Project
  • 25
    CosyVoice

    CosyVoice

    Multi-lingual large voice generation model, providing inference

    ...The model supports multiple languages, including Chinese, English, Japanese, Korean, and a range of Chinese dialects such as Cantonese, Sichuanese, Shanghainese, Tianjinese, and Wuhanese. It is designed for zero-shot voice cloning and cross-lingual or mix-lingual scenarios, so a single reference voice can be used to synthesize speech across languages and in code-switching contexts. CosyVoice 2.0 significantly improves on version 1.0 by boosting accuracy, stability, speed, and overall speech quality, making it more suitable for production environments. The repository contains training recipes, inference pipelines, deployment scripts, and integration examples, positioning it as a comprehensive toolkit rather than just a set of model weights.
    Downloads: 0 This Week
    Last Update:
    See Project
MongoDB Logo MongoDB