The Speech AI platform that powers Voxist — and your product.
Production-grade speech recognition, neural text-to-speech, machine translation, speaker diarization, and language identification. The same engines that run a Fortune 100 knowledge platform and a Fortune 500 contact center. Transparent EUR pricing. Sovereign deployment from day one.
$ curl https://api.voxist.com/v1/transcribe \ -H "Authorization: Bearer $VOXIST_API_KEY" \ -F audio=@call.wav \ -F language=fr -F diarize=true { "text": "bonjour, je vous appelle…", "language": "fr", "words": [{ "w":"bonjour", "t":0.12, "c":0.99 }] }
$ curl https://api.voxist.com/v1/synthesize \ -F text="Votre colis arrive demain." \ -F voice=fr_neural_1 -F format=wav
$ curl https://api.voxist.com/v1/translate \ -F from=fr -F to=de \ -F stream=true -F audio=@call.wav
The gap that compounds quietly
Your Speech AI vendor is American by default
Deepgram, AssemblyAI, ElevenLabs, Speechmatics (now US-owned), Cartesia. Every major Speech AI API is built and hosted in the United States. The cheapest path to "voice AI in our product" is the one that puts your customers' audio on US cloud, under the CLOUD Act, with the next compliance memo waiting to be written. You've shipped that way because there was no real alternative. Now there is.
English is the only language they actually optimized for
Most Speech AI APIs were built English-first and added other languages later. The benchmark numbers show it: word error rates that double when you switch from English to French, halve again on regional accents, and break entirely on code-switching. If your product serves European users in any language other than English, you're paying for a model that loses to a specialized engine on the conversations your users actually have.
Your infrastructure bill grows with your usage. It shouldn't
Mainstream Speech AI APIs are priced per minute, and the per-minute number assumes a GPU somewhere on the other end. At scale, you're funding a GPU rental at meaningful margin. Voxist's ASR runs on CPU — 8.9 concurrent streams per vCPU at real-time factor under 1.05. The economics are different. The bill is different. The deployment is different.
Six endpoints, one platform, three deployment models
| Endpoint | What it does |
|---|---|
| /v1/transcribe | Speech-to-text. Streaming or batch. 40+ languages. |
| /v1/synthesize | Text-to-speech. Neural voices, voice cloning available. |
| /v1/translate | Machine translation. Text or streaming voice-to-voice. |
| /v1/diarize | Speaker separation. Often combined with transcription. |
| /v1/detect-language | Sub-100ms language identification on audio or text. |
| /v1/vad | Voice activity detection. Edge-deployable. |
Three deployment models
The default. Hosted in Europe (OVHcloud, Scaleway), GDPR-native, transparent EUR pricing per second of audio.
Your own VPC on the cloud of your choice. Bring-your-own-key encryption. Data residency you control.
The full platform inside your data center, on your hardware. Air-gapped option for defense and public sector.
One SDK per language you actually code in. Python, Node.js, Go, Rust, Java, .NET. OpenAPI 3.0 for everything else, WebRTC and gRPC for streaming, runnable samples on every doc page.
Built to do the hard things well
40+ languages with European depth
French at 4.2% WER. German, Spanish, Italian, Portuguese, Dutch, Polish, Czech, Hungarian, growing. Each language is production-grade and benchmarked publicly.
Streaming and batch
sub-200ms first-audio latency for real-time use cases, batch processing for high-throughput document and recording workloads.
Speaker diarization
automatic speaker separation, with per-speaker timestamps and consistent speaker labels across long-form audio.
Translation across 40+ languages
VoxTranslate, ranked COMET #2 globally on French-centric EU pairs, beating DeepL in 17/20 pairs.
Neural TTS with voice cloning
natural-sounding speech synthesis. Custom voice cloning available on request. Voice preservation (real-time speaker voice cloning) on roadmap for late 2026.
Domain vocabularies
medical, legal, technical, financial vocabularies pre-loaded. Per-customer custom vocabulary supported on enterprise contracts.
Real-time WebSocket and gRPC streaming
for use cases where round-trip HTTP latency is the constraint.
Word-level timestamps and confidence scores
every transcription returns word-level offsets and per-word confidence, suitable for synchronized captions, search index construction, and quality monitoring.
CPU-first deployment
Voxist ASR runs at 8.9 concurrent streams per vCPU with RTF under 1.05. Most competitors require GPU infrastructure for real-time work.
The same engines that power our enterprise products are the engines you call
There is no separate "developer-grade" ASR or TTS that runs on the Voxist API. The exact same models — the same checkpoints, the same training data, the same engineering — power Voxcept's Dynamic AI Interview at a Fortune 100 FMCG, Voxlive's contact center, and the 30,000-user Voxreply consumer base. When you call /v1/transcribe, you're calling the engine that handles enterprise traffic at scale.
Four things, every time
CPU-first ASR economics
Voxist's ASR engine sustains 8.9 concurrent streams per vCPU at real-time factor under 1.05 — roughly 3.5× the CPU density of Speechmatics, and unique among major Speech AI vendors in not requiring GPU infrastructure for real-time work. The cost per audio-hour, on commodity Intel CPUs, runs at €0.0047. The economics translate directly into our public pricing and into your infrastructure budget if you deploy on-premise.
French and European depth, not a localization
French ASR at 4.2% WER. Translation COMET #2 globally on French-centric pairs, with the largest deltas over DeepL on the pairs you'd expect a specialized engine to win — French ↔ German, French ↔ Polish, French ↔ Hungarian. Every language is benchmarked publicly on the Voxist leaderboard, with full methodology, updated monthly.
Sovereign deployment from day one
Voxist API is the only major Speech AI platform offering all three deployment models with full feature parity: SaaS on European cloud, private cloud on your VPC, fully on-premise (including the models) inside your data center. For regulated industries — healthcare, defense, public sector, finance — this is the deployment matrix that doesn't exist anywhere else.
Outcomes the docs don't lie about
Every claim on this page links to a public benchmark, a customer deployment, or a documented number. The COMET ranking is independent. The CPU efficiency benchmark is published with full methodology. The latency numbers are P95 from production traffic, not lab-clean conditions. Voxist Status (status.voxist.com) shows real-time platform availability and historical incidents.
A short, honest comparison
| Voxist API | Deepgram | AssemblyAI | Speechmatics | ElevenLabs | |
|---|---|---|---|---|---|
| Built and hosted in Europe | — | — | UK (US-owned) | — | |
| French at 4.2% WER | English-first | English-first | Strong | N/A | |
| CPU-first deployment | GPU required | GPU required | Partial | GPU required | |
| On-premise option | Limited | — | Limited | — | |
| Transparent EUR pricing | ❌ (USD) | ❌ (USD) | ❌ (USD) | ❌ (USD) | |
| 1000 minutes free, no credit card | Limited free | Free tier | Limited free | Free tier |
One platform, six products, one flywheel
use /v1/transcribe + /v1/diarize + your own LLM and retrieval stack.
use streaming /v1/transcribe + /v1/translate + intent extraction on your side.
chain /v1/transcribe → /v1/translate → /v1/synthesize over WebSocket.
combine streaming ASR, TTS, and your LLM of choice into a SIP-attached agent.
batch /v1/transcribe + /v1/diarize + your own summarization layer.
Questions, answered
How does Voxist API compare to Deepgram or AssemblyAI?
Can I run Voxist API on-premise, including the models?
What's the SLA?
Do you train on my audio?
What languages do you support, exactly?
What's the difference between Voxist API and OpenAI's Whisper / Azure / Google Cloud Speech?
Do you offer voice cloning?
Can I use Voxist API for medical, legal, or financial dictation?
How do I get started?
Is there a status page?
Do you have a community / open-source presence?
Build with the Speech AI platform that runs production traffic.
English & French · EU-hosted · no audio used for model training