Speech AI Developer Platform

The Speech AI platform that powers Voxist — and your product.

Production-grade speech recognition, neural text-to-speech, machine translation, speaker diarization, and language identification. The same engines that run a Fortune 100 knowledge platform and a Fortune 500 contact center. Transparent EUR pricing. Sovereign deployment from day one.

api.voxist.com200 OK
$ curl https://api.voxist.com/v1/transcribe \
  -H "Authorization: Bearer $VOXIST_API_KEY" \
  -F audio=@call.wav \
  -F language=fr -F diarize=true

{
  "text": "bonjour, je vous appelle…",
  "language": "fr",
  "words": [{ "w":"bonjour", "t":0.12, "c":0.99 }]
}
word timestampsEUR / second8.9 streams/vCPU
The problem

The gap that compounds quietly

The pain

Your Speech AI vendor is American by default

Deepgram, AssemblyAI, ElevenLabs, Speechmatics (now US-owned), Cartesia. Every major Speech AI API is built and hosted in the United States. The cheapest path to "voice AI in our product" is the one that puts your customers' audio on US cloud, under the CLOUD Act, with the next compliance memo waiting to be written. You've shipped that way because there was no real alternative. Now there is.

The cost

English is the only language they actually optimized for

Most Speech AI APIs were built English-first and added other languages later. The benchmark numbers show it: word error rates that double when you switch from English to French, halve again on regional accents, and break entirely on code-switching. If your product serves European users in any language other than English, you're paying for a model that loses to a specialized engine on the conversations your users actually have.

With Voxist

Your infrastructure bill grows with your usage. It shouldn't

Mainstream Speech AI APIs are priced per minute, and the per-minute number assumes a GPU somewhere on the other end. At scale, you're funding a GPU rental at meaningful margin. Voxist's ASR runs on CPU — 8.9 concurrent streams per vCPU at real-time factor under 1.05. The economics are different. The bill is different. The deployment is different.

How it works

Six endpoints, one platform, three deployment models

EndpointWhat it does
/v1/transcribeSpeech-to-text. Streaming or batch. 40+ languages.
/v1/synthesizeText-to-speech. Neural voices, voice cloning available.
/v1/translateMachine translation. Text or streaming voice-to-voice.
/v1/diarizeSpeaker separation. Often combined with transcription.
/v1/detect-languageSub-100ms language identification on audio or text.
/v1/vadVoice activity detection. Edge-deployable.

Three deployment models

Voxist Cloud

The default. Hosted in Europe (OVHcloud, Scaleway), GDPR-native, transparent EUR pricing per second of audio.

Voxist Private Cloud

Your own VPC on the cloud of your choice. Bring-your-own-key encryption. Data residency you control.

Voxist On-Premise

The full platform inside your data center, on your hardware. Air-gapped option for defense and public sector.

One SDK per language you actually code in. Python, Node.js, Go, Rust, Java, .NET. OpenAPI 3.0 for everything else, WebRTC and gRPC for streaming, runnable samples on every doc page.

Capabilities

Built to do the hard things well

40+ languages with European depth

French at 4.2% WER. German, Spanish, Italian, Portuguese, Dutch, Polish, Czech, Hungarian, growing. Each language is production-grade and benchmarked publicly.

Streaming and batch

sub-200ms first-audio latency for real-time use cases, batch processing for high-throughput document and recording workloads.

Speaker diarization

automatic speaker separation, with per-speaker timestamps and consistent speaker labels across long-form audio.

Translation across 40+ languages

VoxTranslate, ranked COMET #2 globally on French-centric EU pairs, beating DeepL in 17/20 pairs.

Neural TTS with voice cloning

natural-sounding speech synthesis. Custom voice cloning available on request. Voice preservation (real-time speaker voice cloning) on roadmap for late 2026.

Domain vocabularies

medical, legal, technical, financial vocabularies pre-loaded. Per-customer custom vocabulary supported on enterprise contracts.

Real-time WebSocket and gRPC streaming

for use cases where round-trip HTTP latency is the constraint.

Word-level timestamps and confidence scores

every transcription returns word-level offsets and per-word confidence, suitable for synchronized captions, search index construction, and quality monitoring.

CPU-first deployment

Voxist ASR runs at 8.9 concurrent streams per vCPU with RTF under 1.05. Most competitors require GPU infrastructure for real-time work.

Proof

The same engines that power our enterprise products are the engines you call

There is no separate "developer-grade" ASR or TTS that runs on the Voxist API. The exact same models — the same checkpoints, the same training data, the same engineering — power Voxcept's Dynamic AI Interview at a Fortune 100 FMCG, Voxlive's contact center, and the 30,000-user Voxreply consumer base. When you call /v1/transcribe, you're calling the engine that handles enterprise traffic at scale.

4.2%
French ASR WER
<200ms
first-audio latency P95
8.9
concurrent streams / vCPU
#2
COMET · French-centric pairs
17/20
pairs ahead of DeepL
18/20
pairs ahead of GPT-4o
40+
languages supported
1000 min
free / month · no card
What makes it Voxist

Four things, every time

Latency

CPU-first ASR economics

Voxist's ASR engine sustains 8.9 concurrent streams per vCPU at real-time factor under 1.05 — roughly 3.5× the CPU density of Speechmatics, and unique among major Speech AI vendors in not requiring GPU infrastructure for real-time work. The cost per audio-hour, on commodity Intel CPUs, runs at €0.0047. The economics translate directly into our public pricing and into your infrastructure budget if you deploy on-premise.

Languages

French and European depth, not a localization

French ASR at 4.2% WER. Translation COMET #2 globally on French-centric pairs, with the largest deltas over DeepL on the pairs you'd expect a specialized engine to win — French ↔ German, French ↔ Polish, French ↔ Hungarian. Every language is benchmarked publicly on the Voxist leaderboard, with full methodology, updated monthly.

Sovereignty

Sovereign deployment from day one

Voxist API is the only major Speech AI platform offering all three deployment models with full feature parity: SaaS on European cloud, private cloud on your VPC, fully on-premise (including the models) inside your data center. For regulated industries — healthcare, defense, public sector, finance — this is the deployment matrix that doesn't exist anywhere else.

Outcomes

Outcomes the docs don't lie about

Every claim on this page links to a public benchmark, a customer deployment, or a documented number. The COMET ranking is independent. The CPU efficiency benchmark is published with full methodology. The latency numbers are P95 from production traffic, not lab-clean conditions. Voxist Status (status.voxist.com) shows real-time platform availability and historical incidents.

How it compares

A short, honest comparison

Voxist APIDeepgramAssemblyAISpeechmaticsElevenLabs
Built and hosted in EuropeUK (US-owned)
French at 4.2% WEREnglish-firstEnglish-firstStrongN/A
CPU-first deploymentGPU requiredGPU requiredPartialGPU required
On-premise optionLimitedLimited
Transparent EUR pricing❌ (USD)❌ (USD)❌ (USD)❌ (USD)
1000 minutes free, no credit cardLimited freeFree tierLimited freeFree tier
Works with

One platform, six products, one flywheel

Replicate Voxcept's Dynamic AI Interview

use /v1/transcribe + /v1/diarize + your own LLM and retrieval stack.

Replicate Voxlive's agent assist

use streaming /v1/transcribe + /v1/translate + intent extraction on your side.

Replicate Voxlingo's voice-to-voice translation

chain /v1/transcribe → /v1/translate → /v1/synthesize over WebSocket.

Replicate Voxreply's AI receptionist

combine streaming ASR, TTS, and your LLM of choice into a SIP-attached agent.

Replicate Voxmemo's meeting capture

batch /v1/transcribe + /v1/diarize + your own summarization layer.

Compliance & trust
GDPR-nativeEU AI Act readySecNumCloud roadmapSOC 2 Type II (in progress)ISO 27001 (in progress)HDS-hostedNo model training on customer audioOn-premise optionAir-gapped option
FAQ

Questions, answered

How does Voxist API compare to Deepgram or AssemblyAI?
Deepgram and AssemblyAI are both excellent US-built Speech AI platforms with strong English performance and extensive developer ecosystems. Voxist API outperforms both on French and European-language accuracy (4.2% WER on French), runs CPU-first (8.9 streams/vCPU) where they require GPU, offers on-premise deployment that they don't, and bills in EUR with no FX exposure. The right choice depends on your language priority, your deployment requirements, and your data-residency posture. See the full comparison.
Can I run Voxist API on-premise, including the models?
Yes. Voxist is one of the very few Speech AI platforms with a fully on-premise deployment option, including the ASR, TTS, and translation models. Sovereign cloud (OVHcloud, Scaleway), private cloud (your own VPC on any major provider), and air-gapped deployments are all supported. Deployment economics are transparent: at our published CPU efficiency, the on-premise total cost of ownership crosses below cloud at roughly 50k–100k minutes of monthly traffic.
What's the SLA?
99.95% platform availability on cloud deployments. Five-nines (99.999%) on dedicated enterprise contracts. P95 first-audio latency under 200ms on streaming ASR. P99 under 500ms. Real-time status at status.voxist.com.
Do you train on my audio?
No. Voxist does not use customer audio to train models. This is a non-negotiable design principle of the platform, on every tier including the free tier. Detailed data handling policy at /company/security/.
What languages do you support, exactly?
40+ languages with production-grade depth on European languages. The full matrix, with WER and latency per language, is published on the Voxist ASR leaderboard. For translation, the COMET-scored language pair matrix is published on the Voxist translation leaderboard. Both leaderboards are updated monthly.
What's the difference between Voxist API and OpenAI's Whisper / Azure / Google Cloud Speech?
Whisper is an open-source model, not a hosted API; you run it yourself (or buy it through one of the many hosted Whisper services). Azure and Google Cloud Speech are general-purpose Speech APIs from US hyperscalers, hosted on US cloud (with European regions available, but operating under the CLOUD Act). Voxist API is purpose-built in Europe, with French and European-language depth that those generalist APIs don't optimize for, and with sovereign deployment options that they cannot offer.
Do you offer voice cloning?
Yes — Voxist TTS supports custom voice cloning on enterprise contracts. Sixty seconds of high-quality source audio is sufficient. Voice preservation in real-time translation (rendering the translated audio in the original speaker's own voice) is on roadmap for late 2026.
Can I use Voxist API for medical, legal, or financial dictation?
Yes — domain vocabularies for medical, legal, technical, and financial contexts are pre-loaded. Per-customer custom vocabulary is supported on enterprise contracts. Health data is hosted on HDS-certified infrastructure in France.
How do I get started?
Sign up for a free API key at voxist.com/api/signup. 1000 minutes per month free, no credit card. Documentation, SDKs, and code samples at developers.voxist.com. Community support via Discord; paid plans include direct support and SLA-backed response times.
Is there a status page?
Yes — status.voxist.com. Real-time platform health, regional availability, incident history, and SLA reporting.
Do you have a community / open-source presence?
Voxist contributes to several open-source projects in the Speech AI space (see github.com/voxist) and participates in ELLIOT (Horizon Europe) research collaborations. The Voxist team publishes on arXiv and at speech/translation conferences. The Voxist Benchmarks pages publish methodology and code for our public benchmarks so they can be reproduced and challenged.

Build with the Speech AI platform that runs production traffic.

Get an API key

English & French · EU-hosted · no audio used for model training