Your voice is biometric data. It identifies you uniquely. It reveals your accent, your mood, your health. And most voice tools send it to someone else's server.
dictare doesn't. Every piece of the pipeline — audio capture, voice activity detection, speech-to-text, text-to-speech — runs locally on your hardware. No cloud. No API keys. No subscription. No audio data leaving your computer.
This isn't a privacy feature. It's the architecture.
Speech-to-text: local engines only¶
dictare supports three STT backends. All of them run on your machine.
Whisper (via faster-whisper and MLX)¶
OpenAI's Whisper models, running locally through optimized runtimes:
- faster-whisper (CTranslate2) — works on Linux and Intel Macs. Fast inference with INT8 quantization.
- MLX Whisper — native Apple Silicon acceleration on macOS. Uses the Metal GPU, runs at near-real-time on M1 and faster on M2/M3/M4.
Models range from tiny (39M parameters, ~1GB RAM) to large-v3-turbo (809M parameters, ~3GB RAM). The large-v3-turbo model gives you near-cloud accuracy with local inference.
[stt]
model = "large-v3-turbo"
language = "en"
Parakeet-v3¶
NVIDIA's Parakeet model running via ONNX Runtime. Works on any platform with ONNX support. A strong alternative if you want something outside the Whisper family.
[stt]
model = "parakeet-v3"
Text-to-speech: local too¶
When your agent talks back, that synthesis also happens locally. dictare supports multiple TTS engines:
- Kokoro — high-quality neural TTS with multiple voices. Runs as a subprocess worker for isolation.
- Piper — lightweight, fast, great for low-latency feedback. Dozens of voices available.
- espeak — the classic. Not pretty, but it's everywhere and it's instant.
- macOS
say— uses the built-in macOS speech synthesis. Zero setup. - OuteTTS — neural TTS with voice cloning capabilities.
[tts]
engine = "kokoro"
voice = "af_heart"
All local. All offline-capable.
No API keys, no subscription¶
This is worth emphasizing. dictare has zero external dependencies at runtime. You install it, you run it, it works. No account creation, no API key management, no "free tier" with limits, no monthly bill.
Your voice data stays on your machine because there's nowhere else for it to go. The architecture has no cloud component to send data to, even if you wanted to.
Performance¶
"Local" used to mean "slow." Not anymore.
On Apple Silicon (M1 and newer), MLX Whisper with the large-v3-turbo model transcribes in near-real-time. You finish speaking and the text appears within a second.
On Linux with CUDA, faster-whisper with GPU acceleration is similarly fast. Even on CPU, the smaller models (small, medium) are responsive enough for interactive use.
The TTS engines are fast too. Kokoro produces natural speech with barely perceptible latency. Piper is even faster if you need instant feedback.
The tradeoff¶
Let's be honest: local STT isn't quite as accurate as the best cloud APIs. Cloud services have massive models and server-grade hardware. The gap is narrowing fast — large-v3-turbo is remarkably good — but it exists.
For coding conversations, it barely matters. You're speaking in short, clear sentences about technical topics. The context is constrained. The vocabulary is predictable. Local models handle this well.
And what you get in return — complete privacy, zero latency variability, offline capability, no recurring cost — is worth it.
How to verify¶
Don't take my word for it. Run dictare and check:
# Watch network traffic while dictare runs
sudo lsof -i -P | grep dictare
You'll see the local OpenVIP server listening on localhost:8770, talking to agents on the loopback interface. Nothing goes out.
Your voice stays yours.
