Every voice tool today is a silo. Wispr Flow works with Wispr Flow. Serenade works with Serenade. If you build an app that accepts voice input, you either build your own stack from scratch or lock yourself into someone else's.
This is where we were with messaging before XMPP, with web APIs before REST. Voice interaction needs a common language. That's what OpenVIP is.
The problem is interoperability¶
Say you build a coding agent. You want voice support. Today your options are:
- Build everything yourself. Audio capture, VAD, transcription, intent parsing, action routing. Months of work before you ship anything.
- Use a proprietary service. Fast to integrate, but now your users' audio goes through someone else's servers. And when that service changes its API or pricing, you're stuck.
- Give up on voice. Most teams pick this one.
OpenVIP is option 4: an open protocol that separates the voice layer from the application. Your agent subscribes to voice events. A voice provider (like dictare) handles the hard parts — capture, transcription, delivery. Everyone speaks the same protocol.
What OpenVIP defines¶
OpenVIP is deliberately simple. It defines:
SSE-based event streaming. Agents connect via Server-Sent Events and receive transcription events in real time. No WebSockets, no polling, no complex handshake. SSE is HTTP — it goes through every proxy, every firewall, every load balancer.
REST endpoints for control. Subscribe an agent, unsubscribe it, check status, send TTS responses. Standard REST, standard JSON.
Agent subscription model. Agents register with the voice server and declare their capabilities. The server routes transcriptions to the right agent based on context, focus, or explicit voice commands.
Bidirectional communication. Agents can send text back through the protocol for TTS synthesis. The conversation goes both ways.
How dictare implements it¶
dictare is the reference implementation of OpenVIP. It runs the voice server that agents connect to. The flow:
- dictare captures audio from your microphone
- VAD detects speech boundaries
- Local STT engine transcribes the speech
- The transcription is delivered via SSE to subscribed agents
- Agents respond via REST, dictare speaks the response via TTS
All local. The OpenVIP server runs on localhost. Agents connect over the loopback interface. No network traffic.
Build your own integration¶
The simplest way to add voice to any tool is the pipe pattern:
# Voice input to any program
dictare transcribe | your-tool
# Voice input, voice output
dictare transcribe | your-tool | dictare speak
# Example: voice-powered LLM chat
dictare transcribe --auto-submit | llm | dictare speak
dictare transcribe outputs transcriptions as text lines. dictare speak reads text lines and speaks them. Unix pipes do the rest.
For deeper integration, use the openvip Python SDK:
from openvip import Client
client = Client("http://localhost:8770/openvip")
for event in client.subscribe("my-agent"):
if event.type == "transcription":
result = process(event.text)
client.speak(result)
That's a complete voice integration in 8 lines.
Why open matters¶
Proprietary voice APIs come and go. Open protocols survive. HTTP is 35 years old. SMTP is 43. They work because anyone can implement them and no single company controls them.
OpenVIP aims for the same thing in voice interaction. If dictare disappears tomorrow, the protocol lives on. Your integrations keep working with any compatible voice server.
Current status¶
The protocol spec is at openvip.dev. The Python SDK is on PyPI. dictare is the reference implementation.
OpenVIP is still evolving. If you believe voice interaction should be open and interoperable, check out the spec and tell me what's missing.
