asr/v1 — Automatic Speech Recognition Backends¶
Status: Draft (design-locked, ready for first implementation) · Stability: v1 will be frozen with the first reference backend (
whisper-cpp) · Implementations: in-tree only until v1 is frozen.
The asr/v1 surface sits between segment/v1 (which produces speech
segments — chunks of audio with detected speech boundaries) and router/v1
(which consumes partially-filled IntentEnvelopes). An ASR backend
takes a speech segment and produces a final transcript: text, language,
confidence, optional word-level alignment.
Vox supports both streaming (partial transcripts emitted as audio flows; final on segment close) and offline (single-shot transcription of a complete segment) modes. Backends declare which they support; the orchestrator picks based on source kind and configuration.
This document is the contract. Backends conforming to it can be loaded by
any version of the Vox core that supports asr/v1.
Scope¶
asr/v1 covers:
- The backend interface (streaming + offline modes)
- Mode selection (per source kind, overridable per instance)
- Input contract (audio frames from
capture/v1, segment metadata fromsegment/v1) - Output contract (
FinalTranscript→ partially-filledIntentEnvelope) - Per-stream backend instances + per-source-kind routing
- BYOK authentication + shared credential support
- Cost controls (budgets, rate limits, pre-flight estimation)
- Fallback chain for backend failure
- Word timestamps, formatting, language detection
- Speaker label precedence + refinement rules
- Custom vocabulary / boost words
- Profanity filtering posture
- Local model management (download, storage, verification, versioning)
- Error model + latency budgets
- Versioning and stability rules
asr/v1 does not cover:
- Audio capture (
capture/v1) - Speech detection / segmentation (
segment/v1) - Intent classification or sink routing (
router/v1) - Sink delivery (
sink/v1) - Speaker diarization at the segmentation level (
segment/v1); ASR-backend diarization is optional refinement, not the primary path - Translation (separate future surface if needed)
Input — what the backend consumes¶
When segment/v1 detects a complete speech segment, it hands the ASR
backend:
- Audio buffer or stream of
capture/v1Frames belonging to the segment (PCM f32 or i16, the rate the backend negotiated) SessionInfowith identity + source metadata:
SessionInfo {
SessionID string
StreamID string
SegmentID string # unique per segment within the stream
SourceKind SourceKind # "self" | "in-person" | "online" | "file"
PreferredLang string? # BCP-47; backend may auto-detect if empty
FallbackLangs []string? # additional acceptable languages
SpeakerHint string? # opaque label; refined per precedence rules below
StartedAt Timestamp
ApproxEndedAt Timestamp # offline mode; streaming computes its own
}
Modes — streaming vs offline¶
Backends declare which modes they support via Capabilities:
Capabilities {
# Modes
SupportsStreaming bool
SupportsOffline bool
StreamingLatencyMS uint32 # typical time-to-first-partial; informational
OfflineRTF float # offline real-time factor (e.g., 0.3 = 3× faster than realtime)
# Output
SupportedLanguages []string # BCP-47 tags
SupportsLanguageDetection bool
SupportsWordTimestamps bool
SupportsPunctuation bool
SupportsVerboseFormatting bool # filler / pause markers
SupportsDiarization bool
SupportsVocabularyBoost bool
SupportsProfanityFilter bool
SupportsCustomModels bool # user can drop in a model_path
# Cost
CostPerAudioMinuteUSD float # 0.0 for local; > 0 for cloud
IsLocal bool
# Limits
MaxSegmentAudioMinutes uint32 # hard cap; longer segments rejected
}
Mode selection¶
Source-kind defaults; overridable per-instance.
| Source kind | Default mode | Reasoning |
|---|---|---|
self |
streaming | Dictation needs live UI feedback; user sees words as they speak |
in-person |
offline | Meetings don't need real-time UI; offline is cheaper and higher accuracy |
online |
offline | Call capture; user-facing transcription latency isn't a constraint |
file |
offline | Batch; never streaming |
asr:
by_source:
self: streaming
in-person: offline
online: offline
file: offline
instances:
- name: dictation-fast
backend: whisper-cpp
mode: streaming # override source-kind default
- name: meeting-accurate
backend: deepgram
mode: offline
A backend that doesn't support the requested mode → Open() returns
ErrUnsupported with a clear message naming the missing capability.
Backend Interface¶
ASRBackend {
# Identity
Name() -> string
Capabilities() -> Capabilities
# Lifecycle
Open(config) -> Error
Close() -> Error
# Streaming mode
StreamOpen(ctx, sessionInfo) -> StreamHandle | Error
StreamFeed(handle, frames) -> Error
StreamPartials(handle) -> <-chan PartialTranscript
StreamClose(handle) -> FinalTranscript | Error
# Offline mode
Transcribe(ctx, audio, options) -> FinalTranscript | Error
# Diagnostics
Stats() -> Stats
Health() -> Health
}
Streaming-mode wire types¶
PartialTranscript {
Text string # current accumulated transcript
IsFinal bool # true on the final assembly
StableUntil uint32 # char index of stable prefix; everything before is committed
Confidence float
StartedAt Timestamp # wall-clock of segment start
CapturedAt Timestamp # wall-clock when this partial was emitted
}
FinalTranscript {
Text string
Language string # BCP-47, detected or echo-configured
Confidence float
StartedAt Timestamp
EndedAt Timestamp
Words []Word? # optional; populated if SupportsWordTimestamps
SpeakerHint string? # backend-assigned diarization label (if any)
Custom map<string, any> # backend-specific extras
}
Word {
Text string
StartMS uint32 # offset from segment start in milliseconds
EndMS uint32
Confidence float
}
Offline-mode wire types¶
TranscribeOptions {
Language string? # preferred / locked
FallbackLangs []string?
EmitWordTimestamps bool
Formatting string # "formatted" | "raw" | "verbose"
NumberFormatting string # "auto" | "spell-out" | "digits"
VocabularyBoost *VocabularyBoost
ProfanityFilter *ProfanityFilter
}
VocabularyBoost {
Words []string
Weight float
CustomWords []string # tell backend these ARE valid words
}
ProfanityFilter {
Enabled bool
Action string # "mask" | "drop"
}
Lifecycle state machines¶
Streaming:
StreamOpen() StreamFeed()×N
[closed] -----------------> [open] ----------------> [open]
|
StreamClose()
|
v
[final transcript returned]
StreamFeed MUST be safe to call concurrently with StreamPartials
draining. StreamClose blocks until the final transcript is assembled.
Offline:
Transcribe()
[open] --------------------> [final transcript returned]
Single-shot. No streaming state to manage.
Concurrency¶
Open/Closeare called from a single control thread per backend instance.Transcribeis safe to call concurrently from multiple goroutines per instance — the orchestrator may issue multiple offline jobs in parallel (subject torate_limit_per_minuteconfig).- Streaming sessions are independent — multiple
StreamHandles may be open concurrently against the same backend instance.
Partial transcripts — internal to ASR¶
Partial transcripts do NOT become envelopes. Only the final transcript
becomes a partially-filled envelope that flows to router/v1.
Reasoning: - Partials are revisions — text that may change as more audio arrives. Routing and sink delivery are committed work; you can't un-route a partial later proven wrong. - Routing every partial would multiply downstream work for no real gain (5–20 partials per envelope, most replaced by the next). - The router and sink contracts are designed for committed envelopes, not provisional ones.
The orchestrator exposes a partials subscriber channel for callers who
want them (UI components, debug surfaces, live-captioning sinks):
orchestrator.SubscribePartials(uiSubscriber)
This decouples the ASR-internal stream from the envelope pipeline.
Multiple Backend Instances¶
A single asr/v1 deployment can load multiple backends simultaneously,
each scoped to specific source kinds. Useful for "fast local ASR for my
voice, accurate cloud ASR for meetings."
asr:
instances:
- name: fast-local
backend: whisper-cpp
mode: streaming
model: large-v3-q5_0
handles_source_kinds: [self]
- name: accurate-cloud
backend: deepgram
mode: offline
handles_source_kinds: [in-person, online]
auth:
method: keychain
credential_name: deepgram-api-key
fallback_chains:
self: [fast-local, whisper-cpp-fallback]
in-person: [accurate-cloud, faster-whisper-local]
online: [accurate-cloud, faster-whisper-local]
file: [faster-whisper-local]
If multiple instances claim the same source kind, the FIRST in declared order handles it. If no instance claims a source kind, the segment is dropped with a structured warning.
Latency Budgets¶
Default budgets:
| Mode | Budget | On overrun |
|---|---|---|
| Streaming partial | < 500 ms time-to-first | Mark backend degraded; asr.partial_latency_exceeded event |
| Streaming final | < 1.5× segment duration | Mark backend degraded; fall back; asr.final_latency_exceeded event |
| Offline transcribe | < 2× segment duration | Same: degrade + fallback |
| Hard cutoff | 30 s for any segment | Abort; emit envelope with empty Transcript + Provenance.Custom["asr.failed"] = "timeout" |
All configurable per instance. The pipeline never blocks waiting for ASR.
Tier-1 Backends (ship with v1)¶
| Backend | Local/Cloud | Modes | Why |
|---|---|---|---|
whisper-cpp |
Local | Offline (streaming patches in some forks) | OpenAI Whisper distilled to C++; GGML quantized models; works on every tier-1 OS; the local default |
faster-whisper |
Local | Offline + experimental streaming | CTranslate2-based; 4–8× faster than whisper-cpp on CPU/GPU; same models |
vosk |
Local | Streaming + offline | Smaller models; lower latency; lower accuracy. The "I want streaming locally" path |
openai-whisper-api |
Cloud | Offline | OpenAI's hosted Whisper; shares credential with llm-openai sink |
deepgram |
Cloud | Streaming + offline | Streaming-first; live captioning use case |
assemblyai |
Cloud | Streaming + offline | Strong diarization + word timestamps |
Tier-2 backends (community-contributable, same contract)¶
ollama-whisper, google-speech, azure-speech, aws-transcribe,
gladia, replicate-whisper.
BYOK Authentication¶
Same credential precedence as LLM / S3 / email sinks:
- Explicit env var (e.g.,
DEEPGRAM_API_KEY,ASSEMBLYAI_API_KEY) - OS keychain (default)
- Config file (deprecated; warns)
- External secrets manager (future
secrets/v1)
Onboarding: vox auth set <backend> (mirrors LLM sink pattern).
Shared credentials¶
Backends that share auth with other Vox surfaces declare it:
asr:
instances:
- name: openai-asr
backend: openai-whisper-api
auth:
shares_credential_with: llm-openai # reuse existing keychain entry
No double prompt at onboarding; the credential is set once via the relevant LLM sink and consumed here too.
Cost Controls (Cloud Backends)¶
Audio-minute–metered cloud ASR can rack up real money fast. First-class guardrails:
asr:
instances:
- name: accurate-cloud
backend: deepgram
cost_controls:
budget_daily_usd: 5.00
budget_monthly_usd: 100.00
on_budget_warn_pct: 80 # warn at 80% of budget
on_budget_exceed: halt # halt | warn | fallback
fallback_backend: whisper-cpp # required if on_budget_exceed: fallback
rate_limit_per_minute: 60 # max segments per minute
max_segment_audio_minutes: 30 # reject implausibly long segments
Behavior¶
| Setting | What it does |
|---|---|
budget_daily_usd / budget_monthly_usd |
Hard caps. Vox tracks consumed audio-minutes × CostPerAudioMinuteUSD |
on_budget_warn_pct |
Emit asr.budget_warning event + structured log + audit event at this threshold |
on_budget_exceed |
halt = stop transcribing; warn = continue + log; fallback = route to fallback_backend |
rate_limit_per_minute |
Soft throttle. Segments queue if exceeded |
max_segment_audio_minutes |
Reject implausibly long segments (silence-VAD bug usually) |
Spend tracking persisted at ~/.vox/state/cost-tracker.db (sqlite).
Session start displays current spend: "Today's ASR spend: $0.47 of $5.00 budget".
For local backends, cost_controls is ignored except for
rate_limit_per_minute (CPU budgeting).
Pre-flight cost estimation¶
asr:
preflight_cost_estimate: true # default true for paid backends
preflight_cost_log_threshold_usd: 0.10
estimated_cost = segment_duration_minutes × backend.cost_per_minute.
Cheap to compute. Logged at INFO when over threshold — useful for spotting
runaway long segments before they bill.
Cost transparency on the envelope¶
Optional Provenance.Custom["asr.*"] cost metadata:
Provenance.Custom {
"asr.backend": "deepgram",
"asr.audio_minutes": "0.85",
"asr.cost_usd": "0.0036",
"asr.model": "nova-2",
"asr.mode": "streaming"
}
Controlled by asr.emit_cost_metadata: true|false (default true for
cloud, false for local). Flows into audit/v1 when loaded — "how much
did the Q3 board meeting transcription cost?" is one query away.
Fallback Chain¶
When a backend goes unhealthy (network error, auth failure, quota /
budget exceeded with on_budget_exceed: fallback, timeout), the
orchestrator routes the next segment to the next healthy backend in the
chain for that source kind.
asr:
fallback_chains:
online: [deepgram, assemblyai, whisper-cpp]
Semantics¶
- A failed backend is taken out of rotation for
health_recovery_interval(default 5 min) - Subsequent segments route to the next healthy backend in the chain
- After recovery interval, Vox probes the primary; if healthy, segments resume routing to it
- Fallback events emit
asr.backend_fallbacktelemetry + audit event - User-visible: structured log at WARN level the first time fallback fires per session
Emergency local fallback¶
If the entire chain fails, Vox uses an always-available emergency local
fallback (the smallest bundled whisper-cpp quant — tiny or base).
If that ALSO fails: envelope with empty Transcript, Intent.Kind =
unclassified, Provenance.Custom["asr.failed"] = "all_backends_exhausted".
The pipeline never blocks.
Output Details¶
Word timestamps¶
asr:
emit_word_timestamps: auto # auto | always | never
auto checks Capabilities.SupportsWordTimestamps; emits if true,
silently skips otherwise. always errors at startup when a configured
backend can't produce them.
Word slice goes on FinalTranscript.Words. Sinks that don't care
ignore it.
Formatting¶
| Mode | Output style | Use case |
|---|---|---|
formatted (default) |
"Let's create a bd issue for the deck refresh." | Normal; downstream router classification works best with punctuation |
raw |
"lets create a bd issue for the deck refresh" | Low-bandwidth; some backends only support raw |
verbose |
"Let's [pause] create a bd issue for the deck refresh [pause] um yeah." | Research workflows that need filler / hesitation markers |
asr:
formatting: formatted # formatted | raw | verbose
number_formatting: auto # auto | spell-out | digits
Language¶
asr:
language:
mode: auto-with-hint # auto | auto-with-hint | locked
preferred: en-US # BCP-47; hint or lock
fallback_languages: [es, zh-CN] # also acceptable in auto modes
Code-switching (mixed languages within a segment): backend picks a dominant language; output is in that language's script with any switch words transcribed best-effort. Vox does NOT split segments by language.
The envelope's Language is the primary detected language. Multilingual
content: configure multiple instances scoped by source / session metadata.
Speaker label precedence¶
Speaker.Label can come from three places. Precedence:
segment/v1diarization (highest — sees full audio context)- ASR backend diarization (refinement — secondary)
- Capture-side hint (default — coarse)
Refinement rule: if segment/v1 says "single speaker, unknown
identity" AND the ASR backend produces a more specific label (e.g.,
Speaker B), the ASR label wins. More-specific-identification wins, but
cross-segment stability wins more.
Envelope's Speaker.Label is the final resolved value.
Provenance.Custom["asr.speaker_hint"] retains the ASR backend's raw
output for audit.
Vocabulary boost¶
Domain terms (product names, jargon) are routinely mis-transcribed. Most backends support boost:
asr:
instances:
- name: engineering-meetings
backend: deepgram
vocabulary:
boost_words: [Kubernetes, gRPC, blackrim, vox, sageox, anthropic]
boost_weight: 1.5
custom_words: [oxledger, voxsink] # tell backend these ARE words
Backends without boost support warn once at Open() and ignore the block.
whisper-cli now honors vocabulary boost (v1.x, shipped). The whisper-cli
backend passes boost words to whisper-cli via the --prompt flag, which biases
the recognizer toward those tokens:
asr:
instances:
- name: engineering-local
backend: whisper-cli
vocabulary_boost_words: [Kubernetes, gRPC, blackrim, vox, sageox]
vocabulary_boost_max_chars: 500 # default; prompt is truncated at this byte length
Config map keys: vocabulary_boost_words (list of strings) and
vocabulary_boost_max_chars (int, default 500 — safe limit for whisper.cpp's
n_text_ctx/2 token budget).
When the prompt is passed, asr.vocab_applied = true is stamped in
FinalTranscript.Custom for audit traceability.
Profanity filtering¶
OFF by default. Vox captures what was said.
asr:
profanity:
enabled: false # default off
action: mask # mask | drop
Reasoning: a transcription tool that silently censors is untrustworthy.
Enterprise compliance scenarios that demand redaction belong in
audit/v1 / report/v1 — the raw transcript stays unmodified.
Local Model Management¶
Local backends need model files (75 MB to 1.5 GB+ per model).
Storage layout¶
~/.vox/models/
whisper-cpp/
base.en.bin
large-v3-q5_0.bin
.checksums
faster-whisper/
large-v3/
model.bin
config.json
vocabulary.txt
.checksums
vosk/
vosk-model-en-us-0.22/
.checksums
CLI¶
vox model list # show available + installed
vox model download whisper-cpp:large-v3-q5_0
vox model verify whisper-cpp
vox model remove whisper-cpp:base.en
Auto-download¶
By default, first reference to an uninstalled model prompts:
Model whisper-cpp:large-v3-q5_0 (1.0 GB) is not installed.
Download from https://huggingface.co/ggerganov/whisper.cpp ? [y/N]
Configurable:
asr:
auto_download: prompt # prompt (default) | silent | never
Custom models¶
Users drop a model file into the storage layout and reference by path:
asr:
instances:
- name: my-finetuned
backend: faster-whisper
model_path: ~/.vox/models/faster-whisper/my-finetuned/
Verification¶
Every model has a .checksums file with SHA-256s. Vox verifies at install
and on first load each session. Mismatch ⇒ refuse to load + structured
error pointing at vox model verify.
Versioning¶
Model names include explicit version (large-v3-q5_0, not large). New
versions are new models; old versions stay usable until removed. No
silent upgrades.
Error Model¶
Typed errors, mirroring other Vox surfaces:
ASRError {
Kind ASRErrorKind
Backend string
Op string # "open" | "stream-open" | "transcribe" | etc.
Message string
Cause Error?
}
ASRErrorKind {
ErrInvalidConfig
ErrAuthFailed
ErrQuotaExceeded # rate-limit / quota / budget
ErrUnsupported # mode / language / feature not supported
ErrModelNotFound # local model file missing
ErrModelCorrupt # checksum mismatch
ErrTimeout
ErrBackendUnavailable # provider down, network unreachable
ErrTransient # retry may help
ErrPersistent # retry won't help
ErrInternal
}
The orchestrator handles failures via the fallback chain. ASR backends
NEVER block the pipeline — worst case is an empty-transcript envelope
with Provenance.Custom["asr.failed"] set.
Versioning and Stability¶
asr/v1 is the contract above. Once frozen:
- Non-breaking changes (allowed in
v1.x): adding optional fields toCapabilities,SessionInfo,FinalTranscript,PartialTranscript,TranscribeOptions, orStats; adding newASRErrorKindvalues; adding new built-in backends; adding new model-management options. - Breaking changes (require
v2): changing theASRBackendinterface signature; changing the meaning of any existing field; changing mode semantics; changing speaker-precedence rules.
The core supports one vN of asr/ at a time, with overlap during
migrations.
Reference Build Order¶
Per the project's build-order principle (validate test substrate first, then add complexity incrementally):
| Order | Backend | Why this order |
|---|---|---|
| 1 | whisper-cpp |
First. Local, offline, single binary, no auth, no network. Unblocks end-to-end pipeline tests with deterministic input |
| 2 | faster-whisper |
Same model family; validates the speed-tier alternative |
| 3 | deepgram |
First cloud backend; validates BYOK + cost controls + streaming mode |
| 4 | openai-whisper-api |
Validates shared-credential reuse with llm-openai sink |
| 5 | vosk |
Validates streaming-only-local backend |
| 6 | assemblyai |
Diarization + word-timestamp validation |
| 7+ | tier-2 backends | Same contract, different wire protocols |
Build the orchestrator with whisper-cpp alone first; add backends one
at a time. The base interface and the partials channel stabilize before
any network-bound backend lands.
Project Principle: Opinionated Defaults, Every Default Configurable¶
This contract continues the principle from capture/v1, sink/v1, and
router/v1. Every behavior with a defensible default (mode per source
kind, formatting: formatted, emit_word_timestamps: auto,
profanity.enabled: false, auto_download: prompt, health_recovery_interval: 5m,
etc.) is exposed as a config knob. Defaults reflect a considered
recommendation for the typical voice-to-LLM use case; the knobs exist so
specialized workflows can tune them.