`asr/v1` — Automatic Speech Recognition Backends¶

Status: Draft (design-locked, ready for first implementation) · Stability: v1 will be frozen with the first reference backend (whisper-cpp) · Implementations: in-tree only until v1 is frozen.

The asr/v1 surface sits between segment/v1 (which produces speech segments — chunks of audio with detected speech boundaries) and router/v1 (which consumes partially-filled IntentEnvelopes). An ASR backend takes a speech segment and produces a final transcript: text, language, confidence, optional word-level alignment.

Vox supports both streaming (partial transcripts emitted as audio flows; final on segment close) and offline (single-shot transcription of a complete segment) modes. Backends declare which they support; the orchestrator picks based on source kind and configuration.

This document is the contract. Backends conforming to it can be loaded by any version of the Vox core that supports asr/v1.

Scope¶

asr/v1 covers:

The backend interface (streaming + offline modes)
Mode selection (per source kind, overridable per instance)
Input contract (audio frames from capture/v1, segment metadata from segment/v1)
Output contract (FinalTranscript → partially-filled IntentEnvelope)
Per-stream backend instances + per-source-kind routing
BYOK authentication + shared credential support
Cost controls (budgets, rate limits, pre-flight estimation)
Fallback chain for backend failure
Word timestamps, formatting, language detection
Speaker label precedence + refinement rules
Custom vocabulary / boost words
Profanity filtering posture
Local model management (download, storage, verification, versioning)
Error model + latency budgets
Versioning and stability rules

asr/v1 does not cover:

Audio capture (capture/v1)
Speech detection / segmentation (segment/v1)
Intent classification or sink routing (router/v1)
Sink delivery (sink/v1)
Speaker diarization at the segmentation level (segment/v1); ASR-backend diarization is optional refinement, not the primary path
Translation (separate future surface if needed)

Input — what the backend consumes¶

When segment/v1 detects a complete speech segment, it hands the ASR backend:

Audio buffer or stream of capture/v1 Frames belonging to the segment (PCM f32 or i16, the rate the backend negotiated)
SessionInfo with identity + source metadata:

SessionInfo {
  SessionID       string
  StreamID        string
  SegmentID       string                  # unique per segment within the stream
  SourceKind      SourceKind              # "self" | "in-person" | "online" | "file"
  PreferredLang   string?                 # BCP-47; backend may auto-detect if empty
  FallbackLangs   []string?               # additional acceptable languages
  SpeakerHint     string?                 # opaque label; refined per precedence rules below
  StartedAt       Timestamp
  ApproxEndedAt   Timestamp               # offline mode; streaming computes its own
}

Modes — streaming vs offline¶

Backends declare which modes they support via Capabilities:

Capabilities {
  # Modes
  SupportsStreaming         bool
  SupportsOffline           bool
  StreamingLatencyMS        uint32           # typical time-to-first-partial; informational
  OfflineRTF                float             # offline real-time factor (e.g., 0.3 = 3× faster than realtime)

  # Output
  SupportedLanguages        []string          # BCP-47 tags
  SupportsLanguageDetection bool
  SupportsWordTimestamps    bool
  SupportsPunctuation       bool
  SupportsVerboseFormatting bool              # filler / pause markers
  SupportsDiarization       bool
  SupportsVocabularyBoost   bool
  SupportsProfanityFilter   bool
  SupportsCustomModels      bool              # user can drop in a model_path

  # Cost
  CostPerAudioMinuteUSD     float             # 0.0 for local; > 0 for cloud
  IsLocal                   bool

  # Limits
  MaxSegmentAudioMinutes    uint32            # hard cap; longer segments rejected
}

Mode selection¶

Source-kind defaults; overridable per-instance.

Source kind	Default mode	Reasoning
`self`	streaming	Dictation needs live UI feedback; user sees words as they speak
`in-person`	offline	Meetings don't need real-time UI; offline is cheaper and higher accuracy
`online`	offline	Call capture; user-facing transcription latency isn't a constraint
`file`	offline	Batch; never streaming

asr:
  by_source:
    self:        streaming
    in-person:   offline
    online:      offline
    file:        offline

  instances:
    - name: dictation-fast
      backend: whisper-cpp
      mode: streaming                  # override source-kind default
    - name: meeting-accurate
      backend: deepgram
      mode: offline

A backend that doesn't support the requested mode → Open() returns ErrUnsupported with a clear message naming the missing capability.

Backend Interface¶

ASRBackend {
  # Identity
  Name()          -> string
  Capabilities()  -> Capabilities

  # Lifecycle
  Open(config)    -> Error
  Close()         -> Error

  # Streaming mode
  StreamOpen(ctx, sessionInfo)  -> StreamHandle | Error
  StreamFeed(handle, frames)    -> Error
  StreamPartials(handle)        -> <-chan PartialTranscript
  StreamClose(handle)           -> FinalTranscript | Error

  # Offline mode
  Transcribe(ctx, audio, options) -> FinalTranscript | Error

  # Diagnostics
  Stats()         -> Stats
  Health()        -> Health
}

Streaming-mode wire types¶

PartialTranscript {
  Text         string                    # current accumulated transcript
  IsFinal      bool                      # true on the final assembly
  StableUntil  uint32                    # char index of stable prefix; everything before is committed
  Confidence   float
  StartedAt    Timestamp                 # wall-clock of segment start
  CapturedAt   Timestamp                 # wall-clock when this partial was emitted
}

FinalTranscript {
  Text         string
  Language     string                    # BCP-47, detected or echo-configured
  Confidence   float
  StartedAt    Timestamp
  EndedAt      Timestamp
  Words        []Word?                   # optional; populated if SupportsWordTimestamps
  SpeakerHint  string?                   # backend-assigned diarization label (if any)
  Custom       map<string, any>          # backend-specific extras
}

Word {
  Text         string
  StartMS      uint32                    # offset from segment start in milliseconds
  EndMS        uint32
  Confidence   float
}

Offline-mode wire types¶

TranscribeOptions {
  Language          string?               # preferred / locked
  FallbackLangs     []string?
  EmitWordTimestamps bool
  Formatting        string                # "formatted" | "raw" | "verbose"
  NumberFormatting  string                # "auto" | "spell-out" | "digits"
  VocabularyBoost   *VocabularyBoost
  ProfanityFilter   *ProfanityFilter
}

VocabularyBoost {
  Words            []string
  Weight           float
  CustomWords      []string                # tell backend these ARE valid words
}

ProfanityFilter {
  Enabled  bool
  Action   string                          # "mask" | "drop"
}

Lifecycle state machines¶

Streaming:

                StreamOpen()             StreamFeed()×N
   [closed]  ----------------->  [open]  ---------------->  [open]
                                   |
                              StreamClose()
                                   |
                                   v
                              [final transcript returned]

StreamFeed MUST be safe to call concurrently with StreamPartials draining. StreamClose blocks until the final transcript is assembled.

Offline:

                  Transcribe()
   [open]  -------------------->  [final transcript returned]

Single-shot. No streaming state to manage.

Concurrency¶

Open / Close are called from a single control thread per backend instance.
Transcribe is safe to call concurrently from multiple goroutines per instance — the orchestrator may issue multiple offline jobs in parallel (subject to rate_limit_per_minute config).
Streaming sessions are independent — multiple StreamHandles may be open concurrently against the same backend instance.

Partial transcripts — internal to ASR¶

Partial transcripts do NOT become envelopes. Only the final transcript becomes a partially-filled envelope that flows to router/v1.

Reasoning: - Partials are revisions — text that may change as more audio arrives. Routing and sink delivery are committed work; you can't un-route a partial later proven wrong. - Routing every partial would multiply downstream work for no real gain (5–20 partials per envelope, most replaced by the next). - The router and sink contracts are designed for committed envelopes, not provisional ones.

The orchestrator exposes a partials subscriber channel for callers who want them (UI components, debug surfaces, live-captioning sinks):

orchestrator.SubscribePartials(uiSubscriber)

This decouples the ASR-internal stream from the envelope pipeline.

Multiple Backend Instances¶

A single asr/v1 deployment can load multiple backends simultaneously, each scoped to specific source kinds. Useful for "fast local ASR for my voice, accurate cloud ASR for meetings."

asr:
  instances:
    - name: fast-local
      backend: whisper-cpp
      mode: streaming
      model: large-v3-q5_0
      handles_source_kinds: [self]

    - name: accurate-cloud
      backend: deepgram
      mode: offline
      handles_source_kinds: [in-person, online]
      auth:
        method: keychain
        credential_name: deepgram-api-key

  fallback_chains:
    self:        [fast-local, whisper-cpp-fallback]
    in-person:   [accurate-cloud, faster-whisper-local]
    online:      [accurate-cloud, faster-whisper-local]
    file:        [faster-whisper-local]

If multiple instances claim the same source kind, the FIRST in declared order handles it. If no instance claims a source kind, the segment is dropped with a structured warning.

Latency Budgets¶

Default budgets:

Mode	Budget	On overrun
Streaming partial	< 500 ms time-to-first	Mark backend degraded; `asr.partial_latency_exceeded` event
Streaming final	< 1.5× segment duration	Mark backend degraded; fall back; `asr.final_latency_exceeded` event
Offline transcribe	< 2× segment duration	Same: degrade + fallback
Hard cutoff	30 s for any segment	Abort; emit envelope with empty `Transcript` + `Provenance.Custom["asr.failed"] = "timeout"`

All configurable per instance. The pipeline never blocks waiting for ASR.

Tier-1 Backends (ship with v1)¶

Backend	Local/Cloud	Modes	Why
`whisper-cpp`	Local	Offline (streaming patches in some forks)	OpenAI Whisper distilled to C++; GGML quantized models; works on every tier-1 OS; the local default
`faster-whisper`	Local	Offline + experimental streaming	CTranslate2-based; 4–8× faster than whisper-cpp on CPU/GPU; same models
`vosk`	Local	Streaming + offline	Smaller models; lower latency; lower accuracy. The "I want streaming locally" path
`openai-whisper-api`	Cloud	Offline	OpenAI's hosted Whisper; shares credential with `llm-openai` sink
`deepgram`	Cloud	Streaming + offline	Streaming-first; live captioning use case
`assemblyai`	Cloud	Streaming + offline	Strong diarization + word timestamps

Tier-2 backends (community-contributable, same contract)¶

ollama-whisper, google-speech, azure-speech, aws-transcribe, gladia, replicate-whisper.

BYOK Authentication¶

Same credential precedence as LLM / S3 / email sinks:

Explicit env var (e.g., DEEPGRAM_API_KEY, ASSEMBLYAI_API_KEY)
OS keychain (default)
Config file (deprecated; warns)
External secrets manager (future secrets/v1)

Onboarding: vox auth set <backend> (mirrors LLM sink pattern).

Shared credentials¶

Backends that share auth with other Vox surfaces declare it:

asr:
  instances:
    - name: openai-asr
      backend: openai-whisper-api
      auth:
        shares_credential_with: llm-openai      # reuse existing keychain entry

No double prompt at onboarding; the credential is set once via the relevant LLM sink and consumed here too.

Cost Controls (Cloud Backends)¶

Audio-minute–metered cloud ASR can rack up real money fast. First-class guardrails:

asr:
  instances:
    - name: accurate-cloud
      backend: deepgram
      cost_controls:
        budget_daily_usd: 5.00
        budget_monthly_usd: 100.00
        on_budget_warn_pct: 80                # warn at 80% of budget
        on_budget_exceed: halt                # halt | warn | fallback
        fallback_backend: whisper-cpp          # required if on_budget_exceed: fallback
        rate_limit_per_minute: 60              # max segments per minute
        max_segment_audio_minutes: 30          # reject implausibly long segments

Behavior¶

Setting	What it does
`budget_daily_usd` / `budget_monthly_usd`	Hard caps. Vox tracks consumed audio-minutes × `CostPerAudioMinuteUSD`
`on_budget_warn_pct`	Emit `asr.budget_warning` event + structured log + audit event at this threshold
`on_budget_exceed`	`halt` = stop transcribing; `warn` = continue + log; `fallback` = route to `fallback_backend`
`rate_limit_per_minute`	Soft throttle. Segments queue if exceeded
`max_segment_audio_minutes`	Reject implausibly long segments (silence-VAD bug usually)

Spend tracking persisted at ~/.vox/state/cost-tracker.db (sqlite). Session start displays current spend: "Today's ASR spend: $0.47 of $5.00 budget".

For local backends, cost_controls is ignored except for rate_limit_per_minute (CPU budgeting).

Pre-flight cost estimation¶

asr:
  preflight_cost_estimate: true                # default true for paid backends
  preflight_cost_log_threshold_usd: 0.10

estimated_cost = segment_duration_minutes × backend.cost_per_minute. Cheap to compute. Logged at INFO when over threshold — useful for spotting runaway long segments before they bill.

Cost transparency on the envelope¶

Optional Provenance.Custom["asr.*"] cost metadata:

Provenance.Custom {
  "asr.backend":       "deepgram",
  "asr.audio_minutes": "0.85",
  "asr.cost_usd":      "0.0036",
  "asr.model":         "nova-2",
  "asr.mode":          "streaming"
}

Controlled by asr.emit_cost_metadata: true|false (default true for cloud, false for local). Flows into audit/v1 when loaded — "how much did the Q3 board meeting transcription cost?" is one query away.

Fallback Chain¶

When a backend goes unhealthy (network error, auth failure, quota / budget exceeded with on_budget_exceed: fallback, timeout), the orchestrator routes the next segment to the next healthy backend in the chain for that source kind.

asr:
  fallback_chains:
    online: [deepgram, assemblyai, whisper-cpp]

Semantics¶

A failed backend is taken out of rotation for health_recovery_interval (default 5 min)
Subsequent segments route to the next healthy backend in the chain
After recovery interval, Vox probes the primary; if healthy, segments resume routing to it
Fallback events emit asr.backend_fallback telemetry + audit event
User-visible: structured log at WARN level the first time fallback fires per session

Emergency local fallback¶

If the entire chain fails, Vox uses an always-available emergency local fallback (the smallest bundled whisper-cpp quant — tiny or base). If that ALSO fails: envelope with empty Transcript, Intent.Kind = unclassified, Provenance.Custom["asr.failed"] = "all_backends_exhausted".

The pipeline never blocks.

Output Details¶

Word timestamps¶

asr:
  emit_word_timestamps: auto                   # auto | always | never

auto checks Capabilities.SupportsWordTimestamps; emits if true, silently skips otherwise. always errors at startup when a configured backend can't produce them.

Word slice goes on FinalTranscript.Words. Sinks that don't care ignore it.

Formatting¶

Mode	Output style	Use case
`formatted` (default)	"Let's create a bd issue for the deck refresh."	Normal; downstream router classification works best with punctuation
`raw`	"lets create a bd issue for the deck refresh"	Low-bandwidth; some backends only support raw
`verbose`	"Let's [pause] create a bd issue for the deck refresh [pause] um yeah."	Research workflows that need filler / hesitation markers

asr:
  formatting: formatted                        # formatted | raw | verbose
  number_formatting: auto                      # auto | spell-out | digits

Language¶

asr:
  language:
    mode: auto-with-hint                       # auto | auto-with-hint | locked
    preferred: en-US                           # BCP-47; hint or lock
    fallback_languages: [es, zh-CN]            # also acceptable in auto modes

Code-switching (mixed languages within a segment): backend picks a dominant language; output is in that language's script with any switch words transcribed best-effort. Vox does NOT split segments by language.

The envelope's Language is the primary detected language. Multilingual content: configure multiple instances scoped by source / session metadata.

Speaker label precedence¶

Speaker.Label can come from three places. Precedence:

segment/v1 diarization (highest — sees full audio context)
ASR backend diarization (refinement — secondary)
Capture-side hint (default — coarse)

Refinement rule: if segment/v1 says "single speaker, unknown identity" AND the ASR backend produces a more specific label (e.g., Speaker B), the ASR label wins. More-specific-identification wins, but cross-segment stability wins more.

Envelope's Speaker.Label is the final resolved value. Provenance.Custom["asr.speaker_hint"] retains the ASR backend's raw output for audit.

Vocabulary boost¶

Domain terms (product names, jargon) are routinely mis-transcribed. Most backends support boost:

asr:
  instances:
    - name: engineering-meetings
      backend: deepgram
      vocabulary:
        boost_words: [Kubernetes, gRPC, blackrim, vox, sageox, anthropic]
        boost_weight: 1.5
        custom_words: [oxledger, voxsink]      # tell backend these ARE words

Backends without boost support warn once at Open() and ignore the block.

whisper-cli now honors vocabulary boost (v1.x, shipped). The whisper-cli backend passes boost words to whisper-cli via the --prompt flag, which biases the recognizer toward those tokens:

asr:
  instances:
    - name: engineering-local
      backend: whisper-cli
      vocabulary_boost_words: [Kubernetes, gRPC, blackrim, vox, sageox]
      vocabulary_boost_max_chars: 500          # default; prompt is truncated at this byte length

Config map keys: vocabulary_boost_words (list of strings) and vocabulary_boost_max_chars (int, default 500 — safe limit for whisper.cpp's n_text_ctx/2 token budget).

When the prompt is passed, asr.vocab_applied = true is stamped in FinalTranscript.Custom for audit traceability.

Profanity filtering¶

OFF by default. Vox captures what was said.

asr:
  profanity:
    enabled: false                             # default off
    action: mask                               # mask | drop

Reasoning: a transcription tool that silently censors is untrustworthy. Enterprise compliance scenarios that demand redaction belong in audit/v1 / report/v1 — the raw transcript stays unmodified.

Local Model Management¶

Local backends need model files (75 MB to 1.5 GB+ per model).

Storage layout¶

~/.vox/models/
  whisper-cpp/
    base.en.bin
    large-v3-q5_0.bin
    .checksums
  faster-whisper/
    large-v3/
      model.bin
      config.json
      vocabulary.txt
    .checksums
  vosk/
    vosk-model-en-us-0.22/
    .checksums

CLI¶

vox model list                                 # show available + installed
vox model download whisper-cpp:large-v3-q5_0
vox model verify whisper-cpp
vox model remove whisper-cpp:base.en

Auto-download¶

By default, first reference to an uninstalled model prompts:

Model whisper-cpp:large-v3-q5_0 (1.0 GB) is not installed.
Download from https://huggingface.co/ggerganov/whisper.cpp ? [y/N]

Configurable:

asr:
  auto_download: prompt                        # prompt (default) | silent | never

Custom models¶

Users drop a model file into the storage layout and reference by path:

asr:
  instances:
    - name: my-finetuned
      backend: faster-whisper
      model_path: ~/.vox/models/faster-whisper/my-finetuned/

Verification¶

Every model has a .checksums file with SHA-256s. Vox verifies at install and on first load each session. Mismatch ⇒ refuse to load + structured error pointing at vox model verify.

Versioning¶

Model names include explicit version (large-v3-q5_0, not large). New versions are new models; old versions stay usable until removed. No silent upgrades.

Error Model¶

Typed errors, mirroring other Vox surfaces:

ASRError {
  Kind     ASRErrorKind
  Backend  string
  Op       string                              # "open" | "stream-open" | "transcribe" | etc.
  Message  string
  Cause    Error?
}

ASRErrorKind {
  ErrInvalidConfig
  ErrAuthFailed
  ErrQuotaExceeded                             # rate-limit / quota / budget
  ErrUnsupported                               # mode / language / feature not supported
  ErrModelNotFound                             # local model file missing
  ErrModelCorrupt                              # checksum mismatch
  ErrTimeout
  ErrBackendUnavailable                        # provider down, network unreachable
  ErrTransient                                 # retry may help
  ErrPersistent                                # retry won't help
  ErrInternal
}

The orchestrator handles failures via the fallback chain. ASR backends NEVER block the pipeline — worst case is an empty-transcript envelope with Provenance.Custom["asr.failed"] set.

Versioning and Stability¶

asr/v1 is the contract above. Once frozen:

Non-breaking changes (allowed in v1.x): adding optional fields to Capabilities, SessionInfo, FinalTranscript, PartialTranscript, TranscribeOptions, or Stats; adding new ASRErrorKind values; adding new built-in backends; adding new model-management options.
Breaking changes (require v2): changing the ASRBackend interface signature; changing the meaning of any existing field; changing mode semantics; changing speaker-precedence rules.

The core supports one vN of asr/ at a time, with overlap during migrations.

Reference Build Order¶

Per the project's build-order principle (validate test substrate first, then add complexity incrementally):

Order	Backend	Why this order
1	`whisper-cpp`	First. Local, offline, single binary, no auth, no network. Unblocks end-to-end pipeline tests with deterministic input
2	`faster-whisper`	Same model family; validates the speed-tier alternative
3	`deepgram`	First cloud backend; validates BYOK + cost controls + streaming mode
4	`openai-whisper-api`	Validates shared-credential reuse with `llm-openai` sink
5	`vosk`	Validates streaming-only-local backend
6	`assemblyai`	Diarization + word-timestamp validation
7+	tier-2 backends	Same contract, different wire protocols

Build the orchestrator with whisper-cpp alone first; add backends one at a time. The base interface and the partials channel stabilize before any network-bound backend lands.

Project Principle: Opinionated Defaults, Every Default Configurable¶

This contract continues the principle from capture/v1, sink/v1, and router/v1. Every behavior with a defensible default (mode per source kind, formatting: formatted, emit_word_timestamps: auto, profanity.enabled: false, auto_download: prompt, health_recovery_interval: 5m, etc.) is exposed as a config knob. Defaults reflect a considered recommendation for the typical voice-to-LLM use case; the knobs exist so specialized workflows can tune them.

asr/v1 — Automatic Speech Recognition Backends¶