Skip to content

asr/v1 — Automatic Speech Recognition Backends

Status: Draft (design-locked, ready for first implementation) · Stability: v1 will be frozen with the first reference backend (whisper-cpp) · Implementations: in-tree only until v1 is frozen.

The asr/v1 surface sits between segment/v1 (which produces speech segments — chunks of audio with detected speech boundaries) and router/v1 (which consumes partially-filled IntentEnvelopes). An ASR backend takes a speech segment and produces a final transcript: text, language, confidence, optional word-level alignment.

Vox supports both streaming (partial transcripts emitted as audio flows; final on segment close) and offline (single-shot transcription of a complete segment) modes. Backends declare which they support; the orchestrator picks based on source kind and configuration.

This document is the contract. Backends conforming to it can be loaded by any version of the Vox core that supports asr/v1.


Scope

asr/v1 covers:

  • The backend interface (streaming + offline modes)
  • Mode selection (per source kind, overridable per instance)
  • Input contract (audio frames from capture/v1, segment metadata from segment/v1)
  • Output contract (FinalTranscript → partially-filled IntentEnvelope)
  • Per-stream backend instances + per-source-kind routing
  • BYOK authentication + shared credential support
  • Cost controls (budgets, rate limits, pre-flight estimation)
  • Fallback chain for backend failure
  • Word timestamps, formatting, language detection
  • Speaker label precedence + refinement rules
  • Custom vocabulary / boost words
  • Profanity filtering posture
  • Local model management (download, storage, verification, versioning)
  • Error model + latency budgets
  • Versioning and stability rules

asr/v1 does not cover:

  • Audio capture (capture/v1)
  • Speech detection / segmentation (segment/v1)
  • Intent classification or sink routing (router/v1)
  • Sink delivery (sink/v1)
  • Speaker diarization at the segmentation level (segment/v1); ASR-backend diarization is optional refinement, not the primary path
  • Translation (separate future surface if needed)

Input — what the backend consumes

When segment/v1 detects a complete speech segment, it hands the ASR backend:

  1. Audio buffer or stream of capture/v1 Frames belonging to the segment (PCM f32 or i16, the rate the backend negotiated)
  2. SessionInfo with identity + source metadata:
SessionInfo {
  SessionID       string
  StreamID        string
  SegmentID       string                  # unique per segment within the stream
  SourceKind      SourceKind              # "self" | "in-person" | "online" | "file"
  PreferredLang   string?                 # BCP-47; backend may auto-detect if empty
  FallbackLangs   []string?               # additional acceptable languages
  SpeakerHint     string?                 # opaque label; refined per precedence rules below
  StartedAt       Timestamp
  ApproxEndedAt   Timestamp               # offline mode; streaming computes its own
}

Modes — streaming vs offline

Backends declare which modes they support via Capabilities:

Capabilities {
  # Modes
  SupportsStreaming         bool
  SupportsOffline           bool
  StreamingLatencyMS        uint32           # typical time-to-first-partial; informational
  OfflineRTF                float             # offline real-time factor (e.g., 0.3 = 3× faster than realtime)

  # Output
  SupportedLanguages        []string          # BCP-47 tags
  SupportsLanguageDetection bool
  SupportsWordTimestamps    bool
  SupportsPunctuation       bool
  SupportsVerboseFormatting bool              # filler / pause markers
  SupportsDiarization       bool
  SupportsVocabularyBoost   bool
  SupportsProfanityFilter   bool
  SupportsCustomModels      bool              # user can drop in a model_path

  # Cost
  CostPerAudioMinuteUSD     float             # 0.0 for local; > 0 for cloud
  IsLocal                   bool

  # Limits
  MaxSegmentAudioMinutes    uint32            # hard cap; longer segments rejected
}

Mode selection

Source-kind defaults; overridable per-instance.

Source kind Default mode Reasoning
self streaming Dictation needs live UI feedback; user sees words as they speak
in-person offline Meetings don't need real-time UI; offline is cheaper and higher accuracy
online offline Call capture; user-facing transcription latency isn't a constraint
file offline Batch; never streaming
asr:
  by_source:
    self:        streaming
    in-person:   offline
    online:      offline
    file:        offline

  instances:
    - name: dictation-fast
      backend: whisper-cpp
      mode: streaming                  # override source-kind default
    - name: meeting-accurate
      backend: deepgram
      mode: offline

A backend that doesn't support the requested mode → Open() returns ErrUnsupported with a clear message naming the missing capability.


Backend Interface

ASRBackend {
  # Identity
  Name()          -> string
  Capabilities()  -> Capabilities

  # Lifecycle
  Open(config)    -> Error
  Close()         -> Error

  # Streaming mode
  StreamOpen(ctx, sessionInfo)  -> StreamHandle | Error
  StreamFeed(handle, frames)    -> Error
  StreamPartials(handle)        -> <-chan PartialTranscript
  StreamClose(handle)           -> FinalTranscript | Error

  # Offline mode
  Transcribe(ctx, audio, options) -> FinalTranscript | Error

  # Diagnostics
  Stats()         -> Stats
  Health()        -> Health
}

Streaming-mode wire types

PartialTranscript {
  Text         string                    # current accumulated transcript
  IsFinal      bool                      # true on the final assembly
  StableUntil  uint32                    # char index of stable prefix; everything before is committed
  Confidence   float
  StartedAt    Timestamp                 # wall-clock of segment start
  CapturedAt   Timestamp                 # wall-clock when this partial was emitted
}

FinalTranscript {
  Text         string
  Language     string                    # BCP-47, detected or echo-configured
  Confidence   float
  StartedAt    Timestamp
  EndedAt      Timestamp
  Words        []Word?                   # optional; populated if SupportsWordTimestamps
  SpeakerHint  string?                   # backend-assigned diarization label (if any)
  Custom       map<string, any>          # backend-specific extras
}

Word {
  Text         string
  StartMS      uint32                    # offset from segment start in milliseconds
  EndMS        uint32
  Confidence   float
}

Offline-mode wire types

TranscribeOptions {
  Language          string?               # preferred / locked
  FallbackLangs     []string?
  EmitWordTimestamps bool
  Formatting        string                # "formatted" | "raw" | "verbose"
  NumberFormatting  string                # "auto" | "spell-out" | "digits"
  VocabularyBoost   *VocabularyBoost
  ProfanityFilter   *ProfanityFilter
}

VocabularyBoost {
  Words            []string
  Weight           float
  CustomWords      []string                # tell backend these ARE valid words
}

ProfanityFilter {
  Enabled  bool
  Action   string                          # "mask" | "drop"
}

Lifecycle state machines

Streaming:

                StreamOpen()             StreamFeed()×N
   [closed]  ----------------->  [open]  ---------------->  [open]
                                   |
                              StreamClose()
                                   |
                                   v
                              [final transcript returned]

StreamFeed MUST be safe to call concurrently with StreamPartials draining. StreamClose blocks until the final transcript is assembled.

Offline:

                  Transcribe()
   [open]  -------------------->  [final transcript returned]

Single-shot. No streaming state to manage.

Concurrency

  • Open / Close are called from a single control thread per backend instance.
  • Transcribe is safe to call concurrently from multiple goroutines per instance — the orchestrator may issue multiple offline jobs in parallel (subject to rate_limit_per_minute config).
  • Streaming sessions are independent — multiple StreamHandles may be open concurrently against the same backend instance.

Partial transcripts — internal to ASR

Partial transcripts do NOT become envelopes. Only the final transcript becomes a partially-filled envelope that flows to router/v1.

Reasoning: - Partials are revisions — text that may change as more audio arrives. Routing and sink delivery are committed work; you can't un-route a partial later proven wrong. - Routing every partial would multiply downstream work for no real gain (5–20 partials per envelope, most replaced by the next). - The router and sink contracts are designed for committed envelopes, not provisional ones.

The orchestrator exposes a partials subscriber channel for callers who want them (UI components, debug surfaces, live-captioning sinks):

orchestrator.SubscribePartials(uiSubscriber)

This decouples the ASR-internal stream from the envelope pipeline.


Multiple Backend Instances

A single asr/v1 deployment can load multiple backends simultaneously, each scoped to specific source kinds. Useful for "fast local ASR for my voice, accurate cloud ASR for meetings."

asr:
  instances:
    - name: fast-local
      backend: whisper-cpp
      mode: streaming
      model: large-v3-q5_0
      handles_source_kinds: [self]

    - name: accurate-cloud
      backend: deepgram
      mode: offline
      handles_source_kinds: [in-person, online]
      auth:
        method: keychain
        credential_name: deepgram-api-key

  fallback_chains:
    self:        [fast-local, whisper-cpp-fallback]
    in-person:   [accurate-cloud, faster-whisper-local]
    online:      [accurate-cloud, faster-whisper-local]
    file:        [faster-whisper-local]

If multiple instances claim the same source kind, the FIRST in declared order handles it. If no instance claims a source kind, the segment is dropped with a structured warning.


Latency Budgets

Default budgets:

Mode Budget On overrun
Streaming partial < 500 ms time-to-first Mark backend degraded; asr.partial_latency_exceeded event
Streaming final < 1.5× segment duration Mark backend degraded; fall back; asr.final_latency_exceeded event
Offline transcribe < 2× segment duration Same: degrade + fallback
Hard cutoff 30 s for any segment Abort; emit envelope with empty Transcript + Provenance.Custom["asr.failed"] = "timeout"

All configurable per instance. The pipeline never blocks waiting for ASR.


Tier-1 Backends (ship with v1)

Backend Local/Cloud Modes Why
whisper-cpp Local Offline (streaming patches in some forks) OpenAI Whisper distilled to C++; GGML quantized models; works on every tier-1 OS; the local default
faster-whisper Local Offline + experimental streaming CTranslate2-based; 4–8× faster than whisper-cpp on CPU/GPU; same models
vosk Local Streaming + offline Smaller models; lower latency; lower accuracy. The "I want streaming locally" path
openai-whisper-api Cloud Offline OpenAI's hosted Whisper; shares credential with llm-openai sink
deepgram Cloud Streaming + offline Streaming-first; live captioning use case
assemblyai Cloud Streaming + offline Strong diarization + word timestamps

Tier-2 backends (community-contributable, same contract)

ollama-whisper, google-speech, azure-speech, aws-transcribe, gladia, replicate-whisper.


BYOK Authentication

Same credential precedence as LLM / S3 / email sinks:

  1. Explicit env var (e.g., DEEPGRAM_API_KEY, ASSEMBLYAI_API_KEY)
  2. OS keychain (default)
  3. Config file (deprecated; warns)
  4. External secrets manager (future secrets/v1)

Onboarding: vox auth set <backend> (mirrors LLM sink pattern).

Shared credentials

Backends that share auth with other Vox surfaces declare it:

asr:
  instances:
    - name: openai-asr
      backend: openai-whisper-api
      auth:
        shares_credential_with: llm-openai      # reuse existing keychain entry

No double prompt at onboarding; the credential is set once via the relevant LLM sink and consumed here too.


Cost Controls (Cloud Backends)

Audio-minute–metered cloud ASR can rack up real money fast. First-class guardrails:

asr:
  instances:
    - name: accurate-cloud
      backend: deepgram
      cost_controls:
        budget_daily_usd: 5.00
        budget_monthly_usd: 100.00
        on_budget_warn_pct: 80                # warn at 80% of budget
        on_budget_exceed: halt                # halt | warn | fallback
        fallback_backend: whisper-cpp          # required if on_budget_exceed: fallback
        rate_limit_per_minute: 60              # max segments per minute
        max_segment_audio_minutes: 30          # reject implausibly long segments

Behavior

Setting What it does
budget_daily_usd / budget_monthly_usd Hard caps. Vox tracks consumed audio-minutes × CostPerAudioMinuteUSD
on_budget_warn_pct Emit asr.budget_warning event + structured log + audit event at this threshold
on_budget_exceed halt = stop transcribing; warn = continue + log; fallback = route to fallback_backend
rate_limit_per_minute Soft throttle. Segments queue if exceeded
max_segment_audio_minutes Reject implausibly long segments (silence-VAD bug usually)

Spend tracking persisted at ~/.vox/state/cost-tracker.db (sqlite). Session start displays current spend: "Today's ASR spend: $0.47 of $5.00 budget".

For local backends, cost_controls is ignored except for rate_limit_per_minute (CPU budgeting).

Pre-flight cost estimation

asr:
  preflight_cost_estimate: true                # default true for paid backends
  preflight_cost_log_threshold_usd: 0.10

estimated_cost = segment_duration_minutes × backend.cost_per_minute. Cheap to compute. Logged at INFO when over threshold — useful for spotting runaway long segments before they bill.

Cost transparency on the envelope

Optional Provenance.Custom["asr.*"] cost metadata:

Provenance.Custom {
  "asr.backend":       "deepgram",
  "asr.audio_minutes": "0.85",
  "asr.cost_usd":      "0.0036",
  "asr.model":         "nova-2",
  "asr.mode":          "streaming"
}

Controlled by asr.emit_cost_metadata: true|false (default true for cloud, false for local). Flows into audit/v1 when loaded — "how much did the Q3 board meeting transcription cost?" is one query away.


Fallback Chain

When a backend goes unhealthy (network error, auth failure, quota / budget exceeded with on_budget_exceed: fallback, timeout), the orchestrator routes the next segment to the next healthy backend in the chain for that source kind.

asr:
  fallback_chains:
    online: [deepgram, assemblyai, whisper-cpp]

Semantics

  • A failed backend is taken out of rotation for health_recovery_interval (default 5 min)
  • Subsequent segments route to the next healthy backend in the chain
  • After recovery interval, Vox probes the primary; if healthy, segments resume routing to it
  • Fallback events emit asr.backend_fallback telemetry + audit event
  • User-visible: structured log at WARN level the first time fallback fires per session

Emergency local fallback

If the entire chain fails, Vox uses an always-available emergency local fallback (the smallest bundled whisper-cpp quant — tiny or base). If that ALSO fails: envelope with empty Transcript, Intent.Kind = unclassified, Provenance.Custom["asr.failed"] = "all_backends_exhausted".

The pipeline never blocks.


Output Details

Word timestamps

asr:
  emit_word_timestamps: auto                   # auto | always | never

auto checks Capabilities.SupportsWordTimestamps; emits if true, silently skips otherwise. always errors at startup when a configured backend can't produce them.

Word slice goes on FinalTranscript.Words. Sinks that don't care ignore it.

Formatting

Mode Output style Use case
formatted (default) "Let's create a bd issue for the deck refresh." Normal; downstream router classification works best with punctuation
raw "lets create a bd issue for the deck refresh" Low-bandwidth; some backends only support raw
verbose "Let's [pause] create a bd issue for the deck refresh [pause] um yeah." Research workflows that need filler / hesitation markers
asr:
  formatting: formatted                        # formatted | raw | verbose
  number_formatting: auto                      # auto | spell-out | digits

Language

asr:
  language:
    mode: auto-with-hint                       # auto | auto-with-hint | locked
    preferred: en-US                           # BCP-47; hint or lock
    fallback_languages: [es, zh-CN]            # also acceptable in auto modes

Code-switching (mixed languages within a segment): backend picks a dominant language; output is in that language's script with any switch words transcribed best-effort. Vox does NOT split segments by language.

The envelope's Language is the primary detected language. Multilingual content: configure multiple instances scoped by source / session metadata.

Speaker label precedence

Speaker.Label can come from three places. Precedence:

  1. segment/v1 diarization (highest — sees full audio context)
  2. ASR backend diarization (refinement — secondary)
  3. Capture-side hint (default — coarse)

Refinement rule: if segment/v1 says "single speaker, unknown identity" AND the ASR backend produces a more specific label (e.g., Speaker B), the ASR label wins. More-specific-identification wins, but cross-segment stability wins more.

Envelope's Speaker.Label is the final resolved value. Provenance.Custom["asr.speaker_hint"] retains the ASR backend's raw output for audit.

Vocabulary boost

Domain terms (product names, jargon) are routinely mis-transcribed. Most backends support boost:

asr:
  instances:
    - name: engineering-meetings
      backend: deepgram
      vocabulary:
        boost_words: [Kubernetes, gRPC, blackrim, vox, sageox, anthropic]
        boost_weight: 1.5
        custom_words: [oxledger, voxsink]      # tell backend these ARE words

Backends without boost support warn once at Open() and ignore the block.

whisper-cli now honors vocabulary boost (v1.x, shipped). The whisper-cli backend passes boost words to whisper-cli via the --prompt flag, which biases the recognizer toward those tokens:

asr:
  instances:
    - name: engineering-local
      backend: whisper-cli
      vocabulary_boost_words: [Kubernetes, gRPC, blackrim, vox, sageox]
      vocabulary_boost_max_chars: 500          # default; prompt is truncated at this byte length

Config map keys: vocabulary_boost_words (list of strings) and vocabulary_boost_max_chars (int, default 500 — safe limit for whisper.cpp's n_text_ctx/2 token budget).

When the prompt is passed, asr.vocab_applied = true is stamped in FinalTranscript.Custom for audit traceability.

Profanity filtering

OFF by default. Vox captures what was said.

asr:
  profanity:
    enabled: false                             # default off
    action: mask                               # mask | drop

Reasoning: a transcription tool that silently censors is untrustworthy. Enterprise compliance scenarios that demand redaction belong in audit/v1 / report/v1 — the raw transcript stays unmodified.


Local Model Management

Local backends need model files (75 MB to 1.5 GB+ per model).

Storage layout

~/.vox/models/
  whisper-cpp/
    base.en.bin
    large-v3-q5_0.bin
    .checksums
  faster-whisper/
    large-v3/
      model.bin
      config.json
      vocabulary.txt
    .checksums
  vosk/
    vosk-model-en-us-0.22/
    .checksums

CLI

vox model list                                 # show available + installed
vox model download whisper-cpp:large-v3-q5_0
vox model verify whisper-cpp
vox model remove whisper-cpp:base.en

Auto-download

By default, first reference to an uninstalled model prompts:

Model whisper-cpp:large-v3-q5_0 (1.0 GB) is not installed.
Download from https://huggingface.co/ggerganov/whisper.cpp ? [y/N]

Configurable:

asr:
  auto_download: prompt                        # prompt (default) | silent | never

Custom models

Users drop a model file into the storage layout and reference by path:

asr:
  instances:
    - name: my-finetuned
      backend: faster-whisper
      model_path: ~/.vox/models/faster-whisper/my-finetuned/

Verification

Every model has a .checksums file with SHA-256s. Vox verifies at install and on first load each session. Mismatch ⇒ refuse to load + structured error pointing at vox model verify.

Versioning

Model names include explicit version (large-v3-q5_0, not large). New versions are new models; old versions stay usable until removed. No silent upgrades.


Error Model

Typed errors, mirroring other Vox surfaces:

ASRError {
  Kind     ASRErrorKind
  Backend  string
  Op       string                              # "open" | "stream-open" | "transcribe" | etc.
  Message  string
  Cause    Error?
}

ASRErrorKind {
  ErrInvalidConfig
  ErrAuthFailed
  ErrQuotaExceeded                             # rate-limit / quota / budget
  ErrUnsupported                               # mode / language / feature not supported
  ErrModelNotFound                             # local model file missing
  ErrModelCorrupt                              # checksum mismatch
  ErrTimeout
  ErrBackendUnavailable                        # provider down, network unreachable
  ErrTransient                                 # retry may help
  ErrPersistent                                # retry won't help
  ErrInternal
}

The orchestrator handles failures via the fallback chain. ASR backends NEVER block the pipeline — worst case is an empty-transcript envelope with Provenance.Custom["asr.failed"] set.


Versioning and Stability

asr/v1 is the contract above. Once frozen:

  • Non-breaking changes (allowed in v1.x): adding optional fields to Capabilities, SessionInfo, FinalTranscript, PartialTranscript, TranscribeOptions, or Stats; adding new ASRErrorKind values; adding new built-in backends; adding new model-management options.
  • Breaking changes (require v2): changing the ASRBackend interface signature; changing the meaning of any existing field; changing mode semantics; changing speaker-precedence rules.

The core supports one vN of asr/ at a time, with overlap during migrations.


Reference Build Order

Per the project's build-order principle (validate test substrate first, then add complexity incrementally):

Order Backend Why this order
1 whisper-cpp First. Local, offline, single binary, no auth, no network. Unblocks end-to-end pipeline tests with deterministic input
2 faster-whisper Same model family; validates the speed-tier alternative
3 deepgram First cloud backend; validates BYOK + cost controls + streaming mode
4 openai-whisper-api Validates shared-credential reuse with llm-openai sink
5 vosk Validates streaming-only-local backend
6 assemblyai Diarization + word-timestamp validation
7+ tier-2 backends Same contract, different wire protocols

Build the orchestrator with whisper-cpp alone first; add backends one at a time. The base interface and the partials channel stabilize before any network-bound backend lands.


Project Principle: Opinionated Defaults, Every Default Configurable

This contract continues the principle from capture/v1, sink/v1, and router/v1. Every behavior with a defensible default (mode per source kind, formatting: formatted, emit_word_timestamps: auto, profanity.enabled: false, auto_download: prompt, health_recovery_interval: 5m, etc.) is exposed as a config knob. Defaults reflect a considered recommendation for the typical voice-to-LLM use case; the knobs exist so specialized workflows can tune them.