capture/v1 — Audio Source Adapter Contract¶
Status: Draft (design-locked, ready for first implementation) · Stability: v1 will be frozen with the first reference adapter (
file-wav) · Implementations: in-tree only until v1 is frozen.
The capture/v1 surface is the entry point of every Vox session. A
capture adapter produces a stream of audio frames from a specific source
(your microphone, a room mic, system audio from a call client, an audio file,
etc.) and hands them to the rest of the pipeline. Everything downstream —
segmentation, transcription, routing, sinks — is source-agnostic; the only
place that knows what kind of input is coming in is the adapter itself.
This document is the contract. An implementation that conforms to it can be
loaded by any version of the Vox core that supports capture/v1.
Scope¶
capture/v1 covers:
- The lifecycle of an audio source (open, start, stop, close)
- The wire format of audio frames
- Configuration negotiation (sample rate, channel count, frame size)
- Backpressure / drop policy at the adapter→consumer boundary
- Disconnect / hot-swap behavior
- Error reporting and permission errors
- Source-type metadata so downstream stages can adjust behavior
capture/v1 does not cover:
- Voice activity detection (
segment/v1) - Speaker diarization (
segment/v1) - Transcription (
asr/v1) - Sample-rate conversion (handled in core, between adapter channel and consumer — see below)
- Multi-source orchestration / session grouping (a higher layer;
capture/v1only exposes theSessionIDcorrelation hook) - Mixing multiple sources into one stream
Adapters MUST do one thing: pull audio out of a source and emit frames. Anything more belongs in another surface.
Source kinds¶
Four source kinds are first-class. Adapters MUST declare which kind they implement, and the core uses the kind to set sensible downstream defaults (diarization on/off, default ASR backend, disconnect policy, etc.).
| Kind | Examples | Typical channel count | Typical sample rate |
|---|---|---|---|
self |
Default mic, USB headset, AirPods | 1 | 16 kHz or 48 kHz |
in-person |
Lapel, conference array, room mic, paired phone | 1–8 | 48 kHz |
online |
System audio (loopback) for remote-participant capture | 2 | 48 kHz |
file |
Pre-recorded audio (testing, batch ingest, replay) | any | any |
Adapters are single-source. An online adapter captures system-audio
loopback only; the user's own mic comes from a parallel self adapter
loaded alongside. The two streams are correlated by SessionID (see Frame
format below). Adapters that internally mix multiple sources are out of
contract for v1.
Custom kinds MAY be added in v2. For v1, adapters that don't fit one of
the four MUST pick the closest match and document the discrepancy.
Frame format¶
A capture stream emits a sequence of frames through a bounded channel (see Transport). Each frame is a fixed-size buffer of audio samples with metadata.
Sample encoding¶
- Samples are 32-bit IEEE-754 floats in the range
[-1.0, +1.0](f32, little-endian on the wire when serialized). - Adapters that read native
int16orint24MUST convert tof32before emitting. The cost of conversion belongs at the source edge so downstream code can assume a single format. - Adapters MAY support
int16mode for memory-constrained scenarios. That is opt-in via configuration and not the default. - The
Encodingenum is reserved for additive future growth. Values beyondf32andi16(e.g.,opus,aac,flac) are out of scope for v1 but explicitly anticipated; introducing them later requires no v1 breaking change.
Channel layout¶
- Channels are emitted interleaved (frame[0]=ch0, frame[1]=ch1, frame[2]=ch0, ...).
- Channel ordering follows WAV / FFmpeg conventions: front-left, front-right, center, low-frequency, back-left, back-right, side-left, side-right.
- For
selfsources, mono (1 channel) is the default and recommended. - For
onlinesources, stereo (2 channels) is the default — loopback capture is typically stereo system audio. Both channels represent remote-participant audio; the user's voice is NOT mixed in. - For
in-personsources with multiple mics, channel count equals mic count. Spatial / array math (if any) is NOT done incapture/v1; emit the raw channels.
Frame structure¶
Frame {
StreamID string # uuid4, stable for the lifetime of the stream;
# re-issued on device change (see Hot-swap)
SessionID string # optional; correlation hook for multi-adapter
# session grouping (see Multi-stream). Empty if
# the stream isn't part of a session group.
Sequence uint64 # monotonically increasing from 0
CapturedAt Timestamp # wall-clock at the start of this frame
Duration Duration # frame duration (derived from samples + rate, included for convenience)
SampleRate uint32 # Hz; constant for the lifetime of the stream
Channels uint8 # constant for the lifetime of the stream
Encoding Encoding # "f32" (default) or "i16"
Samples []f32 or []i16 # interleaved; length == FrameSize * Channels
# Special markers (mutually exclusive with Samples)
IsDropMarker bool # true when this frame represents a gap from dropped frames
DroppedCount uint32 # number of frames dropped before this one (if IsDropMarker)
IsDeviceMarker bool # true when a transient device disconnect was recovered
DeviceMarkerReason string # e.g., "transient_reconnect"
}
StreamID is set by the adapter at Start() and stays constant until either
a clean stream end OR a fallback-to-different-device event (see Hot-swap).
SessionID is set by the orchestrator when the adapter is opened as part of
a multi-adapter session group; empty otherwise. Sequence starts at 0 for
the first frame and increments by one per frame; consumers detect drops by
sequence gaps. CapturedAt is the adapter's best estimate of the wall-clock
time of the first sample in the frame.
Frame size¶
The frame size (samples per channel per frame) is negotiated at Open().
The consumer requests a preferred frame size; the adapter MAY honor it
exactly, or MAY round to the nearest device-natural value and return the
actual size to the consumer.
- Default request: 20 ms worth of samples (320 samples at 16 kHz, 960 at 48 kHz). This matches the WebRTC VAD frame size and most low-latency ASR backends.
- Adapters MUST emit frames at a steady cadence matching the negotiated frame size, modulo unavoidable device jitter.
Time base¶
CapturedAt is wall-clock from the host system. Adapters MUST NOT
back-correct timestamps for buffering or processing delay observed inside the
adapter — the consumer assumes the timestamp is the source-side capture
moment, not the moment the frame arrived in user space.
If the adapter cannot get a precise capture time from the OS (some platforms don't expose it cleanly), it MAY use the wall-clock at the moment the frame was assembled in user space and document that limitation.
Transport — bounded channel, adapter writes, consumer reads¶
Frames flow from adapter to consumer through a bounded channel (Go channel, equivalent in other languages: a bounded queue). The adapter owns the writer side; the consumer owns the reader side.
Adapter.Start(ctx, frames chan<- Frame) -> Error
The adapter MUST:
- Write each captured frame to
framesusing a non-blocking try-write. - Close
frameson gracefulStop(). - Apply the configured drop policy (below) when the channel is full.
The consumer MUST:
- Read from
framescontinuously. - Treat closed channel as graceful end-of-stream.
- Surface drop markers and device markers to its own downstream stages.
Why a channel, not a callback¶
- Decouples capture pace from consumer pace. Capture is real-time; the rest of the pipeline (VAD, ASR, routing, sinks) is variable-latency. Forcing the adapter thread to wait on consumer work is the wrong coupling.
- Cross-language FFI hygiene. Function-pointer callbacks across CGo (CoreAudio / WASAPI / PipeWire bindings) are a known pain point — context loss, GC pinning, deadlock risk. A channel boundary keeps FFI complexity contained in one short writer function.
- Backpressure is policy, not code. Channel-full behavior is selectable at config time (see below).
Channel capacity (default: 64 frames)¶
- Default: 64 frames (~1.28 seconds at 20 ms / 48 kHz mono).
- Rationale: longer than any reasonable GC pause or OS scheduler hiccup, but short enough that a real problem surfaces within ~2 seconds rather than being silently absorbed.
- Override:
capture.buffer_frames: <int>. filesource kind defaults higher (256) because there's no real-time constraint and faster drain is desirable.
Backpressure / drop policy¶
The contract: never block the consumer; drop a frame instead, and surface the drop loudly so the engineering effort can go into preventing it.
Drop policy when channel is full (default: drop-newest)¶
| Policy | Behavior | Use case |
|---|---|---|
drop-newest (default) |
Don't enqueue the new frame; record the gap in Stats; emit a DropMarker frame after the gap |
Recommended for ASR — the queued frames are about to be processed; losing the next word is visible and recoverable |
drop-oldest |
Pop one frame off the channel, enqueue the new one | Real-time monitoring where freshness > completeness |
drop-window |
Drop a contiguous batch of frames, emit a single marker | Aggressive recovery; experimental |
block |
Block the adapter until space frees up | Diagnostic only; violates the "never block" contract; do not use in production |
Configurable via capture.drop_policy.
Drop markers¶
When a drop occurs, the adapter MUST enqueue a synthetic Frame with
IsDropMarker: true and DroppedCount: k after the dropped range. This
makes the discontinuity explicit to downstream consumers without forcing
them to poll Stats().
Drop telemetry — four levels¶
| Level | Mechanism | Default behavior |
|---|---|---|
| 1. Counters | Adapter.Stats() — running counters (frames_emitted, frames_dropped, drop_events, last_drop_at, longest_drop_burst) |
Always on; lock-free; pollable at any time |
| 2. Logs | Structured log line per drop event (not per dropped frame) — stream_id, sequence range, duration | WARN level by default |
| 3. Audit | audit/v1 event on drop bursts > N frames |
Emitted when audit/v1 is loaded |
| 4. Loud escalation | Drop rate exceeds threshold over a window → callback, ERROR log, optional UI indicator | Default trigger: > 1% of frames over 60 s (both configurable) |
Engineering effort goes into preventing drops; telemetry exists to ensure drops can't hide.
Sample-rate conversion — in core, not in adapters¶
Resampling lives between the adapter channel and the consumer. Adapters do NOT resample.
Rules¶
- Adapter declares its native sample rates in
Capabilities.SampleRates. - Core opens the adapter at the closest native rate to what was requested (not the requested rate itself). E.g., if consumer wants 16 kHz and the device natively does {48, 44.1, 16}, the adapter opens at 16 — no conversion.
- If consumer's requested rate is not natively supported, the core's resampler converts the adapter's native-rate output to the requested rate.
- No-op fast path: when requested == native, the resampler is a zero-copy pass-through. No allocation, no math.
- Per-consumer subscription. Multiple consumers may subscribe to the same adapter stream at different rates; the core resampler handles each.
Library choice (v1)¶
- Pure-Go resampler (no CGo) to preserve the single-static-binary story across macOS / Windows / Linux on Intel + Apple Silicon + amd64 + arm64.
- Quality target: equivalent to libsamplerate
SINC_FASTESTfor speech. - If quality benchmarks show real degradation on speech vs. libsamplerate, ship a CGo fallback via build tag (not the default).
Multi-stream — one adapter per source, SessionID for correlation¶
Each adapter produces exactly one stream. Multi-source scenarios (online
call = system audio + your mic; hybrid meeting = remote dial-in + in-room
lapels) are handled by loading multiple adapters, correlated via SessionID
on each Frame.
Why two adapters instead of one multi-channel stream¶
- Uniform pipeline: every stream is a stream. No source-kind-dependent channel-layout conventions for consumers to encode.
- Cross-platform parity: Windows (WASAPI loopback) and Linux (PipeWire monitor) already require two captures regardless of approach. Two adapters matches reality on every platform.
- Per-stream routing is one config away — your voice → fast local ASR; remote voices → accurate cloud ASR. This is the product (intent routing); multi-channel single-stream would force a split downstream anyway.
- Generalizes to N sources (hybrid meetings, multi-mic rooms).
- Independent drop accounting per source.
- Per-stream ASR backend selection falls out for free.
Correlation¶
The orchestrator (a higher-level surface, out of scope for capture/v1)
opens both adapters together with the same SessionID. Each frame carries
that SessionID. Downstream stages correlate by it. Time-base alignment
between streams uses CapturedAt wall-clock; sub-millisecond drift between
two OS audio clocks is acceptable for ASR / intent routing.
Hot-swap and disconnect — pause-then-fallback by default¶
When a device disappears mid-stream (USB unplug, Bluetooth drop, OS device removal), behavior is pause-then-fallback by default, with per-source-kind overrides.
Default policy per source kind¶
| Source kind | Default policy | Reasoning |
|---|---|---|
self |
pause-fallback (10 s wait → default device → hard fail) | Dictation should survive a headset reconnect; if the cable's truly gone, falling back to the laptop mic keeps the user productive |
in-person |
pause-fallback (same) | Default device is usually an acceptable fallback |
online |
pause-only (10 s wait → hard fail, no fallback) | "Fallback to default" doesn't make sense for system-audio loopback — silently capturing the wrong source is worse than failing |
file |
hard-fail | File disconnect = something broken (disk error); don't recover |
All configurable: capture.disconnect_policy: pause-fallback | pause-only | hard-fail | auto-restart and capture.disconnect_timeout: <duration> (default 10 s).
Stream identity on disconnect¶
| Event | StreamID | Sequence | Action |
|---|---|---|---|
| Transient disconnect → same device returns within timeout | Same | Continues with gap | Emit Frame{IsDeviceMarker: true, DeviceMarkerReason: "transient_reconnect"} |
| Fallback to different device | New | Restarts at 0 | Close old channel; new channel with new StreamID; orchestrator stitches via SessionID |
| Timeout expired, no fallback available | StreamID closes | — | Close channel; return ErrDeviceNotFound via the async error path |
Device change == new StreamID because downstream caches (ASR acoustic baseline, speaker embeddings, VAD threshold) are stream-scoped and become stale on device change. A new StreamID forces a clean reset.
Telemetry — four event types¶
Every disconnect emits a structured event (same telemetry path as drop events):
capture.device_disconnected— adapter detected losscapture.device_reconnected— same device returned within timeoutcapture.device_changed— fell back to a different device (new StreamID)capture.device_failed— timeout expired, no fallback, stream closed
Default log level for device_changed: WARN (silent fallback would be
the wrong UX — the user must know their mic switched).
Adapter interface¶
The interface is presented in language-neutral pseudocode. The reference binding will be Go (the open core is Go); bindings for other languages are derivative.
Adapter {
// Identity ----------------------------------------------------------------
Kind() -> SourceKind # "self" | "in-person" | "online" | "file"
Name() -> string # adapter identifier (e.g., "coreaudio", "wasapi")
Capabilities() -> Capabilities # what this adapter supports
// Lifecycle ---------------------------------------------------------------
Open(req: OpenRequest) -> OpenResult | Error
Start(ctx: Context, frames: chan<- Frame, errs: chan<- Error) -> Error
Pause() -> Error # optional; capabilities.Pausable
Resume() -> Error # optional; capabilities.Pausable
Stop() -> Error # closes `frames` channel
Close() -> Error
// Diagnostics -------------------------------------------------------------
Stats() -> Stats # frames_emitted, frames_dropped, drop_events,
# last_drop_at, longest_drop_burst,
# device_disconnect_events, etc.
}
Capabilities {
SampleRates []uint32 # rates the adapter can produce natively
ChannelCounts []uint8 # channel counts supported
Encodings []Encoding # always includes "f32"; MAY include "i16"
Pausable bool # supports Pause/Resume mid-stream
PTT bool # supports push-to-talk (silence on, audio when held)
DeviceList []DeviceInfo # available devices (for kinds where this is meaningful)
HotSwap bool # supports pause-then-resume on device reconnect
}
OpenRequest {
DeviceID string # optional; specific device from Capabilities.DeviceList
SampleRate uint32 # requested rate; core selects nearest native at Open(),
# then resamples in core if needed
Channels uint8 # requested channel count
Encoding Encoding # requested encoding; "f32" default
FrameSizeHint uint32 # preferred samples per channel per frame (e.g., 320 = 20ms@16k)
Mode CaptureMode # "always-on" | "push-to-talk"
PTTKey string # for "push-to-talk" mode; platform-specific
SessionID string # optional; correlation hook stamped onto every emitted Frame
# Backpressure + disconnect policy (defaults from source kind)
BufferFrames uint32 # channel capacity (default 64)
DropPolicy DropPolicy # "drop-newest" (default) | "drop-oldest" | "drop-window" | "block"
DisconnectPolicy DisconnectPolicy # "pause-fallback" | "pause-only" | "hard-fail" | "auto-restart"
DisconnectTimeout Duration # default 10s
}
OpenResult {
StreamID string # uuid4 for this stream
SampleRate uint32 # actual native rate the adapter opened at
Channels uint8 # actual
Encoding Encoding # actual
FrameSize uint32 # actual samples per channel per frame
DeviceID string # actual device chosen
DeviceName string # human-readable device name
}
Lifecycle state machine¶
Open() Start(ctx, frames, errs)
[Closed] ---------------> [Opened] ----------------> [Running]
^ | |
| | | Pause()
| Close() v
| | [Paused]
| v |
+--- Close() -----------[Closing] <--- Stop() ----+ |
| Resume()
| |
[Running] <+
Calling out of order is a programming error and the adapter MUST return an
explicit ErrInvalidState rather than crash. Close() is idempotent.
Concurrency¶
Open,Start,Stop,Close,Pause,Resumeare called from a single control thread. Adapters MAY assume serialization.- Frames are written to the
frameschannel from a single adapter-owned thread. Adapters MUST NOT write from multiple threads concurrently. - Async errors (e.g., device disconnect after
Start()) are written to theerrschannel from the same adapter-owned thread. Stats()is safe to call from any thread at any time.
Configuration schema¶
Configuration is per-adapter; the core just routes the values through. Each adapter MUST publish its schema. A common skeleton:
capture:
adapter: coreaudio # or "wasapi", "pipewire", "loopback-zoom", etc.
source_kind: self # "self" | "in-person" | "online" | "file"
device_id: "" # empty = default device
sample_rate: 16000 # 0 = adapter default
channels: 1 # 0 = adapter default
encoding: f32 # "f32" | "i16"
frame_size_hint: 0 # 0 = let adapter pick (~20ms)
mode: always-on # or "push-to-talk"
ptt_key: "" # required if mode=push-to-talk; platform-specific
session_id: "" # set by orchestrator for multi-adapter sessions
# Backpressure
buffer_frames: 64 # channel capacity
drop_policy: drop-newest # drop-newest | drop-oldest | drop-window | block
drop_alert_threshold_pct: 1.0 # loud-escalation trigger
drop_alert_window_sec: 60
# Disconnect
disconnect_policy: pause-fallback # pause-fallback | pause-only | hard-fail | auto-restart
disconnect_timeout: 10s
# adapter-specific knobs go here:
options:
foo: bar
The core validates the generic fields. Anything under options is passed
through to the adapter verbatim.
Error model¶
All errors are typed:
Error {
Kind ErrorKind # see below
Adapter string # adapter name
Op string # which method was being called
Message string # human-readable, no PII
Cause Error? # optional wrapped cause
}
ErrorKind {
ErrInvalidState # called out of order
ErrInvalidConfig # config rejected at Open()
ErrUnsupported # capability requested that this adapter doesn't have
ErrDeviceNotFound # named device not available
ErrDeviceBusy # device exists but is in use by another process
ErrDeviceDisconnected # device disappeared after Start() (async error channel)
ErrPermissionDenied # OS-level permission missing (see Permissions below)
ErrPlatformUnsupported # adapter cannot run on this OS / OS version
ErrIO # transient device error
ErrInternal # bug in the adapter; the consumer SHOULD restart it
}
ErrPermissionDenied MUST include a Cause or message that names the
specific OS permission missing (e.g., "missing macOS microphone privacy
permission", "missing Windows microphone privacy setting"). The core
surfaces this directly to the user so they can fix it.
ErrDeviceDisconnected is the async error emitted on the errs channel
when a device disappears after Start(). The disconnect-policy machinery
then decides whether to pause-and-wait, fall back, or hard-fail.
Permissions¶
Capture is a permission-sensitive surface. Adapters MUST declare which OS
permissions they require and check them eagerly at Open().
| OS | Permission | Required by |
|---|---|---|
| macOS | Microphone (Privacy & Security) | Any local mic adapter |
| macOS | Screen Recording | System-audio (online) adapters using ScreenCaptureKit |
| Windows | Microphone privacy setting | Any local mic adapter |
| Windows | (none specific) | WASAPI loopback |
| Linux | PipeWire / PulseAudio access | Any local mic adapter |
| Linux | PipeWire monitor source | System-audio (online) |
If a permission is missing, Open() MUST return ErrPermissionDenied. The
adapter MUST NOT block waiting for the user to grant permission — the host
application is responsible for prompting.
Per-source guidance¶
self — your own voice¶
Recommended adapters:
| OS | Adapter | Notes |
|---|---|---|
| macOS | coreaudio |
CoreAudio HAL; works on Intel + Apple Silicon |
| Windows | wasapi |
WASAPI shared mode for low latency |
| Linux | pipewire |
PipeWire is the modern default; PulseAudio fallback |
Defaults: mono, 16 kHz, f32, 20 ms frame size, always-on,
disconnect_policy: pause-fallback. Push-to-talk adapters MAY ship later;
the contract supports it via Mode = push-to-talk and PTTKey.
in-person — meetings, 1:1s, room audio¶
Same OS adapters as self, but typically with a different device chosen
(lapel mic, conference array, paired phone). Channel count > 1 is common.
Defaults: 1–4 channels (device-determined), 48 kHz, f32,
disconnect_policy: pause-fallback. Downstream diarization (in
segment/v1) does the speaker math.
online — Zoom, Meet, Teams, Discord (remote-participant audio only)¶
The online source captures system audio loopback — what's coming out
of your speakers / headphones, where the remote participants' voices are.
The user's own voice is captured by a parallel self adapter; the two are
correlated via SessionID.
| OS | Adapter | Mechanism |
|---|---|---|
| macOS | screencapturekit-audio |
macOS 13+ ScreenCaptureKit audio-only capture |
| macOS | blackhole-loopback |
Fallback for older macOS or when SCK is unavailable; requires user-installed BlackHole driver |
| Windows | wasapi-loopback |
WASAPI loopback (built into Windows) |
| Linux | pipewire-monitor |
PipeWire monitor source on the default sink |
Defaults: 2 channels (system audio is typically stereo), 48 kHz, f32,
disconnect_policy: pause-only (no fallback).
file — recorded audio¶
A file adapter reads WAV / FLAC / Ogg / MP3 / Opus / WebM (subset of formats per adapter implementation) and emits frames. Decoding happens inside the adapter at the file boundary; the pipeline still sees PCM frames.
pace_mode is a first-class config on file adapters:
pace_mode |
Behavior | Use case |
|---|---|---|
realtime (default) |
Sleep between frames to maintain the file's natural duration cadence | Integration tests; mimics live capture timing |
asap |
Stream frames as fast as the consumer can drain | Batch processing; bulk re-transcription; CI |
accelerated |
Configurable speedup factor (2x, 4x, 10x) | Long-meeting replay during development |
Defaults: match the file's native format, pace_mode: realtime,
buffer_frames: 256, disconnect_policy: hard-fail.
Loader registration¶
Adapters are loaded at startup based on the capture.adapter config value.
Each adapter ships with a registration function:
RegisterCaptureAdapter(name: string, factory: () -> Adapter)
Registration is package-init in Go, equivalent in other languages. The core maintains a single registry; duplicate names panic at startup (intentional).
Enterprise plugins register the same way against the same registry — the core does not distinguish open vs. enterprise adapters at the loader level.
Versioning and stability¶
capture/v1 is the contract above. Once frozen:
- Non-breaking changes (allowed in
v1.x): adding optional fields with sensible defaults toOpenRequest,OpenResult,Capabilities,Stats, orFrame; adding newErrorKindvalues; adding newEncodingvalues (e.g.,opus,flac); adding new source kinds via the "custom kind" escape hatch. - Breaking changes (require
v2): removing or renaming any existing field or method; changing the meaning of an existing field; changingSample.Encodingsemantics for existing values; changing channel layout convention.
The core supports one vN of capture/ at a time, with overlap during
migrations. Adapters declare which version they target via their Name()
return value or a parallel SupportedVersions() method (TBD before freeze).
Reference implementations (planned)¶
| Order | Adapter | Source kind | OS | Why |
|---|---|---|---|---|
| 1 | file-wav |
file | all | First. Unblocks tests for every downstream surface with deterministic, reproducible input. Required for CI without audio hardware. Supports pace_mode to exercise real-time pipeline timing without a live mic. |
| 2 | coreaudio |
self | macOS | First live-capture adapter; primary dev platform. |
| 3 | wasapi |
self | Windows | Second live-capture; verifies cross-platform contract. |
| 4 | pipewire |
self | Linux | Third live-capture; closes Linux tier-1 support. |
| 5 | screencapturekit-audio |
online | macOS 13+ | First online adapter; pairs with coreaudio via SessionID. |
| 6 | wasapi-loopback |
online | Windows | Cross-platform parity. |
| 7 | pipewire-monitor |
online | Linux | Cross-platform parity. |
The build order matches the contract drafting order: get a file substrate working end-to-end with the full pipeline first, then add live capture once the downstream stages are stable.
Project principle: opinionated defaults, every default configurable¶
Throughout this contract, every behavior with a defensible default
(buffer_frames: 64, drop_policy: drop-newest, disconnect_timeout: 10s,
etc.) is exposed as a config knob. The defaults reflect a considered
recommendation for the typical voice-to-LLM use case; the knobs exist so
specialized workflows can tune them.
This principle applies to all future Vox extension surfaces.