Pipeline Overview¶

Blackrim Vox processes audio through five stages: capture, segment, ASR, router, and sinks. Each stage is a pluggable extension point with a versioned contract; operators configure which backend runs at each stage without touching the core.

flowchart LR
    A([Audio Source]) --> B[Capture]
    B --> C[Segment]
    C --> D[ASR]
    D --> E[Router]
    E --> F1[Sink: local-file]
    E --> F2[Sink: LLM]
    E --> F3[Sink: S3 / email / bd]

    style B fill:#1e3a5f,color:#e0e0e0
    style C fill:#1e3a5f,color:#e0e0e0
    style D fill:#1e3a5f,color:#e0e0e0
    style E fill:#1e3a5f,color:#e0e0e0
    style F1 fill:#1a3a2a,color:#e0e0e0
    style F2 fill:#1a3a2a,color:#e0e0e0
    style F3 fill:#1a3a2a,color:#e0e0e0

Audio arrives at the Capture stage from any registered source — microphone, system audio tap, a WAV file, or a streaming RTP feed. Capture emits raw PCM frames and nothing else; it has no opinion about what speech is in them.

The Segment stage watches those frames for speech boundaries. When it detects the end of an utterance — by energy silence, voice activity detection, or a configurable maximum duration — it closes the segment and hands a bounded audio buffer to ASR.

The ASR (automatic speech recognition) backend transcribes the segment into text plus metadata: transcript, language tag, confidence score, timing, and speaker label. Vox ships adapters for local Whisper, Deepgram, AssemblyAI, and Azure; the asr/v1 extension point lets you register any backend.

The Router receives a completed transcript envelope and classifies the intent. Based on a policy manifest, it dispatches the envelope — in parallel if needed — to one or more Sinks. Sinks are the output destinations: the local JSONL archive, an LLM API, S3-compatible object storage, email, or the bd task tracker.

Every stage emits structured audit events. Every inter-stage boundary enforces the role-action matrix from RBAC before the payload moves forward. Neither behavior is optional and neither requires configuration — the defaults are secure.

Capture¶

Capture is the entry point of the pipeline. Its sole responsibility is to present a continuous stream of raw PCM audio frames to the rest of the pipeline — normalized to 16 kHz mono 16-bit signed integers — regardless of the physical source.

The stage abstracts over hardware diversity. On macOS, capture reads from the Core Audio HAL. On Linux, it reads from ALSA or PulseAudio. For file-based input, it decodes WAV/FLAC/OGG and feeds the decoded samples. For streaming input — RTP, HLS, or a TCP socket — it buffers and resamples to the canonical format. All of this is transparent to downstream stages.

Capture is stateful in exactly one way: it maintains the session and stream identifiers that propagate through every envelope the pipeline produces. A session begins when capture opens a source and ends when it closes it; a stream is a logical subdivision within a session (e.g. a meeting's participant track). Both IDs appear in every audit event and every sink payload.

The capture stage gates on the capture:start and capture:stop RBAC actions. An operator whose role does not carry capture:start cannot open a capture session; the attempt is rejected before any audio is read and an audit event is emitted with the denial reason.

The capture/v1 extension contract is fully documented in Extension Points. Implementing the interface and registering the adapter with the binary is enough to make a custom source available to the rest of the pipeline with no changes to downstream stages.

Segment¶

Segmentation turns a continuous PCM stream into discrete, bounded audio buffers — one per utterance — that ASR can transcribe in a single call. Without segmentation, a live microphone feed would be unbounded and untranscribable.

The default segmenter is an energy-VAD (voice activity detection) algorithm. It tracks the rolling RMS of incoming frames. When the RMS drops below a calibrated silence threshold for a minimum silence duration, it closes the current segment. On startup, capture auto-calibrates the threshold against a short ambient noise sample; operators can override it via --vad-threshold or in ship.toml.

Segments are bounded by a configurable maximum duration (--max-segment-ms, default 30 s). If an utterance exceeds the maximum — a long monologue, a song, a recording played back through the mic — the segmenter emits a partial segment and immediately opens a new one. The partial is flagged in the envelope metadata so downstream consumers can decide whether to stitch or discard.

Speaker diarization is an optional second pass within the segment stage. When enabled, the segmenter runs a lightweight embedding model against the closed segment, assigns a speaker label (self, other, or a numeric cluster ID), and attaches it to the buffer. The label propagates through ASR and appears in every sink payload. Diarization is off by default because it adds ~50 ms of latency per segment on typical hardware.

The segment boundary also marks where intent-boundary detection fires. An intent-boundary detector is a classifier that looks at the partial transcript of the segment in flight — or the trailing acoustic features — and signals "this segment contains a command boundary." The main router uses that signal to decide whether to wait for the next segment before dispatching or to act immediately. Custom intent-boundary detectors implement the segment/v1 extension contract.

ASR¶

The ASR stage takes a closed audio segment and returns a transcript envelope: the recognized text, the language tag, a confidence score, start and end timestamps, and the speaker label passed through from segmentation.

Vox ships four ASR adapters out of the box. The echo adapter does no speech recognition — it measures segment duration and RMS and returns a placeholder; it exists so the full pipeline can be exercised without installing any model or supplying any API key. The whisper-cli adapter shells out to a locally installed whisper-cli binary and a ggml-*.bin model file; transcription is fully on-device and air-gap safe. The Deepgram and AssemblyAI adapters call the respective cloud APIs over HTTPS with the operator's BYOK credentials; no credential ever passes through a Vox proxy. The Azure adapter calls the Azure AI Speech REST API with the same BYOK model.

All adapters implement the asr/v1 interface, which takes (context.Context, AudioBuffer) and returns (Transcript, error). The interface is intentionally minimal — it imposes no opinion on streaming vs. batch recognition, on language detection, or on word-level timing. Adapters that support richer outputs (word timestamps, alternative transcripts, raw confidence vectors) populate the extensions map on the Transcript struct; downstream stages that don't need those fields ignore them.

ASR gates on the asr:transcribe RBAC action. A session whose associated operator role does not carry asr:transcribe will have every segment returned as a zero-confidence empty transcript and an audit event recording the denial. The session is not terminated; capture and segmentation continue so that partial audio is not silently dropped when a permission boundary changes mid-session.

Latency at the ASR stage dominates end-to-end pipeline latency. The whisper-cli adapter takes 200–800 ms per segment depending on model size and hardware; cloud adapters typically return in 100–300 ms depending on network RTT and segment length. The pipeline does not serialize segments through ASR — if the segmenter closes two segments before the first ASR call returns, both calls are in flight concurrently, and envelopes are re-ordered to transcript sequence before routing.

Router¶

The router receives a completed transcript envelope and decides where it goes. It applies a policy manifest — a YAML file that maps intent patterns to sink lists — and dispatches the envelope to every matching sink in parallel.

Intent classification happens first. The router runs the envelope's transcript through a registered intent classifier; the default classifier is a lightweight keyword-and-regex model that ships with the binary and requires no API key. When an LLM intent classifier is configured, the router sends the transcript to the LLM and waits for a structured intent response before proceeding. The intent tag is attached to the envelope and used for routing decisions and for audit.

Routing rules are additive. A single envelope can match multiple rules and be dispatched to multiple sinks simultaneously. Each sink dispatch is independent — a failure in one sink does not affect delivery to others. The router tracks the delivery outcome of each dispatch and includes the full routing ledger (matched rules, sink identities, delivery status) in the audit event it emits.

The router gates on the router:dispatch RBAC action. Envelopes from sessions whose role does not carry that action are logged — an audit event records the intent and the routing block — and then dropped. Partial delivery (some sinks permitted, some blocked by their own RBAC gate) is recorded per-sink in the routing ledger.

Burst limiting lives at the router boundary. The router enforces the operator's burst policy before dispatching to any sink. If the session has exceeded its token-per-minute or request-per-minute budget, the router queues or drops the envelope according to the burst:policy configuration (queue, drop, or error). The burst event is always audited regardless of policy.

The router/v1 extension contract supports custom intent classifiers and custom routing predicates. A custom predicate takes the envelope and returns a list of sink IDs; the built-in keyword model is itself implemented as a predicate. See Extension Points for the full interface.

Sinks¶

Sinks are the pipeline's output layer. Every sink receives a complete transcript envelope — with intent, speaker, session context, and routing metadata attached — and is responsible for delivering or storing it.

Vox ships five sink adapters in the open-source edition. The local-file sink appends JSONL to a rotating archive under ~/.vox/archive/. The LLM sink forwards the transcript to a configured language model API (OpenAI, Anthropic, or any OpenAI-compatible endpoint) and captures the response in a follow-up envelope. The S3-compatible sink writes JSONL or parquet to any S3-compatible bucket — AWS S3, MinIO, Cloudflare R2, or Backblaze B2. The email sink formats the transcript as a structured email and delivers it via SMTP. The bd sink creates or updates a beads issue with the transcript and intent as the body.

All sinks implement sink/v1 — a simple Deliver(context.Context, Envelope) error interface. The router calls each sink in its own goroutine with a copy of the envelope; sinks are not permitted to mutate the envelope. Failed deliveries are retried according to the sink's configured retry policy (default: 3 attempts, exponential backoff with jitter). After the retry budget is exhausted, the router records a permanent delivery failure in the audit log.

Sinks gate on their own RBAC action: sink:write:<sink-id>. An operator whose role does not carry the action for a specific sink is treated the same as if that sink did not match any routing rule for that envelope — the envelope is not delivered to that sink, and the audit event records the block. This means fine-grained sink-level access control is available without any changes to the routing manifest.

The open-core sink/v1 interface is the boundary between editions. Enterprise sinks — managed LLM subscription, centralized audit forwarders for Splunk/Datadog/Loki/Elasticsearch/syslog — implement the same Deliver interface and are drop-in extensions. Upgrading from open-source to enterprise does not require changes to capture, segmentation, ASR, or routing configuration.

Editions hook¶

The pipeline described on this page is the complete open-source edition. Every stage, every built-in adapter, and every extension point is Apache-2.0 licensed and ships in this repository.

The enterprise edition adds capabilities at the IAM, audit, and sink layers — SSO federation, centralized credential pooling, managed-LLM subscriptions, and audit forwarders — without touching or replacing any pipeline stage. Because enterprise features implement the same versioned extension contracts as any open-source plugin, they are load-time additions, not forks.

For the full edition comparison and upgrade path, see Editions.