Skip to content

Getting Started

Get Blackrim Vox running on your machine in about five minutes. This guide walks through cloning, building, your first file-based transcription, live mic input, and installing the local Whisper model for on-device speech recognition.

For platform-specific quick paths (macOS / Linux / Windows), see Platform Quick-Start.


Prerequisites

Requirement Version Install
Go 1.23+ macOS: brew install go · Linux: apt install golang-go / dnf install golang · Windows: go.dev/dl
git any recent macOS: built-in via Xcode CLT · Linux: apt install git · Windows: git-scm.com
make GNU make macOS: Xcode CLT · Linux: apt install make · Windows: MSYS2 or choco install make
Disk space ~150 MB for the local Whisper model (downloaded on demand in step 4)

macOS microphone permission

When you first run --live, macOS will prompt for microphone access. Grant it once; vox doesn't ask again.

Already have Go?

Skip ahead to step 1go version should print 1.23 or later.


1 — Clone and build

git clone https://github.com/Blackrim-Vox/blackrim-vox.git
cd blackrim-vox
make build

This compiles cmd/vox with version metadata stamped in and writes the binary to ./bin/vox. No CGo, no external C libraries required for the base build.

./bin/vox --version

Expected output (tag varies):

vox dev (commit abc1234, built 2026-05-17T00:00:00Z)

2 — First file transcription (no API key, no network)

Vox ships a 2-second sine-wave WAV in testdata/ specifically for smoke-testing the pipeline end to end.

./bin/vox transcribe --asr echo testdata/sine-440hz-2s-16k-mono.wav

The echo backend doesn't do real speech recognition — it measures the segment's duration and RMS level and returns a placeholder. That's intentional: this step verifies the capture → segment → ASR → router → sink pipeline wires together cleanly before you install any model.

Expected terminal output:

ok: session=<session-id> stream=<stream-id>
    1 frames → 1 segments → 1 envelopes routed → 1 delivered (0 rejected)
    wall elapsed: <duration>
    sink dir: ~/.vox/archive

The transcription result lands in ~/.vox/archive/sessions/<YYYY-MM-DD>/<session-id>.jsonl. One line per segment, schema pinned to internal/envelope:

{
  "envelope_id": "env-...",
  "session_id": "sess-...",
  "stream_id": "stream-...",
  "started_at": "2026-05-17T00:00:00.000Z",
  "ended_at": "2026-05-17T00:00:02.000Z",
  "duration_ms": 2000000000,
  "transcript": "[2.00s of audio at -12.3dBFS]",
  "language": "en",
  "confidence": 0,
  "speaker": { "label": "self", "source_kind": "file" },
  "intent": { ... },
  "routing": { ... },
  "provenance": { ... }
}

Inspect the output

cat ~/.vox/archive/sessions/$(date +%Y-%m-%d)/*.jsonl | python3 -m json.tool

3 — First listen (live mic, echo stub)

The quickest way to verify the microphone path works — no model download required:

./bin/vox transcribe --live --asr echo

What happens, in order:

  1. Vox probes the microphone for ~1.5 s to auto-calibrate the energy-VAD threshold.
  2. Speak a sentence, then pause. The echo placeholder appears in the terminal summary.
  3. Press Ctrl-C to stop.

What you're testing here

The echo backend does not transcribe speech. You're validating the full mic capture → VAD → segment close → pipeline delivery path. If this runs without error, the hardware and OS audio permissions are wired correctly, and you're ready for step 4 (local Whisper).

Linux live mic

Live mic capture on Linux routes through the ALSA/PulseAudio adapter, which is still best-effort. File-based transcription (--asr echo path/to/file.wav) works reliably on all platforms.


4 — Local Whisper (real on-device transcription)

For production-quality transcription with no network dependency, Vox shells out to whisper.cpp via its CLI.

4a — Install the model

./scripts/install-whisper-model.sh

This downloads ggml-base.en.bin (~150 MB) to ~/.vox/models/whisper-cpp/. The script is idempotent — safe to run again if interrupted.

whisper-cli not on your PATH?

Pass --install-deps yes to have the script build and install the whisper-cli binary as well:

./scripts/install-whisper-model.sh --install-deps yes

You'll need make, cmake, and g++ (or clang++) on Linux. On macOS, Xcode Command Line Tools are sufficient.

4b — Transcribe a file

./bin/vox transcribe --asr whisper-cli testdata/sine-440hz-2s-16k-mono.wav

4c — Live mic with real transcription

./bin/vox transcribe --live --asr whisper-cli

Speak naturally. Vox segments on silence, sends each segment to whisper-cli, and appends the resulting transcript to the JSONL archive. Press Ctrl-C to stop.

Larger models

ggml-base.en.bin balances speed and accuracy well on most hardware. To use a larger model, download it manually and pass --whisper-model /path/to/ggml-medium.en.bin.


5 — Cloud ASR backends

Cloud backends (Deepgram, AssemblyAI, Azure) require a provider API key. Vox never touches those keys except to pass them to the provider. See Guides → BYOK for setup instructions for each provider.


Next steps

Where to go What you'll find
Concepts The pipeline model — capture, segment, ASR, router, sinks
Guides Cloud ASR setup (BYOK), continuous listen mode, policy manifests
Reference Full flag reference for vox transcribe and vox listen
Architecture ADRs and the extension-point contracts
./bin/vox --help        # full subcommand surface
./bin/vox transcribe -h # transcribe flag reference