Getting Started¶
Get Blackrim Vox running on your machine in about five minutes. This guide walks through cloning, building, your first file-based transcription, live mic input, and installing the local Whisper model for on-device speech recognition.
For platform-specific quick paths (macOS / Linux / Windows), see Platform Quick-Start.
Prerequisites¶
| Requirement | Version | Install |
|---|---|---|
| Go | 1.23+ | macOS: brew install go · Linux: apt install golang-go / dnf install golang · Windows: go.dev/dl |
| git | any recent | macOS: built-in via Xcode CLT · Linux: apt install git · Windows: git-scm.com |
| make | GNU make | macOS: Xcode CLT · Linux: apt install make · Windows: MSYS2 or choco install make |
| Disk space | ~150 MB | for the local Whisper model (downloaded on demand in step 4) |
macOS microphone permission
When you first run --live, macOS will prompt for microphone access. Grant it once; vox doesn't ask again.
Already have Go?
Skip ahead to step 1 — go version should print 1.23 or later.
1 — Clone and build¶
git clone https://github.com/Blackrim-Vox/blackrim-vox.git
cd blackrim-vox
make build
This compiles cmd/vox with version metadata stamped in and writes the binary to ./bin/vox. No CGo, no external C libraries required for the base build.
./bin/vox --version
Expected output (tag varies):
vox dev (commit abc1234, built 2026-05-17T00:00:00Z)
2 — First file transcription (no API key, no network)¶
Vox ships a 2-second sine-wave WAV in testdata/ specifically for smoke-testing the pipeline end to end.
./bin/vox transcribe --asr echo testdata/sine-440hz-2s-16k-mono.wav
The echo backend doesn't do real speech recognition — it measures the segment's duration and RMS level and returns a placeholder. That's intentional: this step verifies the capture → segment → ASR → router → sink pipeline wires together cleanly before you install any model.
Expected terminal output:
ok: session=<session-id> stream=<stream-id>
1 frames → 1 segments → 1 envelopes routed → 1 delivered (0 rejected)
wall elapsed: <duration>
sink dir: ~/.vox/archive
The transcription result lands in ~/.vox/archive/sessions/<YYYY-MM-DD>/<session-id>.jsonl. One line per segment, schema pinned to internal/envelope:
{
"envelope_id": "env-...",
"session_id": "sess-...",
"stream_id": "stream-...",
"started_at": "2026-05-17T00:00:00.000Z",
"ended_at": "2026-05-17T00:00:02.000Z",
"duration_ms": 2000000000,
"transcript": "[2.00s of audio at -12.3dBFS]",
"language": "en",
"confidence": 0,
"speaker": { "label": "self", "source_kind": "file" },
"intent": { ... },
"routing": { ... },
"provenance": { ... }
}
Inspect the output
cat ~/.vox/archive/sessions/$(date +%Y-%m-%d)/*.jsonl | python3 -m json.tool
3 — First listen (live mic, echo stub)¶
The quickest way to verify the microphone path works — no model download required:
./bin/vox transcribe --live --asr echo
What happens, in order:
- Vox probes the microphone for ~1.5 s to auto-calibrate the energy-VAD threshold.
- Speak a sentence, then pause. The echo placeholder appears in the terminal summary.
- Press Ctrl-C to stop.
What you're testing here
The echo backend does not transcribe speech. You're validating the full mic capture → VAD → segment close → pipeline delivery path. If this runs without error, the hardware and OS audio permissions are wired correctly, and you're ready for step 4 (local Whisper).
Linux live mic
Live mic capture on Linux routes through the ALSA/PulseAudio adapter, which is still best-effort. File-based transcription (--asr echo path/to/file.wav) works reliably on all platforms.
4 — Local Whisper (real on-device transcription)¶
For production-quality transcription with no network dependency, Vox shells out to whisper.cpp via its CLI.
4a — Install the model¶
./scripts/install-whisper-model.sh
This downloads ggml-base.en.bin (~150 MB) to ~/.vox/models/whisper-cpp/. The script is idempotent — safe to run again if interrupted.
whisper-cli not on your PATH?
Pass --install-deps yes to have the script build and install the whisper-cli binary as well:
./scripts/install-whisper-model.sh --install-deps yes
You'll need make, cmake, and g++ (or clang++) on Linux. On macOS, Xcode Command Line Tools are sufficient.
4b — Transcribe a file¶
./bin/vox transcribe --asr whisper-cli testdata/sine-440hz-2s-16k-mono.wav
4c — Live mic with real transcription¶
./bin/vox transcribe --live --asr whisper-cli
Speak naturally. Vox segments on silence, sends each segment to whisper-cli, and appends the resulting transcript to the JSONL archive. Press Ctrl-C to stop.
Larger models
ggml-base.en.bin balances speed and accuracy well on most hardware. To use a larger model, download it manually and pass --whisper-model /path/to/ggml-medium.en.bin.
5 — Cloud ASR backends¶
Cloud backends (Deepgram, AssemblyAI, Azure) require a provider API key. Vox never touches those keys except to pass them to the provider. See Guides → BYOK for setup instructions for each provider.
Next steps¶
| Where to go | What you'll find |
|---|---|
| Concepts | The pipeline model — capture, segment, ASR, router, sinks |
| Guides | Cloud ASR setup (BYOK), continuous listen mode, policy manifests |
| Reference | Full flag reference for vox transcribe and vox listen |
| Architecture | ADRs and the extension-point contracts |
./bin/vox --help # full subcommand surface
./bin/vox transcribe -h # transcribe flag reference