Skip to content

segment/v1 — Speech Segmentation + Diarization

Status: Draft (design-locked, ready for first implementation) · Stability: v1 will be frozen with the first reference backend (energy-vad) · Implementations: in-tree only until v1 is frozen.

The segment/v1 surface sits between capture/v1 (audio frames) and asr/v1 (transcription). A segment backend consumes a continuous stream of audio frames and emits discrete speech segments — bounded audio chunks containing detected speech, with optional speaker attribution and multi-speaker flags. Each segment becomes the input to exactly one ASR transcription call.

This is the most-constrained surface in the pipeline: both endpoints (capture/v1 and asr/v1) are locked. Segment/v1's design space lives entirely between those contracts.


Scope

segment/v1 covers:

  • The backend interface (ProcessFrame, OpenStream, CloseStream)
  • Voice Activity Detection (VAD) with a pluggable chain
  • Segment boundary rules (silence duration, max duration, padding)
  • Diarization layers (capture-hint passthrough → multi-speaker detection → speaker change splits → optional full speaker identification)
  • Multi-stream coordination (independent vs cross-stream suppression)
  • Optional preprocessing (denoise, AGC, high-pass, echo cancellation)
  • Per-source VAD overrides
  • Backpressure to ASR (drop-oldest queue policy)
  • Error model + audit hook
  • Versioning and stability rules

segment/v1 does not cover:

  • Audio capture (capture/v1)
  • Transcription (asr/v1)
  • Intent classification or sink routing (router/v1, sink/v1)
  • Speaker identification across sessions (identity/v1, Enterprise)
  • Real-time UI surfaces (orchestrator subscriber concern, not a segment/v1 responsibility)

Input — what the backend consumes

The backend receives:

  1. A continuous stream of capture/v1 Frames via ProcessFrame(streamID, frame) calls
  2. SessionInfo at OpenStream time:
SessionInfo {
  SessionID    string
  StreamID     string
  SourceKind   SourceKind                # "self" | "in-person" | "online" | "file"
  CaptureHint  string?                   # speaker label from capture layer (e.g., "self")
  SampleRate   uint32
  Channels     uint8
}

Backend state is per-streamOpenStream returns a StreamHandle that owns the VAD state, current in-progress segment, drop counters, etc. Multi-stream sessions are independent at the backend level unless cross-stream coordination is configured (see Multi-stream below).


Output — what flows to ASR

Each closed segment becomes a Segment:

Segment {
  SegmentID            string                  # unique within stream
  SessionID            string
  StreamID             string
  StartedAt            Timestamp               # wall-clock of first speech sample (after pad_start_ms)
  EndedAt              Timestamp               # wall-clock of last sample (after pad_end_ms)
  Frames               []Frame                 # the audio
  SpeakerHint          string?                 # capture-hint or diarization label
  HasMultipleSpeakers  bool                    # multi-speaker boolean
  SpeakerEmbedding     []float?                # optional; when full identification enabled
  Confidence           float                   # segmentation confidence (0..1)
  Reason               SegmentEndReason        # why this segment closed
  Custom               map<string, any>        # provenance flags from preprocessing, diarization, etc.
}

SegmentEndReason {
  EndedSilenceTimeout                          // min_silence_for_split reached
  EndedMaxDuration                             // max_segment_duration hit
  EndedSpeakerChange                           // diarization detected speaker change
  EndedStreamClose                             // stream closing
  EndedSuppressed                              // cross-stream suppression deleted this segment
}

Segment is the unit of input to asr/v1's Transcribe (offline) or the audio fed into StreamFeed (streaming).


Voice Activity Detection — Pluggable Chain

VAD determines speech start/end. Three VADs ship in v1, run in a configurable chain.

The three VADs

VAD What it is Latency Accuracy Cost
energy RMS / spectral-flux threshold + smoothing < 1 ms Decent in clean audio; bad in noise Free, no model
webrtc Google's WebRTC VAD (battle-tested, fast) < 5 ms Good on clean speech; struggles with music / overlapping speakers Free, no model
silero Small neural VAD (ONNX, ~1 MB) 10–50 ms Best, especially in noise / mixed audio Free, bundled model

Chain semantics

All VADs in the chain run in parallel on every audio frame (not sequentially). Each emits a speech probability. The voting rule decides the final speech / no-speech decision.

segment:
  vad:
    chain:
      - type: webrtc
        aggressiveness: 2                       # WebRTC VAD has 0-3 aggressiveness levels
        frame_ms: 20                            # match capture's default frame size
      - type: silero
        model_path: ~/.vox/models/silero/silero_vad.onnx
        threshold: 0.5
      - type: energy
        threshold_db: -40
        smoothing_ms: 50

    voting: priority                            # priority | majority | weighted
    # priority   = first VAD with high confidence decides
    # majority   = at least N-of-M must agree
    # weighted   = sum(probability × weight) ≥ threshold

    # Per-source overrides
    by_source:
      online:
        chain: [silero, webrtc]                 # noisier; ML first
      self:
        chain: [webrtc, energy]                 # clean; cheap first

Failure handling within the VAD chain

Failure Action
One VAD in chain fails (model load error, runtime crash) Mark unhealthy; chain continues with remaining VADs; voting rule adapts
All VADs fail Fall back to fixed-length chunking (10 s chunks); emit segment.degraded audit event; pipeline keeps moving

The pipeline never blocks on VAD failure.


Segment Boundaries

Speech start / end is detected by VAD; segment start / end is VAD plus boundary rules. The rules:

segment:
  boundaries:
    min_speech_duration: 250ms                  # ignore micro-utterances ("uh", coughs)
    min_silence_for_split: 600ms                # silence this long → segment break
    max_segment_duration: 30s                   # force-cut at this point (prevents runaway)
    pad_start_ms: 200                           # include 200ms before detected speech start
    pad_end_ms: 300                             # include 300ms after detected speech end
    speaker_change_splits: true                 # split at diarization-detected speaker changes

Why these defaults

  • min_speech_duration: 250ms — Skip filler sounds. Saves ASR cost on cloud backends and noise on the ASR output.
  • min_silence_for_split: 600ms — Natural speech pauses are 150–400 ms; conversational turn boundaries are typically > 500 ms. 600 ms threads the needle.
  • max_segment_duration: 30s — Hard cap. Prevents a "stuck VAD" scenario from producing a 6-hour segment. Aligns with ASR backend preferences (Whisper-class models are tuned on ~30 s windows).
  • pad_start_ms: 200 / pad_end_ms: 300 — Critical for ASR accuracy. Whisper-class models lose the first/last phoneme if you cut too tight. Asymmetric (more trailing padding) because word offsets typically trail more than they lead.
  • speaker_change_splits: true — When diarization detects a speaker change mid-segment, split. Downstream sees one segment per speaker turn, which is the right granularity for IntentEnvelope per speaker.

All values configurable.


Diarization Layers

Per the precedence rule locked in router/v1 (segment > ASR backend > capture hint), segment/v1 must produce something for Speaker.Label. Four layers, three default-on:

Layer Status Behavior
Capture-hint passthrough Always on The self adapter stamps Speaker.Label = "self" at capture; segment passes it through
Multi-speaker detection Default on Boolean signal: "this segment contains > 1 voice". Cheap; uses any VAD's secondary output
Speaker change detection within segment Default on for online + in-person Mid-segment speaker changes split the segment
Full speaker identification Opt-in Stable labels (speaker-0, speaker-1, …) across segments; uses a pluggable model backend
segment:
  diarization:
    capture_hint: passthrough                   # always passes through
    multi_speaker_detection: enabled            # boolean signal
    speaker_change_within_segment: enabled      # splits per boundary rules

    speaker_identification:
      enabled: false                             # opt-in
      backend: pyannote-onnx                     # pyannote-onnx | nemo-ecapa | sherpa
      model_path: ~/.vox/models/pyannote/segmentation.onnx
      embedding_path: ~/.vox/models/pyannote/ecapa-tdnn.onnx
      similarity_threshold: 0.7                  # cosine similarity for label reuse
      window_size: 30s                           # how far back to compare speakers

With speaker_identification enabled, each segment carries:

  • SpeakerHint: "speaker-0" / "speaker-1" / … (stable within session)
  • Optional SpeakerEmbedding: []float (passed to ASR for backend-side diarization refinement)

Without speaker_identification, segments still carry the capture hint and the multi-speaker boolean — sufficient for most use cases.


Multi-stream Coordination

When two streams run in parallel (your mic + system audio from an online call), how do their VAD pipelines interact?

Mode Behavior Use case
independent (default) Each stream segments independently. Downstream correlates via SessionID Default; covers the majority case where call clients already do echo cancellation
cross-stream-suppression (opt-in) segment/v1 sees both streams; suppresses segments likely to be echo of another stream Aggressive de-duplication for captures without native AEC
segment:
  multi_stream:
    mode: independent
    cross_stream:
      enabled: false
      echo_detection: cross-correlation         # cross-correlation | spectral
      suppress_threshold_ms: 50                 # correlation peak within 50ms → echo
      suppress_stream: self                     # self loses to system-audio "ground truth"

Suppressed segments are still emitted with Reason: EndedSuppressed and flow to the audit log; they just don't go to ASR. This preserves the audit trail without re-paying transcription costs.


Preprocessing — Opt-in, Off by Default

Audio preprocessing (denoise / AGC / high-pass / echo cancellation) lives between capture and ASR. All stages are opt-in, all off by default.

Reasoning: - Modern ASR backends (especially Whisper) are trained on diverse raw audio; preprocessing can hurt accuracy - Cloud ASR backends do their own preprocessing internally; doubling it is wasteful - Online sources (call clients) usually already do echo cancellation

segment:
  preprocessing:
    denoise:
      enabled: false
      method: rnnoise                           # rnnoise | spectral-subtraction
    agc:                                        # automatic gain control
      enabled: false
      target_db: -16
    high_pass:
      enabled: false
      cutoff_hz: 80                             # remove HVAC rumble + low-frequency noise
    echo_cancellation:
      enabled: false

Each preprocessing stage emits a provenance flag into the segment's Custom map when applied:

Custom {
  "segment.preprocessing.rnnoise_applied":      "true",
  "segment.preprocessing.agc_applied":          "true",
  ...
}

ASR sees these flags via the envelope's Provenance chain and can adapt if it knows certain stages help or hurt its specific backend.


Backend Interface

SegmentBackend {
  # Identity
  Name()           -> string
  Capabilities()   -> Capabilities

  # Lifecycle
  Open(config)     -> Error
  Close()          -> Error

  # Stream lifecycle
  OpenStream(sessionInfo)        -> StreamHandle | Error
  CloseStream(handle)            -> []Segment | Error
    # Final flush — emits any in-progress segment

  # Hot path
  ProcessFrame(handle, frame)    -> []Segment | Error
    # Returns 0 or more complete segments. Most frames return [] (no
    # segment boundary reached). A frame that closes one or more segments
    # returns them.

  # Diagnostics
  Stats()          -> Stats
  Health()         -> Health
}

Capabilities {
  SupportedVADs                  []string       # "energy" | "webrtc" | "silero"
  SupportsDiarization            bool
  SupportsMultiSpeaker           bool
  SupportsCrossStreamSuppression bool
  SupportedPreprocessing         []string       # "rnnoise" | "agc" | "high-pass" | "echo-cancel"
}

Orchestrator loop

  1. Capture adapter emits frames on its channel
  2. Orchestrator forwards each frame to segment.ProcessFrame(handle, frame)
  3. ProcessFrame returns []Segment — usually empty, occasionally one or more
  4. For each returned segment, orchestrator hands it to asr/v1 per the source-kind routing
  5. On stream close, orchestrator calls segment.CloseStream(handle) to flush any in-progress segment

Concurrency

  • One StreamHandle per stream; segments from different streams flow independently
  • ProcessFrame is called from a single goroutine per stream (the orchestrator's stream-forwarder)
  • Cross-stream suppression (when enabled) uses a shared lock-protected recent-segments structure; locking is internal to the backend

Backpressure (segment → ASR)

The orchestrator buffers segments destined for ASR. When ASR can't keep up:

segment:
  output_queue:
    buffer_segments: 32                         # ~10 min of typical segments
    drop_policy: drop-oldest
    drop_alert_threshold_pct: 5.0
    drop_alert_window_sec: 60

Default policy: drop-oldest — opposite of capture's drop-newest. Reasoning:

  • In capture, frames are sequential audio samples — losing the newest preserves continuity of the recent past, which is what live processing needs
  • In segment, each segment is an independent meaning-unit — when the queue is backed up, the oldest segment is the most stale and least valuable to the user

Drop telemetry: - segment.dropped counter — running total - Structured WARN log per drop event (not per dropped segment) - audit/v1 event when audit is loaded - Loud-escalation if drop rate > 5% over 60 s (configurable)


Error Model

Typed errors, mirroring other Vox surfaces:

SegmentError {
  Kind     SegmentErrorKind
  Stage    string                                # "vad:webrtc" | "vad:silero" | "diarization" | "preprocessing:rnnoise" | etc.
  Message  string
  Cause    Error?
}

SegmentErrorKind {
  ErrInvalidConfig
  ErrVADUnavailable                             # one VAD in chain failed; chain continues
  ErrAllVADsFailed                              # entire chain failed; falls back to fixed-length chunks
  ErrModelNotFound                              # diarization / preprocessing model missing
  ErrModelCorrupt                               # checksum mismatch
  ErrUnsupported                                # capability requested not declared
  ErrInternal
}

Failure handling

Failure Action
One VAD in chain fails Mark unhealthy; chain continues; voting adapts
All VADs fail Fall back to fixed-length chunking (10 s); emit segment.degraded audit event
Diarization model missing Disable diarization features for the session; segments still emit with capture-hint only
Preprocessing failure Skip the failing stage; pass audio through unchanged
ProcessFrame crashes Orchestrator-side recover; current segment lost; backend restarted; segment.crashed audit event
Segment exceeds max_segment_duration Force-close at the cap with EndedMaxDuration reason — not an error

Key principle: segment/v1 NEVER blocks the pipeline. Worst case is fixed-length chunking, which trades accuracy for continuity. The pipeline keeps moving.


Audit Hook

When audit/v1 is loaded, the segment backend MUST emit a SegmentDecisionEvent per segment:

SegmentDecisionEvent {
  Timestamp           Timestamp
  SessionID           string
  StreamID            string
  SegmentID           string
  StartedAt           Timestamp
  EndedAt             Timestamp
  Duration            Duration
  EndReason           SegmentEndReason
  VADChainTrace       []VADStep                  # which VADs fired, with speech probabilities
  Diarization {
    SpeakerHint            string?
    HasMultipleSpeakers    bool
    Source                 string                  # "capture-hint" | "identification:pyannote-onnx" | "splitting"
  }
  PreprocessingApplied []string                   # ["rnnoise", "agc"] etc., empty if none
  ConfidenceScore     float
  FrameCount          uint32
  AudioBytes          uint64                      # approximate
}

VADStep {
  Name              string                        # "webrtc" | "silero" | "energy"
  SpeechProbability float
  Triggered         bool
  LatencyMS         uint32
}

This lets an auditor reconstruct: "this segment was closed because Silero detected silence for 700 ms, the capture-hint said self, no preprocessing was applied, segmentation confidence was 0.92, and it covered 47 frames."

Combined with RouterDecisionEvent (from router/v1) and downstream sink audit events, the complete fate of every utterance is traceable — from microphone vibration to LLM response to email summary.

When audit/v1 is NOT loaded, the same data flows as structured logs at DEBUG (configurable to INFO via segment.log_decisions_at: info).


Bundling

Component Bundling decision
energy VAD Compiled into binary. Pure algorithm, no model
webrtc VAD Compiled into binary (CGo binding to libwebrtcvad). No model file
silero VAD Bundled ONNX model (~1 MB). Critical to "noise-robust out of the box"
Diarization models (pyannote-onnx, NeMo ECAPA, sherpa) Downloaded via vox model download
Preprocessing models (RNNoise) Downloaded for opt-in stages

Bundling Silero specifically: ~1 MB binary overhead is cheap compared to the "works in a noisy office on first install" UX win.

Diarization and preprocessing models are NOT bundled because: - They're opt-in features (most use cases don't need them) - Sizes are larger (6-30 MB each) - Cross-platform installer size matters

Model lifecycle (download, storage, verification, versioning) follows the same pattern as asr/v1: ~/.vox/models/<backend>/<model>, SHA-256 checksums on every load, explicit version names, no silent upgrades.


Reference Build Order

Order Component Why
1 energy-vad Simplest; no dependencies. Builds the segment pipeline scaffolding (ProcessFrame, OpenStream/CloseStream, queue, drop policy). End-to-end pipeline tests pass with energy-VAD + file-wav capture + whisper-cpp ASR
2 webrtc-vad Stable, fast, well-tested. CGo binding to libwebrtcvad
3 silero-vad ONNX runtime integration; bundling the model. After this lands, "works in noise out of the box" is true
4 diarization scaffolding Plumbing for SpeakerHint and SpeakerEmbedding; fake/synthetic speakers in test audio prove the path before real models land
5 pyannote-onnx diarization First real speaker-identification backend
6 rnnoise preprocessing First preprocessing stage; validates the optional layer
7+ additional backends Same contract, different models

The pipeline is usable after step 1 (energy VAD alone) and good after step 3 (Silero VAD). Diarization + preprocessing layer onto a proven core.


Versioning and Stability

segment/v1 is the contract above. Once frozen:

  • Non-breaking changes (allowed in v1.x): adding optional fields to Capabilities, Segment, SegmentDecisionEvent; adding new VAD types; adding new diarization backends; adding new preprocessing stages; adding new SegmentEndReason values; adding new SegmentErrorKind values.
  • Breaking changes (require v2): changing the SegmentBackend interface signature; changing boundary-rule semantics; removing or repurposing any existing field; changing what frames a segment includes.

The core supports one vN of segment/ at a time, with overlap during migrations.


Project Principle: Opinionated Defaults, Every Default Configurable

This contract continues the principle from the rest of v1. Every behavior with a defensible default (min_silence_for_split: 600ms, pad_end_ms: 300, max_segment_duration: 30s, voting: priority, drop_policy: drop-oldest, auto_download: prompt, etc.) is exposed as a config knob. Defaults reflect a considered recommendation for the typical voice-to-LLM use case; the knobs exist so specialized workflows can tune them.