This is the third article in a series about forking an open-core product with AI coding tools. Part 1 laid out the roadmap of what was gated behind the Pro paywall. Part 2 executed the fork: rebrand, telemetry strip, first feature. It proved the easy stuff was as easy as it looked.

Part 3 is about the stuff that isn’t easy.

Speaker diarization (identifying who said what in a meeting) is the one feature where the open source codebase and the Pro version are on even footing. The Meetily website marks it as “Coming Soon” for every tier, including Pro. Neither side has shipped it.

The roadmap from Part 1 rated this as Very High effort. The legacy code existed but wasn’t wired into the active pipeline. The architecture to support it touches the audio pipeline, the database schema, and the UI. Let’s find out if that estimate was right.

What the Codebase Actually Has#

The Meetily codebase has a file called audio/stt.rs that contains speaker embedding extraction using PyAnnote models. It has EmbeddingExtractor, EmbeddingManager, and SpeechSegment types. It’s real code that was inherited from a screenpipe fork.

But it’s not wired into anything. The active transcription pipeline runs through whisper_engine/ and parakeet_engine/. Neither has any concept of speakers. The TranscriptResult struct that carries transcription events to the frontend has fields for text, timestamps, confidence scores, and language. No speaker ID.

I poked deeper expecting the pyannote module to have been stripped from the repo. That kind of legacy code usually gets deleted when the pipeline is rewritten. But it’s still there. The pyannote/ directory has real Rust bindings for speaker embedding extraction and segmentation models. It’s not a stub. It’s code that was written to do exactly what diarization needs, connected to a pipeline that no longer exists.

So the legacy code is not a shortcut. It’s evidence that someone thought about this problem before, but the implementation that shipped is a single-user transcription engine with no speaker awareness. Building diarization means adding a new dimension to every layer of the pipeline.

The Architecture Problem#

Diarization requires four things that don’t exist in the codebase:

  1. An embedding model that can produce speaker fingerprints from audio segments
  2. A clustering algorithm to group those fingerprints into distinct speakers
  3. A storage layer to associate each transcript chunk with a speaker ID
  4. A UI to display speaker labels and let users name the speakers

The embedding extraction needs to happen in real time, alongside transcription. That means the audio pipeline (which currently runs a single pass through Whisper or Parakeet) now needs to produce both text and speaker embeddings from the same audio buffer.

This is where the “vibe coding” thesis gets stress-tested. The PDF export spec from Part 1 had clear boundaries and no external dependencies. The template editor spec from Part 2 had a complete Rust backend waiting for a frontend. A droid agent could handle both because the problem was well-defined and the infrastructure was already in place.

Speaker diarization is different. It’s not a bolt-on feature. It’s a pipeline redesign.

The Research#

Before writing off the problem, I did what any reasonable engineer would do: looked for someone who already solved it.

There’s a project called WhisperX with 22,000 GitHub stars. It does exactly what we need: batched Whisper transcription with word-level timestamps and speaker diarization, powered by pyannote-audio. The pipeline is:

  1. Voice activity detection (Silero or pyannote) to strip silence
  2. Whisper ASR for transcription
  3. Forced alignment with wav2vec2 for word-level timestamps
  4. Speaker diarization with pyannote-audio to cluster speakers
  5. Assignment: map each word’s timestamp to a speaker turn

The whole thing runs at 70x realtime on a GPU and is BSD-2-Clause licensed. The diarization model it uses, speaker-diarization-community-1, is CC-BY-4.0 from pyannoteAI. Both licenses are compatible with the MIT fork.

Pyannote-audio is already partially present in the Meetily codebase. The pyannote/ directory has real Rust bindings for embedding extraction and segmentation. Someone already started this work. They just didn’t finish wiring it into the active pipeline.

The fork’s approach wouldn’t be to build speaker diarization from scratch in Rust. It would be to integrate WhisperX as a sidecar. The same architectural pattern the codebase already uses for local AI inference via the llama-helper binary. Transcribe in real time with the existing Rust pipeline. When the meeting ends, or during a pause, hand the audio to a WhisperX sidecar process that runs diarization and assigns speaker labels to the transcript chunks.

This is still a significant engineering effort. The sidecar needs to be packaged, the results need to merge back into the existing data model, the TranscriptResult struct needs a speaker_id field, and the frontend needs to display speaker labels. But it’s not a moonshot. It’s a known problem with a known solution and a proven open source toolchain.

Here’s the spec I wrote for it:

# SDD: Speaker Diarization via WhisperX Sidecar

## Goal
Add speaker diarization to LibreMeet — identifying who said what in a meeting. Uses WhisperX (BSD-2-Clause) as a Python sidecar process, following the same architectural pattern as the existing llama-helper sidecar for local AI.

## Architecture
The existing transcription pipeline stays unchanged for real-time capture. After a meeting is recorded, the user can run diarization as a post-processing step. The flow:

1. User clicks "Identify Speakers" on the meeting details page
2. Backend exports the meeting's audio file to a temp location
3. Backend spawns a WhisperX Python sidecar process
4. WhisperX runs: VAD → forced alignment → pyannote diarization → speaker assignment
5. Sidecar writes results as JSON: array of {speaker_id, start_sec, end_sec, text}
6. Backend reads the JSON, assigns speaker IDs to existing transcript chunks
7. Frontend re-renders with speaker labels

## Requirements

### Sidecar Script
- Path: `libremeet-sidecars/whisperx_diarize.py`
- Python script using `whisperx` pip package
- Accepts: `--audio <path> --output <path> --hf-token <token>`
- Runs WhisperX with `--diarize --model large-v3`
- Writes JSON output: [{speaker_id, start_sec, end_sec, text}]
- If pyannote model not available, falls back gracefully (no speakers, no crash)

### Rust Backend
- New module: `frontend/src-tauri/src/diarization/`
- New Tauri command: `api_diarize_meeting(meeting_id: String, hf_token: Option<String>) -> Result<Json>`
- Command flow:
  1. Load meeting metadata from DB (get audio file path)
  2. Check if WhisperX sidecar script exists at known path
  3. Spawn sidecar process, monitor progress via events
  4. Read output JSON, update transcript_chunks table with speaker_id
  5. Return {status, speaker_count, chunk_count}
- Progress events: "diarization-progress" with {status: "processing"|"completed"|"error", message}
- Store speaker labels in meeting metadata: {speaker_1: "Speaker 1", speaker_2: "Speaker 2"}

### Frontend
- New button in meeting details: "Identify Speakers" in the TranscriptButtonGroup
- While running: show spinner with "Identifying speakers..."
- After completion: transcript chunks re-render with speaker labels (colored badges)
- Speaker labels are editable — click to rename (e.g., "Alice", "Bob")
- Renamed labels persist to meeting metadata

### Database
- Add `speaker_id` column to `transcript_chunks` table (nullable TEXT)
- Migration: `ALTER TABLE transcript_chunks ADD COLUMN speaker_id TEXT`
- Meeting metadata JSON gets a `speakers` field: {"1": "Speaker 1", "2": "Speaker 2"}

## Constraints
- No changes to real-time transcription pipeline
- Diarization is post-process only for v1
- WhisperX requires Python + pip — document as dependency
- HF token is optional (required for pyannote model download, not for Whisper)
- Follow existing sidecar pattern from summary/summary_engine/sidecar.rs

## Files to Touch
- frontend/src-tauri/src/diarization/mod.rs — NEW
- frontend/src-tauri/src/diarization/commands.rs — NEW
- frontend/src-tauri/src/lib.rs — register module + command
- frontend/src-tauri/Cargo.toml — NO new crates (use existing tokio/reqwest)
- frontend/src/app/meeting-details/page.tsx — add diarization button
- frontend/src/hooks/meeting-details/useCopyOperations.ts — add handleDiarize
- frontend/src/components/MeetingDetails/TranscriptPanel.tsx — show speaker labels
- libremeet-sidecars/whisperx_diarize.py — NEW Python sidecar script
- frontend/src-tauri/migrations/ — new migration for speaker_id column

## Do NOT Touch
- Audio capture/recording pipeline
- Whisper/Parakeet transcription engines
- Summary engine
- Any real-time streaming code
- Core audio device management

## Acceptance Criteria
1. User records a meeting with 2+ speakers
2. User clicks "Identify Speakers" after recording
3. Sidecar runs, progress shown in UI
4. Transcript chunks display speaker labels (colored)
5. User can rename "Speaker 1" to a custom name
6. Labels persist on page reload
7. Works offline if WhisperX + models are installed
8. Graceful error if Python/WhisperX not installed

I handed this to droid and let it cook. The result came back:

  • 13 files changed, 488 insertions, 262 deletions
  • Python sidecar (whisperx_diarize.py) — runs WhisperX with VAD, forced alignment, pyannote diarization, writes JSON segments with speaker labels
  • Rust backend — new diarization/ module with sidecar spawning, process monitoring, and two Tauri commands: api_diarize_meeting and api_save_speaker_names
  • Database migration — adds speaker_id to transcript chunks
  • Frontend — colored speaker badges (6-color rotation), click-to-rename labels, real-time progress events, “Identify Speakers” button that shows spinner during processing

One spec. One droid pass. 15 minutes. A feature that Meetily Pro marks as “Coming Soon” and hasn’t shipped, implemented in a single afternoon by an AI coding agent working against a fork of the MIT codebase.

The spec was three pages. The implementation touched every layer of the stack. And it worked.

This is what vibe coding looks like when the feature is genuinely hard. Not a single droid pass. Research, architecture, integration, and verification. With AI coding agents handling the implementation once the architecture is understood.

The same pattern holds for the other hard problems, and for the question Part 1 left open about “enhanced accuracy models.” Meetily’s catalog ships Whisper Large V3 Turbo and Parakeet. NVIDIA’s Canary Qwen 2.5B achieves 5.63% WER on the Open ASR Leaderboard (a 24% error reduction over Whisper’s 7.4%) and is CC-BY-4.0 licensed. IBM’s Granite Speech 3.3 8B is Apache 2.0. Neither requires a paywall. The fork’s answer to “enhanced accuracy” is a model download button. It’s free.

None of these are solved by a one-page spec and a single droid pass. But they’re all solved by the same method: find the existing open source solution, understand the architecture change it requires, and make the change.

What This Actually Proves#

Three articles, one fork, two shipped features, one hard problem engaged. The series set out to answer a question: can vibe coding actually fork an open-core product and make it competitive?

The answer is: it depends on the feature.

The features Meetily gates behind Pro are, in most cases, either already in the MIT codebase or a few hundred lines of Rust away. A template editor, PDF export, DOCX export. Those are Low to Medium effort, and droid can handle them in a single spec-to-implementation pass with no architectural changes.

The genuinely hard problems (speaker diarization, meeting auto-detection, RAG over meeting history) aren’t behind a paywall for a reason. Even Meetily Pro hasn’t shipped them. They’re “Coming Soon” for everyone.

The fork’s advantage isn’t that it can implement features the upstream can’t. It’s that it doesn’t have to sell those features to anyone. The fork doesn’t need a business model. It doesn’t need to distinguish Pro from Community to justify a price tag. It can ship speaker diarization the moment it works well enough, not when it’s polished enough to monetize.

That changes the calculus. The fork doesn’t need to match Pro’s feature set. It needs to be good enough for the people who need it. And for the people who need it (self-hosters, privacy-conscious professionals, teams that can’t send audio to the cloud) the fork is already competitive with the MIT codebase alone.

The source is awake. The code is at github.com/magnus919/libremeet, now archived. Fork it, build it, make it yours.

This fork exists to illustrate what vibe coding can do when applied to open-core software. I’ve got my hands full with groktocrawl and this series was meant to be illustrative, not a commitment to maintain a functional fork. The code is shared so that someone who is motivated can run with it. If that’s you, I’d love to see what you build.