E537 add examples/whisper.linux by nnnet · Pull Request #3656 · ggml-org/whisper.cpp · GitHub
[go: up one dir, main page]

Skip to content

add examples/whisper.linux#3656

Open
nnnet wants to merge 1 commit intoggml-org:masterfrom
nnnet:whisper-linux
Open

add examples/whisper.linux#3656
nnnet wants to merge 1 commit intoggml-org:masterfrom
nnnet:whisper-linux

Conversation

@nnnet
Copy link
@nnnet nnnet commented Feb 8, 2026

PR: Add examples/whisper.linux — Voice typing for Linux desktop

Summary

Add a new example application for voice typing on Linux desktop. The app runs as a system tray icon, records speech from the microphone via pw-record/arecord, transcribes it using whisper-cli, and injects the resulting text at the cursor position via xdotool/wtype/clipboard.

  • Works on both X11 and Wayland (including GNOME, KDE, Sway)
  • Two input modes: hotkey (push-to-talk) and listen (wake word activation)
  • Two output modes: batch (transcribe all at once) and stream (transcribe per speech segment via VAD)
  • Voice commands: spoken words like "enter", "backspace", "tab" trigger key presses instead of being typed
  • Hallucination filter: rejects known training data leaks and checks speech rate sanity
  • Full system tray menu: language, model (with download), GPU, audio device, all settings
  • 204 unit tests (all mocked, no hardware needed)
  • Install script: builds whisper-cli, downloads model, configures system

Features

Input / Output Modes

Two independent axes give 4 combinations:

batch stream
hotkey Record → transcribe all → inject Each speech segment transcribed and injected live
listen Wake word → accumulate text → inject on stop Wake word → each segment injected live

Voice Commands

Word (EN) Word (RU) Action
enter энтер, ввод Press Enter
backspace бэкспейс, назад Delete previous word
tab таб, табуляция Press Tab
escape, stop эскейп, стоп Press Escape

Commands use fuzzy matching (threshold 0.75), editable via tray menu.

Hallucination Filter

Two layers:

  1. Pattern matching — rejects known whisper training data leaks ("subtitles by...", "спасибо за просмотр", etc.)
  2. Speech rate check — rejects text that is impossibly long for the audio duration (max 5 words/sec, 25 chars/sec)

Text Injection (Wayland + X11)

Fallback chain: wtypeydotool (with evdev key name translation) → xdotool (via XWayland) → clipboard paste.
Non-ASCII text (e.g. Cyrillic) always uses clipboard paste to avoid encoding issues.

System Tray

Full settings menu: language, model selection (with one-click download from Hugging Face), GPU device, audio device, input/output mode, wake word, silence timeout, voice commands editor.

Files

examples/whisper.linux/
  whisper-linux               # Launcher script
  install.sh                  # One-command install (deps, build, model, config)
  whisper-linux.desktop       # Desktop entry for autostart
  README.md                   # Documentation
  .gitignore                  # Exclude __pycache__, .pytest_cache
  app/                        # Python package
    __init__.py               # Re-exports public API
    __main__.py               # Entry point
    config.py                 # Config, AppState, constants, helpers
    audio.py                  # AudioRecorder, AudioStream, SimpleVAD
    transcriber.py            # Transcriber, WakeWordDetector
    injector.py               # TextInjector (xdotool/wtype/clipboard)
    commands.py               # VoiceCommands (fuzzy matching)
    tray.py                   # TrayIcon, system tray menu, settings UI
    app.py                    # WhisperLinuxApp, state machine, CLI
  tests/
    conftest.py               # Fixtures
    test_whisper_linux.py     # 204 unit tests (all mocked)
  run_tests.sh                # Run tests with one command

Dependencies

  • Python 3.10+, PyQt5
  • whisper-cli (built from this repo)
  • xdotool or wtype (text injection)
  • xclip or wl-copy (clipboard fallback)
  • arecord or pw-record (audio capture)

No additional Python packages beyond PyQt5.

Test plan

  • All 204 unit tests pass (python3 -m pytest tests/ -v)
  • Manual test: hotkey+batch mode (record → stop → text appears)
  • Manual test: hotkey+stream mode (speak → text appears per segment)
  • Manual test: listen+stream mode (wake word → dictate → text appears)
  • Manual test: voice commands (say "enter", "tab", "backspace")
  • Manual test: model download via tray menu
  • Manual test: keyboard shortcut setup (GNOME/KDE)
  • Test on X11
  • Test on Wayland (GNOME)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

0