Back to Articles

VoiceInk: Building a Privacy-First Voice Transcription System with On-Device Whisper

[ View on GitHub ]

VoiceInk: Building a Privacy-First Voice Transcription System with On-Device Whisper

Hook

Your voice transcription app is probably sending your dictation to the cloud right now. VoiceInk processes everything locally, achieving 99% accuracy without your data ever leaving your Mac.

Context

Voice-to-text transcription has become table stakes for productivity tools, but the market is dominated by subscription services that require cloud connectivity. Products like Superwhisper and Wispr Flow offer excellent accuracy, but at the cost of your privacy and $10-15 monthly fees. Meanwhile, macOS's built-in dictation falls short: the offline version is barely usable, and the enhanced mode requires sending audio to Apple's servers.

OpenAI's release of Whisper changed the game by providing state-of-the-art speech recognition that could run locally. But there was a gap between the raw Whisper models and a polished, system-integrated tool that developers and writers could actually use throughout their day. VoiceInk fills this gap as a native macOS app that embeds Whisper entirely on-device, using whisper.cpp bindings and the newer Parakeet models through FluidAudio. It's GPL v3 licensed, meaning you can audit exactly what happens to your voice data, or you can purchase a license for automatic updates and support.

Technical Insight

VoiceInk's architecture centers on three core components: audio capture, local ML inference, and system-wide text injection. Unlike Electron-based alternatives, it's pure Swift leveraging macOS 14.4+ APIs, which means direct access to AudioToolbox for low-latency capture and Accessibility APIs for seamless text insertion.

The audio capture pipeline uses AVAudioEngine to record from the selected input device. When you trigger the global keyboard shortcut (default Cmd+Shift+Space), VoiceInk initializes a recording session that captures PCM audio buffers. These buffers are stored in memory until you release the shortcut, at which point they're passed to the inference engine. The app also intelligently pauses media playback during recording using the MediaPlayer framework, preventing your podcast or music from contaminating the audio stream.

The inference layer is where things get interesting. VoiceInk doesn't use Apple's Speech framework—instead, it embeds whisper.cpp, a C++ implementation of OpenAI's Whisper that's optimized for local execution. Here's a simplified example of how the Swift-to-C++ bridge might work:

class TranscriptionEngine {
    private var whisperContext: OpaquePointer?
    
    func initialize(modelPath: String) {
        let params = whisper_context_default_params()
        whisperContext = whisper_init_from_file(modelPath, params)
    }
    
    func transcribe(audioData: [Float]) -> String {
        guard let context = whisperContext else { return "" }
        
        // Convert audio to format Whisper expects
        var params = whisper_full_default_params(WHISPER_SAMPLING_GREEDY)
        params.language = "en"
        params.translate = false
        params.print_realtime = false
        
        // Run inference
        audioData.withUnsafeBufferPointer { buffer in
            whisper_full(context, params, buffer.baseAddress, Int32(buffer.count))
        }
        
        // Extract transcription
        let segmentCount = whisper_full_n_segments(context)
        var fullText = ""
        for i in 0..<segmentCount {
            if let text = whisper_full_get_segment_text(context, i) {
                fullText += String(cString: text)
            }
        }
        
        return fullText
    }
}

The real innovation is Power Mode—context-aware transcription that adapts based on what app you're using. VoiceInk uses the NSWorkspace API to detect the frontmost application, and for browsers, it can even extract the current URL using accessibility APIs. This enables rules like "when I'm in Slack, use casual tone and don't capitalize, but when I'm in my code editor, expect technical terminology." The system maintains a dictionary of app identifiers mapped to transcription profiles:

struct TranscriptionProfile {
    let modelVariant: WhisperModel  // tiny, base, small, medium
    let customDictionary: [String: String]  // "kubernetes" -> "Kubernetes"
    let autoCapitalize: Bool
    let punctuationStyle: PunctuationMode
    let promptContext: String?  // Primes the model with domain context
}

class PowerMode {
    private var profiles: [String: TranscriptionProfile] = [:]
    
    func getProfileForContext() -> TranscriptionProfile {
        guard let frontApp = NSWorkspace.shared.frontmostApplication else {
            return .default
        }
        
        let bundleId = frontApp.bundleIdentifier ?? "unknown"
        return profiles[bundleId] ?? .default
    }
}

Text injection is perhaps the trickiest part. VoiceInk uses macOS Accessibility APIs to insert transcribed text into the active application, which requires granting accessibility permissions in System Settings. The app creates a CGEvent for keyboard input and posts it to the system event stream, simulating typing. For performance, it batches characters and uses paste events for longer transcriptions. The personal dictionary feature adds another layer: before injection, the transcribed text passes through a replacement engine that swaps recognized patterns with user-defined alternatives, handling cases like "kubernetes" autocorrecting to "Kubernetes" or "thx" expanding to "thanks."

Gotcha

The macOS 14.4+ requirement is a significant barrier. This version shipped in March 2024, and many developers and organizations are still on Monterey or Ventura for stability reasons. The decision likely stems from using newer AVFoundation APIs or Swift concurrency features, but it immediately excludes a large potential user base. There's no fallback mode or graceful degradation for older systems.

The "not accepting pull requests" stance is puzzling for an open-source project. While the code is GPL v3 and you can fork freely, the upstream repository won't merge community contributions. This creates a fragmented ecosystem where improvements live in scattered forks rather than consolidating in the main project. The author's open-core business model—selling licenses for updates and support—may explain this choice, but it means you're essentially dealing with source-available software rather than true community-driven open source. If you need a feature or bug fix, you're either building and maintaining your own fork or hoping the solo maintainer prioritizes your issue. For production use in a team environment, this creates maintenance risk.

Verdict

Use if: You're on macOS 14.4+ and privacy is non-negotiable—whether due to regulatory requirements, handling sensitive information, or philosophical stance. The offline-first architecture means zero data leakage, and the GPL license lets you verify that claim. It's also ideal if you frequently switch contexts between different apps and would benefit from automatic transcription profile switching. The personal dictionary makes it valuable for technical writing where domain-specific terminology matters. Skip if: You need cross-platform support, are on older macOS versions, or require active community development. The lack of pull request acceptance means you're dependent on a single maintainer's roadmap and availability. Also skip if you need real-time streaming transcription—VoiceInk works on completed audio segments, so there's latency between stopping recording and seeing text. Cloud-based alternatives will give you better accuracy on accented speech and noisy environments since they can leverage continuously-updated models.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/developer-tools/beingpax-voiceink.svg)](https://starlog.is/api/badge-click/developer-tools/beingpax-voiceink)