Building Granola.ai clone macOS desktop app
To get a clean implementation plan, I need decisions on these points:
- What stack do you want for the desktop app shell:
Electron,Tauri, nativeSwift/SwiftUI, or “use whatever fits this repo best”? - For macOS target, should we support only a single version (
macOS 15.xfor example) or “latest major + previous”? - For “capture system audio input & output,” do you mean:
- microphone + system output mixed together into one stream, or
- two separate streams (mic vs output) that we can tag in transcript?
- Is system-output capture allowed to require a virtual audio device (e.g. BlackHole), or must it work with no external driver install?
- Should capture be global (all system output) or limited to a selected app/window?
- Do you want live transcription while recording, or only transcription after stopping capture?
- For
voxtral.c, do you already have a model/runtime choice, or should I pick defaults and design a pluggable subprocess wrapper? - Which summarization path do you want:
- local model,
- API model (OpenAI/Anthropic/etc.),
- or placeholder summarizer first?
- What summary format do you want exactly: fixed Markdown template (sections like
Key Points,Decisions,Action Items) or free-form? - Should notes/transcripts be persisted locally (history list), or only in-memory for now?
- Any hard constraints on privacy/security: fully offline only, no cloud calls, encrypted local storage, etc.?
- What is the “good enough” UX for v1:
- single screen with
Start/Stop + transcript + markdown summary, - or include device selectors, status indicators, and error/permission flows from day one?
- Do you want speaker/source attribution in transcript (
Mic,System) and timestamps? - Should summary regenerate continuously as transcript grows, or only once at the end?
- For v1, are we okay with unsigned/dev build runs, or do you want proper app signing/notarization setup early?
If you answer these, I can turn it into a concrete build plan and start implementing immediately.
Perfect — with your answers, we can move fast.
Given the repo is currently empty, my recommendation is:
Swift + SwiftUIdesktop app (most natural for macOS-only audio permissions/capture).- Global capture of system output using Apple APIs (
ScreenCaptureKit) + mic capture (AVFoundation), no external driver. - Source attribution as
MicvsSystemin transcript (true per-person diarization from mixed system audio is not reliably possible in v1). - Record first, then transcribe once on stop (simpler and more stable).
- Summarize at the end via
OpenAIAPI behind a tiny provider interface (so we can swap providers later).
Only 4 final decisions needed from you:
- Is
MicvsSystemattribution acceptable instead of per-human speaker IDs? - Should we target
English-onlyfirst or auto-detect language? - Which OpenAI model do you want for summarization (or should I pick a sensible default)?
- Are you okay with granting
Screen Recordingpermission (required by macOS path for system-audio capture)?
If yes, I’ll start scaffolding the app immediately.
Implementation Plan (Review Draft)
- Build a macOS-only
SwiftUIapp with one screen:
Start/Stopcapture button- Live status area (
idle,capturing,transcribing,summarizing,done,error) - Transcript panel
- Markdown summary panel
- Capture audio from two sources in parallel:
MicviaAVAudioEngine/AVFoundationSystem outputviaScreenCaptureKitstream audio- Keep streams logically separated so transcript segments can be tagged
MicorSystem
- Buffer capture to files (simplest stable path):
- Write
mic.wavandsystem.wavduring recording - On stop, finalize both files
- Keep everything in-memory from app perspective (no note history), but temporary audio files are used for processing
- Transcription pipeline via
voxtral.cCLI subprocess:
- Add a small Swift
Processwrapper (TranscriptionService) - Invoke CLI separately for each source file (
mic.wav,system.wav) - Parse stdout/stderr and normalize into transcript segments
- Merge by timestamps when available; otherwise append with source tags in capture order
- Speaker/source labeling strategy:
- v1 label segments as
MicandSystem - If
voxtralprovides diarization markers, preserve them in segment metadata - Display format example:
[00:01:12] Mic: ...[00:01:15] System: ...
- Summarization at end using OpenAI API:
SummarizationServicewith a provider interface (AIProvider) so we can swap APIs later- Default model: a sensible cost/quality middle ground (
gpt-4.1-miniunless you want different) - Prompt takes full transcript and asks for free-form Markdown summary
- Return raw Markdown and render in app as plain text (or rich markdown view if easy)
- Permissions and UX/error handling:
- Request/check
Microphonepermission - Request/check
Screen Recordingpermission for system audio path - Clear action guidance when permission is missing (including “open System Settings”)
- Show subprocess/API errors inline on the same screen
- Architecture (small but clean):
CaptureService(mic + system)TranscriptionService(voxtralprocess wrapper)SummarizationService(OpenAI call)AppViewModelas orchestrator/state machine- Lightweight models:
TranscriptSegment,SessionState,SummaryResult
- Verification strategy:
- Manual smoke flow: start capture → stop → transcription appears with source tags → summary appears
- Edge tests:
- stop quickly (very short audio)
- missing permissions
voxtralbinary not found/fails- OpenAI key missing/network failure
- Validate no app crash on cancellation and repeated runs
- Delivery phases:
- Phase 1: UI + permissions + capture to files
- Phase 2:
voxtralintegration + transcript rendering - Phase 3: OpenAI summarization + markdown output
- Phase 4: stabilization and quality pass
Answer to your diarization questionvoxtral may expose speaker diarization depending on model/mode, but for this flow we should treat that as a best-effort enhancement. The guaranteed v1 labeling is source-based (Mic vs System), which is robust and immediately useful.
If this plan looks good, I’ll implement Phase 1 first and keep each phase reviewable.
Yes — using OpenAI for transcription too will be simpler than integrating voxtral.c right now.
Quick comparison:
- OpenAI transcription API (
whisper-1/ newer speech models):
- Simpler integration: one HTTP call per audio file
- No local binary/model management
- Fewer failure modes (no subprocess parsing, no local model setup)
- Easy to keep
MicvsSystemby transcribing each file separately
voxtral.clocal CLI:
- More complex integration (process lifecycle, stdout parsing, model/runtime setup)
- Better for offline/local-first goals later
- More moving parts during early MVP
Main tradeoff:
- OpenAI path is easiest for v1, but cloud-dependent and paid per usage.
- Local path is harder now, better for offline/privacy/cost control long-term.
For your MVP, I’d switch to OpenAI transcription + OpenAI summarization.
It will let us ship the end-to-end flow faster.
If you want, I can update the implementation plan to this architecture before we start building.
Updated Implementation Plan (OpenAI for Transcription + Summary)
- Build a macOS-only
SwiftUIapp with one screen:
Start/Stoprecording button- Status indicator:
Idle,Capturing,Transcribing,Summarizing,Done,Error - Transcript text area
- Markdown summary text area
- Capture two audio sources in parallel:
Miccapture viaAVFoundation(AVAudioEngine/AVAudioFile)System outputcapture viaScreenCaptureKitaudio stream- Keep streams separate to preserve attribution (
MicvsSystem)
- Save capture to temporary files:
- Write
mic.wavandsystem.wavin temp dir for the active session - On stop, finalize files and pass them to API pipeline
- No persistent history in v1; clear session state on next recording
- Transcription with OpenAI API:
- Create
TranscriptionServicethat uploads each file separately - Use a sensible default model (e.g.
gpt-4o-mini-transcribeor latest equivalent available) - Request timestamped output if supported; fallback to plain text if not
- Build transcript segments tagged by source:
[timestamp] Mic: ...[timestamp] System: ...
- Segment merge strategy:
- If both sources have timestamps, merge chronologically
- If one/both don’t, render as grouped sections (
Mic,System) or append in stop-order - Keep logic deterministic and simple
- Summarization with OpenAI API:
- Create
SummarizationService(provider interface included for easy future swap) - Default summary model:
gpt-4.1-mini(balanced cost/quality) - Prompt: generate free-form Markdown summary from full transcript
- Show returned Markdown directly in UI
- Permissions and onboarding UX:
- Ask/check
Microphonepermission - Ask/check
Screen Recordingpermission - Provide inline guidance for denied permissions and shortcut to System Settings instructions
- Block
Startuntil required permissions are granted
- App architecture:
CaptureService(mic + system capture lifecycle)TranscriptionService(OpenAI speech-to-text)SummarizationService(OpenAI text generation)SessionOrchestrator/ViewModelfor state machine- Models:
TranscriptSegment,SessionStatus,CaptureArtifacts
- Config and secrets:
- Read
OPENAI_API_KEYfrom environment or local config file for dev - Centralize API client config (timeouts, retry for transient failures)
- Never log raw secrets
- Error handling:
- Friendly errors for:
- missing permissions
- no audio captured / too short recording
- API auth/network/rate-limit failures
- Allow retry from same screen without restarting app
- Verification plan:
- Manual E2E: capture both sources → stop → transcript with source tags → markdown summary appears
- Edge checks:
- missing API key
- denied permissions
- repeated start/stop sessions
- low-duration clip
- offline network
- Delivery phases:
- Phase 1: app skeleton + single-screen UI + permissions + dual-source capture to files
- Phase 2: OpenAI transcription integration + source-tagged transcript rendering
- Phase 3: OpenAI summarization + markdown output
- Phase 4: stabilization, retries, and UX polish
If you approve, I’ll implement Phase 1 first and keep each phase separately reviewable.
Phase 1 is implemented and buildable.
Implemented:
- Single-screen macOS UI with
Start/Stop, status, transcript pane, and markdown summary pane in ContentView.swift. - Session orchestration/state machine in RecorderViewModel.swift with permission checks, start/stop flow, and phase placeholder states (
transcribing/done). - Dual-source capture service in AudioCaptureService.swift:
- mic capture via
AVAudioEnginetomic.wav - system audio capture via
ScreenCaptureKittosystem.m4a - temporary per-session output directory under macOS temp
- mic capture via
- Session/permission models in Models.swift.
- Sandbox entitlements updated for mic + network in Softwarola.entitlements.
Verification:
xcodebuildmcp macos build --workspace-path ./Softwarola.xcworkspace --scheme Softwarolapasses.xcodebuildmcp swift-package test --package-path ./SoftwarolaPackagepasses.
Notes:
- There is one compile warning in AudioCaptureService.swift about
AVAssetWriterSendable capture infinishWritingcompletion; non-blocking for now. - This workspace is not initialized as a git repo, so there’s no
git statusto show.
Next logical step is Phase 2: add OpenAI transcription API integration and replace the placeholder transcript text with real source-tagged segments.
- From the repo root, build and run with:
xcodebuildmcp macos build-and-run --workspace-path "./Softwarola.xcworkspace" --scheme "Softwarola"
- Or open in Xcode and run manually:
- Open Softwarola.xcworkspace
- Select scheme
Softwarola - Press
Cmd+R
- First launch permissions you must allow:
MicrophoneScreen Recording(for system audio)
- If macOS says permissions changed but capture still fails:
- Quit app
- Re-open app (sometimes required after Screen Recording permission grant)
For now, app flow is: Start Capture → Stop Capture → shows paths to captured mic.wav and system.m4a files in the transcript panel.
App is running locally.
Executed:
xcodebuildmcp macos build-and-run --workspace-path "./Softwarola.xcworkspace" --scheme "Softwarola"
Result:
- Build succeeded
- App launched at:
/Users/mk/Library/Developer/Xcode/DerivedData/Softwarola-afkrvxnymwnoghexqsptaeuygalg/Build/Products/Debug/Softwarola.app
If you want, I can proceed directly with Phase 2 (OpenAI transcription integration) now.
Crash fixed: it was a TCC privacy kill due to missing NSMicrophoneUsageDescription.
I added the usage description key in Config/Shared.xcconfig, which feeds the generated Info.plist:
INFOPLIST_KEY_NSMicrophoneUsageDescription = Softwarola needs microphone access to capture your voice during recording.
Then I rebuilt and relaunched with:
xcodebuildmcp macos build-and-run --workspace-path "./Softwarola.xcworkspace" --scheme "Softwarola"
Please try Start Capture again. You should now get a proper microphone permission prompt instead of a crash.
App relaunched successfully with the latest build.
Run command used:
xcodebuildmcp macos build-and-run --workspace-path "./Softwarola.xcworkspace" --scheme "Softwarola"
Please try Start Capture now and tell me exactly what happens (prompt appears, error text in app, or another crash).
Phase 2 is in and working build-wise: Stop Capture now performs real OpenAI transcription for both audio files and renders source-tagged transcript text.
What I changed:
- Added transcription domain models in Models.swift:
TranscriptSource(Mic/System)TranscriptSegmentTranscriptionResultwith ordering mode
- Added OpenAI transcription client in OpenAITranscriptionService.swift:
- Calls
POST /v1/audio/transcriptions - Uploads
mic.wavandsystem.m4ain parallel - Uses
whisper-1withverbose_jsonsegment timestamps - Handles missing key and API errors with readable messages
- Calls
- Wired the view model to transcribe after recording stop in RecorderViewModel.swift:
status = .transcribingduring API calls- Timestamped output format:
[mm:ss] Mic/System: ... - Falls back to grouped output when timestamps are unavailable
Already verified:
xcodebuildmcp macos build --workspace-path "./Softwarola.xcworkspace" --scheme "Softwarola"succeeds.xcodebuildmcp swift-package test --package-path ./SoftwarolaPackagesucceeds.- App launches successfully.
To test transcription now:
- Ensure
OPENAI_API_KEYis set in the shell before launching. - Run:
xcodebuildmcp macos build-and-run --workspace-path "./Softwarola.xcworkspace" --scheme "Softwarola"
- Record a short sample and stop capture. Transcript pane should populate with
Mic/Systemlines.
Natural next step:
- Implement Phase 3 summarization (OpenAI markdown summary from transcript).
Implemented — .env support is now in.
What changed:
- Added a lightweight parser in DotEnv.swift.
- Updated OpenAI key lookup in OpenAITranscriptionService.swift:
- first
OPENAI_API_KEYfrom process env - fallback to
.env(OPENAI_API_KEY=...)
- first
- Added .env.example.
How to use:
- Create
.envin repo root:
OPENAI_API_KEY=sk-...
- Launch app normally (
xcodebuildmcp ... build-and-runor Xcode run). - It will pick up the key automatically from
.envif not present in shell env.
Validation:
- macOS build passes.
- Swift package tests pass.
I’ll add a minimal Makefile with a run target that builds and launches via xcodebuildmcp, plus a build helper target for convenience.
I added the Makefile; I’m quickly validating the targets by running make build now.
Added a simple Makefile with:
make build→ builds the appmake run→ builds and runs the app
File: Makefile
I also verified make build succeeds locally.