iPhone & Desktop Companion TTS/SST Talking App

Mac-hosted Python voice assistant streaming WebRTC audio to iPhone with a hybrid FSM workflow engine for complex research.

https://github.com/davidbmar/iphone-and-desktop-companion-TTS-SST-talking-app  ·  public  ·  shipped

iPhone & Desktop Companion TTS/SST Talking App screenshot

What it is

A local-first voice agent that runs on macOS and connects to an iPhone via Safari. It uses Whisper for speech-to-text, routes queries through a keyword-based router to either a fast LLM path or a Finite State Machine (FSM) engine for deep research/comparison, and streams Piper TTS audio back to the phone via WebRTC.

Features

Quickstart

pip install -r requirements.txt
python main.py

Architecture

flowchart TD
    subgraph Mac_Host["Mac Host"]
        Engine["Engine\nWorkflowRunner\nRouter\nOrchestrator"]
        Gateway["Gateway\naiohttp Server\nWebSocket\nRTCPeerConnection"]
        TTS["Piper TTS"]
        STT["Whisper STT"]
        LLM["LLM Provider\nOllama/Claude/OpenAI"]
    end
    subgraph iPhone["iPhone Safari"]
        UI["Voice Agent UI"]
        Mic["Microphone Input"]
        Speaker["Speaker Output"]
    end
    UI -->|Hold to Talk| Mic
    Mic -->|Audio Stream| Gateway
    Gateway -->|Signaling + Audio| Engine
    Engine -->|Query| LLM
    LLM -->|Response Text| Engine
    Engine -->|Text to Speak| TTS
    TTS -->|Audio Chunks| Gateway
    Gateway -->|WebRTC Audio Track| Speaker
    Engine -->|Transcribe| STT
    STT -->|Text| Engine

How it's built

Python backend using aiohttp for signaling and WebRTC peer connections. The core logic includes a regex-based keyword router, an FSM executor for multi-step workflows (Research, Compare, Fact Check), and adapters for LLMs (Ollama, Claude, OpenAI). Audio is handled via WebRTC data channels/tracks, with Piper for TTS and Whisper for STT.

How it runs

sequenceDiagram
    participant User as iPhone User
    participant Browser as Safari Browser
    participant Gateway as Mac Gateway
    participant Engine as Workflow Engine
    participant LLM as LLM Provider
    participant TTS as Piper TTS
    
    User->>Browser: Hold button & Speak
    Browser->>Gateway: Send Audio via WebRTC
    Gateway->>Engine: Forward Audio Data
    Engine->>Engine: Whisper STT
    Engine->>Engine: Keyword Router
    alt Complex Query
        Engine->>Engine: Execute FSM State
        Engine->>LLM: Request Reasoning/Search
        LLM-->>Engine: Return Structured Data
        Engine->>Engine: Next FSM State
    else Simple Query
        Engine->>LLM: Direct Chat Completion
        LLM-->>Engine: Return Response Text
    end
    Engine->>TTS: Generate Audio from Text
    TTS-->>Engine: Audio Chunks
    Engine->>Gateway: Stream Audio Chunks
    Gateway->>Browser: WebRTC Audio Track
    Browser->>User: Play Response

How to apply & reuse

Ideal for developers building local AI assistants who need structured reasoning (FSMs) rather than just chat, and want low-latency audio streaming to mobile devices without native app store deployment.

At a glance

CapabilitiesSpeech-to-TextText-to-SpeechWeb Search IntegrationFinite State Machine ExecutionReal-time Audio StreamingMulti-LLM Support
ComponentsWorkflow EngineKeyword RouterWebRTC GatewayInput FilterConversation HistoryFast Path Intent Matcher
TechPythonaiohttpWebRTCWhisperPiper TTSMermaid
Depends onOllamaAnthropic APIOpenAI APINode.js (for build tools if any)Safari Browser
Integrates withOllamaClaudeOpenAIWeb Search Tools
PatternsFinite State MachineRouter PatternAdapter PatternObserver Pattern (WebRTC)Chain of Responsibility
Reuse tagsvoice-assistantwebrtc-audiofsm-enginelocal-llmpython-backendmobile-companion

⚠ Needs attention