iphone-webrtc-TURN-speaker-streaming-machost-iphonebrowser

What it is

A real-time voice agent system that runs on a Mac host and streams audio responses to an iPhone browser. It captures microphone input via WebRTC, transcribes it using Whisper, processes it with an LLM (Ollama, Claude, or OpenAI), synthesizes speech with Piper TTS, and streams the audio back to the iPhone speaker. It includes robust NAT traversal using TURN servers for cellular connectivity.

Features

Real-time voice agent loop: Mic -> STT -> LLM -> TTS -> Speaker
Supports local Ollama, Anthropic Claude, and OpenAI GPT models
WebRTC audio streaming with TURN relay for cellular/NAT traversal
Hold-to-talk UI with live transcription and chat history
Self-signed HTTPS support for LAN access and Cloudflare Tunnel for remote access

Quickstart

pip install -r requirements.txt
cp .env.example .env
python3 -m gateway.server
open http://localhost:8080

Architecture

flowchart TD
    subgraph Mac_Host["Mac Host"]
        Engine["Engine\n(Whisper/LLM/Piper)"]
        Gateway["Gateway\n(aiohttp :8080)\nWebSocket + RTCPeerConnection"]
        Engine -->|Audio/Data| Gateway
    end
    subgraph iPhone["iPhone Browser"]
        App["web/app.js\nRTCPeerConnection\ngetUserMedia"]
    end
    Gateway <-->|WebRTC UDP\n(via TURN)| App
    Gateway <-->|WebSocket Signaling| App
    TURN["TURN Server\n(Twilio)"]
    App <-->|Relay| TURN
    TURN <-->|Relay| Gateway

How it's built

The backend is a Python aiohttp server acting as a signaling gateway and media engine. It uses faster-whisper for STT, various LLM APIs for reasoning, and Piper for TTS. The frontend is a TypeScript/JavaScript web app that handles WebRTC peer connections, microphone access, and UI state. Communication between client and server uses WebSocket for signaling and WebRTC data/audio channels for media.

How it runs

sequenceDiagram
    participant Client as iPhone Browser
    participant Server as Mac Gateway
    participant Engine as Voice Engine
    participant LLM as LLM Provider
    
    Client->>Server: WebSocket hello {token}
    Server-->>Client: hello_ack {voices, ice_servers}
    
    Client->>Server: webrtc_offer {sdp}
    Server->>Server: Create RTCPeerConnection
    Server-->>Client: webrtc_answer {sdp}
    
    Note over Client,Server: Media Stream Established
    
    Client->>Server: mic_start (Audio Track)
    Server->>Engine: Buffer Audio
    Engine->>Engine: Whisper STT
    Engine->>LLM: Transcribed Text
    LLM-->>Engine: Reply Text
    Engine->>Engine: Piper TTS
    Engine->>Server: Audio Chunks
    Server->>Client: WebRTC Audio Track
    Client->>Client: Play Speaker

How to apply & reuse

Use this project to build low-latency voice assistants that work on mobile browsers without native apps. It serves as a reference for handling WebRTC audio streaming, TURN relay configuration for cellular networks, and integrating local LLMs like Ollama with real-time speech pipelines.

At a glance

CapabilitiesSpeech-to-TextText-to-SpeechLLM IntegrationWebRTC StreamingNAT Traversal

Componentsaiohttp GatewayWhisper STTPiper TTSLLM AdapterWeb Frontend

TechPythonTypeScriptWebRTCWebSocketaiohttpfaster-whisperPiper

Depends onOllamaTwilio TURNCloudflaredNode.jsPython 3.9+

Integrates withAnthropic ClaudeOpenAI GPTOllamaTavily SearchBrave Search

PatternsSignaling ServerMedia RelayVoice Agent LoopEdge AI

Reuse tagsvoice-assistantwebrtc-audiolocal-llmmobile-browserturn-server

⚠ Needs attention

unmerged_branch: dependabot/npm_and_yarn/web-app/npm_and_yarn-d1f9cb5775 is 1 commit ahead of the default branch
open_pr: PR #1: Bump the npm_and_yarn group across 1 directory with 5 updates