nano-claw voice loop

A local, voice-powered AI agent that runs in your browser with Metal-accelerated speech recognition and tool execution.

https://github.com/davidbmar/2026-nano-claw-voice-loop-tts-stt  ·  public  ·  shipped

nano-claw voice loop screenshot

What it is

nano-claw is a personal AI assistant designed for local execution. It creates a continuous voice loop where users speak to their browser, audio is transcribed via Whisper (running natively on Mac for GPU speed), processed by Claude (via API), and spoken back using Kokoro TTS. It supports interactive tool approval, allowing the AI to request permission before executing shell commands or file operations on the host machine.

Features

Quickstart

git clone https://github.com/davidbmar/2026-nano-claw-voice-loop-tts-stt.git
cd 2026-nano-claw-voice-loop-tts-stt
export ANTHROPIC_API_KEY=sk-ant-...
./run.sh

Architecture

flowchart TD
    User[User Browser] <-->|WebRTC Audio / WebSocket| VS[Voice Server: Python/Docker]
    VS <-->|HTTP POST /transcribe| STT[STT Service: faster-whisper/Native Mac]
    VS <-->|HTTP API| API[nano-claw API: TypeScript/Docker]
    API <-->|LLM Request| Claude[Anthropic Claude API]
    API -->|Execute| Tools[Local Tools: Shell/File]
    subgraph Docker Container
        VS
        API
    end
    subgraph Host Mac
        STT
        Tools
    end

How it's built

The system uses a hybrid architecture to bypass Docker GPU limitations on macOS. A native Python service runs `faster-whisper` on port 8200 using Apple Metal acceleration. The core logic, including the TypeScript API server and Python-based Voice Server (handling WebRTC and Kokoro TTS), runs inside a Docker container. The browser client communicates via WebSockets for real-time audio streaming and UI updates.

How it runs

sequenceDiagram
    participant U as User
    participant B as Browser Client
    participant V as Voice Server (Docker)
    participant S as STT Service (Native)
    participant A as Agent API (Docker)
    participant L as Claude API
    
    U->>B: Hold button & Speak
    B->>V: Stream Audio (WebRTC)
    U->>B: Release button
    V->>S: POST audio bytes
    S-->>V: Transcribed Text
    V->>A: POST text message
    A->>L: Send prompt + context
    L-->>A: Response + Tool Calls?
    alt Tool Call Required
        A-->>V: Pending tool request
        V-->>B: Show approval card
        B->>U: Display approval
        U->>B: Approve/Reject
        B->>V: Send decision
        V->>A: Confirm execution
        A->>A: Execute Tool
        A->>L: Send tool result
        L-->>A: Final response
    end
    A-->>V: Final text response
    V->>V: Generate Speech (Kokoro TTS)
    V->>B: Stream Audio Response
    B->>U: Play audio

How to apply & reuse

Use this project as a foundation for building private, voice-first AI agents that require low-latency speech-to-text on Apple Silicon. It demonstrates how to bridge native high-performance ML services with containerized application logic and secure tool execution patterns.

At a glance

CapabilitiesVoice-to-Voice InteractionLocal Tool ExecutionContextual MemorySkill ExtensionReal-time Transcription
ComponentsBrowser ClientVoice ServerAgent APISTT ServiceTool RegistryMemory Store
TechTypeScriptPythonDockerWebRTCWhisperKokoro TTSClaude API
Depends onDocker DesktopPython 3.10+Anthropic API KeymacOS (for Metal STT)
Integrates withAnthropic ClaudeLocal File SystemShell Environment
PatternsVoice LoopHuman-in-the-Loop ApprovalHybrid Native/Container ArchitectureSkill-based Prompting
Reuse tagsai-agentvoice-interfacelocal-llmtypescriptpythondockerwebrtcwhispertool-use

⚠ Needs attention