Voice agent platform that generates deterministic FSM-based voice flows from plain English business descriptions.
https://github.com/davidbmar/riff · private · shipped
riff is a Python-based voice agent framework that combines Large Language Models (LLMs) for natural language understanding with declared Finite State Machines (FSMs) for strict workflow enforcement. It allows developers to describe a business process in plain English, automatically generating a YAML-defined state graph that handles phone conversations (ordering, scheduling, intake). The system ensures reliability by keeping all state transitions, slot validation, and guardrails as deterministic pure functions, while the LLM only handles language generation and intent recognition within those bounds.
cd ~/src/riff .venv/bin/python3 -m pytest tests/ -q .venv/bin/python3 -m riff.web_server
flowchart TD
Caller[Caller] -->|Speaks| STT[STT]
STT -->|Text| RunTurn[run_turn]
subgraph CorePipeline [Core Pipeline]
RunTurn -->|Context| Guardrails[Guardrails<br/>8 layers pure fns]
RunTurn -->|Prompt| LLM[LLM Call<br/>Gemini/Gemma/Claude]
RunTurn -->|State| StateMgr[State Manager<br/>Declared Graph]
end
Guardrails -->|Validated| RunTurn
LLM -->|Response| RunTurn
StateMgr -->|Next State| RunTurn
RunTurn -->|Result| SlotExt[slot_extractor]
RunTurn -->|Result| Eval[eval framework]
RunTurn -->|Result| Logger[turn_logger JSONL]
RunTurn -->|Audio| TTS[TTS]
TTS -->|Hears| Caller
The core engine uses a `StateManager` to enforce transitions defined in YAML flow files. Input speech is converted to text (STT), processed by `run_turn()` which consults the LLM (via adapters for Gemini, Gemma, or Claude) and deterministic guardrails, then converts response text to speech (TTS). Key components include `slot_extractor` for deterministic data capture, `state_manager` for graph traversal, and an event bus for internal signaling. The architecture isolates non-deterministic LLM calls behind pure-function interfaces for slots and transitions.
sequenceDiagram
participant C as Caller
participant S as STT
participant RT as run_turn()
participant SM as StateManager
participant L as LLM Adapter
participant T as TTS
C->>S: Speaks audio
S->>RT: Transcribed text
RT->>SM: Get current state & valid transitions
SM-->>RT: State definition & guards
RT->>L: Generate response based on state/context
L-->>RT: Raw text response
RT->>SM: Validate transition & extract slots
SM-->>RT: Updated state & validated slots
RT->>T: Synthesize audio response
T->>C: Plays audio
RT->>RT: Log turn result (JSONL)
Use riff to build robust customer service voice agents where hallucination prevention is critical. Ideal for scheduling services (plumbing, dental), retail orders (pizza, coffee), or complex intakes (apartment viewings). Developers define business logic via YAML or generate it via the `/api/flows/generate` endpoint, then integrate the web server or MCP server into their telephony infrastructure.