parlor-on-device-ai

On-device, real-time multimodal AI voice and vision assistant powered by Gemma 4 E2B and Kokoro TTS, running entirely locally via FastAPI WebSocket server.

INSTALLATION
npx skills add https://github.com/aradotso/trending-skills --skill parlor-on-device-ai
Run in your project or agent environment. Adjust flags if your CLI version differs.

SKILL.md

Parlor On-Device AI

Skill by ara.so — Daily 2026 Skills collection.

Parlor is a real-time, on-device multimodal AI assistant. It combines Gemma 4 E2B (via LiteRT-LM) for speech and vision understanding with Kokoro TTS for voice output. Everything runs locally — no API keys, no cloud calls, no cost per request.

Architecture

Browser (mic + camera)

    │

    │  WebSocket (audio PCM + JPEG frames)

    ▼

FastAPI server

    ├── Gemma 4 E2B via LiteRT-LM (GPU)  →  understands speech + vision

    └── Kokoro TTS (MLX on Mac, ONNX on Linux)  →  speaks back

    │

    │  WebSocket (streamed audio chunks)

    ▼

Browser (playback + transcript)

Key features:

  • Silero VAD in browser — hands-free, no push-to-talk
  • Barge-in — interrupt AI mid-sentence by speaking
  • Sentence-level TTS streaming — audio starts before full response is ready
  • Platform-aware TTS — MLX backend on Apple Silicon, ONNX on Linux

Requirements

  • Python 3.12+
  • macOS with Apple Silicon or Linux with a supported GPU
  • ~3 GB free RAM
  • uv package manager

Installation

git clone https://github.com/fikrikarim/parlor.git

cd parlor

# Install uv if needed

curl -LsSf https://astral.sh/uv/install.sh | sh

cd src

uv sync

uv run server.py

Open http://localhost:8000, grant camera and microphone permissions, and start talking.

Models download automatically on first run (~2.6 GB for Gemma 4 E2B, plus TTS models).

Configuration

Set environment variables before running:

# Use a pre-downloaded model instead of auto-downloading

export MODEL_PATH=/path/to/gemma-4-E2B-it.litertlm

# Change server port (default: 8000)

export PORT=9000

uv run server.py

Variable

Default

Description

MODEL_PATH

auto-download from HuggingFace

Path to local .litertlm model file

PORT

8000

Server port

Project Structure

src/

├── server.py              # FastAPI WebSocket server + Gemma 4 inference

├── tts.py                 # Platform-aware TTS (MLX on Mac, ONNX on Linux)

├── index.html             # Frontend UI (VAD, camera, audio playback)

├── pyproject.toml         # Dependencies

└── benchmarks/

    ├── bench.py           # End-to-end WebSocket benchmark

    └── benchmark_tts.py   # TTS backend comparison

Key Components

server.py — FastAPI WebSocket Server

The server handles two WebSocket connections: one for receiving audio/video from the browser, one for streaming audio back.

# Simplified pattern from server.py

from fastapi import FastAPI, WebSocket

import asyncio

app = FastAPI()

@app.websocket("/ws")

async def websocket_endpoint(websocket: WebSocket):

    await websocket.accept()

    async for data in websocket.iter_bytes():

        # data contains PCM audio + optional JPEG frame

        response_text = await run_gemma_inference(data)

        audio_chunks = await run_tts(response_text)

        for chunk in audio_chunks:

            await websocket.send_bytes(chunk)

tts.py — Platform-Aware TTS

Kokoro TTS selects backend based on platform:

# tts.py uses platform detection

import platform

def get_tts_backend():

    if platform.system() == "Darwin":

        # Apple Silicon: use MLX backend for GPU acceleration

        from kokoro_mlx import KokoroMLX

        return KokoroMLX()

    else:

        # Linux: use ONNX backend

        from kokoro import KokoroPipeline

        return KokoroPipeline(lang_code='a')

tts = get_tts_backend()

# Sentence-level streaming — yields audio as each sentence is ready

async def synthesize_streaming(text: str):

    for sentence in split_sentences(text):

        audio = tts.synthesize(sentence)

        yield audio

Gemma 4 E2B Inference via LiteRT-LM

# LiteRT-LM inference pattern

from litert_lm import LiteRTLM

import os

model_path = os.environ.get("MODEL_PATH", None)

# Auto-downloads if MODEL_PATH not set

model = LiteRTLM.from_pretrained(

    "google/gemma-4-E2B-it",

    local_path=model_path

)

async def run_gemma_inference(audio_pcm: bytes, image_jpeg: bytes = None):

    inputs = {"audio": audio_pcm}

    if image_jpeg:

        inputs["image"] = image_jpeg

    response = ""

    async for token in model.generate_stream(**inputs):

        response += token

    return response

Running Benchmarks

cd src

# End-to-end WebSocket latency benchmark

uv run benchmarks/bench.py

# Compare TTS backends (MLX vs ONNX)

uv run benchmarks/benchmark_tts.py

Performance Reference (Apple M3 Pro)

Stage

Time

Speech + vision understanding

~1.8–2.2s

Response generation (~25 tokens)

~0.3s

Text-to-speech (1–3 sentences)

~0.3–0.7s

Total end-to-end

~2.5–3.0s

Decode speed: ~83 tokens/sec on GPU.

Common Patterns

Extending the System Prompt

Modify the prompt in server.py to change the AI's persona or task:

SYSTEM_PROMPT = """You are a helpful language tutor.

Respond conversationally in 1-3 sentences.

If the user makes a grammar mistake, gently correct them.

You can see through the user's camera and discuss what you observe."""

Adding a New Language for TTS

Kokoro supports multiple language codes. Set lang_code in tts.py:

# Language codes: 'a' = American English, 'b' = British English

# 'e' = Spanish, 'f' = French, 'z' = Chinese, 'j' = Japanese

pipeline = KokoroPipeline(lang_code='e')  # Spanish

Customizing VAD Sensitivity (index.html)

The Silero VAD threshold can be tuned in the frontend:

// In index.html — lower positiveSpeechThreshold = more sensitive

const vad = await MicVAD.new({

  positiveSpeechThreshold: 0.6,   // default ~0.8, lower = triggers more easily

  negativeSpeechThreshold: 0.35,  // how quickly it stops detecting speech

  minSpeechFrames: 3,

  onSpeechStart: () => { /* UI feedback */ },

  onSpeechEnd: (audio) => sendAudioToServer(audio),

});

Sending Frames Programmatically (WebSocket Client Example)

import asyncio

import websockets

import json

import base64

async def send_audio_frame(audio_pcm_bytes: bytes, jpeg_bytes: bytes = None):

    uri = "ws://localhost:8000/ws"

    async with websockets.connect(uri) as ws:

        payload = {

            "audio": base64.b64encode(audio_pcm_bytes).decode(),

        }

        if jpeg_bytes:

            payload["image"] = base64.b64encode(jpeg_bytes).decode()

        await ws.send(json.dumps(payload))

        # Receive streamed audio response

        async for message in ws:

            audio_chunk = message  # raw PCM bytes

            # play or save audio_chunk

Troubleshooting

Model download fails

# Pre-download manually via huggingface_hub

uv run python -c "

from huggingface_hub import hf_hub_download

path = hf_hub_download('google/gemma-4-E2B-it', 'gemma-4-E2B-it.litertlm')

print(path)

"

export MODEL_PATH=/path/shown/above

uv run server.py

Microphone/camera not working in browser

  • Must access via http://localhost (not IP address) — browsers block media APIs on non-localhost HTTP
  • Check browser permissions: address bar → lock icon → reset permissions

TTS not loading on Linux

# Ensure ONNX runtime is installed

uv add onnxruntime

# Or for GPU:

uv add onnxruntime-gpu

High latency or slow inference

  • Verify GPU is being used: check for Metal (Mac) or CUDA (Linux) in startup logs
  • Close other GPU-heavy applications
  • On Linux, confirm CUDA drivers match installed onnxruntime-gpu version

Port already in use

export PORT=8080

uv run server.py

# Or kill the existing process:

lsof -ti:8000 | xargs kill

uv sync fails — Python version mismatch

# Parlor requires Python 3.12+

python3 --version

# Install 3.12 via pyenv or system package manager, then:

uv python pin 3.12

uv sync

Dependencies (pyproject.toml)

Key packages installed by uv sync:

  • litert-lm — Google AI Edge inference runtime for Gemma
  • fastapi + uvicorn — async web/WebSocket server
  • kokoro — Kokoro TTS ONNX backend
  • kokoro-mlx — Kokoro TTS MLX backend (Mac only)
  • silero-vad — voice activity detection (browser-side via CDN)
  • huggingface-hub — model auto-download
BrowserAct

Let your agent run on any real-world website

Bypass CAPTCHA & anti-bot for free. Start local, scale to cloud.

Explore BrowserAct Skills →

Stop writing automation&scrapers

Install the CLI. Run your first Skill in 30 seconds. Scale when you're ready.

Start free
free · no credit card