Disclaimer: This is a demo/sample application built with Azure AI services for educational purposes. It is not affiliated with, endorsed by, or sponsored by McDonald's Corporation.
McDonald's AI Drive-Thru is a McDonald's–themed, voice-driven ordering experience that showcases Microsoft best practices for Azure OpenAI GPT-4o Realtime, Azure AI Search, and Azure Container Apps. The experience emulates a McDonald's crew member who can search the official menu, hold multilingual conversations, and keep orders in sync across devices. The app also supports a Local Mode powered by Microsoft's Phi-4-mini-instruct model for fully offline AI drive-thru experiences without cloud connectivity.
As guests speak, real-time transcription, translation, and order management provide a transparent view of every choice...from shakes and fries to burgers and McNuggets. The UI applies McDonald's vibrant design language so stakeholders can picture how voice AI augments drive-thru, crew member, and kiosk flows.
Beyond the drive-thru experience, this sample demonstrates how Microsoft’s Responsible AI guidance plus Azure-first tooling enable inclusive, hands-free interactions for franchise teams, accessibility scenarios, and mixed fleet deployments across the McDonald's restaurant network.
- McDonald's AI Drive-Thru
This project extends the VoiceRAG Repository, adapting its Microsoft-first architecture for a McDonald's Drive-Thru scenario. Review the original pattern in this blog post. For the upstream README, see voice_rag_README.md.
Special thanks to John Carroll for the original coffee-chat-voice-assistant that inspired this sample. This fork updates to the latest OpenAI models and adds a McDonald's Drive-Thru twist to the solution.
- Azure OpenAI GPT-4o Realtime API: Voice-to-voice ordering powered by gpt-realtime-1.5 with optimized system prompt (bulleted format, ALL CAPS emphasis, variety rules to prevent robotic repetition).
- McDonald's crew member personality: Upbeat, friendly, branded — Nova voice (warm, friendly female) embodies the McDonald's crew member persona. Phrase variety rules prevent bot-like repetition ("Awesome choice!", "You got it!", "Great pick!", "Coming right up!").
- Natural turn-taking: Server VAD tuning (threshold 0.7, prefix padding 300ms, silence duration 500ms) for seamless back-and-forth conversations.
- Spoken currency: "Four dollars and nineteen cents" instead of "$4.19" — more natural, more McDonald's.
- Temperature 0.6: Optimized balance of deterministic tool calling and natural conversational variance (Azure OpenAI Realtime API minimum).
- Active listening: Conversational acknowledgments confirm each guest request ("No tartar sauce, you got it!").
- Anti-self-talk: AI NEVER speaks unless the guest has spoken first — imperative greeting prompt prevents startup meta-commentary.
- Echo suppression (defense-in-depth): Server-side audio gating in rtmt.py (
ai_speakingflag + 1.5s cooldown + delayedinput_audio_buffer.clear) plus frontend mic muting via gain node atresponse.created. Extended 3.0s cooldown after greeting audio. - Barge-in detection:
AnalyserNodeon the raw mic stream monitors RMS energy in real-time. When the guest interrupts,response.cancelstops the AI mid-sentence andspeech_startedoverrides echo suppression — natural conversation flow preserved. - Anti-feedback loop: Multi-layered approach — VAD threshold 0.7, silence duration 500ms, auto gain control disabled, recorder worklet isolation via gain node, mic muting during AI playback.
- Tool-calling orchestration: Four tools drive the ordering flow —
search(menu lookup),update_order(add/remove items),get_order(retrieve current order),reset_order(clear the ticket). - Combo validation with SYSTEM HINT:
get_combo_requirements()deterministically tracks missing sides and drinks, injecting[SYSTEM HINT]into tool results to guide the AI without relying on LLM memory. - Combo pivot absorption: When a combo is added, standalone sides and drinks already on the ticket are automatically absorbed into the combo — no duplicate asks. Multi-quantity items are decremented rather than fully removed.
- Combo conversion & upselling: AI asks "Want to make that a combo with fries and a drink?" for solo burgers/sandwiches. Fries-first branding (McDonald's World Famous Fries always suggested as the go-to side). McDonald's signature treat suggestions when the order has no dessert.
- Item customizations: Guests can request modifications like "no lettuce", "extra ketchup", or "plain." Mods are parsed, displayed on the order ticket, and read back naturally ("with no lettuce, extra ketchup").
- Invalid mod rejection: Nonsensical modifications are caught and redirected with friendly crew member humor — mustard on a shake, cheese on a shake, or whipped cream on a burger get a warm redirect ("That's a new one! Want to try a different topping?").
- Quantity limits: Max 10 per item, 25 total with friendly crew member-style responses ("Whoa, that's a lot of fries!").
- Happy Hour dynamic pricing: Drinks and shakes are 50% off from 2:00–4:00 PM local time. Original prices preserved; discounts applied at summary level. AI gets excited about the deal in context.
- OOS machine status: Ice cream machine down → McFlurry/shake/sundae items flagged
[OOS]in search results with alternative suggestions. Non-blocking — items still returned, just flagged. Module-level toggle for demo use. - Size normalization: Various shorthand size references normalize to standard McDonald's sizing in the order.
- Mandatory total re-read: After any order change, the AI re-reads the complete order total so the guest always knows where they stand.
- Grouped readback: "Two Medium Coca-Colas and one McNuggets" instead of listing every item individually — faster, more natural.
- Delta summaries: Natural voice deltas for the AI to speak, full JSON for screen display (
TO_BOTHrouting). - Price validation: Rejects $0 items with friendly retry messages — catches model hallucination when it skips search.
- 8% sales tax: Hardcoded tax rate applied to all orders, displayed on the order ticket.
- Azure AI Search for menu RAG: 172 items indexed from sample McDonald's menu data (
mcdonalds-menu-items.json) with semantic hybrid search (text-embedding-3-large, 3072 dimensions). - TTL search cache: 60-second, 128-entry cache for Azure AI Search results eliminates redundant queries.
- Human-readable sizes: "Small ($2.49), Medium ($3.29)" instead of raw JSON in tool results.
- Gzip compression: 60–70% reduction on HTTP responses for mobile-first experience.
- Strategic vendor chunking: Optimized frontend bundle splitting in Vite with explicit groups (react-vendor, ui-vendor, i18n, motion).
- Lazy-loaded Settings:
React.lazy()+Suspensefor the Settings panel — faster initial page load.
- Grounded recommendations: Azure OpenAI tool-calling plus semantic hybrid search keep menu suggestions grounded with pricing, sizes, and add-on guidance — zero hallucinated items. Explicit grounding rule: "ONLY recommend items found in search results."
- [SYSTEM HINT] pattern: Deterministic Python logic drives conversation direction, not LLM memory. Tools return both voice-friendly text for the AI and JSON metadata for the frontend.
- "Your McDonald's Order" order ticket: Live-updating order panel shows every item, customization, size, quantity, subtotal, tax, and total as the guest speaks — the real-time equivalent of a drive-thru order ticket.
- Live synchronization: Function calls update the shared cart so drive-thru screens, mobile devices, and crew member tablets stay aligned without race conditions.
- 50 menu items: 10 items per category across 5 collapsible categories (Burgers & Sandwiches, Chicken & McNuggets, Shakes & Drinks, McCafé & Ice Cream, Extras & Sides) — all expanded by default. Menu synced with the Azure AI Search demo index.
- Collapsible session token panel: Shows round-trip token history with per-turn identifiers for debugging and QA.
- Settings panel: Verbose Logging toggle, Log to File toggle (sub-option of Verbose Logging), and Show Session Tokens toggle.
- Dark mode support: Full dark/light theme switching.
- Responsive design: Optimized for desktop and mobile viewports.
- Verbose logging (
mcdonalds-verboselogger): Dedicated diagnostic logger separate from the main application logger. Logs every message type, full tool call lifecycle (args, result, direction, execution time), echo suppression state changes, transcriptions, and session lifecycle events. Audio data never logged. - File logging: Timestamped log files written to
app/backend/logs/(e.g.,verbose-2026-03-22T01-38.log). UTF-8, line-buffered. Per-session file handlers toggled via UI orVERBOSE_LOG_FILEenv var. - Session token tracking: Every realtime conversation emits session tokens plus per-turn identifiers so transcripts map back to telemetry, QA findings, or Azure logs.
- Multilingual ordering: Guests receive accurate transcripts in their language of choice with instant pivots between English, Spanish, Mandarin, French, and more.
- Browser audio playback: Mirrors what a guest would hear at a McDonald's drive-thru, supporting screenless or low-vision ordering.
- Phi-4-multimodal-instruct via ONNX Runtime: Run the complete AI drive-thru experience without internet connectivity using Microsoft's Phi-4-multimodal-instruct model (5.6B parameters, INT4 quantized) through ONNX Runtime GenAI — the same model understanding, powered locally.
- Piper TTS voices: Four curated drive-thru voices — Amy (US, friendly), Jenny (UK, upbeat), Lessac (US, warm), Kristin (US, clear) — with
length_scale=0.9for energetic delivery. Switch voices from the settings panel. - One-toggle switch: The settings panel provides a single "Local Mode" toggle. When enabled, the UI swaps to a local voice selector, shows an offline indicator on the mic button, and routes all AI processing through the local ONNX pipeline.
- CPU, GPU, and NPU support: Offline mode runs on CPU out of the box, though a GPU (CUDA/DirectML) or NPU is strongly recommended for real-time inference performance. Auto-detects available hardware at startup.
- Azure Local compatible: Pairs seamlessly with Azure Local (formerly Azure Stack HCI) for edge deployments — enabling uninterrupted AI drive-thru experiences in environments with limited, intermittent, or no cloud connectivity.
- Graceful degradation: If local model files aren't downloaded, the toggle is automatically disabled. Cloud mode remains fully functional — offline mode is purely additive.
Imagine a guest pulling up to a McDonald's drive-thru. They tap the mic button on their phone (or press the drive-thru intercom), and from that moment, an entire agentic pipeline fires in real-time. Here's what happens behind the scenes — every step, every decision, every millisecond matters.
1. The Guest Speaks
"I'll take a Big Mac — plain, cheese only — Medium Fries, and a Large Diet Coke. Actually, can I add a McFlurry but… put some pickles in it?"
The browser's WebAudio API captures raw audio from the microphone. An AnalyserNode monitors the RMS energy of the raw stream in real-time — this is how the system knows the guest is actually speaking versus picking up ambient drive-thru noise or echo from the AI's own response.
2. Frontend → Middleware (WebSocket)
The React/TypeScript frontend encodes the captured audio to base64 and streams it over a persistent WebSocket connection to the Python backend. This isn't a request-response cycle — it's a continuous, low-latency stream. The guest's words arrive at the server as fast as they're spoken.
3. RTMiddleTier — The Agentic Logic Layer
This is where the intelligence lives. The RTMiddleTier (rtmt.py) acts as a WebSocket bridge between the browser and Azure OpenAI, but it's far more than a passthrough — it's the orchestration brain:
- Echo suppression kicks in immediately: a 1.5-second cooldown window and delayed buffer flush prevent the AI from hearing its own voice bouncing back through the guest's speakers. After the initial greeting, an extended 3.0-second cooldown ensures stability.
- Barge-in detection monitors the raw audio stream. If the guest interrupts mid-sentence ("Actually, change that to—"), the system fires
response.cancelto stop the AI mid-word and lets the guest take the floor. Natural conversation, not robotic turn-taking. - Session management handles the greeting trigger and registers all four tool-calling functions with the Azure OpenAI Realtime API.
4. Azure OpenAI Realtime API (GPT-4o)
The audio hits Azure OpenAI's GPT-4o Realtime API (gpt-realtime-1.5), which processes the guest's speech and decides what to do. It doesn't just transcribe — it understands intent and generates both a spoken response and structured tool calls as JSON function calls (the "Citation Payloads" shown in the diagram). This is the agentic core: the model autonomously decides which tools to invoke based on the conversation context.
5. Tool Execution — The Agentic Toolkit
When the model makes a tool call, the middleware executes it deterministically. Four tools drive the entire ordering flow:
| Tool | What It Does |
|---|---|
search |
Queries Azure AI Search across 172 demo menu items using semantic + vector hybrid search (text-embedding-3-large, 3072 dimensions). Returns human-readable sizes and prices — "Medium ($3.29), Large ($4.19)" — not raw JSON. Results come back with a 60-second TTL cache so repeat lookups are instant. |
update_order |
Adds or removes items through the Stateful Order Manager. Validates combo integrity (are the side and drink present?), applies customizations, enforces quantity limits (max 10 per item, 25 total), and normalizes sizing. |
get_order |
Retrieves the current order as a grouped readback optimized for voice — "Two Medium Coca-Colas and one McNuggets" instead of listing each item individually. Returns both a voice-friendly summary for the AI and full JSON for the order ticket UI. |
reset_order |
Clears the entire order and resets the session so the guest can start fresh. |
6. Order State — The Business Logic Brain
The Stateful Order Manager (order_state.py) is where deterministic business rules live — no LLM guesswork allowed:
- Combo pivot absorption: When a guest orders a Big Mac Combo, any standalone side or drink already on the ticket gets absorbed into the combo automatically. No awkward "Did you want that as part of the combo?" back-and-forth.
- Deterministic guardrails: The
[SYSTEM HINT]pattern injects combo requirements directly into tool results — "Missing: Drink" — so the AI knows exactly what to ask for next without relying on memory. - Promotions engine: The system checks the clock. If it's Happy Hour (2–4 PM Eastern), drinks and shakes get 50% off automatically. The AI gets genuinely excited about the deal.
- IoT kitchen telemetry: Machine status flags are checked in real-time. Shake machine down? Every shake, McFlurry, and sundae comes back flagged
[OOS]with a friendly redirect — "Our shake machine is taking a quick nap, so I can't do pickles in a shake anyway — but would you like a refreshing drink instead?" - Validation guardrails: Impossible customizations are caught deterministically. Pickles in a shake? That's a hard no — rejected with warmth and humor, not a stack trace.
- Tax calculation: 8% sales tax applied to all demo orders, displayed on the order ticket.
7. The Response Flows Back
The response takes three parallel paths back to the guest:
- Audio → streams through the WebSocket back to the frontend → plays through the guest's speakers (with echo suppression engaged to prevent feedback loops). The AI's Nova voice — warm, friendly, unmistakably McDonald's — delivers the response.
- Tool results → the frontend parses JSON payloads and updates the Order Ticket in real-time: line items, customizations, combo groupings, subtotals, tax, and the running total. The POS Ticket view shows exactly what would print at the drive-thru.
- Transcript → the guest's words and the AI's response appear in the Guest Conversation panel with real-time transcription (and translation, if the guest is speaking Spanish, Mandarin, or another supported language).
8. The Guest Hears and Sees
The guest hears the AI crew member respond naturally — "You got it! A plain Big Mac and those Fries and Coke. Our shake machine is taking a quick nap, so I can't do pickles, but would you like a refreshing drink instead?" — while simultaneously watching their order ticket update in real-time on screen. Every item, every mod, every price, every total — all in sync, all instant.
The entire round trip — guest speech → AI understanding → tool execution → business logic → voice response + UI update — happens in under two seconds. That's the power of an agentic architecture where deterministic Python guardrails and Azure OpenAI work in concert, not in conflict.
Note: This demo uses sample McDonald's menu data (172 items) for demonstration purposes. All prices, promotions, and machine statuses are simulated to showcase the agentic architecture capabilities.
The RTClient in the frontend receives the audio input, sends that to the Python backend which uses an RTMiddleTier object to interface with the Azure OpenAI Realtime API, and includes a tool for searching Azure AI Search.
The architecture implements a WebSocket middle tier that bridges the browser and Azure OpenAI in real-time, with the backend handling:
- Audio gating & echo suppression for stable, interrupt-friendly conversations
- Tool-calling orchestration: Menu search, combo validation, order management
- [SYSTEM HINT] injection: Deterministic Python logic guides conversation without relying on LLM memory
- TO_BOTH payloads: Split responses between voice-friendly text for the AI and JSON metadata for the frontend
Offline Mode Architecture: When local mode is enabled, the
ProcessorRouterredirects the WebSocket connection fromRTMiddleTier(Azure OpenAI) toLocalPhi4Processor(Phi-4 ONNX + Piper TTS). The frontend, tools, and order state remain identical — only the AI inference layer swaps.
Frontend:
- React, TypeScript, Vite, Tailwind CSS, shadcn/ui
- WebSocket client for real-time audio and order updates
- 50 demo menu items from
menuItems.json(synced with Azure AI Search index)
Backend:
- Python 3.11+ with aiohttp, WebSockets
- WebSocket middle tier (
rtmt.py) — browser ↔ Azure OpenAI Realtime API - Azure OpenAI GPT-4o Realtime API (gpt-realtime-1.5)
- Demo menu data from
mcdonalds-menu-items.json(sample McDonald's menu export, 172 items)
AI & Search:
- Azure AI Search with semantic hybrid search (text-embedding-3-large, 3072 dimensions) for menu grounding
- Four tool-calling functions:
search,update_order,get_order,reset_order
Infrastructure:
- Bicep IaC for reproducible deployments
- Azure Container Apps with auto-scaling (20 concurrent requests/replica, max 5 replicas)
- Gunicorn with 2 async workers, 120s timeout, 65s keep-alive
- Docker with layer caching for fast rebuilds
- Health probes: startup (50s), liveness (30s), readiness (10s)
- Azure Developer CLI (
azd) for one-command provisioning
Offline AI (Local Mode):
- Microsoft Phi-4-multimodal-instruct (5.6B params, INT4 ONNX) for speech understanding and text generation
- ONNX Runtime GenAI for local model inference (CUDA, DirectML, or CPU)
- Piper TTS for local text-to-speech (4 curated voices, ~60MB each)
- Faster-Whisper (small model, 244 MB) for local customer speech transcription
- Audio pipeline: 24kHz PCM → 16kHz downsample → Phi-4 → Piper TTS → 24kHz PCM
This repository includes infrastructure as code and a Dockerfile to deploy the app to Azure Container Apps, but it can also be run locally as long as Azure AI Search and Azure OpenAI services are configured.
You have a few options for getting started with this template. The quickest way to get started is GitHub Codespaces, since it will setup all the tools for you, but you can also set it up locally. You can also use a VS Code dev container
You can run this repo virtually by using GitHub Codespaces, which opens a web-based VS Code in your browser:
- In your forked GitHub repository, select Code ➜ Codespaces ➜ Create codespace on main.
- Choose a machine type with at least 8 cores (the 32 GB option provides the smoothest dev experience).
- After the container finishes provisioning, open a new terminal and proceed to deploying the app.
You can run the project in your local VS Code Dev Container using the Dev Containers extension:
- Start Docker Desktop (install it if not already installed).
- Clone your GitHub repository locally (see Local environment).
- Open the folder in VS Code and choose Reopen in Container when prompted (or run the Dev Containers: Reopen in Container command).
- After the container finishes building, open a new terminal and proceed to deploying the app.
- Install the required tools by running the prerequisites script:
# Make the script executable
chmod +x ./scripts/install_prerequisites.sh
# Run the script
./scripts/install_prerequisites.shThe script installs the Azure CLI, signs you in, and verifies Docker availability for you.
Alternatively, manually install Azure Developer CLI, Node.js, Python >=3.11, Git, and Docker Desktop.
2. Clone your GitHub repository (git clone https://github.com/swigerb/mcdonalds_ai_drivethru.git)
3. Proceed to the next section to deploy the app.
If you have a JSON file containing the menu items for your drive-thru, you can use the provided Jupyter notebook to ingest the data into Azure AI Search.
- Open the
menu_ingestion_search_json.ipynbnotebook. - Follow the instructions to configure Azure OpenAI and Azure AI Search services.
- Prepare the JSON data for ingestion.
- Upload the prepared data to Azure AI Search.
This notebook demonstrates how to configure Azure OpenAI and Azure AI Search services, prepare the JSON data for ingestion, and upload the data to Azure AI Search for hybrid semantic search capabilities.
Link to JSON Ingestion Notebook
If you have a PDF file of a drive-thru's menu that you would like to use, you can use the provided Jupyter notebook to extract text from the PDF, parse it into structured JSON format, and ingest the data into Azure AI Search.
- Open the
menu_ingestion_search_pdf.ipynbnotebook. - Follow the instructions to extract text from the PDF using OCR.
- Parse the extracted text using GPT-4o into structured JSON format.
- Configure Azure OpenAI and Azure AI Search services.
- Prepare the parsed data for ingestion.
- Upload the prepared data to Azure AI Search.
This notebook demonstrates how to extract text from a menu PDF using OCR, parse the extracted text into structured JSON format, configure Azure OpenAI and Azure AI Search services, prepare the parsed data for ingestion, and upload the data to Azure AI Search for hybrid semantic search capabilities.
Link to PDF Ingestion Notebook
You have two options for running the app locally for development and testing:
Run this app locally using the provided start scripts:
-
Create an
app/backend/.envfile with the necessary environment variables. You can use the provided sample file as a template:cp app/backend/.env-sample app/backend/.env
Then, fill in the required values in the
app/backend/.envfile. -
Run this command to start the app:
Windows:
pwsh .\scripts\start.ps1
Linux/Mac:
./scripts/start.sh
-
The app will be available at http://localhost:8000
For GPU-accelerated local AI mode, use the
-GPUflag when starting the app (see Setting Up Offline Mode for GPU setup instructions).
For testing in an isolated container environment:
-
Make sure you have an
.envfile in theapp/backend/directory as described above. -
Run the Docker build script:
# Make the script executable chmod +x ./scripts/docker-build.sh # Run the build script ./scripts/docker-build.sh
This script automatically handles:
- Verifying/creating frontend environment variables
- Building the Docker image using
app/frontend/.envfor Vite settings - Running the container with your backend configuration
-
Navigate to http://localhost:8000 to use the application.
Alternatively, you can manually build and run the Docker container:
# Ensure frontend Vite settings exist (edit values as needed)
# cp ./app/frontend/.env-sample ./app/frontend/.env
# Build the Docker image
docker build -t mcdonalds-drive-thru-app \
-f ./app/Dockerfile ./app
# Run the container with your environment variables
docker run -p 8000:8000 --env-file ./app/backend/.env mcdonalds-drive-thru-app:latestTo deploy the demo app to Azure:
-
Make sure you have an
.envfile set up in theapp/backend/directory. You can copy the sample file:cp app/backend/.env-sample app/backend/.env
-
Run the deployment script with minimal parameters:
# Make the script executable chmod +x ./scripts/deploy.sh # Run the deployment with just the app name (uses all defaults) ./scripts/deploy.sh <name-of-your-app>
The script will automatically:
- Look for backend environment variables in
./app/backend/.env - Look for or create frontend environment variables in
./app/frontend/.env - Use the Dockerfile at
./app/Dockerfile - Use the Docker context at
./app
- Look for backend environment variables in
-
For more control, you can specify custom paths:
./scripts/deploy.sh \ --env-file /path/to/custom/backend.env \ --frontend-env-file /path/to/custom/frontend.env \ --dockerfile /path/to/custom/Dockerfile \ --context /path/to/custom/context \ <name-of-your-app>
-
After deployment completes, your app will be available at the URL displayed in the console.
The McDonald's AI Drive-Thru supports a fully offline Local Mode that runs entirely on the user's GPU — no Azure or cloud dependencies required. This is ideal for demos, air-gapped environments, and edge deployments where connectivity is limited.
Local Mode delivers a complete AI drive-thru experience on consumer hardware. The app swaps Azure OpenAI services for a self-hosted inference stack: Phi-4-mini-instruct (3.8B LLM, text-only), Whisper base.en (STT on CPU), and Piper TTS (speech synthesis). When toggled on in the settings panel, guests enjoy the same natural, multilingual ordering conversation — powered entirely by your GPU.
Local Mode Architecture:
- LLM: Phi-4-mini-instruct (3.8B params, INT4 ONNX) via
onnxruntime-genai-directml— text-only model downloaded frommicrosoft/Phi-4-mini-instruct-onnx - STT: Whisper base.en on CPU (not tiny) for better food vocabulary accuracy — ~200ms transcription for short utterances
- TTS: Piper TTS (Amy voice, en_US) with 0.7 length_scale for faster, upbeat energy
- Pipeline: Sequential — Whisper STT → Phi-4-mini → Piper TTS with half-duplex audio (mic muted during AI response)
- Hardware: DirectML on Windows 11 — works with NVIDIA RTX, AMD, Intel Arc GPUs
Component Comparison:
| Component | Cloud Mode | Local Mode |
|---|---|---|
| Speech Understanding | Azure OpenAI GPT-4o Realtime | Phi-4-mini-instruct (ONNX INT4) |
| Customer Transcription | Whisper-1 (via Azure OpenAI) | Whisper base.en (CPU) |
| Text Generation | GPT-4o Realtime | Phi-4-mini-instruct (ONNX INT4) |
| Voice Synthesis | Azure OpenAI voices (shimmer, coral, etc.) | Piper TTS (Amy, en_US, 0.7 length_scale) |
| Menu Search | Azure AI Search (semantic + vector) | Local in-memory search (keyword matching) |
| Order Management | Same | Same (runs locally in both modes) |
The backend uses a ProcessorRouter that delegates WebSocket connections to either RTMiddleTier (cloud) or LocalPhi4Processor (local) based on the user's toggle. Both implement the same WebSocket protocol, so the frontend works identically in both modes.
System Prompt (Enriched ~610 tokens):
- Full menu with numbered meals and approximate prices
- Meal/combo logic (sandwich + fries + drink)
- Upsell rules, dessert offers, and order readback instructions
- Phi-4-mini uses natural language (no tool calling) to handle ordering and conversation flow
Benchmarked on NVIDIA RTX 4060 (8GB VRAM):
| Metric | Cloud Mode | Local Mode |
|---|---|---|
| Time to first token | ~200ms | <10ms |
| Full response | ~1-2s | ~6s |
| VRAM usage | N/A | ~4.7 GB |
Local mode prioritizes responsiveness for real-time demos. The <10ms time-to-first-token ensures snappy interactions. Full response (~6s) is slower than cloud but acceptable for an interactive, voice-driven experience on 3.8B model parameters.
~4.7 GB total on 8GB RTX 4060:
| Component | VRAM |
|---|---|
| Phi-4-mini INT4 | ~3.0 GB |
| Whisper base.en (CPU) | 0 GB (runs on CPU) |
| Piper TTS | ~0.1 GB |
| KV Cache + OS | ~1.5 GB |
- Close non-essential apps before running local mode
- Run
nvidia-smito check VRAM — aim for under 3GB used before starting - If VRAM is starved, inference can slow to 42s instead of 6s
Why Whisper on CPU? The base.en model (forced to CPU via stt_device: "cpu") frees ~0.5-1.0 GB GPU VRAM for the Phi-4 KV cache, enabling longer conversation history. CPU inference is still fast (~200ms for short utterances) and the model performs better on food-related vocabulary than smaller variants.
This budget fits comfortably on consumer GPUs (RTX 4060, RTX 4070, AMD 7700XT, Intel Arc A750) when other apps are closed.
-
First time: install dependencies and download models
# Install Hugging Face Hub pip install huggingface-hub # Download Phi-4-mini (~3.25 GB download) python -c "from huggingface_hub import snapshot_download; snapshot_download('microsoft/Phi-4-mini-instruct-onnx', allow_patterns=['gpu/*'], local_dir='models/phi4-mini', local_dir_use_symlinks=False)"
-
Start with GPU support
.\scripts\start.ps1 -GPUThe script automatically installs
onnxruntime-genai-directmland swaps CPU dependencies for GPU variants. -
Toggle Local Mode in the settings panel (⚙️) to switch between cloud and local AI
The Guest Conversation panel shows both customer transcripts and AI responses, identical to cloud mode.
We evaluated multiple models for the best real-time demo experience:
- Phi-4-multimodal (5.6B): Too slow (~27s time-to-first-token on RTX 4060) — not viable for interactive ordering; also requires audio-in preprocessing
- Phi-4-mini (3.8B): <10ms TTFT, ~7.8 tok/s — perfect for real-time demos ✓ (text-only; uses Whisper for STT)
- Whisper tiny vs. base.en: Tiny model loses food vocabulary accuracy. Base.en on CPU provides better ordering accuracy without GPU overhead
- Piper Amy voice: 0.7 length_scale for faster, energetic drive-thru personality (standard voice length_scale is 1.0)
Settings Panel Local Mode Toggle:
- When toggled ON: All AI requests route to
LocalPhi4Processor - When toggled OFF: All requests route to
RTMiddleTier(Azure OpenAI) - Seamless switching — no page reload required
Guest Conversation Panel:
- Shows customer speech transcript (from Faster-Whisper)
- Shows AI response (from Phi-4-mini)
- Same multilingual support and order confirmation flow as cloud mode
- First inference warmup (~30-40s DirectML startup) on initial response — subsequent responses ~6s in ideal VRAM conditions
- Response latency (~6s full response in good VRAM) is slower than cloud (~1-2s) — inherent to 3.8B model on consumer GPU; can degrade to 42s if VRAM is starved by other apps
- Half-duplex audio — mic is muted while AI generates, preventing audio feedback loops
- No tool calling — Phi-4-mini handles ordering via natural language; no structured tool_call JSON
- STT accuracy — Generally good, but complex food items may be misheard (e.g., "quarter pounder" occasionally)
- Piper TTS voice quality — Functional but less natural than cloud voices (shimmer/coral)
- 30s inference timeout with graceful fallback to default response if model takes too long
These tradeoffs are acceptable for demos, edge deployments, and air-gapped environments where zero cloud dependency is the priority.
This project was built by an AI development team powered by Squad — a GitHub Copilot agent created by Brady Gaster that assembles AI dev teams with persistent memory, shared decision-tracking, and orchestrated workflows.
Here's the best part: Squad's casting algorithm analyzed this project's context and auto-selected the McDonald's universe for the team. That's right — an AI drive-thru for McDonald's was built by AI agents named after McDonald's characters. You can't make this stuff up. 🎤⬇️
| Agent | Role | What They Do | |
|---|---|---|---|
| 🏗️ | Ronald | Lead | Architecture, decisions, code review — sees the whole system, makes the call |
| ⚛️ | Birdie | Frontend Dev | React, TypeScript, UI components, audio client, real-time WebSocket integration |
| 🔧 | Grimace | Backend Dev | Python, Azure OpenAI Realtime API, WebSockets, AI Search, tool routing |
| 🧪 | Hamburglar | Tester | pytest, edge cases, quality gates, performance test harness |
| ⚙️ | Mayor McCheese | DevOps | Bicep, Docker, Azure Container Apps, health probes, scaling |
| 🤖 | Mac Tonight | AI/Realtime Expert | GPT-4o Realtime tuning, voice AI, system prompts, demo readiness |
| 📋 | Scribe | Session Logger | Memory, decisions, orchestration logs — the team's shared brain |
Every architectural decision, performance optimization, and bug fix in this repo was discussed, debated, and implemented by this crew — with a human (Brian) steering the ship. 🚢
Want your own AI dev team? Check out Squad and let it cast the perfect crew for your project. Who knows what universe you'll get. 🎲
This project is licensed under the MIT License. You may use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the software, provided that the copyright notice and permission notice from the MIT License are included in all copies or substantial portions of the software. Refer to the LICENSE file for the complete terms.
Contributions are welcome! Please review CONTRIBUTING.md for environment setup, branching guidance, and the pre-flight test checklist before opening an issue or submitting a pull request.
All trademarks and brand references belong to their respective owners.
The diagrams, images, and code samples in this repository are provided AS IS for proof-of-concept and pilot purposes only and are not intended for production use.
These materials are provided without warranty of any kind and do not constitute an offer, commitment, or support obligation on the part of Microsoft. Microsoft does not guarantee the accuracy or completeness of any information contained herein.
MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, including but not limited to warranties of merchantability, fitness for a particular purpose, or non-infringement.
Use of these materials is at your own risk.
- OpenAI Realtime API Documentation
- Azure OpenAI Documentation
- Azure AI Services Documentation
- Azure AI Search Documentation
- Azure AI Services Tutorials
- Azure AI Community Support
- Azure AI GitHub Samples
- Azure AI Services API Reference
- Azure AI Services Pricing
- Azure Developer CLI Documentation
- Azure Developer CLI GitHub Repository


