M5Stack Tab5 Video Stream

MJPEG video + synchronized PCM audio streaming over WiFi to the M5Stack Tab5 (ESP32-P4).

The server pre-extracts frames and audio from any YouTube video (via yt-dlp) or local file into a disk cache, then serves them over HTTP. The firmware fetches frames and audio chunks on demand, decodes JPEG in hardware, rotates via PPA, and plays audio through the ES8388 codec — all with A/V sync locked to wall clock.

Hardware

Component	Detail
Board	M5Stack Tab5
SoC	ESP32-P4 (dual-core RISC-V 400 MHz)
WiFi	ESP32-C6 co-processor via SDIO
Display	5" 1280×720 MIPI-DSI (portrait framebuffer)
Audio	ES8388 codec, onboard speaker

Server setup

The server runs on any Linux machine with Python 3, ffmpeg, and yt-dlp. A Raspberry Pi 5 works well and is what this was developed on.

Install dependencies

pip3 install flask gunicorn yt-dlp
# ffmpeg via system package manager, e.g.:
sudo apt install ffmpeg

Configure channels

Edit server/channels.json — each key is a channel name, value is a YouTube URL or a path to a local video file:

{
  "my_channel": "https://www.youtube.com/watch?v=..."
}

On first request the server resolves the URL with yt-dlp and extracts all frames and audio into server/cache/<channel>/. Subsequent runs serve from cache instantly.

Run the server

cd server
gunicorn -w 2 -b 0.0.0.0:8080 server:app

The server will begin extraction in the background on the first request. Video playback starts as soon as the first frames are available — you don't need to wait for the full video to be extracted.

Firmware setup

Prerequisites

ESP-IDF 5.5.3 (~/esp/esp-idf-v5.5.3 or set IDF_PATH)
WiFi credentials in ~/.esp_creds:

CONFIG_WIFI_SSID="YourNetwork"
CONFIG_WIFI_PASS="YourPassword"

Configure

idf.py menuconfig
# → Video Stream Config
#   SERVER_IP  — IP address of the machine running the server
#   SERVER_PORT — 8080 by default
#   CHANNEL    — must match a key in channels.json

Build and flash

idf.py build
idf.py flash

Architecture

HTTP pull model

The firmware requests data on demand rather than the server pushing a stream. This tolerates WiFi hiccups gracefully — a missed frame is simply retried on the next request.

ESP32-P4                          Server (Pi)
─────────────────────────────     ────────────────────────────
GET /frame/<channel>/<ms>   ───►  serve frame_NNNNN.jpg from disk
GET /audio/<channel>/<s>/<n>───►  serve raw u8 PCM slice from audio.raw
GET /info                   ───►  channel metadata (duration, fps, etc.)

Video pipeline (ESP32-P4)

[fetch task, core 1]          [decode task, core 0]
  HTTP GET /frame               xQueueReceive(ready_q)
  → JPEG in PSRAM slot          HW JPEG decode → RGB565
  → xQueueSend(ready_q)         PPA rotate 90° CW → framebuffer
  ← xQueueReceive(free_q)       board_lcd_commit()  (double-buffer flip)
                                vTaskDelayUntil(50ms)  ← paces to 20fps

16 pipeline slots provide ~800 ms of buffer to absorb WiFi retransmit spikes.

A/V sync

Both audio and video reference wall clock from the moment the first frame is successfully fetched. Audio samples are consumed by the I2S DMA at exactly 16 kHz — any drift in the fetch rate shows up as silence (not desync).

Display

Frames are extracted at 992×560 (landscape) and rotated 90° CW on-device via the PPA hardware accelerator, then letterboxed into the 720×1280 portrait framebuffer. Double buffering (2 hardware DPI framebuffers) eliminates tearing.

Server pre-processing

ffmpeg extracts frames at 20 fps and audio as mono unsigned 8-bit PCM at 16 kHz. On Raspberry Pi 5, H.265 sources use hardware decode (hevc_v4l2m2m); H.264/VP9 fall back to software (the Pi 5 CPU handles this at these resolutions).

Notes

The channel to play is hardcoded in the firmware. It is set via CHANNEL in menuconfig (or sdkconfig.defaults) and compiled in. To switch to a different video, update the channel name, rebuild, and reflash.

TODO

Play / pause and volume controls — use the Tab5's onboard buttons or touchscreen to pause playback and adjust volume without reflashing
On-device channel selection — browse and switch channels directly from the Tab5 touchscreen, no server interaction or reflash required
Server web interface — a browser UI to add new videos (YouTube URLs or local files), monitor extraction progress, and manage the channel list

Tuning

Parameter	Location	Effect
`FPS`	`server/server.py`	Extraction frame rate (default 20)
`PIPELINE_SLOTS`	`main/main.c`	Pre-fetch buffer depth (default 16 = ~800 ms)
`AUDIO_CHUNK_SAMPLES`	`main/main.c`	Audio fetch granularity (default 1600 = 100 ms)
`JPEG_IN_MAX`	`main/main.c`	Max compressed JPEG size per frame (default 128 KB)
`SRC_W` / `SRC_H`	`main/main.c`	Frame dimensions — must be divisible by 8

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
components/esp_lcd_st7123		components/esp_lcd_st7123
main		main
server		server
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
dependencies.lock		dependencies.lock
partitions.csv		partitions.csv
sdkconfig.defaults		sdkconfig.defaults

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

M5Stack Tab5 Video Stream

Hardware

Server setup

Install dependencies

Configure channels

Run the server

Firmware setup

Prerequisites

Configure

Build and flash

Architecture

HTTP pull model

Video pipeline (ESP32-P4)

A/V sync

Display

Server pre-processing

Notes

TODO

Tuning

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

M5Stack Tab5 Video Stream

Hardware

Server setup

Install dependencies

Configure channels

Run the server

Firmware setup

Prerequisites

Configure

Build and flash

Architecture

HTTP pull model

Video pipeline (ESP32-P4)

A/V sync

Display

Server pre-processing

Notes

TODO

Tuning

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages