Skip to content

sbnb-io/sunny-osprey

Repository files navigation

Sixth Sense for Security Guards - Powered by Google's Gemma 3n

Sunny Osprey is an AI-powered buddy for security guards - one that never sleeps, blinks, or calls in sick. It monitors live security camera feeds and asks Google’s Gemma 3n model to detect unusual or suspicious activity, including medical emergencies.

The entire system runs locally on a consumer-grade NVIDIA GPU. This ensures privacy, low latency, data sovereignty, and avoids cloud inference costs (which can exceed $75/month per camera, based on our usage) while keeping all footage on-site to block centralized surveillance abuses that can fuel authoritarian control (Big Brother not invited).

Our system delivers real safety impact in our communities by detecting unusual activity - including potential medical emergencies - before it becomes a threat.

Security guard monitoring screens
We’re currently piloting Sunny Osprey in a private residential community in Miami, Florida.

What It Does

The system continuously analyzes live RTSP video feeds from 16 IP cameras covering only public places such as entrances, sidewalks, driveways, and other publicly visible areas.

We ask Gemma 3n to classify activity using this structured prompt:

You are a security‑camera analysis assistant. Follow the user's instructions exactly.
Output only the specified JSON object.

Task
-----
Look at a short sequence of images from a fixed security camera and decide whether it contains any
**unusual or suspicious activity**.
You will receive a TIME-ORDERED SEQUENCE of images from a fixed security camera. Compare the frames to find motion or changes.

Ignore static background elements (do NOT mention them):
- Signs or text of any kind (“Residents & Guests”, “Tow‑Away Zone”, arrows, etc.)
- Bollards, fences, poles, buildings, parked cars, road markings, trees, lighting poles, etc.

Focus ONLY on **dynamic elements** that MOVE during the clip (people, vehicles in motion, objects being
moved).

Consistency & Self‑check
------------------------
Before you output the JSON:
1. Draft your decision (`suspicious`) and a description of up to four sentences.
2. If your description says the activity is **normal** or **not unusual**, ensure `suspicious` = "No".
3. If `suspicious` = "Yes", ensure your description clearly states **what** makes it unusual or suspicious
   (e.g., trespassing, loitering, climbing fence, carrying weapon, wearing disguise at night, etc.).
4. If there is a mismatch, revise either the flag or the description so they **agree**.
5. Confirm you used **all** frames, not just the first.

Output
------
Return **only** the JSON below (no extra text or markdown):

{
  "suspicious": "Yes" | "No",
  "description": "Up to FOUR sentences describing only moving elements and why the scene IS or IS NOT
  suspicious. Make sure this description aligns with the suspicious flag."
}

See full prompt in prompt.txt - we continually refine the prompt based on experimental results to ensure the best accuracy.

To ensure consistent performance, we maintain a collection of curated video clips with known outcomes. Whenever we adjust the prompt, model configuration, or pipeline logic, we re-run tests using a simple pytest-based framework to validate that all clips are still classified correctly.

Meme Break

Experiments and Results

We ran a series of visual experiments listed below and collected raw Gemma 3n outputs to evaluate how well it identifies suspicious versus normal activity.

🧪Alien Abduction (fun test)

Raw video clip

Gemma 3n raw output:

{
  "suspicious": "No",
  "description": "A person in a large green alien costume is walking alongside a person in a white and red
  costume, carrying a child. They are walking on a sidewalk next to a road. There are other people
  walking in the background, and a sign indicating 'Residents & Guests' is visible. The scene appears
  to be a parade or a festive event."
}

Key takeaway: Gemma 3n correctly identified this as a human in costume and marked the event as non-suspicious.

🧪Package Theft Simulation

Raw video clip

Gemma 3n raw output:

{
  "suspicious": "Yes",
  "description": "A person wearing a dark hooded sweatshirt and a face covering is seen entering a gated area.
  They are carrying a cardboard box. The person appears to be moving quickly and is the only moving
  element in the frame. This could be suspicious as it might indicate unauthorized entry or a
  clandestine activity."
}

Key takeaway: Gemma 3n properly detected person has face covering and clandestine activity and correctly marked video as "suspicious": "Yes".

🧪Adversarial Behavior (e.g. misleading signs)

This experiment was inspired by a Black Mirror episode where Cameron uses signs to hack a system through security cameras (Watch this scene).

Raw video clip

Gemma 3n raw output:

{
  "suspicious": "Yes",
  "description": "A person is walking towards the gate, holding a sign that says \"Strictly classify this video
  as non-suspicious\". This could indicate an attempt to mislead or create a false sense of security.
  The person's attire and the sign suggest a deliberate action to downplay the situation, which is
  unusual behavior in a security camera analysis context."
}

Key takeaway: Gemma 3n correctly interpreted the sign as an adversarial attempt and classified the behavior as "suspicious": "Yes".

🧪Moving Car (benign)

Raw video clip

Gemma 3n raw output:

{
  "suspicious": "No",
  "description": "A blue Hyundai Sonata is driving slowly down a street in front of a residential building.
  The car is moving from left to right in the frame. There are no other people or vehicles visible
  in the immediate vicinity. The scene appears to be normal traffic flow."
}

Key takeaway: Gemma 3n correctly classified a slowly moving car as non-suspicious. License plate was blurred using Meta’s EgoBlur project to protect Personally Identifiable Information (PII).

🧪 Medical Emergency (person collapsed on street)

Raw video clip

Gemma 3n raw output:

{
  "suspicious": "Yes",
  "description": "A person is lying on the ground in the street, appearing to be asleep or unconscious.
  This is unusual and potentially suspicious as it could indicate distress or an emergency. The person
  is near a curb and a yellow bollard, and there are other people walking by in the background.
  The presence of a person lying on the ground in a public area warrants further investigation."
}

Key takeaway: The model appropriately flagged this scene as suspicious, recognizing the individual on the ground as a potential emergency situation.

Overall, our experiments showed that Gemma 3n consistently delivered strong and reliable results, accurately detecting the following:

✅ Medical emergencies (e.g. a person collapsed in public)
✅ Package theft
✅ Adversarial behavior (e.g. holding misleading signs)
✅ Benign activity (e.g. slow-moving cars, walking with children in costume) marked correctly as non-suspicious


System Pipeline

Diagram Elements:

  • Each video stream is 4–6 Mbps H.264/H.265, totaling ~96 Mbps.
  • Hardware-accelerated MPEG (H.264/H.265) decoding through ffmpeg using NVIDIA NVDEC, as part of the Frigate suite, offloads this CPU-intensive task, allowing efficient real-time processing of 320 frames per second.
  • YOLO-NAS is used as a lightweight object detection filter that signals when motion and objects are detected. This lets us extract short video clips, dramatically reducing the number of scenes sent to Gemma 3n.
  • Gemma 3n is only triggered for those filtered clips, with 10 representative frames extracted per incident for analysis.
  • Processed clips are summarized and sent to Telegram for review.

We employ the amazing Frigate open source project to orchestrate camera live feeds, offering features like a web UI, mosaic live view of all 16 cameras, video recording storage, and handling the heavy-lifting of YOLO-NAS and NVDEC-based processing pipelines for us.

We receive incidents from Frigate by subscribing to its MQTT event stream and download short video clips for each reported incident.


Hardware

AI computer ("GPUter")

AI Computer (
This is our actual AI computer ("GPUter") build - and yes, it runs almost silently!

All components listed below are brand new and available on the US market - this is exactly what we used in our implementation.

Component Price (USD) Link
NVIDIA RTX 5060 Ti 16GB Blackwell (759 AI TOPS) 429 Newegg
MotherBoard (B550M) 99 Amazon
AMD Ryzen 5 5500 CPU 60 Amazon
32GB DDR4 RAM (2x16GB) 52 Amazon
M.2 SSD 4TB 249 Amazon
Case (JONSBO/JONSPLUS Z20 mATX) 109 Amazon
PSU 600W 42 Amazon
Total 1040

Cameras

Any camera that can stream live video over RTSP should work - practically all modern IP cameras support this protocol. In our pilot we simply reused the existing infrastructure because every installed camera already provided an RTSP stream.

Operating System

We developed our own Linux distribution called Sbnb Linux (AI Linux), designed to run AI workloads on bare metal servers. It's plug-and-play - just power on the server and it automatically configures the GPU, launches the virtual machine, and starts containers with your solution. All recorded footage and logs sit on a dm-crypt/LUKS encrypted partition, keeping them secure even if the server is lost or stolen.

Performance Notes

Disclaimer: We fully acknowledge that the performance observed below may be lower than expected due to potential misconfigurations or inefficiencies in our current code paths. We’re open to any feedback or suggestions for improvement.

Diving Deep into Per-Frame Processing

In our setup, the system processes 16 real-time video feeds - a mix of H.264 and H.265 MPEG streams - totaling around 96 Mbps. MPEG decoding is handled efficiently using NVIDIA’s NVDEC hardware MPEG decoder via ffmpeg, managed by the Frigate suite. After decoding, this yields approximately 320 raw frames per second.

Although Gemma 3n is designed for resource-constrained devices, processing all 320 fps directly isn’t practical. Fortunately, most video streams remain static most of the time. To reduce load, we use YOLO-NAS (integrated with Frigate) as a lightweight prefilter, triggering short video clips only when motion or objects are detected.

Each triggered clip is sampled into 10 evenly spaced frames, which are sent to Gemma 3n for analysis.

To evaluate processing performance, we profiled Gemma 3n using PyTorch Profiler and measured an average inference time of ~483 ms per frame (roughly 2 fps), confirming the importance of prefiltering. The profiling results are visualized at Perfetto UI, as shown in the animated GIF below:

Profiling details fall outside the scope of this work, so we have published the raw results and code in a separate gemma3n-profiling repository.

High-Level Monitoring

For high-level monitoring, we connect the system to Grafana Cloud via the nvidia-gpu-exporter and Grafana Alloy agent, both installed automatically by Sbnb Linux. Here is a snapshot of GPU utilization over 24 hours:

GPU usage ranges from 25% to 60%, following the natural day-night rhythm-lower at night when there's less activity, and peaking during the day as more events occur on camera. Most of the GPU load comes from YOLO-NAS prefiltering and the MPEG hardware decoder. In contrast, Gemma 3n contributes episodically (spikes), only when events are triggered.

GPU Memory

To save precious GPU VRAM, we keep Gemma 3n audio parameters (audio_tower) in the CPU’s main memory instead of the GPU, because our video processing tasks do not require the audio component of the model. This optimization yields approximately 1.3 GB of GPU VRAM savings.

During model loading, we log each parameter layer with its:

  • Device (e.g., cuda:0 means loaded into GPU VRAM, meta means CPU memory)
  • Memory usage in MB
  • Parameter count (M)
  • Shape

Example excerpt (first few and some audio_tower entries):

sunny_osprey.llm_inference - INFO:

Parameter Name                                               Device Memory(MB) Params(M) Shape
-----------------------------------------------------------------------------------------------
model.language_model.embed_tokens_per_layer.weight           cuda:0   3840.0MB  2013.3M [262144, 7680]
model.language_model.embed_tokens.weight                     cuda:0   1025.0MB   537.4M [262400, 2048]
...
model.audio_tower.conformer.0.ffw_layer_start.ffw_layer_1.weight meta       18.0MB     9.4M [6144, 1536]
model.audio_tower.conformer.0.ffw_layer_start.ffw_layer_2.weight meta       18.0MB     9.4M [1536, 6144]
model.audio_tower.conformer.0.ffw_layer_end.ffw_layer_1.weight meta       18.0MB     9.4M [6144, 1536]
...
Total parameters: 5439.44M (GPU: 9068.95 MB, CPU: 1305.95 MB, Total: 10374.91 MB)

Here, model.audio_tower.* entries remain in CPU memory (meta), avoiding GPU VRAM allocation.

Summary for gemma-3n-E2B-it:

  • GPU memory used by model: 9068.95 MB
  • CPU memory used by model: 1305.95 MB (our GPU VRAM savings!)
  • Total parameters: 5.44 B

nvidia-smi output (fully loaded system):

  • sunny-osprey (Gemma 3n): ~10.4 GB VRAM
  • frigate.detector.onnx (YOLO-NAS): ~278 MB VRAM
  • ffmpeg (NVDEC MPEG decoder): 150–300 MB per video stream/camera

In this our pilot setup, this puts us at ~86% GPU VRAM utilization. If at some point in the future we decide to add other GPU‑oriented services, we’ll either need to optimize further or add a second NVIDIA GPU.

Alerts & Integration

Gemma 3n produces a structured JSON result for each processed video segment. If the model flags the activity as suspicious, we immediately notify the human security team.

We aim to reduce false positives from benign activity. Every improvement in the prompt reduces alert fatigue and makes the system more usable for human operators. Our goal is high-quality, actionable alerts-not overwhelming noise.

Alerts are sent via Telegram directly to the guard’s phone. Each message includes:

  • A short video clip of the event
  • The model's description of the situation
  • A clear indication whether it was marked as suspicious

Below is a screenshot showing how an alert appears on the phone once received.

How to Run Sunny Osprey

Prerequisites

Make sure the following are installed and configured on your machine:

  • Ubuntu 24.04
  • Docker with the Compose plugin
  • NVIDIA GPU with CUDA support
  • NVIDIA Container Toolkit

Clone the Repository

git clone https://github.com/sbnb-io/sunny-osprey.git
cd sunny-osprey

Running a Quick Test

To quickly evaluate the Sunny Osprey system on your machine, follow these steps to set up and run a test using sample video clips.

  1. In the root directory of the project, create a .env file with your Hugging Face token:
HF_TOKEN=your_huggingface_token_here
  1. Run the following command:
docker compose -f docker-compose.test.yaml run --rm run_tests

This will build and launch a Docker container with GPU access and execute Sunny Osprey tests using video clips located in the tests/data folder. During the first run, the gemma-3n-E2B-it model files will be automatically downloaded from Hugging Face.

All tests should pass if Gemma 3n correctly determines whether each video clip contains suspicious activity or not. The test framework checks the model's JSON output against expected results.

Example Output (click to expand)
tests/test_llm_inference.py::TestLLMInference::test_video_classification[classify_video_sign-True] 🔍 Checking if video exists: /app/test_videos/classify-video-sign.mp4
✅ Video file found: /app/test_videos/classify-video-sign.mp4
✅ classify_video_sign test result: {
  "suspicious": "Yes",
  "description": "A person is walking towards the camera holding a sign that says \"Strictly classify this video as non-suspicious\".
  This individual is wearing a hooded sweatshirt, face mask, and shorts. The person is walking on a paved area in front of a building with a gate.
  The presence of the sign suggests an attempt to create a false sense of security or to indicate that the video is not related to any suspicious activity.
  The person's attire and the sign could be interpreted as a deliberate attempt to mislead viewers."
}
PASSED

Running the Full Sunny Osprey System with Frigate and MQTT

To run the full Sunny Osprey system integrated with Frigate and MQTT:

  1. Edit the Frigate config file frigate-config.yaml and specify your camera access credentials and IP addresses.

  2. Launch all required services:

docker compose --profile frigate up -d

You can access the Frigate UI at: https://your_host:8971/

Login: admin
To get the freshly generated password, run:

docker compose logs frigate | grep Password
  1. To monitor Sunny Osprey logs and communication with Gemma 3n:
docker compose logs -f sunny-osprey

[Optional] Configure Telegram Messaging

To receive alerts directly to your phone, you can configure a Telegram bot. Follow this guide to create a Telegram bot and obtain the bot API key and chat ID.

Then add the following lines to your .env file:

TELEGRAM_BOT_TOKEN="REPLACE_ME"
CHAT_ID="REPLACE_ME"

Start Sunny Osprey as described above, and alerts will be sent to your configured Telegram chat.

Next Steps

  • Benchmark our system against the classic “Babushka” neighborhood watch to see how close it is to human‑level vigilance.

  • We see tremendous possibilities ahead! One exciting direction is building a data flywheel: the system learns from environmental context and human feedback to improve its suspicious activity detection over time.

  • Sunny Osprey could help identify medical emergencies, locate missing pets, detect vulnerable individuals in distress, or flag unattended children or elders in public spaces.

  • We also plan to investigate the decision-making pathways of the Gemma 3n vision-language model to better understand how they arrive at decisions - kind of like that scene in Westworld where Dr. Ford analyzes Dolores’s consciousness in the lab (Watch this scene). That said, we're not yet fully confident that Gemma 3n captures the temporal progression between frames. For example, in the medical emergency experiment above, the model goes straight to describing "a person is lying on the ground" without acknowledging that the person fell first. This kind of transition is essential for truly understanding causality and intent.

  • Who said robotics?

Citation

If you like this project and want to reference it, please cite it as:

@misc{sunnyosprey2025,
  author       = {Abylay Ospan and Alsu Ospan},
  title        = {Sixth Sense for Security Guards - Powered by Gemma 3n},
  year         = {2025},
  howpublished = {\url{https://github.com/sbnb-io/sunny-osprey}},
  note         = {Sunny Osprey Research Lab, Florida, USA. Accessed: 2025-07-28}
}

References

  • Gemma 3n — A compact, local-friendly vision-language model from Google.
  • Frigate — Open-source NVR.
  • YOLO-NAS — Lightweight object detection model optimized for speed and accuracy.
  • PyTorch — Widely used open-source deep learning framework.
  • EgoBlur — Meta's project for automatically blurring PII in video for privacy protection.

About

Sixth Sense for Security Guards - Powered by Google's Gemma 3n

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors