GitHub - siges-portfolio/room-reserve-ai

Room Reserve AI System

Local distributed AI multimodal system composed of specialized microservices capabilities with RAG, memory, routing etc.

Architectural schema

Layer	Responsibility
Compute Layer	llama.cpp GPU
Core OS Layer	Prompt / Context / Policy / Router
Runtime Layer	Tools / Memory / RAG / Vision
Cognitive Layer	Agents / Planner / Critic / Verifier
Access Layer	API / Billing / Auth / Scale

Hardware load distribution

Designed for low-VRAM systems (8GB GPU).

Layer	Device
LLM Service	GPU
All other services	CPU
Memory / Vectors	RAM / NVMe

Guide

0. Sync UV packages

uv sync --all-packages

1. Build docker llama.cpp image

cd engines/llama-engine
docker build -t llama-engine .

2. Place .gguf model files in `models/`

3. Configure models config `models/models.ini`

Example:
[Qwen3VL-8B-Instruct-Q4_K_M]
    model = ./models/llm/Qwen3VL-8B-Instruct-Q4_K_M.gguf
    c = 2048
    n-gpu-layers = 40
    rope-scaling = linear

4. Configure services environment `docker-compose.yml`

Example:
  llm:
    environment:
      MODEL_NAME: Qwen3VL-8B-Instruct-Q4_K_M

5. Run containers

docker compose up

6. Endpoints

...

Development Roadmap

1. Foundation and infrastructure

Status	Description
Done	GPU-accelerated llama.cpp server - Base output server. Providing a high-performance API for GPU-accelerated inference of local models. Docker image for rapid deployment;
Pending	Helm / Kubernetes deployment - Orchestration in a production environment. Automated deployment, scaling, and management of all system components in a Kubernetes cluster via Helm charts;
Done	Services Registry - Component version management. Centralized services registry;
Done	Model Registry Centralized model registry (versions, quantizations, contexts, GPU requirements)
Pending	Artifact Registry storage of prompt templates, tools schemas, policies

2. System core

Status	Description	Layer
In Progress	Prompt Engine - Prompt creation and optimization. Transforming user queries into effective instructions for the model, managing templates and dynamic context;	Core
Pending	Context Builder - Dialog context management. Collecting, maintaining, and timely clearing the history of dialogues and system instructions to ensure the relevance of responses;	Core
Pending	Semantic chunker - cutting long documents in RAG.	Core
Pending	Token budgets - Resource management. Control prompt and response length within model limits, optimize costs, and prevent context overflow;	Core
Pending	Safety rules - Safety and ethics control. Filtering incoming and outgoing data, preventing the generation of malicious or unwanted content;	Core
Pending	Router - Service Manager. Receiving a task from Core and forwarding it to the appropriate specialized service or agent;	Services
Pending	Multi-model orchestration - Coordination of multiple models. Dynamic model loading/unloading, request distribution between specialized models, load balancing;	Services (Router)

3. Services and tools

Status	Description	Layer
Pending	LLM - Main text engine. Generation and processing of text responses, completion of dialogues;	Services
Pending	Tools - Executive mechanism. Providing APIs for calling external functions (search, calculations, third-party service APIs) and managing their lifecycle;	Services
Pending	Tool calling - Execution of actions. Recognizing the need to use tools, calling them with the correct parameters, and processing the results;	Services
Pending	Memory - Long-term semantic memory. Storage and retrieval of information based on semantic similarity (including dialogue history, facts);	Services
Pending	RAG - Knowledge access system. Extracting relevant documents from the knowledge base to enrich the model context with up-to-date information.	Services
Pending	Memory policy - Defining memory management rules. Strategy for storing, retrieving, and forgetting information in the system's short-term and long-term memory;	Core
Pending	Vision - Working with visual content. Image analysis, description generation, image creation on demand;	Services

4. Advanced behavior (Agents & Logic)

Status	Description	Layer
Pending	Reasoning loops - Ensuring in-depth analysis. Organization of chains of reasoning and iterative thinking (chain-of-thought) for complex tasks.	Core
Pending	Routing logic - Intelligent task distribution. Analysis of the request and determination of the optimal way to process it (selection of a model, service, or agent)	Core
Pending	Planner Agent - Strategist and task decomposer. Breaking down a complex goal into a sequence of steps, coordinating the work of other agents or services;	Agents
Pending	Vision Agent - Visual tasks specialist. Coordination between Vision Service and other components for solving complex image-related tasks (e.g., “describe, then generate something similar”);	Agents
Pending	System Agent - Administrator and monitor. System status monitoring, logging, error handling, ensuring stable operation;	Agents
Pending	Critic Agent - Self Reflection. Inference quality control;	Agents
Pending	DevAgent - Development agent. Automation of development tasks: writing, analyzing, and refactoring code, debugging, working with documentation.	Agents

5. Scaling and Access (Scale & API)

Status	Description	Layer
Pending	REST/WebSocket - Customer interface. Processing incoming HTTP/WebSocket requests, routing, providing stable endpoints;	API
Pending	Streaming - Streaming of responses. Implementation of progressive delivery of generated text (SSE, WebSocket) to improve UX and perceived speed	API
Pending	User session - Session management. Creating and maintaining an isolated interaction context for each user, communication with Memory;	API
Penging	Authorization - Access control. Authentication and authorization of users/services, management of access rights to functions and data;	API
Penging	Rate limit - Overload protection. Limiting the number of requests from a user/client to ensure fair use and service stability;	API
Pending	Multi-GPU scaling - Horizontal scaling. Distributing the inference of a single large model or multiple queries across multiple GPUs to increase throughput and reduce latency.	Core

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
.idea		.idea
engines/llama-engine		engines/llama-engine
models		models
registry		registry
services		services
.env		.env
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
docker-compose.yml		docker-compose.yml
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Room Reserve AI System

Architectural schema

Hardware load distribution

Guide

0. Sync UV packages

1. Build docker llama.cpp image

2. Place .gguf model files in `models/`

3. Configure models config `models/models.ini`

4. Configure services environment `docker-compose.yml`

5. Run containers

6. Endpoints

Development Roadmap

1. Foundation and infrastructure

2. System core

3. Services and tools

4. Advanced behavior (Agents & Logic)

5. Scaling and Access (Scale & API)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Room Reserve AI System

Architectural schema

Hardware load distribution

Guide

0. Sync UV packages

1. Build docker llama.cpp image

2. Place .gguf model files in models/

3. Configure models config models/models.ini

4. Configure services environment docker-compose.yml

5. Run containers

6. Endpoints

Development Roadmap

1. Foundation and infrastructure

2. System core

3. Services and tools

4. Advanced behavior (Agents & Logic)

5. Scaling and Access (Scale & API)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

2. Place .gguf model files in `models/`

3. Configure models config `models/models.ini`

4. Configure services environment `docker-compose.yml`

Packages