Skip to content

siges-portfolio/room-reserve-ai

Repository files navigation

Room Reserve AI System

Local distributed AI multimodal system composed of specialized microservices capabilities with RAG, memory, routing etc.

Architectural schema

Layer Responsibility
Compute Layer llama.cpp GPU
Core OS Layer Prompt / Context / Policy / Router
Runtime Layer Tools / Memory / RAG / Vision
Cognitive Layer Agents / Planner / Critic / Verifier
Access Layer API / Billing / Auth / Scale

Hardware load distribution

Designed for low-VRAM systems (8GB GPU).

Layer Device
LLM Service GPU
All other services CPU
Memory / Vectors RAM / NVMe

Guide

0. Sync UV packages

uv sync --all-packages

1. Build docker llama.cpp image

cd engines/llama-engine
docker build -t llama-engine .

2. Place .gguf model files in models/

3. Configure models config models/models.ini

Example:
[Qwen3VL-8B-Instruct-Q4_K_M]
    model = ./models/llm/Qwen3VL-8B-Instruct-Q4_K_M.gguf
    c = 2048
    n-gpu-layers = 40
    rope-scaling = linear

4. Configure services environment docker-compose.yml

Example:
  llm:
    environment:
      MODEL_NAME: Qwen3VL-8B-Instruct-Q4_K_M

5. Run containers

docker compose up

6. Endpoints

...


Development Roadmap

1. Foundation and infrastructure

Status Description
Done GPU-accelerated llama.cpp server - Base output server. Providing a high-performance API for GPU-accelerated inference of local models. Docker image for rapid deployment;
Pending Helm / Kubernetes deployment - Orchestration in a production environment. Automated deployment, scaling, and management of all system components in a Kubernetes cluster via Helm charts;
Done Services Registry - Component version management. Centralized services registry;
Done Model Registry Centralized model registry (versions, quantizations, contexts, GPU requirements)
Pending Artifact Registry storage of prompt templates, tools schemas, policies

2. System core

Status Description Layer
In Progress Prompt Engine - Prompt creation and optimization. Transforming user queries into effective instructions for the model, managing templates and dynamic context; Core
Pending Context Builder - Dialog context management. Collecting, maintaining, and timely clearing the history of dialogues and system instructions to ensure the relevance of responses; Core
Pending Semantic chunker - cutting long documents in RAG. Core
Pending Token budgets - Resource management. Control prompt and response length within model limits, optimize costs, and prevent context overflow; Core
Pending Safety rules - Safety and ethics control. Filtering incoming and outgoing data, preventing the generation of malicious or unwanted content; Core
Pending Router - Service Manager. Receiving a task from Core and forwarding it to the appropriate specialized service or agent; Services
Pending Multi-model orchestration - Coordination of multiple models. Dynamic model loading/unloading, request distribution between specialized models, load balancing; Services (Router)

3. Services and tools

Status Description Layer
Pending LLM - Main text engine. Generation and processing of text responses, completion of dialogues; Services
Pending Tools - Executive mechanism. Providing APIs for calling external functions (search, calculations, third-party service APIs) and managing their lifecycle; Services
Pending Tool calling - Execution of actions. Recognizing the need to use tools, calling them with the correct parameters, and processing the results; Services
Pending Memory - Long-term semantic memory. Storage and retrieval of information based on semantic similarity (including dialogue history, facts); Services
Pending RAG - Knowledge access system. Extracting relevant documents from the knowledge base to enrich the model context with up-to-date information. Services
Pending Memory policy - Defining memory management rules. Strategy for storing, retrieving, and forgetting information in the system's short-term and long-term memory; Core
Pending Vision - Working with visual content. Image analysis, description generation, image creation on demand; Services

4. Advanced behavior (Agents & Logic)

Status Description Layer
Pending Reasoning loops - Ensuring in-depth analysis. Organization of chains of reasoning and iterative thinking (chain-of-thought) for complex tasks. Core
Pending Routing logic - Intelligent task distribution. Analysis of the request and determination of the optimal way to process it (selection of a model, service, or agent) Core
Pending Planner Agent - Strategist and task decomposer. Breaking down a complex goal into a sequence of steps, coordinating the work of other agents or services; Agents
Pending Vision Agent - Visual tasks specialist. Coordination between Vision Service and other components for solving complex image-related tasks (e.g., “describe, then generate something similar”); Agents
Pending System Agent - Administrator and monitor. System status monitoring, logging, error handling, ensuring stable operation; Agents
Pending Critic Agent - Self Reflection. Inference quality control; Agents
Pending DevAgent - Development agent. Automation of development tasks: writing, analyzing, and refactoring code, debugging, working with documentation. Agents

5. Scaling and Access (Scale & API)

Status Description Layer
Pending REST/WebSocket - Customer interface. Processing incoming HTTP/WebSocket requests, routing, providing stable endpoints; API
Pending Streaming - Streaming of responses. Implementation of progressive delivery of generated text (SSE, WebSocket) to improve UX and perceived speed API
Pending User session - Session management. Creating and maintaining an isolated interaction context for each user, communication with Memory; API
Penging Authorization - Access control. Authentication and authorization of users/services, management of access rights to functions and data; API
Penging Rate limit - Overload protection. Limiting the number of requests from a user/client to ensure fair use and service stability; API
Pending Multi-GPU scaling - Horizontal scaling. Distributing the inference of a single large model or multiple queries across multiple GPUs to increase throughput and reduce latency. Core

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors