You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
GPU-accelerated llama.cpp server - Base output server. Providing a high-performance API for GPU-accelerated inference of local models. Docker image for rapid deployment;
Pending
Helm / Kubernetes deployment - Orchestration in a production environment. Automated deployment, scaling, and management of all system components in a Kubernetes cluster via Helm charts;
Done
Services Registry - Component version management. Centralized services registry;
Done
Model Registry Centralized model registry (versions, quantizations, contexts, GPU requirements)
Pending
Artifact Registry storage of prompt templates, tools schemas, policies
2. System core
Status
Description
Layer
In Progress
Prompt Engine - Prompt creation and optimization. Transforming user queries into effective instructions for the model, managing templates and dynamic context;
Core
Pending
Context Builder - Dialog context management. Collecting, maintaining, and timely clearing the history of dialogues and system instructions to ensure the relevance of responses;
Core
Pending
Semantic chunker - cutting long documents in RAG.
Core
Pending
Token budgets - Resource management. Control prompt and response length within model limits, optimize costs, and prevent context overflow;
Core
Pending
Safety rules - Safety and ethics control. Filtering incoming and outgoing data, preventing the generation of malicious or unwanted content;
Core
Pending
Router - Service Manager. Receiving a task from Core and forwarding it to the appropriate specialized service or agent;
Services
Pending
Multi-model orchestration - Coordination of multiple models. Dynamic model loading/unloading, request distribution between specialized models, load balancing;
Services (Router)
3. Services and tools
Status
Description
Layer
Pending
LLM - Main text engine. Generation and processing of text responses, completion of dialogues;
Services
Pending
Tools - Executive mechanism. Providing APIs for calling external functions (search, calculations, third-party service APIs) and managing their lifecycle;
Services
Pending
Tool calling - Execution of actions. Recognizing the need to use tools, calling them with the correct parameters, and processing the results;
Services
Pending
Memory - Long-term semantic memory. Storage and retrieval of information based on semantic similarity (including dialogue history, facts);
Services
Pending
RAG - Knowledge access system. Extracting relevant documents from the knowledge base to enrich the model context with up-to-date information.
Services
Pending
Memory policy - Defining memory management rules. Strategy for storing, retrieving, and forgetting information in the system's short-term and long-term memory;
Core
Pending
Vision - Working with visual content. Image analysis, description generation, image creation on demand;
Services
4. Advanced behavior (Agents & Logic)
Status
Description
Layer
Pending
Reasoning loops - Ensuring in-depth analysis. Organization of chains of reasoning and iterative thinking (chain-of-thought) for complex tasks.
Core
Pending
Routing logic - Intelligent task distribution. Analysis of the request and determination of the optimal way to process it (selection of a model, service, or agent)
Core
Pending
Planner Agent - Strategist and task decomposer. Breaking down a complex goal into a sequence of steps, coordinating the work of other agents or services;
Agents
Pending
Vision Agent - Visual tasks specialist. Coordination between Vision Service and other components for solving complex image-related tasks (e.g., “describe, then generate something similar”);
Agents
Pending
System Agent - Administrator and monitor. System status monitoring, logging, error handling, ensuring stable operation;
Streaming - Streaming of responses. Implementation of progressive delivery of generated text (SSE, WebSocket) to improve UX and perceived speed
API
Pending
User session - Session management. Creating and maintaining an isolated interaction context for each user, communication with Memory;
API
Penging
Authorization - Access control. Authentication and authorization of users/services, management of access rights to functions and data;
API
Penging
Rate limit - Overload protection. Limiting the number of requests from a user/client to ensure fair use and service stability;
API
Pending
Multi-GPU scaling - Horizontal scaling. Distributing the inference of a single large model or multiple queries across multiple GPUs to increase throughput and reduce latency.