After running python3 run.py, the user is presented with three options:
1) Offline AI Chat
2) System Analyzer
3) Exit
This scenario is intended for local execution of a GGUF model through llama.cpp followed by interactive communication with the model in the console.
The project reads _work-models/catalog.json and displays the list of models.
For each model, the following information is shown:
- name;
- short description;
max_tokens;- approximate host requirements;
- source;
- GGUF layout type: single-file or sharded.
After a model is selected, the project checks the _work-models/models/<model_key>/ directory.
Workflow logic:
- if the file already exists, its SHA256 is calculated and compared against the catalog;
- if the file is missing, it is downloaded from the URL specified in the catalog;
- after download, the file is verified again using SHA256;
- if the checksum does not match, execution stops with an error.
As a result, the project does not use unverified local GGUF files.
To build the image, a temporary build context is created inside .runtime/offline-ai-chat/.
Implementation details:
- the Dockerfile is taken from
apps/offline_llm_chat/docker/Dockerfile; - model files are added to the build context via hardlinks rather than by copying;
- the image is built with the following tags:
offline-ai-chat-llm:latest;offline-ai-chat-llm:<model_key>.
After the image is built, the project starts the container with llama.cpp and waits for the Unix socket to appear.
The runtime uses parameters from .env:
CTX_SIZE;MEM_LIMIT;CPU_LIMIT;PIDS_LIMIT;DEBUG_LOGS.
Once the socket appears, the chat is not opened immediately. Two checks are performed first:
- polling
/v1/modelsfor API readiness; - a smoke test with a short test request.
Only after these checks complete successfully does the project start the user chat.
In interactive mode, the following commands are supported:
/exit— exit the chat;/reset— reset the current message history.
In addition, the project outputs service statistics for each response:
- token count;
- processing duration;
- prompt/generation speed;
- cumulative token counter since the start of the session.
This scenario is intended for preliminary evaluation of Linux host resources and selection of container startup parameters.
Based on system data, the project collects information about:
- OS version;
- CPU and the number of logical cores;
- RAM and swap size;
- root filesystem;
- physical disks;
- GPU, if supported detection tools are available.
Based on the analysis results, the project generates startup profiles and suggests values for .env:
CTX_SIZE;MEM_LIMIT;CPU_LIMIT;PIDS_LIMIT.
After user confirmation, the values are written to the root .env file.
The project performs a strictly defined task and does not include additional operating modes.
In the current implementation, the project:
- does not mount a user repository into the runtime container;
- does not modify external source code;
- does not launch a web application;
- does not expose a REST API externally;
- does not maintain a separate file-based report storage;
- does not manage multiple containers simultaneously.
Baseline workflow:
Run run.py
-> System Analyzer
-> profile evaluation and .env update
-> return to the main menu
-> Offline AI Chat
-> model selection
-> GGUF verification or download
-> Docker build
-> Docker run
-> smoke test
-> interactive CLI chat