-
Notifications
You must be signed in to change notification settings - Fork 84
Open
Description
Basic cluster behavior
- Start a balancer
- Start agents connected to the balancer
- Allow to customize the number of slots in agents
- Ensure the balancer distributes the requests among agents correctly
- Ensure agents process the requests
- Allow for agent naming
- Allow to persist balancer configuration across restarts (state database as a file)
- Shut down the balancer and agents cleanly when stopped
Handle the load, scale a cluster
- Buffer the incoming requests when applicable (all slots in all agents busy)
- Allow to customize the request buffer (maximum number of requests)
- Allow to customize the request buffer (maximum time the requests can spend in the buffer)
- Be able to add / remove agents connected to a balancer
- Be able to scale from zero agent instances
- Allow to customize the inference timeout
- Return an error to the client when an agent disconnects mid-request
Load and manage a model
- Load a model from Hugging Face
- Load a model from local file path
- Allow for specifying pooling type (for embeddings only)
- Allow for swapping models without restarting the balancer or agents
- Show model's metadata
Generate tokens
- Generate tokens from conversation history
- Generate tokens from raw prompt
- Stream generated tokens back to the client in real time
- Support the thinking mode
- Respect the
max_tokensnumber
Generate embeddings
- Ensure embeddings are generated only when enabled
- Generate embeddings from batches of input documents
- Stream embedding results back to the client in real time
- Preserve document IDs in embedding results so the user can match them to the input
- Automatically split large batches into smaller chunks to fit agent capacity
- Allow for specifying normalization method
Multimodal support
- Load a mmproj from Hugging Face
- Load a mmproj from local file path
- Generate tokens from conversation history that includes images
Use function calling
- Ensure
toolsparameter is optional - Allow for adding functions in
toolsparameter - Validate the tools schema
Control response quality
- Customize the inference parameters
- Apply chat template from the model if no override is provided
- Allow to override the model's chat template
Monitor cluster's metrics
- Expose metrics for slots in use, total slots, and buffered requests (Prometheus format)
- Allow for optionally enabling StatsD when starting the balancer
- Push the metrics (slots in use, total slots, and buffered requests) to a StatsD server
Web admin panel
- Allow for optionally enabling Web Admin Panel when starting the balancer
- Dashboard view presents the setup in real time (buffered requests, agents being added/removed, slots occupancy, model downloading)
- Model view allows for model and chat template management
- Model's metadata and chat template can be viewed
- Prompt section allows for inference testing (token generation only and the history of the conversation is not supported at the moment)
Use OpenAI-compatible API
- Allow for optionally enabling OpenAI compatibility endpoint when starting the balancer
- Ensure
/v1/chat/completionsendpoint is supported (requests are translated to OpenAI's format and then responses are translated back to Paddler's format)
CORS configuration
- Allow to configure CORS allowed origins for inference service
- Allow to configure CORS allowed origins for management service
Monitor cluster's health
- Report agent issues to the user (e.g. model failed to load, file path does not exist, chat template missing)
- Expose a health check endpoint for inference service
- Expose a health check endpoint for management service
- Expose a health check endpoint for OpenAI compatibility service
Unhappy paths
Base model input
- invalid or non-existent model references (wrong repo, filename, revision, or local path
- files that are invalid GGUF
- invalid or non-existent multimodal projection references
Multimodal projection input
- invalid or non-existent multimodal projection references
- files that are not valid mmproj models
Image analysis
- images sent to a text-only model (no mmproj loaded)
- invalid or unsupported image data (malformed data URI, invalid base64, unconvertible format, remote URLs)
Code coverage
Code coverage
- implement code coverage tool
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels