[Feature] Mac native backend support with vLLM-Metal, MLX, and llama.cpp#8
Conversation
- Add official MLX logo (light/dark variants) - Fix backend dropdown logo alignment with centered icons - Fix model dropdown tag alignment with fixed width
There was a problem hiding this comment.
Pull request overview
This pull request adds comprehensive Mac native backend support for LMStack, enabling Apple Silicon GPU acceleration through vLLM-Metal, MLX, and llama.cpp backends. The implementation includes automatic dependency installation, model format conversion, and deployment progress tracking.
Changes:
- Added native Mac backend support for vLLM-Metal, MLX-LM, and llama.cpp with automatic installation
- Implemented automatic model conversion from HuggingFace to MLX/GGUF formats with caching
- Added deployment progress logging visible in the web UI for native deployments
- Enhanced Apple Silicon GPU detection and worker capability reporting
- Updated frontend UI to support new backends with format compatibility indicators
Reviewed changes
Copilot reviewed 20 out of 22 changed files in this pull request and generated 20 comments.
Show a summary per file
| File | Description |
|---|---|
| worker/routes/converter.py | New API routes for model format conversion (MLX/GGUF) |
| worker/routes/init.py | Added converter router to available routes |
| worker/native_ops/process_manager.py | Extended process manager with vLLM-Metal support, auto-install for MLX/llama.cpp, and log file management |
| worker/native_ops/converter.py | New model converter module handling HuggingFace to MLX/GGUF conversions with caching |
| worker/native_ops/init.py | Exported ModelConverter class |
| worker/docker_ops/gpu.py | Added Apple Silicon GPU detection via system_profiler |
| worker/agent.py | Integrated converter routes and auto-start Ollama with external access |
| frontend/src/pages/Deployments.tsx | Updated UI to show MLX/llama.cpp/vLLM-Metal backends for Mac workers with format indicators |
| frontend/src/components/logos/index.tsx | Added official MLX logo and improved llama.cpp branding |
| frontend/src/components/ModelFormatCompatibility.tsx | New component showing format compatibility and conversion warnings |
| frontend/src/components/ModelCompatibilityCheck.tsx | Extended to support new backend types |
| frontend/src/components/HuggingFaceModelPicker.tsx | Added format filtering (All/MLX Ready/GGUF Ready) |
| frontend/src/components/DeploymentAdvancedForm.tsx | Excluded MLX/llama.cpp from advanced settings |
| frontend/src/assets/mlx-logo-dark.png | Added MLX logo asset |
| frontend/src/api/index.ts | Exported ModelFormatInfo type |
| frontend/src/api/huggingface.ts | Added format info and MLX/GGUF search endpoints |
| backend/app/services/local_worker.py | Auto-start Ollama with external access on Mac before Docker worker |
| backend/app/services/deployer/service.py | Updated to use native deployment for vLLM on Mac (vLLM-Metal) |
| backend/app/services/deployer/native.py | Added conversion detection and early container_id assignment for log streaming |
| backend/app/models/worker.py | Updated to always show vLLM/MLX/llama.cpp as available on Mac (auto-installable) |
| backend/app/api/huggingface.py | Added format info, MLX search, and GGUF search endpoints |
Comments suppressed due to low confidence (1)
backend/app/services/local_worker.py:319
- This import of module os is redundant, as it was previously imported on line 7.
import os
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def _write_log(self, process_id: str, message: str) -> None: | ||
| """Write a message to a process's log file.""" | ||
| log_file = self._log_dir / f"{process_id}.log" | ||
| with open(log_file, "a") as f: | ||
| from datetime import datetime | ||
|
|
||
| timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S") | ||
| f.write(f"[{timestamp}] {message}\n") | ||
| f.flush() |
There was a problem hiding this comment.
The _write_log method opens the log file in append mode without any size limits or rotation. For long-running deployments with verbose output, this could lead to unbounded log file growth. Consider implementing log rotation or size limits to prevent disk space issues.
| stderr = await create_venv.stderr.read() | ||
| raise RuntimeError(f"Failed to create virtual environment: {stderr.decode()}") |
There was a problem hiding this comment.
The stderr output is read after the subprocess completes, but if stderr is None (which can happen if the subprocess was not properly initialized), calling .read() will raise an AttributeError. Add a null check before reading stderr to handle this edge case gracefully.
| stderr = await create_venv.stderr.read() | |
| raise RuntimeError(f"Failed to create virtual environment: {stderr.decode()}") | |
| stderr_data = b"" | |
| if create_venv.stderr is not None: | |
| stderr_data = await create_venv.stderr.read() | |
| error_msg = stderr_data.decode() if stderr_data else "Unknown error" | |
| raise RuntimeError(f"Failed to create virtual environment: {error_msg}") |
| def ensure_ollama_running_on_host(host: str = "0.0.0.0", port: int = OLLAMA_DEFAULT_PORT) -> bool: | ||
| """Ensure Ollama is running on the host with external access enabled. | ||
|
|
||
| This is called BEFORE starting Docker worker so that the container | ||
| can access Ollama on the host via localhost (with --network host). | ||
|
|
||
| Args: | ||
| host: Host to bind to (default 0.0.0.0 for external access) | ||
| port: Port to bind to (default 11434) | ||
|
|
||
| Returns: | ||
| True if Ollama is running and accessible | ||
| """ | ||
| # Only run on macOS | ||
| if platform.system() != "Darwin": | ||
| return True # Not needed on Linux (Docker can use GPU directly) | ||
|
|
||
| # Check if Ollama is installed | ||
| ollama_path = shutil.which("ollama") | ||
| if not ollama_path: | ||
| logger.info("Ollama is not installed on this Mac") | ||
| return False | ||
|
|
||
| # Check if Ollama is already running | ||
| try: | ||
| import httpx | ||
|
|
||
| with httpx.Client(timeout=2.0) as client: | ||
| response = client.get(f"http://localhost:{port}/api/tags") | ||
| if response.status_code == 200: | ||
| logger.info("Ollama service is already running") | ||
| return True | ||
| except Exception: | ||
| pass | ||
|
|
||
| # Ollama not running, start it with external access | ||
| logger.info(f"Starting Ollama service on {host}:{port}") | ||
|
|
||
| env = os.environ.copy() | ||
| env["OLLAMA_HOST"] = f"{host}:{port}" | ||
|
|
||
| try: | ||
| # Start ollama serve in background | ||
| process = subprocess.Popen( | ||
| [ollama_path, "serve"], | ||
| stdout=subprocess.PIPE, | ||
| stderr=subprocess.STDOUT, | ||
| env=env, | ||
| start_new_session=True, | ||
| ) | ||
| logger.info(f"Started Ollama service (PID {process.pid})") | ||
|
|
||
| # Wait for Ollama to be ready (up to 30 seconds) | ||
| import httpx | ||
|
|
||
| for _ in range(30): | ||
| time.sleep(1) | ||
| try: | ||
| with httpx.Client(timeout=2.0) as client: | ||
| response = client.get(f"http://localhost:{port}/api/tags") | ||
| if response.status_code == 200: | ||
| logger.info("Ollama service is ready") | ||
| return True | ||
| except Exception: | ||
| pass | ||
|
|
||
| logger.error("Ollama service failed to start in time") | ||
| return False | ||
|
|
||
| except Exception as e: | ||
| logger.error(f"Failed to start Ollama service: {e}") | ||
| return False |
There was a problem hiding this comment.
The ensure_ollama_running_on_host function in the backend service is nearly identical to the ensure_ollama_running method in the worker's process manager. This code duplication means any bug fixes or improvements need to be applied in two places. Consider refactoring this into a shared utility module that both the backend and worker can use.
| @staticmethod | ||
| def find_mlx_variant(hf_model_id: str) -> Optional[str]: | ||
| """Find MLX variant of a HuggingFace model. | ||
|
|
||
| Searches mlx-community for a converted version of the model. | ||
|
|
||
| Args: | ||
| hf_model_id: Original HuggingFace model ID | ||
|
|
||
| Returns: | ||
| MLX model ID if found, None otherwise | ||
| """ | ||
| # Try common naming patterns | ||
| model_name = hf_model_id.split("/")[-1] | ||
| patterns = [ | ||
| f"mlx-community/{model_name}", | ||
| f"mlx-community/{model_name}-mlx", | ||
| f"mlx-community/{model_name}-4bit", | ||
| f"mlx-community/{model_name}-8bit", | ||
| ] | ||
| return patterns[0] if patterns else None |
There was a problem hiding this comment.
The find_mlx_variant static method returns only the first pattern if patterns list is not empty, otherwise None. However, the logic doesn't actually check if these variant models exist on HuggingFace - it just constructs potential model IDs. The caller should verify the existence of these models before using them. Consider documenting this behavior or adding actual existence checks.
| f"Download failed: {stdout.decode() if stdout else 'Unknown error'}" | ||
| ) | ||
| model_dir = self.cache_dir / "downloads" / hf_model_id.replace("/", "--") | ||
|
|
There was a problem hiding this comment.
The GGUF conversion process downloads the model using huggingface-cli but doesn't validate that the download was successful before attempting conversion. The code checks the return code of the download process (line 370), but if the model directory already exists (line 356), it skips the download and uses the existing directory. This could lead to conversion failures if the existing directory is incomplete or corrupted. Consider validating that the necessary model files exist before proceeding with conversion.
| # Validate that the model directory contains the necessary files | |
| config_file = model_dir / "config.json" | |
| has_safetensors = any(model_dir.glob("*.safetensors")) | |
| has_bin_weights = any(model_dir.glob("*.bin")) | |
| if ( | |
| not model_dir.exists() | |
| or not model_dir.is_dir() | |
| or not config_file.exists() | |
| or not (has_safetensors or has_bin_weights) | |
| ): | |
| raise RuntimeError( | |
| f"Model directory '{model_dir}' is missing required files. " | |
| "Please clear the Hugging Face cache or downloads and retry." | |
| ) |
| raise RuntimeError( | ||
| "llama-server not found. " "Please install llama.cpp: brew install llama.cpp" | ||
| ) | ||
| effective_model_path = model_id |
There was a problem hiding this comment.
This assignment to 'effective_model_path' is unnecessary as it is redefined before this value is used.
This assignment to 'effective_model_path' is unnecessary as it is redefined before this value is used.
This assignment to 'effective_model_path' is unnecessary as it is redefined before this value is used.
This assignment to 'effective_model_path' is unnecessary as it is redefined before this value is used.
| if response.status_code == 200: | ||
| logger.info("Ollama service is already running") | ||
| return True | ||
| except Exception: |
There was a problem hiding this comment.
'except' clause does nothing but pass and there is no explanatory comment.
| if response.status_code == 200: | ||
| logger.info("Ollama service is ready") | ||
| return True | ||
| except Exception: |
There was a problem hiding this comment.
'except' clause does nothing but pass and there is no explanatory comment.
| if response.status_code == 200: | ||
| logger.info("Ollama service is already running") | ||
| return True | ||
| except Exception: |
There was a problem hiding this comment.
'except' clause does nothing but pass and there is no explanatory comment.
| if response.status_code == 200: | ||
| logger.info("Ollama service is ready") | ||
| return True | ||
| except Exception: |
There was a problem hiding this comment.
'except' clause does nothing but pass and there is no explanatory comment.
Summary