Skip to content

[Feature] Mac native backend support with vLLM-Metal, MLX, and llama.cpp#8

Merged
ricky-chaoju merged 7 commits into
mainfrom
dev
Feb 2, 2026
Merged

[Feature] Mac native backend support with vLLM-Metal, MLX, and llama.cpp#8
ricky-chaoju merged 7 commits into
mainfrom
dev

Conversation

@ricky-chaoju
Copy link
Copy Markdown
Contributor

Summary

  • Add Apple Silicon GPU detection for Mac workers
  • Add vLLM-Metal, MLX, and llama.cpp native backend support
  • Auto-install dependencies (no manual pip/brew required)
  • Auto-convert HuggingFace models to MLX/GGUF formats
  • Download pre-quantized GGUF models directly from HuggingFace
  • Add deployment progress logs visible in web UI
  • Add official MLX logo and fix dropdown alignment

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request adds comprehensive Mac native backend support for LMStack, enabling Apple Silicon GPU acceleration through vLLM-Metal, MLX, and llama.cpp backends. The implementation includes automatic dependency installation, model format conversion, and deployment progress tracking.

Changes:

  • Added native Mac backend support for vLLM-Metal, MLX-LM, and llama.cpp with automatic installation
  • Implemented automatic model conversion from HuggingFace to MLX/GGUF formats with caching
  • Added deployment progress logging visible in the web UI for native deployments
  • Enhanced Apple Silicon GPU detection and worker capability reporting
  • Updated frontend UI to support new backends with format compatibility indicators

Reviewed changes

Copilot reviewed 20 out of 22 changed files in this pull request and generated 20 comments.

Show a summary per file
File Description
worker/routes/converter.py New API routes for model format conversion (MLX/GGUF)
worker/routes/init.py Added converter router to available routes
worker/native_ops/process_manager.py Extended process manager with vLLM-Metal support, auto-install for MLX/llama.cpp, and log file management
worker/native_ops/converter.py New model converter module handling HuggingFace to MLX/GGUF conversions with caching
worker/native_ops/init.py Exported ModelConverter class
worker/docker_ops/gpu.py Added Apple Silicon GPU detection via system_profiler
worker/agent.py Integrated converter routes and auto-start Ollama with external access
frontend/src/pages/Deployments.tsx Updated UI to show MLX/llama.cpp/vLLM-Metal backends for Mac workers with format indicators
frontend/src/components/logos/index.tsx Added official MLX logo and improved llama.cpp branding
frontend/src/components/ModelFormatCompatibility.tsx New component showing format compatibility and conversion warnings
frontend/src/components/ModelCompatibilityCheck.tsx Extended to support new backend types
frontend/src/components/HuggingFaceModelPicker.tsx Added format filtering (All/MLX Ready/GGUF Ready)
frontend/src/components/DeploymentAdvancedForm.tsx Excluded MLX/llama.cpp from advanced settings
frontend/src/assets/mlx-logo-dark.png Added MLX logo asset
frontend/src/api/index.ts Exported ModelFormatInfo type
frontend/src/api/huggingface.ts Added format info and MLX/GGUF search endpoints
backend/app/services/local_worker.py Auto-start Ollama with external access on Mac before Docker worker
backend/app/services/deployer/service.py Updated to use native deployment for vLLM on Mac (vLLM-Metal)
backend/app/services/deployer/native.py Added conversion detection and early container_id assignment for log streaming
backend/app/models/worker.py Updated to always show vLLM/MLX/llama.cpp as available on Mac (auto-installable)
backend/app/api/huggingface.py Added format info, MLX search, and GGUF search endpoints
Comments suppressed due to low confidence (1)

backend/app/services/local_worker.py:319

  • This import of module os is redundant, as it was previously imported on line 7.
    import os

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +46 to +54
def _write_log(self, process_id: str, message: str) -> None:
"""Write a message to a process's log file."""
log_file = self._log_dir / f"{process_id}.log"
with open(log_file, "a") as f:
from datetime import datetime

timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
f.write(f"[{timestamp}] {message}\n")
f.flush()
Copy link

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _write_log method opens the log file in append mode without any size limits or rotation. For long-running deployments with verbose output, this could lead to unbounded log file growth. Consider implementing log rotation or size limits to prevent disk space issues.

Copilot uses AI. Check for mistakes.
Comment on lines +350 to +351
stderr = await create_venv.stderr.read()
raise RuntimeError(f"Failed to create virtual environment: {stderr.decode()}")
Copy link

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The stderr output is read after the subprocess completes, but if stderr is None (which can happen if the subprocess was not properly initialized), calling .read() will raise an AttributeError. Add a null check before reading stderr to handle this edge case gracefully.

Suggested change
stderr = await create_venv.stderr.read()
raise RuntimeError(f"Failed to create virtual environment: {stderr.decode()}")
stderr_data = b""
if create_venv.stderr is not None:
stderr_data = await create_venv.stderr.read()
error_msg = stderr_data.decode() if stderr_data else "Unknown error"
raise RuntimeError(f"Failed to create virtual environment: {error_msg}")

Copilot uses AI. Check for mistakes.
Comment on lines +129 to +200
def ensure_ollama_running_on_host(host: str = "0.0.0.0", port: int = OLLAMA_DEFAULT_PORT) -> bool:
"""Ensure Ollama is running on the host with external access enabled.

This is called BEFORE starting Docker worker so that the container
can access Ollama on the host via localhost (with --network host).

Args:
host: Host to bind to (default 0.0.0.0 for external access)
port: Port to bind to (default 11434)

Returns:
True if Ollama is running and accessible
"""
# Only run on macOS
if platform.system() != "Darwin":
return True # Not needed on Linux (Docker can use GPU directly)

# Check if Ollama is installed
ollama_path = shutil.which("ollama")
if not ollama_path:
logger.info("Ollama is not installed on this Mac")
return False

# Check if Ollama is already running
try:
import httpx

with httpx.Client(timeout=2.0) as client:
response = client.get(f"http://localhost:{port}/api/tags")
if response.status_code == 200:
logger.info("Ollama service is already running")
return True
except Exception:
pass

# Ollama not running, start it with external access
logger.info(f"Starting Ollama service on {host}:{port}")

env = os.environ.copy()
env["OLLAMA_HOST"] = f"{host}:{port}"

try:
# Start ollama serve in background
process = subprocess.Popen(
[ollama_path, "serve"],
stdout=subprocess.PIPE,
stderr=subprocess.STDOUT,
env=env,
start_new_session=True,
)
logger.info(f"Started Ollama service (PID {process.pid})")

# Wait for Ollama to be ready (up to 30 seconds)
import httpx

for _ in range(30):
time.sleep(1)
try:
with httpx.Client(timeout=2.0) as client:
response = client.get(f"http://localhost:{port}/api/tags")
if response.status_code == 200:
logger.info("Ollama service is ready")
return True
except Exception:
pass

logger.error("Ollama service failed to start in time")
return False

except Exception as e:
logger.error(f"Failed to start Ollama service: {e}")
return False
Copy link

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ensure_ollama_running_on_host function in the backend service is nearly identical to the ensure_ollama_running method in the worker's process manager. This code duplication means any bug fixes or improvements need to be applied in two places. Consider refactoring this into a shared utility module that both the backend and worker can use.

Copilot uses AI. Check for mistakes.
Comment on lines +169 to +189
@staticmethod
def find_mlx_variant(hf_model_id: str) -> Optional[str]:
"""Find MLX variant of a HuggingFace model.

Searches mlx-community for a converted version of the model.

Args:
hf_model_id: Original HuggingFace model ID

Returns:
MLX model ID if found, None otherwise
"""
# Try common naming patterns
model_name = hf_model_id.split("/")[-1]
patterns = [
f"mlx-community/{model_name}",
f"mlx-community/{model_name}-mlx",
f"mlx-community/{model_name}-4bit",
f"mlx-community/{model_name}-8bit",
]
return patterns[0] if patterns else None
Copy link

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The find_mlx_variant static method returns only the first pattern if patterns list is not empty, otherwise None. However, the logic doesn't actually check if these variant models exist on HuggingFace - it just constructs potential model IDs. The caller should verify the existence of these models before using them. Consider documenting this behavior or adding actual existence checks.

Copilot uses AI. Check for mistakes.
f"Download failed: {stdout.decode() if stdout else 'Unknown error'}"
)
model_dir = self.cache_dir / "downloads" / hf_model_id.replace("/", "--")

Copy link

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The GGUF conversion process downloads the model using huggingface-cli but doesn't validate that the download was successful before attempting conversion. The code checks the return code of the download process (line 370), but if the model directory already exists (line 356), it skips the download and uses the existing directory. This could lead to conversion failures if the existing directory is incomplete or corrupted. Consider validating that the necessary model files exist before proceeding with conversion.

Suggested change
# Validate that the model directory contains the necessary files
config_file = model_dir / "config.json"
has_safetensors = any(model_dir.glob("*.safetensors"))
has_bin_weights = any(model_dir.glob("*.bin"))
if (
not model_dir.exists()
or not model_dir.is_dir()
or not config_file.exists()
or not (has_safetensors or has_bin_weights)
):
raise RuntimeError(
f"Model directory '{model_dir}' is missing required files. "
"Please clear the Hugging Face cache or downloads and retry."
)

Copilot uses AI. Check for mistakes.
raise RuntimeError(
"llama-server not found. " "Please install llama.cpp: brew install llama.cpp"
)
effective_model_path = model_id
Copy link

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assignment to 'effective_model_path' is unnecessary as it is redefined before this value is used.
This assignment to 'effective_model_path' is unnecessary as it is redefined before this value is used.
This assignment to 'effective_model_path' is unnecessary as it is redefined before this value is used.
This assignment to 'effective_model_path' is unnecessary as it is redefined before this value is used.

Copilot uses AI. Check for mistakes.
if response.status_code == 200:
logger.info("Ollama service is already running")
return True
except Exception:
Copy link

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'except' clause does nothing but pass and there is no explanatory comment.

Copilot uses AI. Check for mistakes.
if response.status_code == 200:
logger.info("Ollama service is ready")
return True
except Exception:
Copy link

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'except' clause does nothing but pass and there is no explanatory comment.

Copilot uses AI. Check for mistakes.
if response.status_code == 200:
logger.info("Ollama service is already running")
return True
except Exception:
Copy link

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'except' clause does nothing but pass and there is no explanatory comment.

Copilot uses AI. Check for mistakes.
if response.status_code == 200:
logger.info("Ollama service is ready")
return True
except Exception:
Copy link

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'except' clause does nothing but pass and there is no explanatory comment.

Copilot uses AI. Check for mistakes.
@ricky-chaoju ricky-chaoju merged commit 5541d42 into main Feb 2, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants