Skip to content
View janhilgard's full-sized avatar

Block or report janhilgard

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
janhilgard/README.md

Jan Hilgard — AI inference engineer, serial founder

I build systems for running large language models efficiently on local hardware. Currently focused on inference engine optimization for Apple Silicon — specifically M3 Ultra deployments, MLX quantization (Q6/Q8), KV cache strategies, and agentic pipelines that actually work in production.

I'm a co-founder and CTO in the AI space. Before that, I founded and ran Hosting90, a Czech hosting and cloud infrastructure company, for 18 years before exiting in 2020. That background in running infrastructure at scale informs how I think about LLM serving: reliability, latency, and resource efficiency matter more than benchmark headlines.

What I'm working on

  • vllm-mlx — Core contributor to vllm-mlx (80+ PRs; second external contributor by join order), an OpenAI-compatible inference server built on MLX for Apple Silicon. Specific contributions:
    • MTP speculative decoding for Qwen3-Next (PR #82, merged) — Multi-Token Prediction with always-advance strategy and rejection sampling; 1.43x verified / 1.76x optimistic on M3 Ultra
    • Draft-model speculative decoding (PR #45, merged) — HybridEngine sharing a single model instance across speculative and batched modes; 1.2–1.4x throughput improvement
    • Prefix caching, KV cache quantization, Anthropic Messages API integration, MoE model support
    • Assigned collaborator on MTP roadmap for Qwen3-Next / MiMo / Qwen3.5 family alongside the repo owner
  • vllm-mlx-dashboard — Real-time monitoring dashboard for local LLM inference servers (llama.cpp + vllm-mlx), built with Next.js.
  • M3 Ultra (256 GB) benchmarking — sustained throughput, quantization tradeoffs, batch size effects at the edge of consumer hardware.

Background

  • 2020–present: Co-founder & CTO, AI inference tooling
  • 2002–2020: Founder & CEO, Hosting90 (Czech hosting/cloud, exited)
  • Focus areas: inference optimization, local LLM deployment, agentic systems

Tech I work with

Python TypeScript MLX vLLM Next.js Apple Silicon llama.cpp KV cache quantization

Reach me

Pinned Loading

  1. vllm-mlx-dashboard vllm-mlx-dashboard Public

    Real-time monitoring dashboard for local LLM inference servers (vllm-mlx + llama.cpp) — built with Next.js

    TypeScript 4

  2. vllm-mlx vllm-mlx Public

    Forked from waybarrios/vllm-mlx

    OpenAI-compatible server for Apple Silicon. Run LLMs and vision-language models (Llama, Qwen-VL, LLaVA) with continuous batching, MCP tool calling, and multimodal support. Native MLX backend, 400+ …

    Python