Skip to content

iyulab/lm-supply

Repository files navigation

LMSupply

Local Model Supply for .NET — on-demand AI inference

CI License: MIT

LMSupply Console LMSupply Console

LMSupply Console LMSupply Console

Start small. Download what you need. Run locally.

// This is all you need. No setup. No configuration. No API keys.
await using var model = await LocalEmbedder.LoadAsync("auto");  // Hardware-optimized selection
float[] embedding = await model.EmbedAsync("Hello, world!");

LMSupply is designed around three core principles:

🪶 Minimal Footprint

Your application ships with zero bundled models. The base package is tiny. Models, tokenizers, and runtime components are downloaded only when first requested and cached for reuse.

⚡ Lazy Everything

First run:  LoadAsync("default") → Downloads model → Caches → Runs inference
Next runs:  LoadAsync("default") → Uses cached model → Runs inference instantly

No pre-download scripts. No model management. Just use it.

🎯 Zero Boilerplate

Traditional approach:

// ❌ Without LMSupply: 50+ lines of setup
var tokenizer = LoadTokenizer(modelPath);
var session = new InferenceSession(modelPath, sessionOptions);
var inputIds = tokenizer.Encode(text);
var attentionMask = CreateAttentionMask(inputIds);
var inputs = new List<NamedOnnxValue> { ... };
var outputs = session.Run(inputs);
var embeddings = PostProcess(outputs);
// ... error handling, pooling, normalization, cleanup ...
// ✅ With LMSupply: 2 lines
await using var model = await LocalEmbedder.LoadAsync("default");
float[] embedding = await model.EmbedAsync("Hello, world!");

Packages

Package Description Status
LMSupply.Embedder Text → Vector embeddings (ONNX + GGUF) NuGet
LMSupply.Reranker Semantic reranking for search NuGet
LMSupply.Generator Text generation & chat (ONNX + GGUF) NuGet
LMSupply.Captioner Image → Text captioning NuGet
LMSupply.Ocr Document OCR NuGet
LMSupply.Detector Object detection NuGet
LMSupply.Segmenter Image segmentation NuGet
LMSupply.Translator Neural machine translation NuGet
LMSupply.Transcriber Speech → Text (Whisper) NuGet
LMSupply.Synthesizer Text → Speech (Piper) NuGet
LMSupply.Llama Shared llama-server management for GGUF NuGet

Quick Start

Text Embeddings

using LMSupply.Embedder;

// Use "auto" for hardware-optimized model selection
await using var model = await LocalEmbedder.LoadAsync("auto");

// Single text
float[] embedding = await model.EmbedAsync("Hello, world!");

// Batch processing
float[][] embeddings = await model.EmbedAsync(new[]
{
    "First document",
    "Second document",
    "Third document"
});

// Similarity
float similarity = LocalEmbedder.CosineSimilarity(embeddings[0], embeddings[1]);

// GGUF models (via llama-server) - Auto-detected by repo name pattern
await using var ggufModel = await LocalEmbedder.LoadAsync("nomic-ai/nomic-embed-text-v1.5-GGUF");
float[] ggufEmbedding = await ggufModel.EmbedAsync("Hello from GGUF!");

Semantic Reranking

using LMSupply.Reranker;

await using var reranker = await LocalReranker.LoadAsync("default");

var results = await reranker.RerankAsync(
    query: "What is machine learning?",
    documents: new[]
    {
        "Machine learning is a subset of artificial intelligence...",
        "The weather today is sunny and warm...",
        "Deep learning uses neural networks..."
    },
    topK: 2
);

foreach (var result in results)
{
    Console.WriteLine($"[{result.Score:F4}] {result.Document}");
}

Text Generation

using LMSupply.Generator;

// GGUF models — native tool calling support via llama-server
await using var model = await LocalGenerator.LoadAsync("gguf:auto");  // Hardware-optimized (Qwen3 pool)

await foreach (var token in model.GenerateAsync("Hello, my name is"))
{
    Console.Write(token);
}

// Chat with tool calling support (--jinja enabled)
var messages = new[]
{
    ChatMessage.System("You are a helpful assistant."),
    ChatMessage.User("Explain quantum computing simply.")
};

await foreach (var token in model.GenerateChatAsync(messages))
{
    Console.Write(token);
}

// ONNX models (for DirectML/NPU environments)
var generator = await TextGeneratorBuilder.Create()
    .WithDefaultModel()  // Platform-aware: Gemma 4 GGUF on NVIDIA/CPU/Mac/Linux, Phi-4 Mini ONNX on DirectML+non-NVIDIA
    .BuildAsync();

string response = await generator.GenerateCompleteAsync("What is machine learning?");

// Fallback chain — try candidates in order, load first that succeeds
await using var robust = await LocalGenerator.LoadWithFallbackChainAsync(
    ["gguf:phi-4-mini", "gguf:qwen3-default"],
    onFailure: (id, ex) => Console.WriteLine($"Skipped {id}: {ex.Message}"));

// Quality floor — prefer a specific model in 'auto' selection, fall back if unavailable
var options = new GeneratorOptions { PreferredAutoModelId = "gguf:phi-4-mini" };
await using var preferred = await LocalGenerator.LoadAsync("auto", options);

Translation

using LMSupply.Translator;

await using var translator = await LocalTranslator.LoadAsync("ko-en");

// Translate Korean to English
string english = await translator.TranslateAsync("안녕하세요, 세계!");
Console.WriteLine(english); // "Hello, world!"

// Batch translation
string[] translations = await translator.TranslateBatchAsync(new[]
{
    "첫 번째 문장입니다.",
    "두 번째 문장입니다."
});

Speech Recognition (Transcriber)

using LMSupply.Transcriber;

await using var transcriber = await LocalTranscriber.LoadAsync("default");

// Transcribe audio file
var result = await transcriber.TranscribeAsync("audio.wav");
Console.WriteLine(result.Text);
Console.WriteLine($"Language: {result.Language}");

// Streaming transcription
await foreach (var segment in transcriber.TranscribeStreamingAsync("audio.wav"))
{
    Console.WriteLine($"[{segment.Start:F2}s] {segment.Text}");
}

Text-to-Speech (Synthesizer)

using LMSupply.Synthesizer;

await using var synthesizer = await LocalSynthesizer.LoadAsync("default");

// Synthesize and save to file
await synthesizer.SynthesizeToFileAsync("Hello, world!", "output.wav");

// Get audio samples
var result = await synthesizer.SynthesizeAsync("Hello!");
Console.WriteLine($"Duration: {result.DurationSeconds:F2}s");
Console.WriteLine($"Real-time factor: {result.RealTimeFactor:F1}x");

Available Models

Updated: 2026-03 based on MTEB leaderboard and community benchmarks

Embedder (ONNX)

Alias Model Dims Params Context Best For
default bge-m3 1024 568M 8192 SOTA multilingual, 100+ languages (v0.34+)
quality bge-m3 1024 568M 8192 Same as default; for pipelines that pin quality tier
fast multilingual-e5-small 384 118M 512 Lightweight multilingual, low latency
large multilingual-e5-large 1024 560M 512 Highest dense quality, 100+ languages

Embedder (GGUF via llama-server)

GGUF models are auto-detected by -GGUF or _gguf in repo name, or .gguf file extension.

Model Repository Dims Context Best For
nomic-ai/nomic-embed-text-v1.5-GGUF 768 8K Long context, matryoshka
BAAI/bge-small-en-v1.5-GGUF 384 512 Compact and fast
BAAI/bge-base-en-v1.5-GGUF 768 512 Quality balance
Any HuggingFace GGUF embedding repo varies varies Custom models

Reranker (ONNX)

Alias Model Params Context Best For
default ms-marco-MiniLM-L-6-v2 22M 512 Balanced speed/quality
fast ms-marco-TinyBERT-L-2-v2 4.4M 512 Ultra-low latency
quality bge-reranker-base 278M 512 Higher accuracy
large bge-reranker-large 560M 512 Best accuracy
multilingual bge-reranker-v2-m3 568M 8192 Long docs, 100+ languages

Reranker (GGUF via llama-server)

GGUF reranker models are auto-detected by -GGUF or _gguf in repo name.

Model Repository Context Best For
BAAI/bge-reranker-v2-m3-GGUF 8K Multilingual, long docs
jinaai/jina-reranker-v2-base-multilingual-GGUF 8K Multilingual

Generator

Platform-based defaults (default and auto delegate to this matrix):

Platform Selected backend Selected model
Windows + NVIDIA GGUF (llama.cpp CUDA) Gemma 4 via gguf:auto (VRAM-aware)
Windows + AMD/Intel GPU ONNX (DirectML) Phi-4 Mini (MIT, FC-capable)
Windows / Linux CPU-only GGUF (llama.cpp CPU) Gemma 4 via gguf:auto (VRAM-aware)
Linux + any GPU GGUF (llama.cpp; CUDA on NVIDIA, CPU/ROCm on AMD) Gemma 4 via gguf:auto
macOS (Apple Silicon) GGUF (llama.cpp Metal) Gemma 4 via gguf:auto

LoadAsync("default") and LoadAsync("auto") both route through this matrix. For explicit selection, use gguf:* aliases, ONNX aliases, or a direct HuggingFace repo ID.

ONNX aliases (recommended for Windows DirectML + non-NVIDIA):

Alias Model Params Context License Notes
phi-4-mini Phi-4-mini-instruct 3.8B 16K MIT Smallest FC-capable ONNX model
fast Phi-4-mini-instruct 3.8B 16K MIT Same as phi-4-mini
quality phi-4 14B 16K MIT Best reasoning
phi-3.5-mini Phi-3.5-mini-instruct 3.8B 128K MIT Long context (legacy)

GGUF aliases (via llama-server):

Gemma 4와 Qwen3 시리즈 중심 레지스트리. gguf:autoqwen3 auto-pool (qwen3-fast/default/balanced/quality)에서 VRAM에 맞는 가장 큰 모델을 자동 선택합니다. Gemma 4 aliases는 명시적으로 지정하거나 하드코딩된 워크로드에 사용하세요.

Gemma 4 aliases (Apache 2.0, 멀티모달, 네이티브 function calling; llama.cpp b8672+ 필요):

Alias Model Params Quant Size VRAM Target
gguf:gemma4-fast Gemma 4 E2B Instruct 2.3B Q4_K_M ~3.1 GB <4GB iGPU/mobile
gguf:gemma4-default Gemma 4 E4B Instruct 4.5B Q4_K_M ~5.3 GB 4-8GB
gguf:gemma4-balanced Gemma 4 E4B Instruct 4.5B Q8_0 ~7.5 GB 8-16GB (RTX 3060 12GB 등)
gguf:gemma4-quality Gemma 4 26B A4B (MoE) 26B (4B active) Q4_K_M ~16.8 GB 16-20GB
gguf:gemma4-large Gemma 4 31B Instruct 31B Q4_K_M ~18.7 GB 20-48GB

Qwen3/3.5/3.6 aliases (Apache 2.0, ChatML, thinking mode; gguf:auto pool):

Alias Model Params Quant Size VRAM Target Notes
gguf:auto Hardware-optimized (qwen3 pool) varies varies varies Auto-select
gguf:qwen3-fast Qwen 3.5 2B Instruct 2B Q4_K_M ~1.5 GB <3GB
gguf:qwen3-default Qwen 3.5 4B Instruct 4B Q4_K_M ~3.0 GB 4-6GB thinking ON by default
gguf:qwen3-balanced Qwen3 8B Instruct 8B Q4_K_M ~5.0 GB 6-10GB
gguf:qwen3-quality Qwen 3.6 35B A3B Instruct (IQ4_XS, MoE) 35B (3B active) IQ4_XS ~17.7 GB 20-24GB thinking ON by default
gguf:qwen3-large Qwen 3.6 35B A3B Instruct (Q4_K_M, MoE) 35B (3B active) Q4_K_M ~22.1 GB 24GB+ thinking ON; auto-pool excluded

Other aliases:

Alias Model Params Quant Size VRAM Target
gguf:phi-4-mini Phi-4 Mini Instruct 3.8B Q4_K_M ~2.4 GB <4GB
gguf:qwen2.5-7b Qwen 2.5 7B Instruct 7.6B Q4_K_M ~4.7 GB 6-8GB
gguf:xlarge Qwen 3.5 122B A10B (MoE, split) 122B (10B active) Q4_K_M ~76.5 GB (3 shards) 48GB+ server

Translator

Alias Direction Model Best For
ko-en Korean → English OPUS-MT Korean translation
en-ko English → Korean OPUS-MT Korean translation
ja-en Japanese → English OPUS-MT Japanese translation
zh-en Chinese → English OPUS-MT Chinese translation
multilingual Many → English mBART/M2M100 100+ languages

Transcriber (Whisper)

Alias Model Params Size WER Best For
fast Whisper Tiny 39M ~150MB 7.6% Ultra-fast transcription
default Whisper Base 74M ~290MB 5.0% Balanced speed/quality
quality Whisper Small 244M ~970MB 3.4% Higher accuracy
large Whisper Large V3 1.5B ~6GB 2.5% Best accuracy
english Whisper Base.en 74M ~290MB 4.3% English-optimized

Synthesizer (Piper TTS)

Alias Voice Language Sample Rate Best For
default Lessac en-US 22050 Hz Balanced quality
fast Ryan en-US 16000 Hz Ultra-fast synthesis
quality Amy en-US 22050 Hz High quality
british Semaine en-GB 22050 Hz British English
korean KSS ko-KR 22050 Hz Korean
japanese JSUT ja-JP 22050 Hz Japanese
chinese Huayan zh-CN 22050 Hz Mandarin Chinese

Adaptive Model Selection ("auto" mode)

Use "auto" to let LMSupply select the optimal model based on your hardware:

// Hardware-optimized model selection
await using var embedder = await LocalEmbedder.LoadAsync("auto");
await using var generator = await LocalGenerator.LoadAsync("auto");      // Platform-based: GGUF or ONNX
await using var reranker = await LocalReranker.LoadAsync("auto");

LMSupply detects your hardware and selects models accordingly:

ONNX Models

LocalEmbedder.LoadAsync("auto") selects the largest model whose estimated size fits available VRAM. Candidates (largest first): BGE-M3 (568M), multilingual-e5-large (560M), nomic-embed-text-v1.5 (137M), multilingual-e5-small (118M). Falls back to multilingual-e5-small when nothing fits.

Performance Tier Hardware Embedder (auto) Generator Reranker
Low CPU only or GPU <4GB multilingual-e5-small (118M) Phi-4-mini (3.8B) MiniLM-L6 (22M)
Medium GPU 4-8GB nomic-embed-text-v1.5 (137M) Phi-4-mini (3.8B) bge-reranker-base
High GPU 8-16GB multilingual-e5-large (560M) Phi-4 (14B) bge-reranker-large
Ultra GPU 16GB+ bge-m3 (568M) Phi-4 (14B) bge-reranker-large

GGUF Models (via gguf:auto)

gguf:auto selects from the Qwen3 auto-pool (qwen3-fast, qwen3-default, qwen3-balanced, qwen3-quality) based on VRAM. Models with thinking-enabled-by-default generate <think>...</think> blocks — pass FilterReasoningTokens = true to suppress them.

Performance Tier Free VRAM Selected Model Notes
Low CPU or <3GB gguf:qwen3-fast (Qwen 3.5 2B) FallbackToSmallest
Medium 4-6GB gguf:qwen3-default (Qwen 3.5 4B) thinking ON
High 6-10GB gguf:qwen3-balanced (Qwen3 8B)
Ultra 20-24GB gguf:qwen3-quality (Qwen 3.6 35B MoE) thinking ON

Platform-based routing (v0.28.0+): LoadAsync("default") and LoadAsync("auto") both select the optimal backend+model for the current host: GGUF via llama.cpp on CPU / NVIDIA / Apple Silicon / Linux, and ONNX via DirectML on Windows AMD/Intel. Use gguf:* aliases or ONNX aliases for explicit control.

Key benefits:

  • Zero configuration - Just use "auto", no hardware research needed
  • Optimal performance - Larger models on capable hardware
  • Graceful degradation - Smaller models on limited hardware
  • Backward compatible - Existing aliases ("default", "fast", "quality") still work

GPU Acceleration

GPU acceleration is automatic — LMSupply detects your hardware and downloads appropriate runtime binaries on first use:

Detection priority: CUDA → DirectML → CoreML → CPU
// Auto-detect (default) - uses GPU if available, falls back to CPU
var options = new EmbedderOptions { Provider = ExecutionProvider.Auto };

// Force specific provider
var options = new EmbedderOptions { Provider = ExecutionProvider.Cuda };     // NVIDIA
var options = new EmbedderOptions { Provider = ExecutionProvider.DirectML }; // Windows GPU
var options = new EmbedderOptions { Provider = ExecutionProvider.CoreML };   // macOS

Verify GPU Detection

using LMSupply.Runtime;

// Quick summary (returns formatted string)
Console.WriteLine(EnvironmentDetector.GetEnvironmentSummary());

// Or access individual properties
var gpu = EnvironmentDetector.DetectGpu();
var provider = EnvironmentDetector.GetRecommendedProvider();

Console.WriteLine($"Provider: {provider}");
Console.WriteLine($"CUDA Available: {gpu.Vendor == GpuVendor.Nvidia && gpu.CudaDriverVersionMajor >= 11}");
Console.WriteLine($"DirectML Available: {gpu.DirectMLSupported}");

Troubleshooting GPU Issues

Do NOT install ONNX Runtime packages manually. LMSupply handles runtime binary management automatically via lazy downloading.

If you have conflicting packages installed, remove them:

dotnet remove package Microsoft.ML.OnnxRuntime
dotnet remove package Microsoft.ML.OnnxRuntime.Gpu
dotnet remove package Microsoft.ML.OnnxRuntime.DirectML

For NVIDIA CUDA support, ensure you have:

  • NVIDIA GPU drivers installed
  • CUDA 11.x or 12.x runtime (LMSupply auto-selects the appropriate version)

Logging & Diagnostics

LMSupply emits operational logs (model auto-selection, GPU layer offload decisions, VRAM warnings, runtime download progress) via System.Diagnostics.Trace.TraceInformation / TraceWarning. These do not automatically surface in Microsoft.Extensions.Logging (ILogger) pipelines — Trace.* writes to Trace.Listeners, which is a separate channel.

To surface LMSupply diagnostics in an ILogger sink (Serilog, Console logging, Application Insights, etc.), attach LMSupplyTraceListener at host startup:

using LMSupply.Diagnostics;
using Microsoft.Extensions.Logging;

var logger = loggerFactory.CreateLogger("LMSupply");
LMSupplyTraceListener.Attach((message, severity) =>
    logger.Log(severity switch
    {
        TraceEventType.Warning => LogLevel.Warning,
        TraceEventType.Error   => LogLevel.Error,
        _                      => LogLevel.Information
    }, message));

After attaching, the following diagnostic events become visible in your standard logging pipeline:

  • Auto model selection ([EmbedderModelRegistry] Auto-selecting model for VRAM: ...)
  • Llama-server GPU layer decisions ([LlamaServerGeneratorModel] Auto partial offload: 18/32 layers on GPU or CPU-only fallback: 0/32 layers ...)
  • Runtime binary download progress
  • VRAM budget warnings (when LMSUPPLY_VRAM_BUDGET_MB overrides take effect)

VRAM Budget Override

LMSupply caps GPU model loading using min(total × (1 - margin), free × 0.95). To override the computed budget with an absolute value (megabytes), set the environment variable LMSUPPLY_VRAM_BUDGET_MB before process start:

# Force 8 GB budget regardless of GPU free/total
LMSUPPLY_VRAM_BUDGET_MB=8000

When set to a positive integer, the override is applied before any safety margin and feeds into all VRAM-aware decisions: model auto-selection, GGUF quantization variant selection, llama-server GPU layer count, and context length capping. If the override results in 0 GPU layers (full CPU fallback), LlamaOffloadTraceHelper emits a Trace.TraceWarning with the VRAM figures and the override hint — attach LMSupplyTraceListener per the section above to surface it.


Thread Safety & Batch Processing

All LMSupply models are thread-safe for concurrent inference. ONNX Runtime's InferenceSession.Run() is thread-safe by design.

// Safe: Concurrent inference on the same model instance
await using var embedder = await LocalEmbedder.LoadAsync("default");

await Parallel.ForEachAsync(documents, async (doc, ct) =>
{
    var embedding = await embedder.EmbedAsync(doc, ct);
    // Process embedding...
});

// Or with Task.WhenAll
var tasks = documents.Select(d => embedder.EmbedAsync(d));
var embeddings = await Task.WhenAll(tasks);

Performance tips:

  • GPU inference: 2-4 concurrent operations typically optimal
  • CPU inference: Match MaxDegreeOfParallelism to core count
  • Use EmbedBatchAsync() when available for better throughput

Loading Models

LMSupply supports three ways to specify models:

1. Aliases (Recommended for beginners)

Use predefined aliases for quick access to popular models:

await using var embedder = await LocalEmbedder.LoadAsync("default");      // bge-small-en-v1.5
await using var embedder = await LocalEmbedder.LoadAsync("default");      // bge-m3 (multilingual SOTA, v0.34+)
await using var generator = await LocalGenerator.LoadAsync("gguf:auto");    // Hardware-optimized
await using var generator = await LocalGenerator.LoadAsync("gguf:qwen3-balanced"); // Qwen3 8B

2. HuggingFace Repository ID (Full control)

Use any HuggingFace repository directly with owner/repo-name format:

// ONNX models - auto-discovers onnx/ subfolder
await using var embedder = await LocalEmbedder.LoadAsync("BAAI/bge-large-en-v1.5");
await using var reranker = await LocalReranker.LoadAsync("BAAI/bge-reranker-v2-m3");

// GGUF models - auto-detected by repo name pattern (-GGUF, _gguf)
await using var generator = await LocalGenerator.LoadAsync("bartowski/Llama-3.2-3B-Instruct-GGUF");
await using var generator = await LocalGenerator.LoadAsync("bartowski/Qwen2.5-Coder-7B-Instruct-GGUF");

// Vision models
await using var captioner = await LocalCaptioner.LoadAsync("microsoft/Florence-2-base");
await using var detector = await LocalDetector.LoadAsync("onnx-community/yolov8s");

The system automatically:

  • Discovers ONNX files via HuggingFace API
  • Detects subfolder structure (onnx/, cpu/, cuda/)
  • Selects appropriate quantization variants (Q4_K_M for GGUF)
  • Downloads required tokenizer and config files

3. Local Path

Use locally stored models:

// ONNX model directory
await using var embedder = await LocalEmbedder.LoadAsync("/path/to/model-directory");

// GGUF file directly
await using var generator = await LocalGenerator.LoadAsync("/path/to/model.gguf");

For private HuggingFace repositories, set the HF_TOKEN environment variable.


Model Caching

Models are cached following HuggingFace Hub conventions:

  • Default: ~/.cache/huggingface/hub
  • Environment variables: HF_HUB_CACHE, HF_HOME, or XDG_CACHE_HOME
  • Manual override: new EmbedderOptions { CacheDirectory = "/path/to/cache" }

Requirements

Software

  • .NET 10.0+
  • Windows 10+, Linux, or macOS 11+

Hardware (Recommended)

Use Case RAM GPU VRAM Notes
Embeddings 4GB+ Optional CPU works fine for small models
Reranking 8GB+ 4GB+ GPU recommended for large models
Text Generation 16GB+ 8GB+ VRAM strongly recommended
Speech (Whisper) 8GB+ 4GB+ GPU significantly faster
Vision (Detection/Captioning) 8GB+ 4GB+ GPU recommended

Minimum for "auto" mode:

  • Any modern CPU with 8GB RAM
  • For best experience: NVIDIA GPU with 8GB+ VRAM

Documentation

Getting Started

Text & Language

Vision

Audio


License

MIT License - see LICENSE for details.

About

.NET library for on-demand local AI model inference — zero bundled models, lazy loading, hardware-aware GPU/CPU selection, 10 task types including embeddings, generation, vision, and audio.

Topics

Resources

License

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors