Summary
Expose Flare's unique progressive inference capability through BrowserAI — start generating text while the model is still downloading.
How it works
Flare can run inference with a partial model:
FlareEngine.forward_partial(token, pos, num_layers) — runs with N layers
FlareEngine.available_layers() — how many layers are loaded
FlareEngine.inference_quality() — 0.0 to 1.0 quality score
User experience
const ai = new BrowserAI({ engine: 'flare', progressive: true });
// Start loading — returns immediately, downloads in background
ai.loadModel('llama-3.2-1b-flare', {
onProgress: (loaded, total) => updateProgressBar(loaded, total),
onLayersReady: (available, total) => {
qualityMeter.value = available / total;
}
});
// User can start chatting before download completes
// Flare uses available layers, quality improves as more arrive
const response = await ai.generateText('Hello!');
// Response generated with partial model — rough but usable
// Later, model fully loaded — full quality
const response2 = await ai.generateText('Explain quantum computing');
// Full quality response
UX elements
- Quality meter showing inference_quality (0-100%)
- "Generating with X/Y layers" indicator
- Smooth quality upgrade — no jarring transition
- Show estimated improvement: "4 more layers loading..."
Why this is unique
No other browser LLM engine supports this. WebLLM and Transformers.js require the full model before any inference.
Depends on
Related
Summary
Expose Flare's unique progressive inference capability through BrowserAI — start generating text while the model is still downloading.
How it works
Flare can run inference with a partial model:
FlareEngine.forward_partial(token, pos, num_layers)— runs with N layersFlareEngine.available_layers()— how many layers are loadedFlareEngine.inference_quality()— 0.0 to 1.0 quality scoreUser experience
UX elements
Why this is unique
No other browser LLM engine supports this. WebLLM and Transformers.js require the full model before any inference.
Depends on
Related