Skip to content

Progressive model loading with early inference via Flare #300

@sauravpanda

Description

@sauravpanda

Summary

Expose Flare's unique progressive inference capability through BrowserAI — start generating text while the model is still downloading.

How it works

Flare can run inference with a partial model:

  • FlareEngine.forward_partial(token, pos, num_layers) — runs with N layers
  • FlareEngine.available_layers() — how many layers are loaded
  • FlareEngine.inference_quality() — 0.0 to 1.0 quality score

User experience

const ai = new BrowserAI({ engine: 'flare', progressive: true });

// Start loading — returns immediately, downloads in background
ai.loadModel('llama-3.2-1b-flare', {
    onProgress: (loaded, total) => updateProgressBar(loaded, total),
    onLayersReady: (available, total) => {
        qualityMeter.value = available / total;
    }
});

// User can start chatting before download completes
// Flare uses available layers, quality improves as more arrive
const response = await ai.generateText('Hello!');
// Response generated with partial model — rough but usable

// Later, model fully loaded — full quality
const response2 = await ai.generateText('Explain quantum computing');
// Full quality response

UX elements

  • Quality meter showing inference_quality (0-100%)
  • "Generating with X/Y layers" indicator
  • Smooth quality upgrade — no jarring transition
  • Show estimated improvement: "4 more layers loading..."

Why this is unique

No other browser LLM engine supports this. WebLLM and Transformers.js require the full model before any inference.

Depends on

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    flare-integrationFlare WASM inference engine integration

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions