Image Classification

Image Classification in Nexa: From Single-Stage VLM to a Grounded, Multi-Provider Pipeline

Stanford CS 194, Spring 2026 (Team 24)

Nexa is a civic-issue reporting app that asks its users to take one photo, write a short description, and trust the system to pick the right category (ROAD_DAMAGE, ILLEGAL_DUMPING, STREETLIGHT_OUTAGE, VEHICLE_EMISSIONS, OTHER) and the right municipal agency. The accuracy of that single classification call determines whether a report reaches the correct desk or vanishes into a queue. This page describes how we did image classification before, how we rebuilt it using techniques from multimodal machine learning, the new 76-image evaluation dataset we curated to measure both, and the early performance numbers from the harness.

Before: a single-stage VLM call with majority voting

The original classifier, still live as the baseline mode in the eval harness, is a single LLM call per provider:

Send the raw user image (Base64-encoded JPEG/PNG, exactly as uploaded from the phone) plus the user's free-text description to each of three vision-language models (VLMs):
- OpenAI gpt-4o-mini
- Anthropic claude-haiku-4-5
- Google gemini-2.5-flash
Each model is given the same prompt (CLASSIFICATION_PROMPT) asking it to return strict JSON: {issueType, aiDescription, severity, confidence}.
A consensus function combines the three answers: unanimous → majority → highest-confidence fallback.

This worked, but had three concrete weaknesses that surfaced in casual testing and midpoint user testing:

Photo quality variance was punished hard. Phone photos arrive at ~3 to 5 MB, rotated according to EXIF orientation tags that not every VLM honours, and sometimes with HEIC quirks. The VLMs disagreed more often on rotated or oversized inputs.
No grounding before deciding. Each model had to do object recognition, hazard reasoning, and category assignment in one shot. When the photo was ambiguous (e.g. a puddle vs. a pothole-with-water), confidence dropped and consensus broke.
Location was discarded. The phone reports lat/lng and a reverse-geocoded address, and the photo often carries EXIF GPS. None of it was reaching the classifier, so the model had no jurisdictional prior ("this is a Caltrans-maintained on-ramp" vs. "this is a Palo Alto residential street").

Now: a two-stage, grounded, multi-provider pipeline

The new pipeline (PR #73, branch feat/two-stage-classification-eval) replaces the single call with four explicit steps, each individually inspectable.

Stage 0: Server-side image preprocessing (`preprocess.ts`)

We treat the raw upload as untrusted, large, and potentially mis-oriented. Using sharp (libvips-backed) and exifr:

Decode the (possibly data-URL-wrapped) Base64.
Pull GPS from the EXIF block before touching the bytes, because sharp.rotate() strips orientation metadata, which on some encoders also drops GPS.
Auto-rotate based on EXIF orientation.
Downscale so neither dimension exceeds 1024 px, preserving aspect ratio (fit: "inside", withoutEnlargement: true).
Re-encode as JPEG at quality 80 with mozjpeg.

A typical 3 to 5 MB phone photo becomes a 100 to 300 KB upright JPEG. Token cost and latency on every downstream VLM call drop proportionally, and the EXIF GPS becomes available as a location hint even when the client forgot to send coordinates.

Stage 1: Grounded visual observation (`observe.ts`)

A single cheap call to gpt-4o-mini (detail: "low", max_tokens: 250, temperature: 0.1) extracts a structured observation of the scene:

{
  "objects":    ["pothole", "sedan", "trash bag"],
  "conditions": ["cracked asphalt", "standing water"],
  "hazards":    ["trip hazard", "obstructing lane"],
  "scene":      "Damaged residential street with standing water near curb."
}

The Stage 1 model is explicitly forbidden from classifying. It only describes what is visible. This is the same idea behind "chain-of-thought" and "observe then decide" patterns in current VLM research: separating the perceptual step from the categorical step makes both steps shorter, cheaper, and easier to debug. If Stage 1 fails for any reason, Stage 2 still runs on the raw image as a graceful fallback.

Stage 2: Multi-provider consensus with grounding

The Stage 1 observation is rendered into a compact text block and appended to the Stage 2 prompt for all three providers. So is the location context (caller-supplied lat/lng, address, jurisdiction, plus EXIF GPS as a fallback if the caller lacked coordinates):

Stage-1 visual observations:
  Scene: Damaged residential street with standing water near curb.
  Objects: pothole, sedan, trash bag
  Conditions: cracked asphalt, standing water
  Hazards: trip hazard, obstructing lane

Report location context:
  Address: 450 Serra Mall, Stanford, CA 94305
  Coordinates: 37.42830, -122.16860
  Jurisdiction: Santa Clara County

All three VLMs (OpenAI, Anthropic, Google) then return their own {issueType, aiDescription, severity, confidence} JSON, and the consensus function picks the winner: unanimous, then majority-of-three, then highest-confidence fallback.

Why this is at the forefront

Three ideas from current multimodal-LLM practice show up here, used together:

Decoupled perception and decision. Stage 1 / Stage 2 mirrors the "visual chain-of-thought" line of work (set-of-mark prompting, visual program induction, observe-then-reason agents): one model grounds, another decides. Each is small, fast, and interpretable in isolation.
Multi-model consensus / ensembling at inference time. Instead of fine-tuning one model on civic-issue data we don't have, we pool three frontier VLMs with diverse training distributions and let agreement do the heavy lifting. This is the same principle behind self-consistency and mixture-of-agents.
Cross-modal context injection. Location, EXIF GPS, and jurisdiction are folded into the text channel of a vision call, giving the VLM a prior over plausible categories without retraining. The classifier "knows" it's looking at a Caltrans on-ramp, not just an unlabelled photo of asphalt.

A new, robust evaluation dataset

To measure whether any of this actually helps, we built a 76-image evaluation set that didn't exist before. The dataset and harness live in nexa/eval/.

Sourcing

Images are pulled programmatically from Wikimedia Commons via the MediaWiki API (eval/dataset/fetch.ts). We query a curated set of categories that map cleanly to Nexa's IssueType enum:

Wikimedia category	Expected `IssueType`
Potholes	`ROAD_DAMAGE`
Road damage	`ROAD_DAMAGE`
Damaged street lights	`STREETLIGHT_OUTAGE`
Illegal dumping	`ILLEGAL_DUMPING`
Litter	`ILLEGAL_DUMPING`
Vehicles emitting smoke	`VEHICLE_EMISSIONS`
Exhaust smoke	`VEHICLE_EMISSIONS`

The fetcher throttles requests at 1.1 s/call and retries HTTP 429 with exponential backoff, per Wikimedia's robot policy. Each kept image is filtered for size (20 KB to 4 MB) and type (JPEG/PNG). We persist its URL, MIME, dimensions, byte count, license, attribution, EXIF GPS, and caption alongside its expected label.

The team's three existing internal nexa/test-photos/*.jpg files are appended as ground-truth anchor cases.

Composition

Class	Cases	Share
`ROAD_DAMAGE`	25	32.9%
`ILLEGAL_DUMPING`	25	32.9%
`VEHICLE_EMISSIONS`	14	18.4%
`STREETLIGHT_OUTAGE`	12	15.8%
Total	76	100%

By source: 73 cases from Wikimedia Commons (multiple CC and public-domain licenses, with CC BY-SA 4.0 dominating at 24 images and public domain second at 18) and 3 internal test photos. The full license breakdown is preserved in cases.json so the dataset can be redistributed legally.

Methodology and metrics

For each case in each mode (baseline and two-stage) we record per-prediction: whether the predicted issueType matches the expected one (ok), the consensus method used (unanimous / majority / highest-confidence / fallback), the model-reported confidence, and the wall-clock latency for the full pipeline.

These are aggregated into:

overall accuracy
per-class accuracy (with support counts)
a class-by-class confusion matrix
mean and p90 latency
consensus-method breakdown (how often did the three providers agree?)

Conditions compared

Run	Image preprocess	Stage-1 observation	Location in prompt
`baseline`	raw image direct to VLMs	none	none
`two-stage`	sharp resize + EXIF GPS	gpt-4o-mini observation pass	EXIF + caller GPS

Both runs hit the same three VLMs and use the same consensus voting. The only differences are the inputs the VLMs receive.

Files added and changed

A high-level map of what each file in this PR contributes. Library code lives under nexa/src/lib/classify/, API routing under nexa/src/app/api/, and the offline evaluation harness under nexa/eval/.

Pipeline library

nexa/src/lib/classify/preprocess.ts (new). Exports preprocessImage(input) and the PreprocessedImage interface. Decodes a Base64 input (tolerating an optional data-URL prefix), pulls EXIF GPS via exifr before any pixel manipulation, auto-rotates by EXIF orientation with sharp.rotate(), downscales to a 1024 px bounding box preserving aspect ratio, and re-encodes as JPEG quality 80 with mozjpeg. Returns the normalized data URL, the raw Base64, byte length, output dimensions, original dimensions, and extracted GPS.

nexa/src/lib/classify/observe.ts (new). Exports observeImage(dataUrl, description?), renderObservation(obs), and the Observation interface. observeImage runs a single low-cost gpt-4o-mini call (detail: "low", temperature: 0.1, max_tokens: 250) with a prompt that forbids classification and asks for strict-JSON {objects, conditions, hazards, scene}. renderObservation flattens that JSON into a compact text block that downstream providers can paste into their prompts.

nexa/src/lib/classify/types.ts (modified). Adds the LocationContext interface ({latitude, longitude, address, jurisdiction}) and the buildClassificationPrompt({observationBlock, location}) helper that composes the existing CLASSIFICATION_PROMPT with an optional stage-1 observation block and a rendered location section. The original CLASSIFICATION_PROMPT string is kept exported unchanged so the baseline path is byte-for-byte identical to what it was before.

nexa/src/lib/classify/consensus.ts (heavily modified). Adds ConsensusOptions ({twoStage, location}) and ExtendedComparisonResult. The exported classifyWithConsensus now optionally runs Stage 0 (preprocess) and Stage 1 (observe) before fanning out to the three providers in parallel with a grounded Stage 2 prompt. Consensus voting (unanimous → majority → highest-confidence → fallback) is unchanged. EXIF GPS extracted in Stage 0 is folded into the location block when the caller did not supply coordinates.

nexa/src/lib/classify/openai-provider.ts, anthropic-provider.ts, google-provider.ts (modified). Each classifyWith{OpenAI,Anthropic,Google} function now accepts an options.prompt override. When omitted, the providers fall back to CLASSIFICATION_PROMPT and behave exactly as before. When provided, the grounded two-stage prompt is used instead. No other provider behaviour changed.

API integration

nexa/src/app/api/reports/classify/route.ts (modified). The POST handler now accepts optional latitude, longitude, address, and jurisdiction fields in the JSON body. It assembles a LocationContext from whichever are present and invokes classifyWithConsensus with {twoStage: true, location}. The response shape is unchanged for existing callers.

Evaluation harness

nexa/eval/dataset/fetch.ts (new). Builds cases.json by querying Wikimedia Commons via the MediaWiki API. Exports the DatasetCase interface used by the runner. Iterates a category-to-IssueType mapping, requests up to 3× the per-category cap of candidates, filters by MIME (JPEG/PNG) and size (20 KB to 4 MB), and persists URL, MIME, dimensions, byte count, license, attribution, EXIF GPS, and a stripped caption. Throttles requests at 1.5 s/call and retries HTTP 429 / 5xx with exponential backoff. Three internal team test photos are appended as ground-truth anchors. With --download it also caches image bytes to eval/dataset/_cache/.

nexa/eval/dataset/cases.json (new). The serialized 76-case manifest produced by fetch.ts: 73 Wikimedia Commons cases plus 3 team test photos. Each row carries the URL, expected label, source category, MIME, dimensions, bytes, license, attribution, EXIF GPS, and caption.

nexa/eval/metrics.ts (new). Exports the CasePrediction and AggregateMetrics interfaces plus aggregate(predictions), renderReport(metrics), and diffReport(baseline, twoStage). aggregate computes overall accuracy, per-class accuracy with support counts, the class-by-class confusion matrix, mean and p90 latency, mean reported confidence, and the consensus-method breakdown. The renderers produce the pretty-printed stdout report and the baseline-vs-two-stage delta block that lands in SUMMARY.md.

nexa/eval/run.ts (new). The CLI entry point. Parses --mode={baseline|two-stage|both}, --limit=N, and --no-download flags. Loads cases.json, ensures each image is locally cached (downloading on demand unless told otherwise), reads it as Base64, and invokes classifyWithConsensus with the right twoStage flag for each mode. Writes per-case predictions plus aggregate metrics to eval/results/baseline.json and eval/results/two-stage.json, and prints the human-readable delta to stdout.

nexa/eval/results/SUMMARY.md (new). Human-readable digest of the most recent run: headline accuracy / latency / confidence table, per-class accuracy, consensus-method breakdown, both confusion matrices, the per-case crosstab (both right / both wrong / rescued / regressed), and the robustness-wins table. Regenerated by hand from the JSON results when a fresh eval is run.

nexa/eval/README.md (new). Documentation for the harness: what it answers, the directory layout, the methodology, the conditions compared, the run recipe, the expected cost, and the limitations.

Dependencies and tooling

nexa/package.json, package-lock.json (modified). Adds three dependencies: sharp (image preprocessing), exifr (EXIF / GPS extraction), and tsx (TypeScript runner used by the eval scripts). Adds the eval npm scripts: eval:fetch, eval:fetch:download, eval, eval:baseline, eval:two-stage.

nexa/.gitignore (modified). Ignores eval/dataset/_cache/ (downloaded image bytes are not redistributed in-repo) and eval/results/{baseline,two-stage}.json (raw per-case predictions are large and machine-regenerable, so the human-readable SUMMARY.md is kept in version control instead).

How the pieces fit together

This section ties the per-file changes back into a single end-to-end picture: what gets called in what order when a user files a report, where the new information is generated, and which downstream consumers learn about it.

End-to-end request path

[Phone browser]
  user takes photo + types description + (optional) hits "Detect" for GPS
        |
        v
POST /api/reports/classify
  { description, imageBase64, latitude?, longitude?, address?, jurisdiction? }
        |
        v
[route.ts]  assemble LocationContext if any of lat/lng/address/jurisdiction set
        |
        v
classifyWithConsensus(description, imageBase64, { twoStage: true, location })
        |
        v
[consensus.ts orchestrates the pipeline]
        |
        +--> Stage 0: preprocess.ts
        |       sharp + exifr -> { dataUrl, base64, byteLength, exifGps,
        |                          width, height, originalWidth, originalHeight }
        |       (if Stage 0 fails, fall back to raw image and skip Stage 1)
        |
        +--> Stage 1: observe.ts (gpt-4o-mini, low detail)
        |       -> { objects, conditions, hazards, scene, latencyMs }
        |       renderObservation(obs) -> compact text block
        |       (if Stage 1 errors, skip and continue without grounding)
        |
        +--> buildClassificationPrompt({ observationBlock, location })
        |       composes CLASSIFICATION_PROMPT + obs block + location block
        |
        +--> fan out in parallel to:
        |       openai-provider.ts    (gpt-4o-mini)
        |       anthropic-provider.ts (claude-haiku-4-5)
        |       google-provider.ts    (gemini-2.5-flash)
        |       each gets the preprocessed image + the augmented prompt
        |
        +--> consensus: unanimous -> majority -> highest-confidence -> fallback
        |
        v
ExtendedComparisonResult
  { winner, providers[3], method,
    observation, preprocess, locationUsed }
        |
        v
NextResponse.json(result)  -> phone shows category + agency routing

The eval harness (nexa/eval/run.ts) calls the same classifyWithConsensus entry point as the API route. The only difference is that the harness reads its image bytes from cases.json plus the local cache instead of receiving them in a request body, and it iterates the call once per mode (baseline: twoStage: false, two-stage: twoStage: true) so the two pipelines can be compared on identical inputs.

Where new information enters the system

Four pieces of information are produced or surfaced by the new code that didn't exist in the system before. Each enters at a specific file and propagates to a specific set of downstream consumers.

Preprocessed image bytes. Produced by preprocess.ts (Stage 0). Replaces the raw user upload everywhere downstream: stage 1 sees the preprocessed data URL, all three provider calls in stage 2 see the preprocessed Base64. The original bytes are never sent to any VLM in the two-stage path.

Stage-1 observation block. Produced by observe.ts via renderObservation(obs). Threaded into the stage-2 prompt for all three providers via buildClassificationPrompt in types.ts, then carried through the new options.prompt parameter on each provider's classifyWith* function. Also surfaced on the response as observation on ExtendedComparisonResult so callers (or the eval harness) can inspect it.

EXIF GPS. Extracted by preprocess.ts from the raw image before any other manipulation. Consumed by consensus.ts, which falls it back into the LocationContext when the caller did not supply lat/lng. From there it flows into the stage-2 prompt via the location block, and also surfaces on the response as part of locationUsed.

Location context. Assembled by route.ts from the caller's lat/lng, address, and jurisdiction. Carried by consensus.ts into buildClassificationPrompt, which renders it as a small bullet list at the end of the stage-2 prompt. All three providers see it.

What is intentionally unchanged

The PR is layered, not invasive. Four things are deliberately identical to the pre-PR behaviour so the baseline is reproducible and the surface area of the change is small.

The exported CLASSIFICATION_PROMPT string. Same bytes as before, still exported. buildClassificationPrompt composes it with the new blocks instead of replacing it.
The consensus voting rule. Unanimous, then majority-of-three, then highest-confidence, then fallback. Implemented in the same code path. Any accuracy delta between modes is therefore attributable to inputs, not to voting logic.
The provider call signatures. Each classifyWith*(description, imageBase64, options?) adds one optional options parameter. Callers that pass no options get exactly the old behaviour. The baseline eval mode relies on this.
The classify-route response shape. Existing front-end callers continue to read {winner, providers, method} as before. The new observation, preprocess, and locationUsed fields on ExtendedComparisonResult are additive.

Where the eval harness plugs in

The eval harness is a peer of the API route, not a layer beneath it. Both call the same classifyWithConsensus entry point with the same arguments shape, so a behavioural change observed in the harness is by construction a change visible to real users in production. eval/run.ts drives the loop, eval/metrics.ts aggregates per-case predictions, eval/dataset/fetch.ts builds the manifest, and eval/results/SUMMARY.md is the human-readable artifact. The Performance section that follows is generated entirely from one pair of harness runs over the 76-case dataset.

Performance

The full eval was run across the 76-case dataset, hitting OpenAI gpt-4o-mini, Anthropic claude-haiku-4-5, and Google gemini-2.5-flash for each case, in both modes. Raw per-case results live in eval/results/baseline.json and eval/results/two-stage.json, with a human-readable digest in eval/results/SUMMARY.md.

Overall accuracy

Mode	Accuracy	Mean confidence	Mean latency	p90 latency
`baseline`	92.1% (70/76)	0.938	5,471 ms	8,818 ms
`two-stage`	89.5% (68/76)	0.950	7,302 ms	9,383 ms
Δ	−2.6 pp	+0.012	+1,832 ms	+565 ms

On this dataset the two-stage pipeline cost 2.6 percentage points of accuracy and added roughly 1.8 s of mean latency, while producing slightly higher mean reported confidence. The accuracy regression is the headline result, and the next subsections unpack where it came from.

Per-class accuracy

Class	Support	Baseline	Two-stage	Δ
`ROAD_DAMAGE`	25	100.0%	92.0%	−8.0 pp
`ILLEGAL_DUMPING`	25	96.0%	96.0%	±0
`VEHICLE_EMISSIONS`	14	92.9%	92.9%	±0
`STREETLIGHT_OUTAGE`	12	66.7%	66.7%	±0

The entire two-stage regression comes from ROAD_DAMAGE: two cases the baseline got right (a flood-damaged road and a Mexican-street pothole) were re-classified as OTHER by the two-stage pipeline. Every other class is flat. STREETLIGHT_OUTAGE is stuck at 66.7% in both modes, which is the dataset-quality issue flagged earlier (four of the 12 streetlight cases are off-topic photos like the Warsaw-ghetto ruins and a tornado-damaged church).

Consensus method breakdown

How often did the three providers actually agree?

Method	Baseline	Two-stage
unanimous	63	67
majority	8	9
highest-confidence	5	0

This is the strongest qualitative signal in the run. With grounded Stage-1 observations and location context in the prompt, all three providers agreed unanimously on 67 of 76 cases (up from 63), and the highest-confidence tiebreaker, which fires only when no two providers can agree at all, never had to be used (down from 5 cases). The two-stage pipeline is producing tighter inter-provider agreement even where the final label happens to be wrong.

Confusion matrices

Rows are expected, columns are predicted. The two off-diagonal cells that move between modes are bolded.

Baseline:

expected	`ILLEGAL_DUMPING`	`OTHER`	`ROAD_DAMAGE`	`STREETLIGHT`	`VEHICLE_EMISSIONS`
`ILLEGAL_DUMPING`	24	1	0	0	0
`ROAD_DAMAGE`	0	0	25	0	0
`STREETLIGHT_OUTAGE`	0	4	0	8	0
`VEHICLE_EMISSIONS`	0	1	0	0	13

Two-stage:

expected	`ILLEGAL_DUMPING`	`OTHER`	`ROAD_DAMAGE`	`STREETLIGHT`	`VEHICLE_EMISSIONS`
`ILLEGAL_DUMPING`	24	1	0	0	0
`ROAD_DAMAGE`	0	2	23	0	0
`STREETLIGHT_OUTAGE`	0	4	0	8	0
`VEHICLE_EMISSIONS`	0	1	0	0	13

Per-case crosstab

Outcome	Cases
Both modes correct	67
Both modes wrong	5
Baseline wrong, two-stage right	1
Baseline right, two-stage wrong	3

The single case the two-stage pipeline rescued was Broken_4182613125_.jpg (a streetlight photo whose caption was about EXIF metadata rather than the subject), which baseline labelled OTHER and two-stage corrected to STREETLIGHT_OUTAGE. The three cases it lost were all ROAD_DAMAGE photos re-classified as OTHER: Flood_damage_in_American_Fork_Canyon_June_2023.jpg, Vddj-1.jpg, and M_Infraestrutura.jpg. Net effect: 1 − 3 = −2 cases, which is the −2.6 pp seen at the top.

Robustness wins independent of accuracy

The headline accuracy regression understates the operational case for the two-stage pipeline. Three things only the two-stage run got right:

Failure mode	Baseline	Two-stage
Anthropic 5 MB image rejections (raw phone-sized photos)	~5 cases	0
Highest-confidence tiebreaker fired (no provider agreement)	5 cases	0
EXIF GPS extracted and used as fallback location	0	24 / 76

Anthropic rejected several baseline calls outright because the raw phone-sized JPEGs exceeded its 5 MB inline-image limit, and those cases only landed at all because the consensus rule fell back to whichever of the remaining two providers had higher confidence. The two-stage pipeline preprocesses every image down to 100 to 300 KB before any provider sees it, so Anthropic accepts every case and the highest-confidence tiebreaker never has to fire. EXIF GPS, which the baseline ignores entirely, was successfully extracted as a location hint on 24 of the 76 cases.

Reading the result honestly

On this dataset, two-stage classification did not beat baseline classification on raw accuracy. It cost 2.6 pp and 1.8 s of mean latency to:

eliminate provider-side image-size rejections,
drive up the unanimous-agreement rate from 63 to 67 of 76 cases,
drive the no-agreement tiebreaker rate from 5 cases to 0,
recover EXIF GPS on roughly a third of cases.

The two-stage regression is also concentrated entirely in ROAD_DAMAGE, where the grounded Stage-1 observations seem to be too cautious: faced with a flood-damaged road or a debris-strewn pothole, the augmented prompt is nudging models toward OTHER. That is a fixable prompt issue, not a fundamental property of the pipeline. The robustness and consensus-quality results, by contrast, are properties of the architecture and would persist across prompt tweaks.

For a civic-reporting product where a wrong-but-confident label routes a report to the wrong agency, higher inter-provider agreement and zero provider rejections are arguably worth a small accuracy haircut on a Commons-biased eval. We will revisit the trade-off on real user submissions when we have enough of them to label.

Limitations

Wikimedia Commons photos are biased toward "good" examples of each category, since they are curated and well-lit. Real user submissions to Nexa will be lower-quality. This eval is a floor on what we can expect from the models, not a tight estimate of production accuracy.
EXIF GPS extracted from Commons photos points at the photographer's location, which usually correlates with the depicted issue but isn't guaranteed. We treat the EXIF GPS as a hint, not as ground truth.
The OTHER class is under-sampled because Commons categories are topic-specific. Confusion-matrix off-diagonals involving OTHER should be interpreted with that in mind.
All three providers are closed-source frontier models. Their behaviour can shift between releases, so the eval is point-in-time.

Reproducing the eval

cd nexa
# Requires OPENAI_API_KEY, ANTHROPIC_API_KEY, GOOGLE_API_KEY in .env.local
npx tsx eval/dataset/fetch.ts              # rebuild cases.json
npx tsx eval/dataset/fetch.ts --download   # cache image bytes locally
npx tsx eval/run.ts                        # run both baseline and two-stage

Results land in eval/results/baseline.json and eval/results/two-stage.json, plus a pretty-printed delta on stdout.

Image Classification

Image Classification in Nexa: From Single-Stage VLM to a Grounded, Multi-Provider Pipeline

Before: a single-stage VLM call with majority voting

Now: a two-stage, grounded, multi-provider pipeline

Stage 0: Server-side image preprocessing (preprocess.ts)

Stage 1: Grounded visual observation (observe.ts)

Stage 2: Multi-provider consensus with grounding

Why this is at the forefront

A new, robust evaluation dataset

Sourcing

Composition

Methodology and metrics

Conditions compared

Files added and changed

Pipeline library

API integration

Evaluation harness

Dependencies and tooling

How the pieces fit together

End-to-end request path

Where new information enters the system

What is intentionally unchanged

Where the eval harness plugs in

Performance

Overall accuracy

Per-class accuracy

Consensus method breakdown

Confusion matrices

Per-case crosstab

Robustness wins independent of accuracy

Reading the result honestly

Limitations

Reproducing the eval

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally

Stage 0: Server-side image preprocessing (`preprocess.ts`)

Stage 1: Grounded visual observation (`observe.ts`)