Add RELLIS-3D evaluation harness + reproducibility note by Parv-Maheshwari · Pull Request #2 · SimonSchwaiger/otas

Parv-Maheshwari · 2026-05-21T23:12:21Z

The released OTAS code does not include the multi-class evaluation script that produced the RELLIS-3D Table V mIoU numbers. Only the single-(pos, neg) -> binary-mask single_inference API is shipped, and there is no RELLIS-3D dataset loader. This PR adds the missing pieces so the result can be reproduced from a single OTAS checkout:

own_eval/own_RELLIS.py: RELLIS-3D driver, with --paper_config to match the hyperparameters described in supplementary §VII.A.
own_eval/rellis_dataset.py: PyTorch Dataset for RELLIS-3D's 20-class ontology.yaml ontology (also exposes the HRNet-19 label_mapping variant for completeness, default off).
own_eval/eval_common.py: per-frame inference + confusion-matrix-pooled IoU, forwards arbitrary config_overrides to OTASEncoder.
own_eval/otas_segmentor.py: N-way adapter that wraps language_map.embed_image -> per-class clip_similarity -> argmax. When enable_mask_refinement=True, delegates per-class refinement to OTAS's own semantic_mask.binary_mask_refined(..., ret_dict=True) so the SAM-on path reuses upstream code verbatim.
docs/RELLIS_REPLICATION.md: PR description with the paper claim, our matching config, the result, the ruled-out hypotheses, and the remaining open questions for upstream.

Replication result: 15.43 mIoU under the §VII.A-matching config, vs the paper Table V claim of 48.48 mIoU (DINOv2 ViT-S/14, no mask refinement). ~32-point gap remains unexplained.

Also bundles a one-line MaskCLIP import fix (import packaging.version) required to run on Python 3.12, and a .gitignore entry for the cached eval outputs under result/.

See docs/RELLIS_REPLICATION.md for the full reproduction recipe and the list of hypotheses we tested + the ones we cannot test without the upstream eval script.

The released OTAS code does not include the multi-class evaluation script that produced the RELLIS-3D Table V mIoU numbers. Only the single-(pos, neg) -> binary-mask `single_inference` API is shipped, and there is no RELLIS-3D dataset loader. This PR adds the missing pieces so the result can be reproduced from a single OTAS checkout: - own_eval/own_RELLIS.py: RELLIS-3D driver, with `--paper_config` to match the hyperparameters described in supplementary §VII.A. - own_eval/rellis_dataset.py: PyTorch Dataset for RELLIS-3D's 20-class ontology.yaml ontology (also exposes the HRNet-19 `label_mapping` variant for completeness, default off). - own_eval/eval_common.py: per-frame inference + confusion-matrix-pooled IoU, forwards arbitrary config_overrides to OTASEncoder. - own_eval/otas_segmentor.py: N-way adapter that wraps `language_map.embed_image` -> per-class `clip_similarity` -> argmax. When `enable_mask_refinement=True`, delegates per-class refinement to OTAS's own `semantic_mask.binary_mask_refined(..., ret_dict=True)` so the SAM-on path reuses upstream code verbatim. - docs/RELLIS_REPLICATION.md: PR description with the paper claim, our matching config, the result, the ruled-out hypotheses, and the remaining open questions for upstream. Replication result: 15.43 mIoU under the §VII.A-matching config, vs the paper Table V claim of 48.48 mIoU (DINOv2 ViT-S/14, no mask refinement). ~32-point gap remains unexplained. Also bundles a one-line MaskCLIP import fix (`import packaging.version`) required to run on Python 3.12, and a `.gitignore` entry for the cached eval outputs under `result/`. See docs/RELLIS_REPLICATION.md for the full reproduction recipe and the list of hypotheses we tested + the ones we cannot test without the upstream eval script. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

SimonSchwaiger · 2026-05-24T19:34:55Z

Hi @Parv-Maheshwari, thanks for contributing to OTAS!

This repo is intended for inference and using OTAS in your own robotics projects. To keep it minimal and make updating OTAS easier, we have not included evaluation code in this repo.

We are in the process of cleaning up the evaluation codebase and consider adding evaluation code in another repository.

In the mean time, I have looked through your evaluation code and the mIoU results seem really low. Since it's only an ablation and our goal was offroad terrain segmentation, we run Rellis-3D only on a subset of classes relevant to terrain segmentation. The exact classes are specified in the OTAS paper: https://arxiv.org/pdf/2507.08851 . Upon first glance it seems to me that the attached code evaluates a different set of classes.

For matching the eval setup completely, I'd also recommend switching to the first commit of this repository (that's the exact code we ran the evaluation on). Subsequent commits improved the implementation in our real-world trials and added quality of life improvements.

Best,
Simon

Parv-Maheshwari · 2026-05-24T19:46:29Z

Hello Simon,

Thank k you for getting back. I couldn't find the exact classes you used for the RELLIS-3D. The paper states this but not anything explicit about no. of classes or which classes.

"Foundation Model Choice. Foundation model depen-
dence is evaluated on RELLIS-3D [22], a challenging off-
road dataset with highly textured classes and semantically
overlapping categories (e.g., “dirt,” “mud,” “puddle”)."

"All
prompts are class names in RELLIS-3D’s included ontology
file. "

It would be great if you can tell me the list of the class names you ran the code on, and I can try to replicate that with the first commit .

Warm regards,
Parv Maheshwari

SimonSchwaiger · 2026-05-24T20:21:27Z

Thanks for pointing out that the classes are missing! The exact list of classes was supposed to be included in the supplementary material but the Arxiv version doesn't seem to include that yet. I will update the preprint to include them after ICRA.

Here is the exact prompt setup used in our tests. The numbers are class IDs:

class_prompts = {
    1: "dirt",
    6: "water",
    10: "asphalt",
    19: "bush",
    33: "mud",
    34: "rubble"
}
neg_prompts = ["thing"]
threshold_value = 0.8

Best,
Simon

Parv-Maheshwari · 2026-05-24T20:44:02Z

Hello Simon,

Thank you for sending the exact prompt. Will post an update here regarding any further issues.

Warm regards,
Parv Maheshwari

… t=0.8) Add own_RELLIS_paper.py implementing the literal protocol Simon described in the PR SimonSchwaiger#2 review comment — 6-class terrain subset, neg=["thing"], threshold=0.8 via semantic_mask.similarity() — and run it against three independent code/config combinations: current main + paper §VII.A, first commit (6aec2d4) + first-commit defaults, and first commit + §VII.A overrides. All three land within 0.07 mIoU of each other (6.63-6.70), vs the Table V claim of 48.48. The structural reason is visible in the per-class breakdown: semantic_mask .similarity min-max-normalises per image, so threshold@0.8 fires on the noisiest 20% of pixels in every frame where a class is absent. Across the ~1500 absent -class frames per sparse class (dirt: 13 of 1672 frames, water: 19, rubble: 145), this accumulates into hundreds of millions of FPs that drown every sparse class. bush (1658 of 1672 frames) carries the entire mean at ~35. docs/RELLIS_REPLICATION.md rewritten as a PR-ready note with the full 6-run table, per-class numbers for all three Simon-protocol runs, reproduction commands including the worktree dance for the first-commit reproduction, a ruled-out hypothesis table, the one remaining open hypothesis (mIoU averaging convention: pooled vs per-frame-over-frames-with-class-present), and asks for the upstream maintainers. Other changes are consistency cleanups paired with own_RELLIS_paper.py: - own_RELLIS.py: §VII.A hyperparameters baked into otas_segmentor.py _DEFAULT_CONFIG (so the 20-class argmax run also lands at the documented 15.66 instead of 16.70). Native 1200x1920 default (no more 480x640 resize that was the OTAS_small.json shape). - rellis_dataset.py: drop the unused 4th th_vis tuple element and the unused mask_modality flag. RELLIS has no thermal channel. - eval_common.py: tuple shape aligned with rellis_dataset.py. Optional per-dataset palette + redraw_overlays_only flag for visualization iteration (no behaviour change unless used). - otas_segmentor.py: §VII.A defaults (d=64, Cr=24, dinov2_input_size=224) in _DEFAULT_CONFIG. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Parv-Maheshwari · 2026-05-26T17:10:47Z

Thanks @SimonSchwaiger — your clarification (6-class terrain subset, neg=["thing"], threshold=0.8, plus the suggestion to run on the first commit) was the missing piece. Pushed an update (commit 6b54286) that reproduces the literal protocol you described against three independent code/config combinations. Headline:

#	Code	Config	Protocol	mIoU
3	current `main`, `own_RELLIS.py`	§VII.A native	20-cls argmax (our original interpretation)	15.66
4	current `main`, `own_RELLIS_paper.py` (new)	§VII.A native	6cls + `neg=["thing"]` + `t=0.8`	6.70
5	first commit (`6aec2d4`)	first-commit defaults (d=32, Cr=48, dinov2 input 518)	Same as row 4	6.64
6	first commit (`6aec2d4`)	§VII.A overrides	Same as row 4	6.63
—	—	—	Paper Table V claim	48.48

Rows 4–6 land within 0.07 mIoU of each other across two code versions and two config presets — the gap is insensitive to commit version, shared_feat_resolution, n_components, dinov2_input_size, or input resize. We diffed src/model.py between 6aec2d4 and current main: semantic_mask.similarity math is byte-identical (cosine sim → min-max normalise → threshold).

Best guess at where the gap lives

Per-class breakdown for row 4 (the pattern is essentially identical for rows 5–6):

class	raw id	n frames present (of 1672)	total GT px	OTAS IoU
dirt	1	13	9,690	0.00
water	6	19	959,662	0.35
asphalt	10	503	3,850,438	0.66
bush	19	1658	662,926,185	35.14
mud	33	574	29,818,401	3.59
rubble	34	145	1,907,260	0.47
mIoU(6cls)				6.70

semantic_mask.similarity min-max-normalises per image. On a frame where the class of interest is absent (e.g. dirt on the 1659 dirt-free frames of 1672), the cosine-similarity range is small but non-zero noise; the clamp(min=0.05) lower-bounds the range but stretches the noise distribution to fill [0, 1], and the threshold@0.8 then fires on roughly the noisiest 20% of pixels of that frame. Across ~1500 absent-class frames × 1920×1200 px × ~20%, that's hundreds of millions of false positives per sparse class — see the FP counts in the per-class table in RELLIS_REPLICATION.md. bush is fine because it's present in 99.2% of frames, so the per-image normalisation is normalising real signal not noise.

The one open question

Given the sparse-class FP problem above, the structural unknown that would close the gap is the mIoU averaging convention. We pool one (TP, FP, FN) triple per class across all 1672 frames and compute IoU once over the totals. If Table V instead averages IoU per-frame only over frames where the class is present in GT (per class, then mean across the 6 classes), the absent-class FPs drop out of the average entirely and the headline jumps. Is that what Table V uses?

Two other much smaller asks:

Which OTAS_*.json config produced Table V? The shipped OTAS_small.json has enable_mask_refinement: true and shared_feat_resolution: 16, which contradicts §VII.A.
Is the test split the upstream test.lst from the offical split, or a different release / curated subset? Our scan shows dirt in only 13 of 1672 frames, water in 19.

What's in the new commit

own_eval/own_RELLIS_paper.py (new) — implements your protocol literally. CLI: --config_preset {paper_vii_a, first_commit_defaults}, --threshold (default 0.8), --max_frames for smoke runs.
docs/RELLIS_REPLICATION.md — rewritten as a self-contained reproduction guide with all six runs, per-class numbers, root-cause analysis, the worktree dance for first-commit reproduction, ruled-out hypothesis table, and these asks.
own_eval/{own_RELLIS,rellis_dataset,eval_common,otas_segmentor}.py — paired consistency cleanups (§VII.A baked into defaults, drop the unused th_vis tuple element from RELLIS dataset since RELLIS has no thermal sensor).

Quick repro for the headline number on current main:

env -u LD_LIBRARY_PATH .venv/bin/python own_eval/own_RELLIS_paper.py \
    --data_dir /path/to/Rellis-3D --config_preset paper_vii_a
# → mIoU_6cls: ~6.70 (kmeans noise ±0.05)

First-commit reproduction worktree dance documented in docs/RELLIS_REPLICATION.md#run-on-first-commit-6aec2d4-with-first-commit-defaults-row-5.

Happy to run the per-frame-averaging variant as soon as you confirm whether that's the convention — it's a one-knob change.

kevintsq · 2026-06-21T20:38:18Z

Hi Simon,

Thanks for your great work!

It would be really helpful if you could release the evaluation code. That would make it easier for others to reproduce the results and avoid having to implement their own evaluation pipeline or troubleshoot discrepancies.

Since OTAS is already very strong, any further improvements may be difficult to distinguish visually. Quantitative metrics would therefore be especially useful for evaluating the gains.

Thanks again for your efforts!

SimonSchwaiger · 2026-06-25T09:33:46Z

Hello @kevintsq, thanks for your interest in OTAS!

Unfortunately, releasing the raw evaluation code is currently not possible since it's not compatible with the publicly released OTAS model/method code. I altered the model structure and interfaces for the public release to make it easier to locally run OTAS and make it usable in downstream projects such as the ROS 2 node in this repo. I didn't anticipate that there would be such a strong interest in the eval code, hence I made the tradeoff of compatibility vs. ease of setup.

However, I am currently working on a separate repository that includes OVSS benchmarks (e.g. ADE20K) and more backbones than the ones used in the OTAS paper. I'll do my best to also port the OTAS evals forward into that repository (main interest seems to be ORFD and Rellis-3D, so I'll start with those). Since I am coding/maintaining this mostly alone alongside my day job, it unfortunately will take some more time to get the repo properly cleaned up and documented for release. I will update the issues here once that is published.

In the mean time, I am happy to answer any questions!

Best,
Simon

SimonSchwaiger self-assigned this Jun 9, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add RELLIS-3D evaluation harness + reproducibility note#2

Add RELLIS-3D evaluation harness + reproducibility note#2
Parv-Maheshwari wants to merge 2 commits into
SimonSchwaiger:mainfrom
Parv-Maheshwari:rellis-3d-replication-attempt

Parv-Maheshwari commented May 21, 2026

Uh oh!

SimonSchwaiger commented May 24, 2026

Uh oh!

Parv-Maheshwari commented May 24, 2026

Uh oh!

SimonSchwaiger commented May 24, 2026

Uh oh!

Parv-Maheshwari commented May 24, 2026

Uh oh!

Parv-Maheshwari commented May 26, 2026 •

edited

Loading

Uh oh!

kevintsq commented Jun 21, 2026

Uh oh!

SimonSchwaiger commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Parv-Maheshwari commented May 21, 2026

Uh oh!

SimonSchwaiger commented May 24, 2026

Uh oh!

Parv-Maheshwari commented May 24, 2026

Uh oh!

SimonSchwaiger commented May 24, 2026

Uh oh!

Parv-Maheshwari commented May 24, 2026

Uh oh!

Parv-Maheshwari commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Best guess at where the gap lives

The one open question

What's in the new commit

Uh oh!

kevintsq commented Jun 21, 2026

Uh oh!

SimonSchwaiger commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Parv-Maheshwari commented May 26, 2026 •

edited

Loading