Add RELLIS-3D evaluation harness + reproducibility note#2
Add RELLIS-3D evaluation harness + reproducibility note#2Parv-Maheshwari wants to merge 2 commits into
Conversation
The released OTAS code does not include the multi-class evaluation script that produced the RELLIS-3D Table V mIoU numbers. Only the single-(pos, neg) -> binary-mask `single_inference` API is shipped, and there is no RELLIS-3D dataset loader. This PR adds the missing pieces so the result can be reproduced from a single OTAS checkout: - own_eval/own_RELLIS.py: RELLIS-3D driver, with `--paper_config` to match the hyperparameters described in supplementary §VII.A. - own_eval/rellis_dataset.py: PyTorch Dataset for RELLIS-3D's 20-class ontology.yaml ontology (also exposes the HRNet-19 `label_mapping` variant for completeness, default off). - own_eval/eval_common.py: per-frame inference + confusion-matrix-pooled IoU, forwards arbitrary config_overrides to OTASEncoder. - own_eval/otas_segmentor.py: N-way adapter that wraps `language_map.embed_image` -> per-class `clip_similarity` -> argmax. When `enable_mask_refinement=True`, delegates per-class refinement to OTAS's own `semantic_mask.binary_mask_refined(..., ret_dict=True)` so the SAM-on path reuses upstream code verbatim. - docs/RELLIS_REPLICATION.md: PR description with the paper claim, our matching config, the result, the ruled-out hypotheses, and the remaining open questions for upstream. Replication result: 15.43 mIoU under the §VII.A-matching config, vs the paper Table V claim of 48.48 mIoU (DINOv2 ViT-S/14, no mask refinement). ~32-point gap remains unexplained. Also bundles a one-line MaskCLIP import fix (`import packaging.version`) required to run on Python 3.12, and a `.gitignore` entry for the cached eval outputs under `result/`. See docs/RELLIS_REPLICATION.md for the full reproduction recipe and the list of hypotheses we tested + the ones we cannot test without the upstream eval script. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Hi @Parv-Maheshwari, thanks for contributing to OTAS! This repo is intended for inference and using OTAS in your own robotics projects. To keep it minimal and make updating OTAS easier, we have not included evaluation code in this repo. We are in the process of cleaning up the evaluation codebase and consider adding evaluation code in another repository. In the mean time, I have looked through your evaluation code and the mIoU results seem really low. Since it's only an ablation and our goal was offroad terrain segmentation, we run Rellis-3D only on a subset of classes relevant to terrain segmentation. The exact classes are specified in the OTAS paper: https://arxiv.org/pdf/2507.08851 . Upon first glance it seems to me that the attached code evaluates a different set of classes. For matching the eval setup completely, I'd also recommend switching to the first commit of this repository (that's the exact code we ran the evaluation on). Subsequent commits improved the implementation in our real-world trials and added quality of life improvements. Best, |
|
Hello Simon, Thank k you for getting back. I couldn't find the exact classes you used for the RELLIS-3D. The paper states this but not anything explicit about no. of classes or which classes. "Foundation Model Choice. Foundation model depen- "All It would be great if you can tell me the list of the class names you ran the code on, and I can try to replicate that with the first commit . Warm regards, |
|
Thanks for pointing out that the classes are missing! The exact list of classes was supposed to be included in the supplementary material but the Arxiv version doesn't seem to include that yet. I will update the preprint to include them after ICRA. Here is the exact prompt setup used in our tests. The numbers are class IDs: class_prompts = {
1: "dirt",
6: "water",
10: "asphalt",
19: "bush",
33: "mud",
34: "rubble"
}
neg_prompts = ["thing"]
threshold_value = 0.8Best, |
|
Hello Simon, Thank you for sending the exact prompt. Will post an update here regarding any further issues. Warm regards, |
… t=0.8) Add own_RELLIS_paper.py implementing the literal protocol Simon described in the PR SimonSchwaiger#2 review comment — 6-class terrain subset, neg=["thing"], threshold=0.8 via semantic_mask.similarity() — and run it against three independent code/config combinations: current main + paper §VII.A, first commit (6aec2d4) + first-commit defaults, and first commit + §VII.A overrides. All three land within 0.07 mIoU of each other (6.63-6.70), vs the Table V claim of 48.48. The structural reason is visible in the per-class breakdown: semantic_mask .similarity min-max-normalises per image, so threshold@0.8 fires on the noisiest 20% of pixels in every frame where a class is absent. Across the ~1500 absent -class frames per sparse class (dirt: 13 of 1672 frames, water: 19, rubble: 145), this accumulates into hundreds of millions of FPs that drown every sparse class. bush (1658 of 1672 frames) carries the entire mean at ~35. docs/RELLIS_REPLICATION.md rewritten as a PR-ready note with the full 6-run table, per-class numbers for all three Simon-protocol runs, reproduction commands including the worktree dance for the first-commit reproduction, a ruled-out hypothesis table, the one remaining open hypothesis (mIoU averaging convention: pooled vs per-frame-over-frames-with-class-present), and asks for the upstream maintainers. Other changes are consistency cleanups paired with own_RELLIS_paper.py: - own_RELLIS.py: §VII.A hyperparameters baked into otas_segmentor.py _DEFAULT_CONFIG (so the 20-class argmax run also lands at the documented 15.66 instead of 16.70). Native 1200x1920 default (no more 480x640 resize that was the OTAS_small.json shape). - rellis_dataset.py: drop the unused 4th th_vis tuple element and the unused mask_modality flag. RELLIS has no thermal channel. - eval_common.py: tuple shape aligned with rellis_dataset.py. Optional per-dataset palette + redraw_overlays_only flag for visualization iteration (no behaviour change unless used). - otas_segmentor.py: §VII.A defaults (d=64, Cr=24, dinov2_input_size=224) in _DEFAULT_CONFIG. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
Thanks @SimonSchwaiger — your clarification (6-class terrain subset,
Rows 4–6 land within 0.07 mIoU of each other across two code versions and two config presets — the gap is insensitive to commit version, Best guess at where the gap livesPer-class breakdown for row 4 (the pattern is essentially identical for rows 5–6):
The one open questionGiven the sparse-class FP problem above, the structural unknown that would close the gap is the mIoU averaging convention. We pool one (TP, FP, FN) triple per class across all 1672 frames and compute IoU once over the totals. If Table V instead averages IoU per-frame only over frames where the class is present in GT (per class, then mean across the 6 classes), the absent-class FPs drop out of the average entirely and the headline jumps. Is that what Table V uses? Two other much smaller asks:
What's in the new commit
Quick repro for the headline number on current env -u LD_LIBRARY_PATH .venv/bin/python own_eval/own_RELLIS_paper.py \
--data_dir /path/to/Rellis-3D --config_preset paper_vii_a
# → mIoU_6cls: ~6.70 (kmeans noise ±0.05)First-commit reproduction worktree dance documented in Happy to run the per-frame-averaging variant as soon as you confirm whether that's the convention — it's a one-knob change. |
|
Hi Simon, Thanks for your great work! It would be really helpful if you could release the evaluation code. That would make it easier for others to reproduce the results and avoid having to implement their own evaluation pipeline or troubleshoot discrepancies. Since OTAS is already very strong, any further improvements may be difficult to distinguish visually. Quantitative metrics would therefore be especially useful for evaluating the gains. Thanks again for your efforts! |
|
Hello @kevintsq, thanks for your interest in OTAS! Unfortunately, releasing the raw evaluation code is currently not possible since it's not compatible with the publicly released OTAS model/method code. I altered the model structure and interfaces for the public release to make it easier to locally run OTAS and make it usable in downstream projects such as the ROS 2 node in this repo. I didn't anticipate that there would be such a strong interest in the eval code, hence I made the tradeoff of compatibility vs. ease of setup. However, I am currently working on a separate repository that includes OVSS benchmarks (e.g. ADE20K) and more backbones than the ones used in the OTAS paper. I'll do my best to also port the OTAS evals forward into that repository (main interest seems to be ORFD and Rellis-3D, so I'll start with those). Since I am coding/maintaining this mostly alone alongside my day job, it unfortunately will take some more time to get the repo properly cleaned up and documented for release. I will update the issues here once that is published. In the mean time, I am happy to answer any questions! Best, |
The released OTAS code does not include the multi-class evaluation script that produced the RELLIS-3D Table V mIoU numbers. Only the single-(pos, neg) -> binary-mask
single_inferenceAPI is shipped, and there is no RELLIS-3D dataset loader. This PR adds the missing pieces so the result can be reproduced from a single OTAS checkout:--paper_configto match the hyperparameters described in supplementary §VII.A.label_mappingvariant for completeness, default off).language_map.embed_image-> per-classclip_similarity-> argmax. Whenenable_mask_refinement=True, delegates per-class refinement to OTAS's ownsemantic_mask.binary_mask_refined(..., ret_dict=True)so the SAM-on path reuses upstream code verbatim.Replication result: 15.43 mIoU under the §VII.A-matching config, vs the paper Table V claim of 48.48 mIoU (DINOv2 ViT-S/14, no mask refinement). ~32-point gap remains unexplained.
Also bundles a one-line MaskCLIP import fix (
import packaging.version) required to run on Python 3.12, and a.gitignoreentry for the cached eval outputs underresult/.See docs/RELLIS_REPLICATION.md for the full reproduction recipe and the list of hypotheses we tested + the ones we cannot test without the upstream eval script.