Skip to content

LLM stage crashes on OpenAI-compatible/local endpoints (Ollama): models return confidence 0-100 but output schema validates 0-1 #89

@TroyOMTV

Description

@TroyOMTV

Summary

In LLM mode against an OpenAI-compatible endpoint (e.g. Ollama via OPENAI_BASE_URL), the semantic pass crashes with a pydantic ValidationError: local instruct models return confidence on a 0–100 scale, but the LLM-output schemas validate it as a 0.0–1.0 float (Field(ge=0.0, le=1.0)). Combined with the abort-on-first-error behavior (#10), one out-of-range value takes down the entire LLM stage; only --no-llm static analysis survives.

Environment

  • SkillSpector 2.2.3 (installed from main, git+https://github.com/NVIDIA/skillspector.git)
  • Python 3.12 (isolated venv), macOS / Apple Silicon
  • SKILLSPECTOR_PROVIDER=openai, OPENAI_BASE_URL=http://localhost:11434/v1, Ollama 0.30.8
  • Reproduced with two different models: qwen2.5:14b and gemma4:12b

What happens

With LLM mode on (no --no-llm):

pydantic_core.ValidationError: N validation errors for MetaAnalyzerResult
  findings.0.confidence  Input should be less than or equal to 1  [input=100]
  ...

Both models emit confidence: 100. The constraint isn't on a single schema: after relaxing the bound on MetaAnalyzerFinding, the identical crash reappears on LLMAnalysisResult (the per-analyzer schema) — i.e. it's systemic across the LLMAnalyzerBase output models, not a one-off. (Raw model speed is fine here; this is purely the value-range mismatch, separate from any timeout.)

Root cause

The LLM-output models constrain confidence to 0–1:

  • src/skillspector/llm_analyzer_base.py:67confidence: float = Field(ge=0.0, le=1.0, ...)
  • src/skillspector/nodes/meta_analyzer.py:66confidence: float = Field(ge=0.0, le=1.0, ...)

Instruct models commonly express confidence as a percentage (0–100). Frontier models on strict function-calling providers tend to stay in range, but models served over Ollama's OpenAI-compatible endpoint don't honor the numeric bound (constrained decoding enforces type/structure, not magnitude), so the value comes back as 100 and client-side pydantic validation rejects it.

How this differs from existing issues

Possible direction (untested)

Normalizing/clamping confidence before validation would resolve it — the existing @field_validator("overall_assessment", mode="before") in meta_analyzer.py is a natural precedent. It would need to cover every confidence-bearing LLM-output model (relaxing the bound on one just surfaced the same crash on the next). Rough shape, but you'll know the right form: float(v); if v > 1: v /= 100; then clamp to [0, 1].

Repro

ollama pull qwen2.5:14b
export SKILLSPECTOR_PROVIDER=openai OPENAI_BASE_URL=http://localhost:11434/v1 \
       OPENAI_API_KEY=ollama SKILLSPECTOR_MODEL=qwen2.5:14b
skillspector scan ./tests/fixtures/malicious_skill     # no --no-llm
# -> ValidationError: findings.0.confidence Input should be less than or equal to 1 [input=100]

Happy to open a PR if that's useful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions