Skip to content

AgentBoundary v0.1 conformance evaluation of LangSmith — pre-publication review #2919

@sunilp

Description

@sunilp

Hi William FH (@hinthornw) and LangSmith team —

I'm Sunil Prakash from JamJet Labs. I've been authoring an open spec for AI-action receipts called AgentBoundary (jamjet-labs/agentboundary, v0.1 stable + v0.2-alpha draft). It defines a portable, tamper-evident JSON receipt format that a third party can verify without trusting the runtime.

I built a 40-scenario conformance suite and graded it against four prominent agent-governance products including LangSmith. Opening this issue to give the team a 7-day right-to-respond window before publication of the comparative report.

Headline up front: LangSmith is the most full-featured observability platform in the comparison. The Run object captures everything needed to debug a multi-step agent call — inputs, outputs, full trace tree, feedback_stats, token costs, eval-dataset linkage. What it does not have, and does not claim to have, is a normative artifact format for portable verification. Policy decisions, approver identity, and execution outcomes live in team-defined tag/feedback conventions. The comparison reveals that LangSmith's design choice (capture-everything-flexibly) and AgentBoundary's (schema-versioned-receipts) target different audiences — engineers debugging vs third-party verifiers.

What I did:

Headline:

PASS         15
PARTIAL      14
DOCS-ONLY     1
NOT COVERED   8
N/A           2
──────────────
TOTAL        40

15 PASS is the second-highest of any vendor (after Microsoft AGT), driven mostly by Level 3 hashing scenarios — Run.inputs stores raw JSON the adapter can canonicalise and hash directly. 14 PARTIAL is the signature: the data is in the Run somewhere, but the schema location varies by team convention. With a strict team convention (e.g., decision:allow, policy:foo, env:prod tags), many PARTIAL rows upgrade to PASS; without conventions, they fall to NOT COVERED.

Specific notes (design choices worth acknowledging, not bugs):

  • LangSmith doesn't normalize a policy.decision field; teams use Run.tags conventions. A cross-team auditor can't reliably parse what they find without per-team schema understanding.
  • parent_run_id builds a tree per trace; it does NOT chain across traces, so the AgentBoundary v0.2-alpha L4 chain-integrity check doesn't apply at trace boundaries.
  • The Gateway adds spend caps + PII redaction at the LLM-request layer; these are control-plane primitives, orthogonal to per-action receipt structure.

The ask: if any per-scenario mapping or factual claim is wrong, corrections welcome via this issue or via PR to jamjet-labs/agentboundary within 7 days. After that, the report publishes with the data as currently mapped.

Happy to share §7.3 of the report (the LangSmith section, ~600 words) for a sneak look if anyone wants one.

Thanks for LangSmith — the Run data model is one of the most thoughtful I've seen in this space, and the observability + eval primitives raise the bar.

— Sunil Prakash, JamJet Labs (sunil@jamjet.dev)

Metadata

Metadata

Labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions