Quick external LangGraph agent reliability test #238

productmakerjason · 2026-05-23T21:41:31Z

productmakerjason
May 23, 2026

Hey — EvalView is very close to the problem I’m testing.

I’m collecting a few quick external agent runs around a tiny task-feed reliability test:

Can an agent follow /llms.txt → tasks.json → schema → payload without inventing missing context or claiming completion without evidence?

Start:

https://the-agents-of-nations.vercel.app/llms.txt

One-line failure point is enough. This might be an interesting tiny regression/eval case.

hidai25 · 2026-05-23T22:12:56Z

hidai25
May 23, 2026
Maintainer

Hey, interesting test case.

The failure mode is relevant to EvalView: agents that appear to complete a task while skipping part of the evidence chain.

I’ll review the public files and think about whether this fits as a small regression example.

Before going further, would be useful to know a bit more about the project and what you’re hoping to learn from these external runs.

Hidai

2 replies

productmakerjason May 24, 2026
Author

Hi Hidai,

Thanks!

yes, that’s exactly the failure mode I’m trying to understand.

The project is still very small. It’s a public agent-readable task arena, not a marketplace or a finished product. The goal is to see whether an agent can follow an external task flow and leave enough evidence to verify what actually happened.

The current flow is roughly:

/llms.txt → tasks.json → task schema → submission schema → payload

What I’m trying to learn from external runs is:

whether the agent can fetch the required files
whether it selects a real listed task_id
whether it actually reads the schemas
whether it avoids claiming completion or submission without evidence

So your framing is exactly right: the interesting failure is when an agent appears to complete the task while skipping part of the evidence chain.

Start URL:
https://the-agents-of-nations.vercel.app/llms.txt

A very small regression example would be enough. Even a failed run is useful if it shows the first point where the evidence chain breaks.

For example:

“Fetched llms.txt, failed at tasks.json”
“Selected a task_id but skipped schema”
“Prepared payload but claimed submission without receipt”

If this fits EvalView, I’d be happy for it to be included as a small public regression example.

Regards,

Jason.

productmakerjason May 25, 2026
Author

Hi Hidai :)

Just following up lightly.

Even a very small note like “the agent failed at tasks.json” or “schema was skipped” would be useful.

No need for a full review. I’m mainly trying to capture the first evidence-chain break.

Best,
Jason

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quick external LangGraph agent reliability test #238

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Quick external LangGraph agent reliability test #238

Uh oh!

productmakerjason May 23, 2026

Replies: 1 comment · 2 replies

Uh oh!

hidai25 May 23, 2026 Maintainer

Uh oh!

productmakerjason May 24, 2026 Author

Uh oh!

productmakerjason May 25, 2026 Author

productmakerjason
May 23, 2026

Replies: 1 comment 2 replies

hidai25
May 23, 2026
Maintainer

productmakerjason May 24, 2026
Author

productmakerjason May 25, 2026
Author