fix dynamic resource finding(#50) by JudyZhu45 · Pull Request #101 · ChicagoHAI/NeuriCo

JudyZhu45 · 2026-05-02T16:24:20Z

Summary

This PR fixes Dynamic resource finding #50 by making the resource-gathering pipeline more reliable.

Core insight: agents can't tell good sources from bad until the metadata they see and the errors they hit look the same across sources.

Features implemented:

Retry + rate limit on the paper-finder wrapper (exp backoff + jitter on 429/5xx; other 4xx fail fast)
Extract venue and influential_citation_count from paper-finder responses (the wrapper used to drop them)
Source scoring: relevance + log1p(influential_citations cap 2) + 0.5 × exp(-age/5)
Resource deduplication: DOI > URL > (title, year)
Dataset-verifier skill: 1-second pre-flight returning uniform JSON {id, exists, url, error} for HF / Kaggle / generic URLs
PHASE 2 of resource_finder.txt calls the verifier before any download

Limitations

1. Pre-flight saves failed-download time, but doesn't shortcut alternative search

The verifier catches dead sources (gated 401, missing 404, network errors) before load_dataset() retries burn minutes — that part is solved. But {exists: false} doesn't stop the agent from looking for mirrors or alternative sources elsewhere.

End-to-end LoCoMo run: HF returned 401, agent skipped the bad path, then spent time finding the Aman279/Locomo mirror and the GitHub repo. Faster pivot, but not a free skip — the agent keeps searching until something works. In this case that's the right behavior.

2. Scoring weights are guesses, and venue is hard to score from a script

The weights in score_paper() are guesses — I don't have validation data to pick the right recency-vs-citation tradeoff. They look fine on the runs I tested but aren't tuned.

Venue is harder. The service returns venue as a free-form string ("NeurIPS 2024", "Conference on Computer Vision...", sometimes empty).

I extracted the field so it's in the result dict, but didn't add it to the formula.

Next steps

1. Experiment runner fabricates data without orchestrator visibility (top priority)

Discovered during this PR's e2e run on memory drift problem: the agent's first Write of experiment.py already contained _generate_synthetic_multiwoz and inject_drift_events — synthetic fallback was designed in before any real data was attempted. When MultiWOZ parsing failed (real data on disk, parser assumed wrong schema), the agent silently used the synthesis path. metrics.json then reported 99% accuracy on a benchmark the agent fabricated. The verifier guards "did we get the data" but not "did we use it" — synthesis happens inside experiment-runner code, invisible to the orchestrator.

Two-piece fix planned next:

Prompt-level fallback ban in session_instructions.txt: experiment-runner code must raise on dataset-load failure, not fall back to synthesized replacement.
Metric-level provenance: each entry in metrics.json declares labels_provenance (dataset_shipped / derived / agent_generated). Pipeline halts before paper writer if a real-world claim is backed by agent_generated labels.

2. Venue ranking in scoring

Citations and recency are in score_paper(); venue is in the result dict but not in the formula. Tier lists like "NeurIPS=1.0, AAAI=0.5" need field-specific agreement we don't have for a multi-domain agent. The field is exposed for future scoring work.

3. Credentialed and hallucinated datasets

The verifier sees a 200 on PhysioNet / CheXpert pages but can't tell that downloads need a CITI/DUA workflow. It also can't tell a hallucinated dataset name from a renamed one. Both are agent-level policy (read page keywords, search for similar slugs), not verifier problems. Worth a follow-up.

4. User-specified scope (Task 5)

Solvable by adding a flag to the schema, but I think it overlaps with future work. Deferred.

5. Paper-finder as a research planner

from my perspective paper-finder today is more like query → relevant papers, but NeuriCo's actual need is idea → research evidence map. Papers play different roles for an idea (problem-defining, dataset/benchmark, evaluation protocol, related work, negative result), and "is this source good?" is really "does this paper fill a role the experiment needs?".

Additionally, Paper-finder should also be persistent across the run — callable when the experiment runner hits specific gaps.

Research Questions

1. How to assess source reliability automatically?

Reliability isn't a property of the paper alone — it's a property of the (paper, idea) pair. A high-citation paper can still be the wrong tool for a specific experiment. The structural alternative is to define typed roles a paper can fill (problem / baseline / dataset / eval protocol / implementation / negative result) and ask whether each role is filled, instead of computing a single quality score. This PR's scoring is comparable across results but doesn't yet check role fitness.

2. When should the agent use training knowledge vs. searching?

The structural alternative: search is triggered by visible gaps in the coverage matrix or by specific failures (baseline didn't run, dataset returned an unexpected schema, results contradict the literature). This turns an introspective decision into a structural one. Persistent paper-finder access during the experiment runner phase makes this concrete.

3. How to prevent the synthetic data problem?

Same structural framing as Q1/Q2. The problem has two sides: fabricated inputs (no real data, or wrong kind — LLM generated benchmarks, tiny samples used as full eval) and fabricated outputs (numbers the agent didn't actually compute). Both need provenance the orchestrator can verify, not the agent's say-so.

- Exponential backoff with ±50% jitter on 429 and 5xx - httpx.RequestError also retries - 4xx (other) returns immediately - Max 5 attempts, base 1s, max 60s fix Task 1 of ChicagoHAI#50.

These fields are returned by the paper-finder service (defined in ai2i.dcollection.Document) but were dropped in the wrapper's response parsing. Needed for source quality scoring in the next commit. Output dict uses 'venue' and 'influential_citations' as keys. Part of Task 2 of ChicagoHAI#50.

JudyZhu45 added 4 commits May 1, 2026 16:02

Add retry + rate limit to paper-finder wrapper

775f554

- Exponential backoff with ±50% jitter on 429 and 5xx - httpx.RequestError also retries - 4xx (other) returns immediately - Max 5 attempts, base 1s, max 60s fix Task 1 of ChicagoHAI#50.

Add source quality scoring and deduplication, Task 2 and 4

85e4531

Add dataset-verifier skill for phase 2 pre-flight check

e44fdf2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix dynamic resource finding(#50)#101

fix dynamic resource finding(#50)#101
JudyZhu45 wants to merge 4 commits intoChicagoHAI:mainfrom
JudyZhu45:fix-Dynamic-resource-finding

JudyZhu45 commented May 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

JudyZhu45 commented May 2, 2026

Summary

Limitations

1. Pre-flight saves failed-download time, but doesn't shortcut alternative search

2. Scoring weights are guesses, and venue is hard to score from a script

Next steps

1. Experiment runner fabricates data without orchestrator visibility (top priority)

2. Venue ranking in scoring

3. Credentialed and hallucinated datasets

4. User-specified scope (Task 5)

5. Paper-finder as a research planner

Research Questions

1. How to assess source reliability automatically?

2. When should the agent use training knowledge vs. searching?

3. How to prevent the synthetic data problem?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant