fix dynamic resource finding(#50)#101
Open
JudyZhu45 wants to merge 4 commits intoChicagoHAI:mainfrom
Open
Conversation
- Exponential backoff with ±50% jitter on 429 and 5xx - httpx.RequestError also retries - 4xx (other) returns immediately - Max 5 attempts, base 1s, max 60s fix Task 1 of ChicagoHAI#50.
These fields are returned by the paper-finder service (defined in ai2i.dcollection.Document) but were dropped in the wrapper's response parsing. Needed for source quality scoring in the next commit. Output dict uses 'venue' and 'influential_citations' as keys. Part of Task 2 of ChicagoHAI#50.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR fixes Dynamic resource finding #50 by making the resource-gathering pipeline more reliable.
Core insight: agents can't tell good sources from bad until the metadata they see and the errors they hit look the same across sources.
Features implemented:
venueandinfluential_citation_countfrom paper-finder responses (the wrapper used to drop them)relevance + log1p(influential_citations cap 2) + 0.5 × exp(-age/5){id, exists, url, error}for HF / Kaggle / generic URLsresource_finder.txtcalls the verifier before any downloadLimitations
1. Pre-flight saves failed-download time, but doesn't shortcut alternative search
The verifier catches dead sources (gated 401, missing 404, network errors) before
load_dataset()retries burn minutes — that part is solved. But{exists: false}doesn't stop the agent from looking for mirrors or alternative sources elsewhere.End-to-end LoCoMo run: HF returned 401, agent skipped the bad path, then spent time finding the
Aman279/Locomomirror and the GitHub repo. Faster pivot, but not a free skip — the agent keeps searching until something works. In this case that's the right behavior.2. Scoring weights are guesses, and venue is hard to score from a script
The weights in
score_paper()are guesses — I don't have validation data to pick the right recency-vs-citation tradeoff. They look fine on the runs I tested but aren't tuned.Venue is harder. The service returns
venueas a free-form string ("NeurIPS 2024","Conference on Computer Vision...", sometimes empty).I extracted the field so it's in the result dict, but didn't add it to the formula.
Next steps
1. Experiment runner fabricates data without orchestrator visibility (top priority)
Discovered during this PR's e2e run on memory drift problem: the agent's first Write of
experiment.pyalready contained_generate_synthetic_multiwozandinject_drift_events— synthetic fallback was designed in before any real data was attempted. When MultiWOZ parsing failed (real data on disk, parser assumed wrong schema), the agent silently used the synthesis path.metrics.jsonthen reported 99% accuracy on a benchmark the agent fabricated. The verifier guards "did we get the data" but not "did we use it" — synthesis happens inside experiment-runner code, invisible to the orchestrator.Two-piece fix planned next:
session_instructions.txt: experiment-runner code must raise on dataset-load failure, not fall back to synthesized replacement.metrics.jsondeclareslabels_provenance(dataset_shipped/derived/agent_generated). Pipeline halts before paper writer if a real-world claim is backed byagent_generatedlabels.2. Venue ranking in scoring
Citations and recency are in
score_paper();venueis in the result dict but not in the formula. Tier lists like "NeurIPS=1.0, AAAI=0.5" need field-specific agreement we don't have for a multi-domain agent. The field is exposed for future scoring work.3. Credentialed and hallucinated datasets
The verifier sees a 200 on PhysioNet / CheXpert pages but can't tell that downloads need a CITI/DUA workflow. It also can't tell a hallucinated dataset name from a renamed one. Both are agent-level policy (read page keywords, search for similar slugs), not verifier problems. Worth a follow-up.
4. User-specified scope (Task 5)
Solvable by adding a flag to the schema, but I think it overlaps with future work. Deferred.
5. Paper-finder as a research planner
from my perspective paper-finder today is more like
query → relevant papers, but NeuriCo's actual need isidea → research evidence map. Papers play different roles for an idea (problem-defining, dataset/benchmark, evaluation protocol, related work, negative result), and "is this source good?" is really "does this paper fill a role the experiment needs?".Additionally, Paper-finder should also be persistent across the run — callable when the experiment runner hits specific gaps.
Research Questions
1. How to assess source reliability automatically?
Reliability isn't a property of the paper alone — it's a property of the (paper, idea) pair. A high-citation paper can still be the wrong tool for a specific experiment. The structural alternative is to define typed roles a paper can fill (problem / baseline / dataset / eval protocol / implementation / negative result) and ask whether each role is filled, instead of computing a single quality score. This PR's scoring is comparable across results but doesn't yet check role fitness.
2. When should the agent use training knowledge vs. searching?
The structural alternative: search is triggered by visible gaps in the coverage matrix or by specific failures (baseline didn't run, dataset returned an unexpected schema, results contradict the literature). This turns an introspective decision into a structural one. Persistent paper-finder access during the experiment runner phase makes this concrete.
3. How to prevent the synthetic data problem?
Same structural framing as Q1/Q2. The problem has two sides: fabricated inputs (no real data, or wrong kind — LLM generated benchmarks, tiny samples used as full eval) and fabricated outputs (numbers the agent didn't actually compute). Both need provenance the orchestrator can verify, not the agent's say-so.