diff --git a/CHANGELOG.md b/CHANGELOG.md index 50547de..ab47b85 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -6,12 +6,31 @@ All notable changes to RESYNTH are documented here. The format follows ## [Unreleased] -### Changed -- Default model for the Claude Code operator is now Fable 5 - (`claude-fable-5`), Anthropic's newest and fastest top-tier model; - override per workspace in `operator.yaml` as before. +## [0.2.0] - 2026-06-12 ### Added +- Source resolution (`resynth resolve`): follows links and file references + found inside ingested sources and registers what it fetches as new first + class sources with provenance. Five target kinds: html articles, pdf + links, local files, YouTube videos and Vimeo videos. Public video + captions become timestamped transcripts. +- Transcript pending stubs: a video without public captions still becomes + a real source, and a later successful fetch upgrades the stub in place, + keeping its source id. +- The resolution manifest at `index/resolution.jsonl`: records every + target and its outcome so re-runs are idempotent. Fetched and duplicate + targets are never retried, failed and pending ones are. +- Schema v2 source frontmatter: `source_type`, `url`, `resolved_from` and, + for video sources, `transcript_status`. +- Optional `source_locator` on claims, a deep link into the source built + from a url, page, timestamp or anchor, validated by `extract-verify`. +- `resynth migrate`: explicit upgrade of a project's sources to schema v2. + Bodies and content hashes are untouched, re-sealing stays a separate + operator step. +- `resynth --version`. +- MASTER.json format `resynth-master/2` with a sources array, plus a + `load_master` reader that accepts both `/1` and `/2`. +- The guided wizard offers source resolution straight after intake. - Completion ping: when a delegated AI step runs longer than 90 seconds, RESYNTH plays a sound and shows a desktop notification (Windows toast / macOS notification) when it finishes, and again when the master document @@ -21,6 +40,13 @@ All notable changes to RESYNTH are documented here. The format follows the assistant's output streamed as it arrives, instead of a silent prompt until completion. +### Changed +- Default model for the Claude Code operator is now Fable 5 + (`claude-fable-5`), Anthropic's newest and fastest top-tier model; + override per workspace in `operator.yaml` as before. +- The MASTER.md source register gains Type and Link columns, so every + source's kind and origin url are visible in the sealed master. + ### Fixed - Sealing failed with "paths are ignored by one of your .gitignore files" in workspaces where `projects/*` is gitignored (any workspace cloned from diff --git a/DECISIONS.md b/DECISIONS.md index b3f985f..ea92335 100644 --- a/DECISIONS.md +++ b/DECISIONS.md @@ -27,3 +27,9 @@ Every architectural decision with a one line rationale. - The wired assistant defaults to Claude Code with claude-opus-4-8 at high reasoning effort, adjustable with resynth operator. - Delegated operator steps verify against the stage gate and retry up to three times with the gate reasons fed back, then fall back to manual mode. - The brief step asks whether reports already exist and skips prompt generation when they do, consolidation only is a first class flow. +- resolve is a stage 01 verb that re-evaluates the intake gate, not a sixth gate, so already sealed projects never grow a phantom PENDING gate. +- Fetchers use only the Python standard library (urllib, html.parser, xml.etree), so the four dependency decision holds. +- Video transcripts are best effort, a pending stub is a real source that upgrades in place and keeps its source id, so claim ids built on it stay stable. +- Source schema v2 is adopted only through an explicit resynth migrate, never silently, and re-sealing afterwards stays an operator act. +- MASTER.json format resynth-master/2 ships together with a load_master reader that accepts both /1 and /2, so downstream consumers never break on old exports. +- Resolution is depth one by design, fetched sources are not scanned for further links unless the operator forces a re-scan with --source. diff --git a/README.md b/README.md index 7857c77..ce70980 100644 --- a/README.md +++ b/README.md @@ -78,11 +78,47 @@ agent then drives the pipeline below, and the result is one BEST master document, readable by humans, verifiable by machine, and exportable as JSON for a downstream AI agent to action. +After intake there is one optional extra step. RESYNTH offers to fetch the +links and file references found inside your reports and register them as +extra sources, so the things your reports cite become evidence too. + ``` chat -> brief -> per platform prompts -> research reports -> intake -> -extract -> reconcile -> synthesise -> audit -> seal -> MASTER.md + MASTER.json +resolve (optional) -> extract -> reconcile -> synthesise -> audit -> seal -> +MASTER.md + MASTER.json +``` + +## Fetching linked sources + +Research reports cite things. `resynth resolve ` follows those +citations and turns them into first class sources of their own. It scans +every ingested report for links and file references, fetches each one, and +registers the result with provenance back to the report that mentioned it. +It handles html articles, pdf links, local files, and YouTube and Vimeo +videos. Public video captions become timestamped transcripts. + +When a video has no public captions, resolve still creates the source as a +pending transcript stub. You can paste a transcript into the stub yourself, +or re-run resolve later to retry. A later successful fetch upgrades the stub +in place and keeps the same source id. + +Resolution is idempotent. Every outcome is recorded in +`index/resolution.jsonl`, fetched and duplicate targets are never fetched +twice, and failed or pending targets are retried on the next run. Fetching +is polite: robots.txt is honoured, requests to the same host are spaced one +second apart, and responses are capped at 10 MiB. Resolution goes one level +deep only. Fetched sources are not scanned for further links unless you +force a re-scan with `--source`. + +``` +resynth resolve myproject +resynth resolve myproject --only youtube # just targets matching a substring +resynth resolve myproject --source S03 # re-scan one source, even a fetched one ``` +The full reference, including the manifest format, the source schema and +the migration guide, lives in [docs/SOURCE-RESOLUTION.md](docs/SOURCE-RESOLUTION.md). + ## Install ``` @@ -135,6 +171,7 @@ scripts/run_demo.py and in the end to end test. resynth init create project skeleton plus default merge-rules.yaml resynth brief --topic capture the research question, generate the prompt workspace resynth intake --source ... stage 1, repeatable per file +resynth resolve fetch links and file references inside sources as new first class sources resynth extract stage 2 workspace generation resynth extract-verify stage 2 gate resynth reconcile stage 3, also evaluates the gate @@ -144,6 +181,7 @@ resynth audit stage 5 coverage, drift, traceability resynth seal hash everything, commit SEAL.yaml, tag the repo resynth export machine readable output/MASTER.json for agents resynth status gate dashboard +resynth migrate upgrade a project's sources to the current schema (v2) resynth operator show or set the wired AI assistant, model and effort resynth doctor environment probe ``` @@ -173,6 +211,11 @@ Then run: resynth extract-verify --json and fix every violation until the gate reports PASS. ``` +Each claim may also carry an optional `source_locator`, a deep link into the +source built from a url, a page number, a timestamp or an anchor. The full +claim and source schemas live in +[docs/SOURCE-RESOLUTION.md](docs/SOURCE-RESOLUTION.md). + ### Agent prompt for stage 3, reconciliation ``` diff --git a/docs/SOURCE-RESOLUTION.md b/docs/SOURCE-RESOLUTION.md new file mode 100644 index 0000000..ce4b362 --- /dev/null +++ b/docs/SOURCE-RESOLUTION.md @@ -0,0 +1,280 @@ +# RESYNTH Source Resolution + +> [!abstract] Purpose +> The full reference for `resynth resolve` and everything it touches: the +> resolution flow, the manifest, transcript handling, source frontmatter +> schema v2, the claim `source_locator`, MASTER.json formats and the +> migration guide for pre 0.2.0 projects. + +## What resolve does + +Research reports cite things, and `resynth resolve ` turns those +citations into evidence. It scans every ingested source for links and file +references, fetches each one over the network or from disk, and registers +the result as a new first class source with provenance back to the source +that mentioned it. Fetched sources carry the same frontmatter, content hash +and gate checks as any hand ingested report, so everything downstream of +intake treats them identically. Resolve is a stage 1 verb. It re-evaluates +gate 01-intake when it finishes and adds no gate of its own. + +## The resolution flow + +1. **Discover.** Every source without a `resolved_from` parent is scanned. + Targets come from markdown link destinations, bare urls and backtick + spans. A local path is only accepted when it has a supported suffix + (.md .txt .docx .pdf) and the file actually exists, either as an + absolute path or relative to the folder the parent source came from. + Nothing else is guessed. +2. **Classify.** Each target becomes one of four kinds: `youtube` and + `vimeo` by hostname, `local` for an existing file on disk, and `url` + for every other http or https link. +3. **Fetch.** The matching fetcher retrieves the content. Web fetching + respects the etiquette rules below. Failures are recorded, never fatal. +4. **Register.** The fetched content goes through the same registration as + intake: it is hashed, deduplicated against every existing source, given + the next free source id, and written to `sources/` with schema v2 + frontmatter including `resolved_from` set to the parent source id. +5. **Manifest.** Every outcome is written to `index/resolution.jsonl` so + the next run knows what to skip and what to retry. Gate 01-intake is + then re-evaluated. + +Resolution is depth one by design. Fetched sources are never scanned for +further links on a normal run. To go deeper deliberately, name the fetched +source explicitly: `resynth resolve --source S04`. + +## Supported targets + +| Kind | What is fetched | Resulting source_type | +| --- | --- | --- | +| html page | readable article text, reduced from the main or article region, navigation and boilerplate dropped | html-article | +| pdf link | the pdf body converted with pdftotext (detected by content type or a .pdf path) | pdf | +| local file | the file converted exactly as intake would convert it | pdf for .pdf, notes otherwise | +| YouTube video | the public caption track as a timestamped transcript, English preferred | video-transcript | +| Vimeo video | the public text track (WebVTT) as a timestamped transcript, English preferred | video-transcript | + +An html page that yields fewer than 200 characters of text fails with +`page yielded no extractable text (login wall or script rendered)`. Any +other content type fails with `unsupported content type`. + +## Network etiquette + +| Rule | Value | +| --- | --- | +| User agent | `resynth/ (+https://github.com/Markus-Doc/resynth) research consolidation tool` | +| robots.txt | honoured per host, a disallowed url fails with `disallowed by robots.txt` | +| Rate limit | at most one request per second to the same host | +| Timeout | 30 seconds per request | +| Size cap | 10 MiB per response, larger responses fail with `response exceeds 10 MiB` | + +Fetching uses only the Python standard library. There are no extra +dependencies and no API keys. + +## The resolution manifest + +`index/resolution.jsonl` holds one JSON object per discovered target. +Lines starting with `#` are comments. The target string is the key, so a +target keeps a single record across runs. + +| Field | Meaning | +| --- | --- | +| target | the discovered url or absolute local path | +| kind | url, local, youtube or vimeo | +| status | fetched, duplicate, transcript_pending or failed | +| source_id | the source the target became, null when no source exists yet | +| resolved_from | the source id the target was discovered in | +| sha256 | content hash of the registered body, null on failure | +| fetched_at | ISO date the record last changed | +| note | the short failure reason, null otherwise | + +Example line: + +``` +{"target": "https://example.com/articles/pipeline-reliability", "kind": "url", "status": "fetched", "source_id": "S04", "resolved_from": "S01", "sha256": "3f8c0d2ab1...", "fetched_at": "2026-06-12", "note": null} +``` + +Retry semantics are simple. `fetched` and `duplicate` are terminal, those +targets are reported as cached and never fetched again. `failed` and +`transcript_pending` are retried on every run. A re-run that changes +nothing rewrites each record byte for byte, including its original +`fetched_at` date, so unchanged projects stay diff clean. + +## Transcript handling + +For videos, resolve tries the platform's public caption sources. On +YouTube that is the timedtext caption listing, preferring a track whose +language code starts with `en`, otherwise the first track. On Vimeo it is +the player text tracks fetched as WebVTT, with the same English preference. +Cues become a `## Transcript` section of timestamped lines, for example +`[00:14:32] ...`, with a paragraph break wherever the audio gaps for more +than eight seconds. + +When no public captions exist, the video still becomes a real source: a +pending stub with `transcript_status: pending` and this body. + +``` +# {title} + +> [!info] Video transcript pending +> RESYNTH could not retrieve a public caption track for this video. +> Link: {url} +> Re-run `resynth resolve ` to retry, or paste the transcript +> below this callout. The next resolve run can also upgrade this stub. +``` + +Re-running resolve retries pending stubs. When captions have appeared, the +stub is upgraded in place: the same file gains the fetched transcript, a +fresh `sha256` and `transcript_status: fetched`, while `source_id`, +`date_ingested`, `recency_rank` and `resolved_from` are preserved. Because +the source id never changes, any claim ids already extracted against the +stub stay stable. + +To paste a transcript yourself, open the stub under `sources/`, paste the +transcript below the callout, set `transcript_status: fetched` and update +the `sha256` field to the SHA-256 hex digest of the new body (everything +after the closing `---` line). Gate 01 reports `sha256 does not match body +content` until the hash is correct, so the gate tells you when it is done. +A wired AI assistant can make these edits for you. Note that a later +resolve run will replace a pasted body if the platform fetch succeeds, so +remove the manifest line for that target if you want your paste to stand. + +To force a refresh of a target already recorded as `fetched` or +`duplicate`, delete its line from `index/resolution.jsonl` and remove the +fetched source file from `sources/`, then re-run resolve. The target is +discovered again and fetched fresh under a new source id. Deleting only +the manifest line is not enough, the re-fetch would deduplicate against +the existing file and record a `duplicate`. + +## Source frontmatter schema v2 + +Every source written by RESYNTH 0.2.0 carries these fields. + +| Field | Since | Type | Meaning | +| --- | --- | --- | --- | +| source_id | v1 | string SNN | stable id, S01, S02 and so on | +| title | v1 | string | first heading of the body, or the file or page title | +| origin | v1 | string | the path or url the content came from | +| author_or_tool | v1 | string | author, channel or generating tool, unknown when unstated | +| date_authored | v1 | string | ISO date when known, otherwise unknown | +| date_ingested | v1 | string | ISO date the source entered the project | +| authority_tier | v1 | enum | primary, secondary, tertiary or unknown | +| recency_rank | v1 | integer | intake order, used as a tie breaker | +| sha256 | v1 | string | SHA-256 of the body, verified by gate 01 and the audit | +| schema_version | v2 | integer | always 2 | +| source_type | v2 | enum | one of the source types below | +| url | v2 | string or null | the canonical url for fetched web content | +| resolved_from | v2 | string or null | the source id this source was resolved out of | +| transcript_status | v2 | enum | fetched or pending, present only on video-transcript sources | + +`source_type` is one of: `report`, `html-article`, `pdf`, +`video-transcript`, `webinar`, `study-notes`, `dataset`, `notes`, `other`. +A `resolved_from` value must name a source that exists in the project. + +A full example, a Vimeo transcript resolved out of report S01: + +``` +--- +source_id: S03 +title: Designing Reliable Pipelines +origin: https://vimeo.com/76979871 +author_or_tool: Conference Channel +date_authored: '2024-03-18' +date_ingested: '2026-06-12' +authority_tier: unknown +recency_rank: 3 +sha256: 3f8c0d2ab1e4f5a6b7c8d9e0f1a2b3c4d5e6f7a8b9c0d1e2f3a4b5c6d7e8f9a0 +schema_version: 2 +source_type: video-transcript +url: https://vimeo.com/76979871 +resolved_from: S01 +transcript_status: fetched +--- +# Designing Reliable Pipelines + +## Transcript + +[00:00:00] Welcome to the talk. +``` + +## Claim source_locator + +Claims may carry one optional field beyond the required schema: a +`source_locator` object that deep links the claim into its source. + +| Key | Type | Meaning | +| --- | --- | --- | +| url | non empty string | the url the claim is anchored to | +| page | positive integer | page number, for pdf sources | +| timestamp | string H:MM or HH:MM:SS | position in a video transcript | +| anchor | non empty string | a section slug or fragment identifier | + +Validation rules, enforced by `resynth extract-verify`: + +- `source_locator` must be an object with at least one of the four keys. +- No other keys are allowed. +- Each present key must match the type rules above. +- A claim against a video-transcript source without a timestamp draws a + warning, not a failure. +- A locator url that differs from the source's own `url` draws a warning. + +Example claim line: + +``` +{"claim_id": "S03-C002", "source_id": "S03", "claim_text": "Retry queues should cap at three attempts before alerting.", "claim_type": "recommendation", "topic_tags": ["reliability"], "supporting_quote_location": "Transcript at 14:32", "confidence_as_stated": "high", "depends_on": [], "source_locator": {"url": "https://vimeo.com/76979871", "timestamp": "00:14:32"}} +``` + +## MASTER.json formats + +`resynth export` writes format `resynth-master/2`. The only difference +from `resynth-master/1` is a top level `sources` array carrying every +source's frontmatter in a uniform v2 shape, sorted by source id. Sources +that were never migrated appear with `schema_version` 1 and defaults of +`source_type` report, `url` null and `resolved_from` null. + +Downstream consumers should read the file through `load_master`, which +accepts both formats: + +```python +from pathlib import Path +from resynth.export import load_master + +master = load_master(Path("projects/myproject/output/MASTER.json")) +master["format_version"] # 1 or 2 +master["sources"] # always present, empty for a /1 file +``` + +Any other format tag raises an error rather than guessing. + +## Migration guide + +Projects sealed before 0.2.0 hold schema v1 sources. They keep working as +they are. Gate 01 reports a warning, not a failure, and suggests the +migration. Upgrading is always an explicit act: + +``` +resynth migrate +``` + +What migrate changes: each v1 source's frontmatter gains +`schema_version: 2`, a `source_type` (pdf when the origin ends in .pdf, +otherwise report), `url: null` and `resolved_from: null`. + +What it never touches: source bodies, the stored `sha256` (it hashes the +body only, so it stays valid), claims, the index, the output, the seal +file and the git tags. Migration is idempotent, sources already on v2 are +reported as `already schema v2` and left alone. + +One consequence needs care. The seal hashes whole files, frontmatter +included, so after migration the existing `SEAL.yaml` no longer matches +the source files. The sealed git tag still pins the old state exactly. +Re-sealing is deliberately left to the operator, because a seal is a +statement that a human or agent verified the project at that point. The +worked sequence for a sealed project is: + +``` +resynth migrate myproject --dry-run # see what would change +resynth migrate myproject +resynth audit myproject +resynth seal myproject # produces the next version tag +``` + +The new tag pins the migrated state and the old tag remains as history. diff --git a/pyproject.toml b/pyproject.toml index 4c3558e..427562d 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta" [project] name = "resynth" -version = "0.1.0" +version = "0.2.0" description = "CLI research consolidation platform built on systematic review gates" readme = "README.md" requires-python = ">=3.11" diff --git a/src/resynth/__init__.py b/src/resynth/__init__.py index b04e0f8..cb0f565 100644 --- a/src/resynth/__init__.py +++ b/src/resynth/__init__.py @@ -4,4 +4,4 @@ AI agent, supplies judgement. RESYNTH has zero runtime AI dependency. """ -__version__ = "0.1.0" +__version__ = "0.2.0" diff --git a/src/resynth/cli.py b/src/resynth/cli.py index b3b225e..e6e7266 100644 --- a/src/resynth/cli.py +++ b/src/resynth/cli.py @@ -10,13 +10,16 @@ from rich.console import Console from rich.table import Table +from . import __version__ from . import audit as audit_mod from . import config from . import doctor as doctor_mod from . import export as export_mod from . import extract as extract_mod +from . import migrate as migrate_mod from . import project as project_mod from . import reconcile as reconcile_mod +from . import resolve as resolve_mod from . import synthesise as synth_mod from .errors import ResynthError from .gates import all_gates @@ -105,6 +108,7 @@ def main(self, *args, standalone_mode=True, **kwargs): @click.group(cls=_GuardedGroup, invoke_without_command=True) +@click.version_option(version=__version__, prog_name="resynth") @click.pass_context def main(ctx): """RESYNTH, research consolidation with systematic review gates. @@ -151,6 +155,34 @@ def intake(project, sources, as_json, dry_run): _run("intake", project, as_json, dry_run, run_intake, project, list(sources), dry_run=dry_run) +@main.command() +@click.argument("project") +@click.option("--source", "source_ids", multiple=True, help="Scan only these source ids (allows re-scanning resolved sources).") +@click.option("--only", default=None, help="Only targets containing this substring.") +@common +def resolve(project, source_ids, only, as_json, dry_run): + """Fetch links and file references inside sources as new first class sources.""" + _run( + "resolve", + project, + as_json, + dry_run, + resolve_mod.run_resolve, + project, + only=only, + source_ids=list(source_ids) or None, + dry_run=dry_run, + ) + + +@main.command() +@click.argument("project") +@common +def migrate(project, as_json, dry_run): + """Upgrade a project's sources to the current schema (v2). Re-seal is a separate step.""" + _run("migrate", project, as_json, dry_run, migrate_mod.run_migrate, project, dry_run=dry_run) + + @main.command() @click.argument("project") @common diff --git a/src/resynth/export.py b/src/resynth/export.py index 83d3b9f..e649d72 100644 --- a/src/resynth/export.py +++ b/src/resynth/export.py @@ -1,14 +1,37 @@ -"""Machine readable export of the sealed master for downstream AI agents.""" +"""Machine readable export of the sealed master for downstream AI agents. + +Downstream consumers read the file back with :func:`load_master`, which +accepts both the resynth-master/1 and resynth-master/2 payload formats. +""" from __future__ import annotations import json +from pathlib import Path from . import config +from .errors import ResynthError from .fsutil import safe_write from .gates import require_gate +from .intake import load_sources from .synthesise import _plan, _split_sections +FORMAT_V1 = "resynth-master/1" +FORMAT_V2 = "resynth-master/2" + + +def _export_sources(pdir: Path) -> list[dict]: + """Source frontmatter dicts in a uniform v2 shape, sorted by source_id.""" + out = [] + for fm in load_sources(pdir): + src = {k: v for k, v in fm.items() if k not in {"_file", "_body"}} + src.setdefault("schema_version", 1) + src.setdefault("source_type", "report") + src.setdefault("url", None) + src.setdefault("resolved_from", None) + out.append(src) + return sorted(out, key=lambda s: s["source_id"]) + def run_export(project: str, dry_run: bool = False) -> dict: pdir = config.project_dir(project) @@ -20,8 +43,9 @@ def run_export(project: str, dry_run: bool = False) -> dict: ] payload = { "project": project, - "format": "resynth-master/1", + "format": FORMAT_V2, "sections": sections, + "sources": _export_sources(pdir), "claims": sorted(plan["claims"].values(), key=lambda c: c["claim_id"]), "decisions": plan["decisions"], "winning_claims": plan["winners"], @@ -36,3 +60,17 @@ def run_export(project: str, dry_run: bool = False) -> dict: "events": [{"file": "MASTER.json", "action": outcome}], "messages": [f"output/MASTER.json: {outcome}"], } + + +def load_master(path: Path) -> dict: + """Read a MASTER.json of format resynth-master/1 or /2.""" + data = json.loads(Path(path).read_text(encoding="utf-8")) + tag = data.get("format") if isinstance(data, dict) else None + if tag == FORMAT_V1: + data.setdefault("sources", []) + data["format_version"] = 1 + elif tag == FORMAT_V2: + data["format_version"] = 2 + else: + raise ResynthError(f"unsupported master format {tag}") + return data diff --git a/src/resynth/extract.py b/src/resynth/extract.py index 555dffb..6341c8f 100644 --- a/src/resynth/extract.py +++ b/src/resynth/extract.py @@ -28,6 +28,9 @@ "confidence_as_stated", "depends_on", } +OPTIONAL_FIELDS = {"source_locator"} +LOCATOR_KEYS = {"url", "page", "timestamp", "anchor"} +TIMESTAMP_RE = re.compile(r"^\d{1,2}:\d{2}(:\d{2})?$") COVERAGE_MIN_BYTES = 2048 COVERAGE_MIN_CLAIMS = 3 @@ -61,6 +64,8 @@ def _workspace_header(sid: str) -> str: f"# One JSON object per line. Lines starting with # are ignored.\n" f"# Schema template, copy the line below, remove the leading #, fill it in:\n" f"# {example}\n" + f'# optional: "source_locator": {{"url": "https://...", "page": 12, ' + f'"timestamp": "00:14:32", "anchor": "section-slug"}}\n' ) @@ -94,10 +99,32 @@ def run_extract(project: str, dry_run: bool = False) -> dict: } +def _validate_locator(loc) -> list[str]: + if not isinstance(loc, dict): + return ["source_locator must be an object"] + errors = [] + if not loc: + errors.append("source_locator must have at least one of url, page, timestamp, anchor") + errors.extend(f"unknown source_locator key {k}" for k in sorted(loc.keys() - LOCATOR_KEYS)) + if "url" in loc and (not isinstance(loc["url"], str) or not loc["url"].strip()): + errors.append("source_locator.url must be a non-empty string") + if "page" in loc and ( + not isinstance(loc["page"], int) or isinstance(loc["page"], bool) or loc["page"] < 1 + ): + errors.append("source_locator.page must be a positive integer") + if "timestamp" in loc and ( + not isinstance(loc["timestamp"], str) or not TIMESTAMP_RE.match(loc["timestamp"]) + ): + errors.append("source_locator.timestamp must look like H:MM or HH:MM:SS") + if "anchor" in loc and (not isinstance(loc["anchor"], str) or not loc["anchor"].strip()): + errors.append("source_locator.anchor must be a non-empty string") + return errors + + def validate_claim(obj: dict, sid: str) -> list[str]: errors = [] missing = REQUIRED_FIELDS - obj.keys() - extra = obj.keys() - REQUIRED_FIELDS + extra = obj.keys() - REQUIRED_FIELDS - OPTIONAL_FIELDS errors.extend(f"missing field {f}" for f in sorted(missing)) errors.extend(f"unknown field {f}" for f in sorted(extra)) if missing or extra: @@ -134,6 +161,8 @@ def validate_claim(obj: dict, sid: str) -> list[str]: isinstance(d, str) and CLAIM_ID_RE.match(d) for d in deps ): errors.append("depends_on must be a list of claim ids in SNN-CNNN format") + if "source_locator" in obj: + errors.extend(_validate_locator(obj["source_locator"])) return errors @@ -168,6 +197,8 @@ def run_extract_verify(project: str, dry_run: bool = False) -> dict: reasons.append(f"{sid}: claims file missing, run resynth extract") continue count = 0 + src_type = fm.get("source_type") + src_url = fm.get("url") for lineno, _raw, obj, err in iter_jsonl(path): where = f"{path.name}:{lineno}" if err: @@ -181,6 +212,12 @@ def run_extract_verify(project: str, dry_run: bool = False) -> dict: reasons.append(f"{where}: duplicate claim_id {cid}, first seen {seen_ids[cid]}") else: seen_ids[cid] = where + loc = obj.get("source_locator") + loc = loc if isinstance(loc, dict) else {} + if src_type == "video-transcript" and not loc.get("timestamp"): + warnings.append(f"{cid}: video source claim without a timestamp locator") + if src_url and loc.get("url") and loc["url"] != src_url: + warnings.append(f"{cid}: locator url does not match the source url") count += 1 all_claims.append(obj) claims_by_source[sid] = count diff --git a/src/resynth/intake.py b/src/resynth/intake.py index 83be453..aa3c1a7 100644 --- a/src/resynth/intake.py +++ b/src/resynth/intake.py @@ -29,6 +29,22 @@ AUTHORITY_TIERS = {"primary", "secondary", "tertiary", "unknown"} SUPPORTED = {".md", ".txt", ".docx", ".pdf"} +SCHEMA_VERSION = 2 +SOURCE_TYPES = { + "report", + "html-article", + "pdf", + "video-transcript", + "webinar", + "study-notes", + "dataset", + "notes", + "other", +} +TRANSCRIPT_STATUSES = {"fetched", "pending"} +RESOLVED_FROM_RE = re.compile(r"^S\d{2}$") +V2_FIELDS = ["schema_version", "source_type", "url", "resolved_from"] + def slugify(name: str) -> str: slug = re.sub(r"[^a-z0-9]+", "-", name.lower()).strip("-") @@ -76,12 +92,71 @@ def load_sources(pdir: Path) -> list[dict]: return out -def _frontmatter_block(fm: dict) -> str: +def frontmatter_block(fm: dict) -> str: + """Render the YAML frontmatter body with keys in canonical order.""" import yaml - ordered = {k: fm[k] for k in FRONTMATTER_FIELDS} - block = yaml.safe_dump(ordered, sort_keys=False, allow_unicode=True, default_flow_style=False) - return f"---\n{block}---\n" + keys = [*FRONTMATTER_FIELDS, *V2_FIELDS] + if "transcript_status" in fm: + keys.append("transcript_status") + ordered = {k: fm[k] for k in keys if k in fm} + return yaml.safe_dump(ordered, sort_keys=False, allow_unicode=True, default_flow_style=False) + + +def register_source( + pdir: Path, + body: str, + *, + title: str, + origin: str, + source_type: str = "report", + url: str | None = None, + resolved_from: str | None = None, + author_or_tool: str = "unknown", + date_authored: str = "unknown", + transcript_status: str | None = None, + dry_run: bool = False, +) -> dict: + """Number, dedup and write a source file with schema-v2 frontmatter.""" + digest = sha256_text(body) + existing = load_sources(pdir) + for prior in existing: + if prior.get("sha256") == digest: + return { + "action": "duplicate", + "source_id": prior["source_id"], + "file": prior["_file"], + "sha256": digest, + } + numbers = [ + int(m.group(1)) + for f in (pdir / "sources").glob("S*.md") + if (m := re.match(r"^S(\d+)", f.name)) + ] + n = max(numbers, default=0) + 1 + sid = f"S{n:02d}" + fm = { + "source_id": sid, + "title": title, + "origin": origin, + "author_or_tool": author_or_tool, + "date_authored": date_authored, + "date_ingested": date.today().isoformat(), + "authority_tier": "unknown", + "recency_rank": n, + "sha256": digest, + "schema_version": SCHEMA_VERSION, + "source_type": source_type, + "url": url, + "resolved_from": resolved_from, + } + if transcript_status is not None: + fm["transcript_status"] = transcript_status + dest = pdir / "sources" / f"{sid}-{slugify(title)}.md" + if dry_run: + return {"action": "dry-run", "source_id": sid, "file": dest.name, "sha256": digest} + safe_write(dest, f"---\n{frontmatter_block(fm)}---\n" + body, pdir) + return {"action": "created", "source_id": sid, "file": dest.name, "sha256": digest} def _title_of(body: str, fallback: str) -> str: @@ -91,35 +166,72 @@ def _title_of(body: str, fallback: str) -> str: return fallback +def _check_v2(fm: dict, known_ids: set) -> list[str]: + problems = [] + stype = fm.get("source_type") + if stype and stype not in SOURCE_TYPES: + problems.append(f"invalid source_type '{stype}'") + for key in ("url", "resolved_from"): + if key not in fm: + problems.append(f"missing frontmatter field {key}") + ref = fm.get("resolved_from") + if ref is not None: + if not isinstance(ref, str) or not RESOLVED_FROM_RE.match(ref): + problems.append(f"invalid resolved_from '{ref}'") + elif ref not in known_ids: + problems.append(f"resolved_from references unknown source {ref}") + if "transcript_status" in fm: + if fm["transcript_status"] not in TRANSCRIPT_STATUSES: + problems.append(f"invalid transcript_status '{fm['transcript_status']}'") + if stype != "video-transcript": + problems.append("transcript_status only allowed for video-transcript sources") + return problems + + def check_intake_gate(pdir: Path, dry_run: bool = False) -> dict: reasons: list[str] = [] + warnings: list[str] = [] checks: dict = {"sources": {}} sources = load_sources(pdir) + known_ids = {fm.get("source_id") for fm in sources} if not sources: reasons.append("no sources ingested") + legacy = 0 for fm in sources: sid = fm.get("source_id", fm["_file"]) problems = [] - for field in FRONTMATTER_FIELDS: + version = fm.get("schema_version") + if "schema_version" not in fm: + legacy += 1 + elif version != SCHEMA_VERSION: + problems.append(f"unsupported schema_version {version}") + required = list(FRONTMATTER_FIELDS) + if version == SCHEMA_VERSION: + required.append("source_type") + for field in required: if field not in fm or fm[field] in (None, ""): problems.append(f"missing frontmatter field {field}") tier = fm.get("authority_tier") if tier and tier not in AUTHORITY_TIERS: problems.append(f"invalid authority_tier '{tier}'") + if version == SCHEMA_VERSION: + problems.extend(_check_v2(fm, known_ids)) actual = sha256_text(fm["_body"]) if fm.get("sha256") != actual: problems.append("sha256 does not match body content") checks["sources"][sid] = "ok" if not problems else problems reasons.extend(f"{sid}: {p}" for p in problems) + if legacy: + warnings.append( + f"{legacy} source(s) use the pre 0.2.0 schema, run: resynth migrate {pdir.name}" + ) checks["source_count"] = len(sources) - return write_gate(pdir, "01-intake", reasons, checks, dry_run=dry_run) + return write_gate(pdir, "01-intake", reasons, checks, warnings=warnings, dry_run=dry_run) def run_intake(project: str, source_paths: list[str], dry_run: bool = False) -> dict: pdir = config.project_dir(project) - existing = load_sources(pdir) - by_hash = {fm["sha256"]: fm["source_id"] for fm in existing} - next_n = len(existing) + 1 + by_hash = {fm["sha256"]: fm["source_id"] for fm in load_sources(pdir)} events = [] for raw in source_paths: src = Path(raw) @@ -141,23 +253,18 @@ def run_intake(project: str, source_paths: list[str], dry_run: bool = False) -> } ) continue - sid = f"S{next_n:02d}" - fm = { - "source_id": sid, - "title": _title_of(body, src.stem), - "origin": str(src), - "author_or_tool": "unknown", - "date_authored": "unknown", - "date_ingested": date.today().isoformat(), - "authority_tier": "unknown", - "recency_rank": next_n, - "sha256": digest, - } - dest = pdir / "sources" / f"{sid}-{slugify(src.stem)}.md" - outcome = safe_write(dest, _frontmatter_block(fm) + body, pdir, dry_run=dry_run) - events.append({"source": src.name, "action": outcome, "source_id": sid}) - by_hash[digest] = sid - next_n += 1 + result = register_source( + pdir, + body, + title=_title_of(body, src.stem), + origin=str(src), + source_type="pdf" if src.suffix.lower() == ".pdf" else "report", + dry_run=dry_run, + ) + events.append( + {"source": src.name, "action": result["action"], "source_id": result["source_id"]} + ) + by_hash[digest] = result["source_id"] gate = check_intake_gate(pdir, dry_run=dry_run) return { "ok": gate["status"] == "PASS", diff --git a/src/resynth/migrate.py b/src/resynth/migrate.py new file mode 100644 index 0000000..ef5adc9 --- /dev/null +++ b/src/resynth/migrate.py @@ -0,0 +1,67 @@ +"""Project upgrader for RESYNTH v0.2.0. Rewrites schema v1 source +frontmatter to schema v2 in place, preserving each source body byte for +byte so the stored content hashes remain valid.""" + +from __future__ import annotations + +from . import config, intake +from .errors import ResynthError +from .fsutil import parse_frontmatter, safe_write + + +def run_migrate(project: str, dry_run: bool = False) -> dict: + """Upgrade every schema v1 source in a project to schema v2. + + Adds schema_version, source_type, url and resolved_from to the + frontmatter and leaves the body untouched, so the stored content hash + stays valid. Sources already on v2 are left unchanged, which makes + the command idempotent. The seal is never touched here, re-sealing + is a separate operator act. Returns ok, gate, events and messages. + """ + pdir = config.project_dir(project) + files = sorted((pdir / "sources").glob("S*.md")) + if not files: + raise ResynthError("no sources to migrate, run resynth intake first") + events: list[dict] = [] + messages: list[str] = [] + upgraded = 0 + for f in files: + fm, body = parse_frontmatter(f.read_text(encoding="utf-8"), f.name) + sid = fm.get("source_id", f.name) + if "schema_version" in fm: + events.append({"source": sid, "action": "unchanged"}) + messages.append(f"{sid}: already schema v2") + continue + origin = str(fm.get("origin", "")) + stype = "pdf" if origin.lower().endswith(".pdf") else "report" + fm["schema_version"] = intake.SCHEMA_VERSION + fm["source_type"] = stype + fm["url"] = None + fm["resolved_from"] = None + content = f"---\n{intake.frontmatter_block(fm)}---\n" + body + outcome = safe_write(f, content, pdir, dry_run=dry_run) + events.append({"source": sid, "action": outcome, "source_type": stype}) + if outcome == "dry-run": + messages.append(f"{sid}: would upgrade to schema v2 (source_type {stype})") + else: + upgraded += 1 + messages.append(f"{sid}: upgraded to schema v2 (source_type {stype})") + if upgraded: + messages.extend( + [ + "Frontmatter has changed, so the existing seal no longer matches these files.", + "The sealed git tag still pins the old state.", + f"When you are ready, re-seal with: resynth audit {project} " + f"then resynth seal {project}", + ] + ) + gate = None + if not dry_run: + gate = intake.check_intake_gate(pdir) + messages.append(f"gate 01-intake: {gate['status']}") + return { + "ok": True if dry_run else gate["status"] == "PASS", + "gate": gate, + "events": events, + "messages": messages, + } diff --git a/src/resynth/reconcile.py b/src/resynth/reconcile.py index 70e3a4d..4cbec11 100644 --- a/src/resynth/reconcile.py +++ b/src/resynth/reconcile.py @@ -58,6 +58,20 @@ def _candidates(claims: list[dict]) -> list[dict]: return out +def _locator_hint(claim: dict) -> str: + """Short deep link hint for the claims index, empty when absent.""" + loc = claim.get("source_locator") + if not isinstance(loc, dict): + return "" + if loc.get("timestamp"): + return f" @ {loc['timestamp']}" + if loc.get("page"): + return f" p. {loc['page']}" + if loc.get("anchor"): + return f" #{loc['anchor']}" + return "" + + def _claims_index_md(project: str, claims: list[dict]) -> str: by_tag: dict[str, list[dict]] = {} for c in claims: @@ -77,7 +91,7 @@ def _claims_index_md(project: str, claims: list[dict]) -> str: for c in sorted(by_tag[tag], key=lambda c: c["claim_id"]): lines.append( f"- {c['claim_id']} ({c['source_id']}, {c['claim_type']}, " - f"confidence {c['confidence_as_stated']}) {c['claim_text']}" + f"confidence {c['confidence_as_stated']}){_locator_hint(c)} {c['claim_text']}" ) lines.append("") return "\n".join(lines) diff --git a/src/resynth/resolve/__init__.py b/src/resynth/resolve/__init__.py new file mode 100644 index 0000000..e410b25 --- /dev/null +++ b/src/resynth/resolve/__init__.py @@ -0,0 +1,278 @@ +"""Source resolution, a stage 1 verb. Follow links inside ingested +sources, fetch the linked material and register it as new sources with +provenance. Resolution re-evaluates gate 01-intake rather than adding a +gate of its own. + +Outcomes are tracked in index/resolution.jsonl so re-runs are cheap and +byte identical when nothing changed. +""" + +from __future__ import annotations + +import json +from datetime import date +from pathlib import Path + +from .. import config, intake +from ..errors import ResynthError +from ..fsutil import iter_jsonl, parse_frontmatter, safe_write, sha256_text +from .discover import discover_targets +from .fetchers import classify_target, fetch_local, fetch_url, fetch_vimeo, fetch_youtube +from .net import FetchError + +MANIFEST = "resolution.jsonl" + +_HEADER = ( + "# RESYNTH source resolution manifest. One JSON object per line records a\n" + "# discovered link target and its outcome: fetched, duplicate,\n" + "# transcript_pending or failed. Maintained by `resynth resolve`." +) +_KEYS = ["target", "kind", "status", "source_id", "resolved_from", "sha256", "fetched_at", "note"] +_FETCHERS = {"local": fetch_local, "youtube": fetch_youtube, "vimeo": fetch_vimeo, "url": fetch_url} + + +def manifest_path(pdir: Path) -> Path: + """Path to the resolution manifest inside a project directory.""" + return pdir / "index" / MANIFEST + + +def _load_manifest(pdir: Path) -> dict[str, dict]: + path = manifest_path(pdir) + out: dict[str, dict] = {} + if path.is_file(): + for _lineno, _raw, obj, err in iter_jsonl(path): + if obj is not None and obj.get("target"): + out[obj["target"]] = obj + return out + + +def _write_manifest(pdir: Path, records: list[dict]) -> None: + lines = [_HEADER] + lines += [json.dumps({k: rec.get(k) for k in _KEYS}, ensure_ascii=False) for rec in records] + safe_write(manifest_path(pdir), "\n".join(lines) + "\n", pdir) + + +def _record( + target: str, + kind: str, + status: str, + *, + source_id: str | None = None, + resolved_from: str | None = None, + sha256: str | None = None, + note: str | None = None, + prior: dict | None = None, +) -> dict: + rec = { + "target": target, + "kind": kind, + "status": status, + "source_id": source_id, + "resolved_from": resolved_from, + "sha256": sha256, + "fetched_at": None, + "note": note, + } + if ( + prior is not None + and prior.get("fetched_at") + and all(prior.get(k) == rec[k] for k in _KEYS if k != "fetched_at") + ): + return prior + rec["fetched_at"] = date.today().isoformat() + return rec + + +def _scan_targets(scan: list[dict]) -> list[dict]: + out: list[dict] = [] + seen: set[str] = set() + for fm in scan: + origin = fm.get("origin") + origin = origin if isinstance(origin, str) else "" + for t in discover_targets(fm.get("_body", ""), origin): + if t["raw"] in seen: + continue + seen.add(t["raw"]) + out.append( + {"raw": t["raw"], "kind": classify_target(t["raw"]), "parent": fm.get("source_id")} + ) + return out + + +def preview_targets(project: str) -> list[dict]: + """Discovery only, no network: targets with parent sid and kind.""" + pdir = config.project_dir(project) + scan = [fm for fm in intake.load_sources(pdir) if not fm.get("resolved_from")] + return _scan_targets(scan) + + +def _upgrade_source(pdir: Path, sid: str, doc: dict) -> str: + """Rewrite an existing pending stub in place, keeping its identity.""" + matches = sorted((pdir / "sources").glob(f"{sid}-*.md")) + if not matches: + raise ResynthError(f"source file for {sid} not found, cannot upgrade transcript") + path = matches[0] + fm, _old = parse_frontmatter(path.read_text(encoding="utf-8"), path.name) + body = doc["body_markdown"] + fm["title"] = doc["title"] + fm["sha256"] = sha256_text(body) + fm["transcript_status"] = doc["transcript_status"] + safe_write(path, f"---\n{intake.frontmatter_block(fm)}---\n{body}", pdir) + return fm["sha256"] + + +def run_resolve( + project: str, + only: str | None = None, + source_ids: list[str] | None = None, + dry_run: bool = False, +) -> dict: + """Discover, fetch and register link targets found inside sources. + + By default every source without a resolved_from parent is scanned, so + fetched sources are never scanned in turn. Passing source_ids scans + exactly those sources instead, including already resolved ones. The + only filter keeps just the targets containing that substring. Targets + recorded in the manifest as fetched or duplicate are skipped, failed + and transcript_pending targets are retried. Re-evaluates gate + 01-intake and returns ok, gate, counts, events and messages. + """ + pdir = config.project_dir(project) + sources = intake.load_sources(pdir) + if not sources: + raise ResynthError( + f"project '{project}' has no sources, run: resynth intake {project} " + ) + if source_ids: + by_id = {fm.get("source_id"): fm for fm in sources} + missing = [sid for sid in source_ids if sid not in by_id] + if missing: + raise ResynthError(f"unknown source id(s): {', '.join(missing)}") + scan = [by_id[sid] for sid in source_ids] + else: + scan = [fm for fm in sources if not fm.get("resolved_from")] + targets = _scan_targets(scan) + prior_manifest = _load_manifest(pdir) + manifest = dict(prior_manifest) + counts = {"fetched": 0, "cached": 0, "duplicate": 0, "transcript_pending": 0, "failed": 0} + messages: list[str] = [] + events: list[dict] = [] + + for t in targets: + raw, kind, parent = t["raw"], t["kind"], t["parent"] + if only and only.lower() not in raw.lower(): + continue + prior = prior_manifest.get(raw) + if prior and prior.get("status") in ("fetched", "duplicate"): + counts["cached"] += 1 + messages.append(f"{raw}: cached ({prior.get('source_id')})") + events.append({"target": raw, "action": "cached", "source_id": prior.get("source_id")}) + continue + if dry_run: + counts["fetched"] += 1 + messages.append(f"{raw}: would fetch ({kind})") + events.append({"target": raw, "action": "would-fetch", "kind": kind}) + continue + try: + doc = _FETCHERS[kind](raw) + except FetchError as err: + note = str(err) + manifest[raw] = _record( + raw, + kind, + "failed", + source_id=prior.get("source_id") if prior else None, + resolved_from=parent, + note=note, + prior=prior, + ) + counts["failed"] += 1 + messages.append(f"{raw}: failed ({note})") + events.append({"target": raw, "action": "failed", "note": note}) + continue + pending = doc["transcript_status"] == "pending" + prior_sid = prior.get("source_id") if prior else None + if prior_sid and prior.get("status") in ("transcript_pending", "failed"): + resolved_from = prior.get("resolved_from") or parent + if pending: + manifest[raw] = _record( + raw, + kind, + "transcript_pending", + source_id=prior_sid, + resolved_from=resolved_from, + sha256=prior.get("sha256"), + prior=prior, + ) + counts["transcript_pending"] += 1 + messages.append(f"{raw}: transcript still pending ({prior_sid})") + events.append( + {"target": raw, "action": "transcript-pending", "source_id": prior_sid} + ) + continue + digest = _upgrade_source(pdir, prior_sid, doc) + manifest[raw] = _record( + raw, + kind, + "fetched", + source_id=prior_sid, + resolved_from=resolved_from, + sha256=digest, + prior=prior, + ) + counts["fetched"] += 1 + messages.append(f"{raw}: fetched as {prior_sid} ({doc['source_type']})") + events.append({"target": raw, "action": "upgraded", "source_id": prior_sid}) + continue + result = intake.register_source( + pdir, + doc["body_markdown"], + title=doc["title"], + origin=doc["origin"], + source_type=doc["source_type"], + url=doc["url"], + resolved_from=parent, + author_or_tool=doc["author_or_tool"], + date_authored=doc["date_authored"], + transcript_status=doc["transcript_status"], + ) + sid = result["source_id"] + if result["action"] == "duplicate": + status = "duplicate" + messages.append(f"{raw}: duplicate of {sid}") + elif pending: + status = "transcript_pending" + messages.append(f"{raw}: transcript pending, stub created as {sid}") + else: + status = "fetched" + messages.append(f"{raw}: fetched as {sid} ({doc['source_type']})") + counts[status] += 1 + manifest[raw] = _record( + raw, kind, status, source_id=sid, resolved_from=parent, sha256=result["sha256"], + prior=prior, + ) + events.append({"target": raw, "action": status, "source_id": sid}) + + if not dry_run: + ordered: list[dict] = [] + emitted: set[str] = set() + for t in targets: + rec = manifest.get(t["raw"]) + if rec is not None and t["raw"] not in emitted: + ordered.append(rec) + emitted.add(t["raw"]) + for raw, rec in prior_manifest.items(): + if raw not in emitted: + ordered.append(rec) + emitted.add(raw) + if ordered: + _write_manifest(pdir, ordered) + gate = intake.check_intake_gate(pdir, dry_run=dry_run) + messages.append(f"gate 01-intake: {gate['status']}") + return { + "ok": gate["status"] == "PASS", + "gate": gate, + "counts": counts, + "events": events, + "messages": messages, + } diff --git a/src/resynth/resolve/discover.py b/src/resynth/resolve/discover.py new file mode 100644 index 0000000..5e26a07 --- /dev/null +++ b/src/resynth/resolve/discover.py @@ -0,0 +1,81 @@ +"""Discover fetchable targets referenced inside a source body. + +URLs are taken from bare links and markdown link destinations. Local paths +are only accepted from markdown destinations or backtick spans, with a +supported suffix, and only when the file actually exists. Nothing else is +guessed. +""" + +from __future__ import annotations + +import re +from pathlib import Path + +from ..intake import SUPPORTED + +_TARGET_RE = re.compile( + r"\]\(((?:[^()\s]|\([^()]*\))+)\)" # markdown link destination + r"|(https?://[^\s<>\"'`\]]+)" # bare url + r"|`([^`\n]+)`" # backtick span +) +_TRAILING = ")>,.;:]\"'" +_ABS_RE = re.compile(r"^(?:[A-Za-z]:[\\/]|/)") +_SCHEME_RE = re.compile(r"^[a-z][a-z0-9+.-]*://", re.IGNORECASE) + + +def _strip_url(url: str) -> str: + while url: + ch = url[-1] + if ch not in _TRAILING: + break + if ch == ")" and url.count("(") >= url.count(")"): + break + url = url[:-1] + return url + + +def _resolve_local(raw: str, origin: str) -> str | None: + cand = raw.strip() + if not cand or _SCHEME_RE.match(cand): + return None + if Path(cand).suffix.lower() not in SUPPORTED: + return None + if _ABS_RE.match(cand): + path = Path(cand) + if path.is_file(): + return str(path.resolve()) + if origin and not _SCHEME_RE.match(origin): + path = Path(origin).parent / cand + if path.is_file(): + return str(path.resolve()) + return None + + +def discover_targets(body: str, origin: str) -> list[dict]: + """Return ordered, deduped targets: {"raw": str, "kind": "url"|"local"}.""" + out: list[dict] = [] + seen: set[str] = set() + + def add(raw: str, kind: str) -> None: + if raw and raw not in seen: + seen.add(raw) + out.append({"raw": raw, "kind": kind}) + + for match in _TARGET_RE.finditer(body): + dest, bare, span = match.groups() + if dest is not None: + if dest.lower().startswith(("http://", "https://")): + add(dest, "url") + elif not _SCHEME_RE.match(dest): + local = _resolve_local(dest, origin) + if local: + add(local, "local") + elif bare is not None: + url = _strip_url(bare) + if url: + add(url, "url") + else: + local = _resolve_local(span, origin) + if local: + add(local, "local") + return out diff --git a/src/resynth/resolve/fetchers.py b/src/resynth/resolve/fetchers.py new file mode 100644 index 0000000..a5682f4 --- /dev/null +++ b/src/resynth/resolve/fetchers.py @@ -0,0 +1,277 @@ +"""Fetchers turn a resolved target into a FetchedDoc dict ready for intake.""" + +from __future__ import annotations + +import html as htmllib +import json +import os +import re +import tempfile +import xml.etree.ElementTree as ET +from pathlib import Path +from urllib.parse import parse_qs, quote, urljoin, urlsplit + +from .. import intake +from ..errors import ResynthError +from . import net +from .net import FetchError +from .reduce_html import reduce_html + +PENDING_STUB = ( + "# {title}\n" + "\n" + "> [!info] Video transcript pending\n" + "> RESYNTH could not retrieve a public caption track for this video.\n" + "> Link: {url}\n" + "> Re-run `resynth resolve ` to retry, or paste the transcript\n" + "> below this callout. The next resolve run can also upgrade this stub.\n" +) + +_YT_HOSTS = {"youtube.com", "www.youtube.com", "m.youtube.com", "youtu.be"} +_VIMEO_HOSTS = {"vimeo.com", "www.vimeo.com", "player.vimeo.com"} + +_VTT_TIME = re.compile( + r"(?:(\d+):)?(\d{1,2}):(\d{2})[.,](\d{3})\s+-->\s+(?:(\d+):)?(\d{1,2}):(\d{2})[.,](\d{3})" +) + + +def classify_target(raw: str) -> str: + """Classify a raw target as youtube, vimeo, url or local.""" + if re.match(r"^https?://", raw, re.IGNORECASE): + host = (urlsplit(raw).hostname or "").lower() + if host in _YT_HOSTS: + return "youtube" + if host in _VIMEO_HOSTS: + return "vimeo" + return "url" + if Path(raw).exists(): + return "local" + return "url" + + +def _doc( + body: str, + title: str, + source_type: str, + url: str | None, + origin: str, + author_or_tool: str = "unknown", + date_authored: str = "unknown", + transcript_status: str | None = None, +) -> dict: + return { + "body_markdown": body, + "title": title, + "source_type": source_type, + "url": url, + "origin": origin, + "author_or_tool": author_or_tool, + "date_authored": date_authored, + "transcript_status": transcript_status, + } + + +def _heading_title(body: str) -> str | None: + for line in body.splitlines(): + if line.startswith("# "): + return line[2:].strip() + return None + + +def fetch_local(path: str) -> dict: + """Convert a local file to a doc. PDFs become source_type pdf, + everything else becomes notes.""" + src = Path(path) + try: + body = intake._convert(src) + except ResynthError as err: + raise FetchError(str(err)) from err + source_type = "pdf" if src.suffix.lower() == ".pdf" else "notes" + return _doc(body, _heading_title(body) or src.stem, source_type, None, str(src)) + + +def fetch_url(url: str) -> dict: + """Fetch a web url. PDF responses are converted with pdftotext and + HTML pages are reduced to clean text. Anything else is an error.""" + payload, content_type, final_url = net.http_get(url) + ct = (content_type or "").split(";")[0].strip().lower() + path = urlsplit(final_url or url).path.lower() + if ct == "application/pdf" or path.endswith(".pdf"): + handle, name = tempfile.mkstemp(suffix=".pdf") + tmp = Path(name) + try: + with os.fdopen(handle, "wb") as fh: + fh.write(payload) + try: + body = intake._convert(tmp) + except ResynthError as err: + raise FetchError(str(err)) from err + finally: + tmp.unlink(missing_ok=True) + title = _heading_title(body) or Path(path).stem or url + return _doc(body, title, "pdf", url, url) + if "html" in ct: + text = net.decode(payload, content_type) + body, title = reduce_html(text, final_url or url) + if len(body.strip()) < 200: + raise FetchError( + "page yielded no extractable text (login wall or script rendered)" + ) + return _doc(body, title or url, "html-article", url, url) + raise FetchError(f"unsupported content type {ct or 'unknown'}") + + +def _hms(seconds: float) -> str: + s = int(seconds) + return f"{s // 3600:02d}:{s % 3600 // 60:02d}:{s % 60:02d}" + + +def _render_transcript(cues: list[tuple[float, float, str]]) -> str: + paras: list[list[str]] = [[]] + prev_end: float | None = None + for start, end, text in cues: + text = re.sub(r"\s+", " ", text).strip() + if not text: + continue + if prev_end is not None and start - prev_end > 8: + paras.append([]) + paras[-1].append(f"[{_hms(start)}] {text}") + prev_end = end + body = "\n\n".join("\n".join(p) for p in paras if p) + return f"## Transcript\n\n{body}\n" + + +def _oembed(endpoint: str) -> dict: + try: + body, _ct, _final = net.http_get(endpoint) + data = json.loads(body.decode("utf-8", errors="replace")) + return data if isinstance(data, dict) else {} + except (FetchError, ValueError): + return {} + + +def _video_doc( + url: str, + title: str, + author: str, + date_authored: str, + cues: list[tuple[float, float, str]], +) -> dict: + if cues: + body = f"# {title}\n\n{_render_transcript(cues)}" + status = "fetched" + else: + body = PENDING_STUB.format(title=title, url=url) + status = "pending" + return _doc(body, title, "video-transcript", url, url, author, date_authored, status) + + +def _youtube_id(url: str) -> str | None: + parts = urlsplit(url) + host = (parts.hostname or "").lower() + if host == "youtu.be": + seg = parts.path.strip("/").split("/")[0] + return seg or None + qs = parse_qs(parts.query) + if qs.get("v"): + return qs["v"][0] + match = re.match(r"^/(?:shorts|embed|live)/([^/?#]+)", parts.path) + return match.group(1) if match else None + + +def fetch_youtube(url: str) -> dict: + """Fetch a YouTube video as a timestamped transcript source. Falls + back to a pending stub when no public caption track is available.""" + vid = _youtube_id(url) + if not vid: + raise FetchError("could not determine youtube video id") + meta = _oembed(f"https://www.youtube.com/oembed?url={quote(url, safe='')}&format=json") + title = meta.get("title") or url + author = meta.get("author_name") or "unknown" + cues: list[tuple[float, float, str]] = [] + try: + listing, _ct, _final = net.http_get( + f"https://www.youtube.com/api/timedtext?type=list&v={vid}" + ) + codes = [t.get("lang_code") or "" for t in ET.fromstring(listing).findall(".//track")] + codes = [c for c in codes if c] + code = next((c for c in codes if c.lower().startswith("en")), codes[0] if codes else None) + if code: + track, _ct, _final = net.http_get( + f"https://www.youtube.com/api/timedtext?lang={quote(code)}&v={vid}" + ) + for el in ET.fromstring(track).findall(".//text"): + start = float(el.get("start") or 0) + dur = float(el.get("dur") or 0) + cues.append((start, start + dur, htmllib.unescape("".join(el.itertext())))) + except (FetchError, ET.ParseError, ValueError): + cues = [] + return _video_doc(url, title, author, "unknown", cues) + + +def _vimeo_id(url: str) -> str | None: + for seg in urlsplit(url).path.split("/"): + if seg.isdigit(): + return seg + return None + + +def _vtt_seconds(h: str | None, m: str, s: str, ms: str) -> float: + return int(h or 0) * 3600 + int(m) * 60 + int(s) + int(ms) / 1000 + + +def _parse_vtt(text: str) -> list[tuple[float, float, str]]: + cues: list[tuple[float, float, str]] = [] + lines = text.splitlines() + i = 0 + while i < len(lines): + line = lines[i].strip() + if line.startswith(("NOTE", "STYLE", "REGION")): + i += 1 + while i < len(lines) and lines[i].strip(): + i += 1 + continue + if not line or line.startswith("WEBVTT"): + i += 1 + continue + match = _VTT_TIME.search(line) + if not match: + i += 1 + continue + start = _vtt_seconds(*match.groups()[:4]) + end = _vtt_seconds(*match.groups()[4:]) + i += 1 + texts = [] + while i < len(lines) and lines[i].strip(): + texts.append(lines[i].strip()) + i += 1 + cue_text = htmllib.unescape(re.sub(r"<[^>]+>", "", " ".join(texts))) + cues.append((start, end, cue_text)) + return cues + + +def fetch_vimeo(url: str) -> dict: + """Fetch a Vimeo video as a timestamped transcript source. Falls + back to a pending stub when no public text track is available.""" + vid = _vimeo_id(url) + if not vid: + raise FetchError("could not determine vimeo video id") + meta = _oembed(f"https://vimeo.com/api/oembed.json?url={quote(url, safe='')}") + title = meta.get("title") or url + author = meta.get("author_name") or "unknown" + date_authored = str(meta.get("upload_date") or "unknown")[:10] or "unknown" + cues: list[tuple[float, float, str]] = [] + try: + body, _ct, _final = net.http_get(f"https://player.vimeo.com/video/{vid}/config") + cfg = json.loads(body.decode("utf-8", errors="replace")) + tracks = (cfg.get("request") or {}).get("text_tracks") or [] + track = next( + (t for t in tracks if str(t.get("lang", "")).lower().startswith("en")), + tracks[0] if tracks else None, + ) + if track and track.get("url"): + vtt, _ct, _final = net.http_get(urljoin("https://player.vimeo.com", track["url"])) + cues = _parse_vtt(vtt.decode("utf-8", errors="replace")) + except (FetchError, ValueError, AttributeError): + cues = [] + return _video_doc(url, title, author, date_authored, cues) diff --git a/src/resynth/resolve/net.py b/src/resynth/resolve/net.py new file mode 100644 index 0000000..687d7e5 --- /dev/null +++ b/src/resynth/resolve/net.py @@ -0,0 +1,107 @@ +"""HTTP access for source resolution. + +Every request goes through this module's `urlopen` attribute so tests can +patch a single seam. Robots.txt is honoured and requests to one host are +rate limited. +""" + +from __future__ import annotations + +import re +import time +import urllib.error +import urllib.request +from urllib.parse import urlsplit +from urllib.robotparser import RobotFileParser + +from .. import __version__ + +urlopen = urllib.request.urlopen +monotonic = time.monotonic +sleep = time.sleep + +USER_AGENT = ( + f"resynth/{__version__} (+https://github.com/Markus-Doc/resynth) " + "research consolidation tool" +) +TIMEOUT = 30 +MAX_BYTES = 10 * 1024 * 1024 +HOST_DELAY = 1.0 + +_robots: dict[str, RobotFileParser | None] = {} +_last_hit: dict[str, float] = {} + +_CHARSET_RE = re.compile(r"charset=[\"']?([\w.:-]+)", re.IGNORECASE) + + +class FetchError(Exception): + """A target could not be fetched. str(err) is the short reason.""" + + +def _request(url: str) -> urllib.request.Request: + return urllib.request.Request(url, headers={"User-Agent": USER_AGENT}) + + +def _robot_parser(base: str) -> RobotFileParser | None: + if base in _robots: + return _robots[base] + parser: RobotFileParser | None + try: + with urlopen(_request(base + "/robots.txt"), timeout=TIMEOUT) as resp: + raw = resp.read(MAX_BYTES) + parser = RobotFileParser() + parser.parse(raw.decode("utf-8", errors="replace").splitlines()) + except (urllib.error.URLError, TimeoutError, OSError, ValueError): + parser = None + _robots[base] = parser + return parser + + +def _check_robots(url: str) -> None: + parts = urlsplit(url) + parser = _robot_parser(f"{parts.scheme}://{parts.netloc}") + if parser is not None and not parser.can_fetch(USER_AGENT, url): + raise FetchError("disallowed by robots.txt") + + +def _throttle(host: str) -> None: + last = _last_hit.get(host) + if last is not None: + wait = HOST_DELAY - (monotonic() - last) + if wait > 0: + sleep(wait) + _last_hit[host] = monotonic() + + +def http_get(url: str) -> tuple[bytes, str, str]: + """GET a url, returning (body, content-type header, final url).""" + _check_robots(url) + _throttle(urlsplit(url).netloc.lower()) + try: + with urlopen(_request(url), timeout=TIMEOUT) as resp: + body = resp.read(MAX_BYTES + 1) + content_type = resp.headers.get("Content-Type") or "" + final_url = getattr(resp, "url", None) or resp.geturl() + except urllib.error.HTTPError as err: + raise FetchError(f"http {err.code} {err.reason}") from err + except urllib.error.URLError as err: + raise FetchError(f"unreachable: {err.reason}") from err + except (TimeoutError, OSError) as err: + raise FetchError(str(err) or err.__class__.__name__) from err + if len(body) > MAX_BYTES: + raise FetchError("response exceeds 10 MiB") + return body, content_type, final_url + + +def decode(body: bytes, content_type: str) -> str: + """Decode a response body using the declared or sniffed charset, + falling back to utf-8 with replacement.""" + match = _CHARSET_RE.search(content_type or "") + if not match: + match = _CHARSET_RE.search(body[:2048].decode("ascii", errors="ignore")) + if match: + try: + return body.decode(match.group(1), errors="replace") + except LookupError: + pass + return body.decode("utf-8", errors="replace") diff --git a/src/resynth/resolve/reduce_html.py b/src/resynth/resolve/reduce_html.py new file mode 100644 index 0000000..c1a50b2 --- /dev/null +++ b/src/resynth/resolve/reduce_html.py @@ -0,0 +1,199 @@ +"""Reduce noisy HTML to markdown-ish clean text using the stdlib parser.""" + +from __future__ import annotations + +import re +from html.parser import HTMLParser +from urllib.parse import urljoin + +DROP = { + "script", + "style", + "noscript", + "template", + "svg", + "form", + "nav", + "header", + "footer", + "aside", + "iframe", +} +REGION = {"main", "article"} +HEADINGS = {f"h{n}": n for n in range(1, 7)} +BLOCK_PREFIX = {"p": "", "blockquote": "> ", "li": "- "} + + +def _collapse(text: str) -> str: + return re.sub(r"\s+", " ", text).strip() + + +class _Reducer(HTMLParser): + def __init__(self, base_url: str): + super().__init__(convert_charrefs=True) + self.base_url = base_url + self.blocks: list[tuple[bool, str]] = [] + self.cur: list | None = None # [prefix, parts, is_pre] + self.drop = 0 + self.region_seen = False + self.in_region = False + self.region_depth = 0 + self.links: list[str | None] = [] + self.cells: list[str] | None = None + + def _open(self, prefix: str, pre: bool = False) -> None: + self._flush() + self.cur = [prefix, [], pre] + + def _flush(self) -> None: + if self.cur is None: + return + prefix, parts, pre = self.cur + self.cur = None + raw = "".join(parts) + if pre: + text = raw.strip("\n") + if text.strip(): + self.blocks.append((self.in_region, f"```\n{text}\n```")) + return + text = _collapse(raw) + if not text: + return + if self.cells is not None: + self.cells.append(text) + return + self.blocks.append((self.in_region, prefix + text)) + + def handle_starttag(self, tag, attrs): + if tag in DROP: + self.drop += 1 + return + if self.drop: + return + if tag in REGION: + if not self.region_seen: + self.region_seen = True + self.in_region = True + self.region_depth = 1 + elif self.in_region: + self.region_depth += 1 + return + if tag in HEADINGS: + self._open("#" * HEADINGS[tag] + " ") + elif tag == "pre": + self._open("", pre=True) + elif tag in BLOCK_PREFIX: + if tag == "p" and self.cur is not None and self.cur[0] in ("> ", "- "): + self.cur[1].append(" ") + else: + self._open(BLOCK_PREFIX[tag]) + elif tag == "tr": + self._flush() + self.cells = [] + elif tag in ("td", "th"): + if self.cells is not None: + self._open("") + elif tag == "br": + if self.cur is not None: + self.cur[1].append(" ") + elif tag == "a": + href = dict(attrs).get("href") + self.links.append(urljoin(self.base_url, href) if href else None) + + def handle_endtag(self, tag): + if tag in DROP: + if self.drop: + self.drop -= 1 + return + if self.drop: + return + if tag in REGION: + if self.in_region: + self.region_depth -= 1 + if self.region_depth <= 0: + self.in_region = False + return + if tag in HEADINGS or tag == "pre": + self._flush() + elif tag == "p": + if self.cur is not None and self.cur[0] == "": + self._flush() + elif tag == "blockquote": + if self.cur is not None and self.cur[0] == "> ": + self._flush() + elif tag == "li": + if self.cur is not None and self.cur[0] == "- ": + self._flush() + elif tag in ("td", "th"): + self._flush() + elif tag == "tr": + self._flush() + if self.cells is not None: + row = " | ".join(cell for cell in self.cells if cell) + if row: + self.blocks.append((self.in_region, row)) + self.cells = None + elif tag == "table": + self.cells = None + elif tag == "a": + if self.links: + href = self.links.pop() + if href and self.cur is not None: + self.cur[1].append(f" ({href})") + + def handle_data(self, data): + if self.drop or self.cur is None: + return + self.cur[1].append(data) + + +class _TitleParser(HTMLParser): + def __init__(self): + super().__init__(convert_charrefs=True) + self.og: str | None = None + self.title: str | None = None + self.h1: str | None = None + self._stack: list[list] = [] + + def handle_starttag(self, tag, attrs): + a = dict(attrs) + if tag == "meta" and self.og is None: + prop = a.get("property") or a.get("name") + if prop == "og:title" and a.get("content"): + self.og = _collapse(a["content"]) + elif tag in ("title", "h1"): + self._stack.append([tag, []]) + + def handle_data(self, data): + if self._stack: + self._stack[-1][1].append(data) + + def handle_endtag(self, tag): + if self._stack and self._stack[-1][0] == tag: + name, parts = self._stack.pop() + text = _collapse("".join(parts)) + if name == "title" and self.title is None and text: + self.title = text + if name == "h1" and self.h1 is None and text: + self.h1 = text + + +def extract_title(html: str) -> str | None: + """Page title, preferring og:title, then , then the first h1.""" + parser = _TitleParser() + parser.feed(html) + parser.close() + return parser.og or parser.title or parser.h1 + + +def reduce_html(html: str, base_url: str) -> tuple[str, str | None]: + """Return (markdown_body, title) for an HTML page.""" + reducer = _Reducer(base_url) + reducer.feed(html) + reducer.close() + reducer._flush() + blocks = [ + text for in_region, text in reducer.blocks if in_region or not reducer.region_seen + ] + body = "\n\n".join(blocks) + return (body + "\n" if body else ""), extract_title(html) diff --git a/src/resynth/synthesise.py b/src/resynth/synthesise.py index 98ecddf..a04bd9b 100644 --- a/src/resynth/synthesise.py +++ b/src/resynth/synthesise.py @@ -87,6 +87,8 @@ def run_synthesise(project: str, dry_run: bool = False, force: bool = False) -> { "source_id": fm["source_id"], "title": fm["title"], + "source_type": fm.get("source_type") or "report", + "url": fm.get("url"), "authority_tier": fm["authority_tier"], "date_authored": fm["date_authored"], "sha256_short": str(fm["sha256"])[:12], diff --git a/src/resynth/wizard.py b/src/resynth/wizard.py index ee99b95..0c0c86f 100644 --- a/src/resynth/wizard.py +++ b/src/resynth/wizard.py @@ -30,6 +30,7 @@ from .intake import SUPPORTED, run_intake from .project import run_brief, run_init from .reconcile import run_reconcile +from .resolve import preview_targets, run_resolve from .synthesise import run_synth_verify, run_synthesise console = Console() @@ -463,9 +464,42 @@ def _step_intake(project: str) -> bool: for event in result["events"]: console.print(f" {event['source']}: {event['action']}") _show_reasons(result) + if result.get("ok"): + _offer_resolve(project) return True +def _offer_resolve(project: str) -> None: + """Offer to fetch links and file references found inside the new sources.""" + try: + targets = preview_targets(project) + except ResynthError as exc: + console.print(f"[red]{exc}[/red]") + return + if not targets: + return + n = len(targets) + if not Confirm.ask( + f"I found {n} links and file references inside your reports. " + "Fetch them as extra sources now?", + default=True, + ): + return + try: + resolved = run_resolve(project) + except ResynthError as exc: + console.print(f"[red]{exc}[/red]") + return + for line in resolved["messages"]: + console.print(f" {line}") + _show_reasons(resolved) + if resolved["counts"]["transcript_pending"] > 0: + console.print( + "Some videos have no public captions yet. You can paste a transcript\n" + "into the stub file, or re-run resolve later to retry." + ) + + def _step_operator( project: str, pdir: Path, diff --git a/templates/extraction-instructions.md.j2 b/templates/extraction-instructions.md.j2 index 02d0d90..b62d62c 100644 --- a/templates/extraction-instructions.md.j2 +++ b/templates/extraction-instructions.md.j2 @@ -27,6 +27,13 @@ Every claim line must contain exactly these fields. - confidence_as_stated: one of high, medium, low, unstated - depends_on: a list of claim ids this claim depends on, empty list if none +One optional field may be added. + +- source_locator: a structured pointer to the exact origin, an object with any + of url, page (a PDF page number), timestamp (a video HH:MM:SS time), anchor + (an HTML heading slug). Add a timestamp for video-transcript sources and a + page number for PDF sources. + ## Rules 1. One claim per line. Split compound statements into separate claims. diff --git a/templates/master.md.j2 b/templates/master.md.j2 index c20863a..a099a40 100644 --- a/templates/master.md.j2 +++ b/templates/master.md.j2 @@ -35,8 +35,8 @@ No conflicts were recorded during reconciliation. ## Appendix: Source Register -| Source | Title | Authority | Authored | Content hash | -| --- | --- | --- | --- | --- | +| Source | Title | Type | Authority | Authored | Link | Content hash | +| --- | --- | --- | --- | --- | --- | --- | {% for s in sources %} -| {{ s.source_id }} | {{ s.title }} | {{ s.authority_tier }} | {{ s.date_authored }} | {{ s.sha256_short }} | +| {{ s.source_id }} | {{ s.title }} | {{ s.source_type }} | {{ s.authority_tier }} | {{ s.date_authored }} | {{ s.url or "-" }} | {{ s.sha256_short }} | {% endfor %} diff --git a/tests/fixtures/resolve/article.html b/tests/fixtures/resolve/article.html new file mode 100644 index 0000000..39fcfdf --- /dev/null +++ b/tests/fixtures/resolve/article.html @@ -0,0 +1,37 @@ +<!DOCTYPE html> +<html> +<head> +<meta charset="utf-8"> +<meta property="og:title" content="Field Guide to Widgets"> +<title>Field Guide to Widgets - Example Articles + + + + +
Site header boilerplate
+ +

Outside main, must be ignored.

+
+
+

Field Guide to Widgets

+

Widgets are small reusable components that appear in nearly every +modern interface. This guide walks through selection, installation and +maintenance of widgets in production systems, with enough detail to +satisfy the extractable-text threshold used by the resolver.

+

Setup

+

Start by reading the spec and the upstream +manual before touching +anything.

+
    +
  • First step: inventory existing widgets
  • +
  • Second step: remove broken ones
  • +
+
Quoted wisdom about widgets from the maintainers.
+
widgetctl install --all
+
NameStatus
Alphastable
+
+
+
Footer copyright notice
+ + + diff --git a/tests/fixtures/resolve/notes-with-links.md b/tests/fixtures/resolve/notes-with-links.md new file mode 100644 index 0000000..973d678 --- /dev/null +++ b/tests/fixtures/resolve/notes-with-links.md @@ -0,0 +1,15 @@ +# Study notes with links + +Collected references for the widget research project. + +Reading list: + +- Main article: https://example-articles.test/guide +- Mirror copy: https://example-articles.test/guide-copy +- Conference talk: https://vimeo.com/123456 +- Deep dive video: https://www.youtube.com/watch?v=abc123XYZ +- Paywalled piece: https://blocked.test/secret +- [Extra local notes](extra-notes.md) + +The article at https://example-articles.test/guide is the primary +reference and appears twice on purpose. diff --git a/tests/fixtures/resolve/robots_disallow.txt b/tests/fixtures/resolve/robots_disallow.txt new file mode 100644 index 0000000..1f53798 --- /dev/null +++ b/tests/fixtures/resolve/robots_disallow.txt @@ -0,0 +1,2 @@ +User-agent: * +Disallow: / diff --git a/tests/fixtures/resolve/vimeo_config.json b/tests/fixtures/resolve/vimeo_config.json new file mode 100644 index 0000000..19b91ee --- /dev/null +++ b/tests/fixtures/resolve/vimeo_config.json @@ -0,0 +1 @@ +{"request": {"text_tracks": [{"lang": "en", "label": "English", "url": "/texttrack/123.vtt?token=x"}]}} diff --git a/tests/fixtures/resolve/vimeo_oembed.json b/tests/fixtures/resolve/vimeo_oembed.json new file mode 100644 index 0000000..3b658c0 --- /dev/null +++ b/tests/fixtures/resolve/vimeo_oembed.json @@ -0,0 +1 @@ +{"title": "Vimeo Talk", "author_name": "Speaker Person", "upload_date": "2024-05-01 10:00:00"} diff --git a/tests/fixtures/resolve/vimeo_track.vtt b/tests/fixtures/resolve/vimeo_track.vtt new file mode 100644 index 0000000..964951d --- /dev/null +++ b/tests/fixtures/resolve/vimeo_track.vtt @@ -0,0 +1,15 @@ +WEBVTT + +NOTE auto generated + +1 +00:00:00.000 --> 00:00:04.000 +Welcome to the talk. + +2 +00:00:04.500 --> 00:00:08.000 +We discuss widgets at length. + +3 +00:00:20.000 --> 00:00:24.000 +Closing remarks after a pause. diff --git a/tests/fixtures/resolve/youtube_oembed.json b/tests/fixtures/resolve/youtube_oembed.json new file mode 100644 index 0000000..d31aed0 --- /dev/null +++ b/tests/fixtures/resolve/youtube_oembed.json @@ -0,0 +1 @@ +{"title": "Deep Dive Video", "author_name": "Chan Academy", "type": "video"} diff --git a/tests/fixtures/resolve/youtube_timedtext_en.xml b/tests/fixtures/resolve/youtube_timedtext_en.xml new file mode 100644 index 0000000..09bab96 --- /dev/null +++ b/tests/fixtures/resolve/youtube_timedtext_en.xml @@ -0,0 +1,5 @@ + +Welcome to the deep dive. +Today we cover widgets &amp; gadgets. +After a long pause we resume. + diff --git a/tests/fixtures/resolve/youtube_timedtext_list.xml b/tests/fixtures/resolve/youtube_timedtext_list.xml new file mode 100644 index 0000000..cbfda17 --- /dev/null +++ b/tests/fixtures/resolve/youtube_timedtext_list.xml @@ -0,0 +1,4 @@ + + + + diff --git a/tests/test_e2e.py b/tests/test_e2e.py index cd8229d..763a3f3 100644 --- a/tests/test_e2e.py +++ b/tests/test_e2e.py @@ -2,7 +2,9 @@ simulated operator inputs, finishing sealed with every gate PASS.""" import json +from pathlib import Path +import pytest import yaml from click.testing import CliRunner @@ -12,8 +14,9 @@ from resynth import config, demo_operator from resynth.audit import run_audit, run_seal from resynth.cli import main as cli_main -from resynth.export import run_export -from resynth.extract import run_extract, run_extract_verify +from resynth.errors import ResynthError +from resynth.export import load_master, run_export +from resynth.extract import load_all_claims, run_extract, run_extract_verify from resynth.gates import all_gates from resynth.intake import run_intake from resynth.project import run_brief, run_init @@ -49,8 +52,22 @@ def test_full_pipeline(ws): assert (pdir / "output" / "SEAL.yaml").is_file() assert run_export("demo")["ok"] exported = json.loads((pdir / "output" / "MASTER.json").read_text(encoding="utf-8")) - assert exported["format"] == "resynth-master/1" + assert exported["format"] == "resynth-master/2" assert len(exported["claims"]) == 11 + assert [s["source_id"] for s in exported["sources"]] == ["S01", "S02", "S03"] + for src in exported["sources"]: + assert src["schema_version"] == 2 + assert src["source_type"] == "report" + assert "url" in src and "resolved_from" in src + assert "_file" not in src and "_body" not in src + # claims are dumped whole, so optional fields like source_locator ride along + assert exported["claims"] == sorted( + load_all_claims(pdir), key=lambda c: c["claim_id"] + ) + + loaded = load_master(pdir / "output" / "MASTER.json") + assert loaded["format_version"] == 2 + assert loaded["sources"] == exported["sources"] runner = CliRunner() result = runner.invoke(cli_main, ["status", "demo", "--json"]) @@ -79,6 +96,24 @@ def test_dry_run_writes_nothing(ws): assert (config.runs_dir()).is_dir(), "dry runs still produce a run log" +V1_FIXTURE = Path(__file__).resolve().parents[1] / "projects" / "demo" / "output" / "MASTER.json" + + +def test_load_master_v1_fixture(): + loaded = load_master(V1_FIXTURE) + assert loaded["format"] == "resynth-master/1" + assert loaded["format_version"] == 1 + assert loaded["sources"] == [] + assert loaded["claims"], "v1 payload content passes through untouched" + + +def test_load_master_unknown_format(tmp_path): + bad = tmp_path / "MASTER.json" + bad.write_text(json.dumps({"format": "resynth-master/9"}), encoding="utf-8") + with pytest.raises(ResynthError, match="unsupported master format resynth-master/9"): + load_master(bad) + + def test_doctor_json(ws): result = CliRunner().invoke(cli_main, ["doctor", "--json"]) payload = json.loads(result.output) diff --git a/tests/test_extract.py b/tests/test_extract.py index 5f1e29c..be06b71 100644 --- a/tests/test_extract.py +++ b/tests/test_extract.py @@ -57,6 +57,51 @@ def test_missing_and_unknown_fields_fail(): assert any("unknown field extra" in e for e in errors) +@pytest.mark.parametrize( + "locator", + [ + {"url": "https://example.com/talk"}, + {"page": 12}, + {"timestamp": "00:14:32"}, + {"timestamp": "4:05"}, + {"anchor": "section-slug"}, + {"url": "https://example.com/talk", "page": 3, "timestamp": "1:02:03", "anchor": "intro"}, + ], +) +def test_valid_source_locator_accepted(locator): + claim = dict(VALID) + claim["source_locator"] = locator + assert validate_claim(claim, "S01") == [] + + +def test_claim_without_locator_still_valid(): + claim = dict(VALID) + assert "source_locator" not in claim + assert validate_claim(claim, "S01") == [] + + +@pytest.mark.parametrize( + "locator,fragment", + [ + ({"chapter": 3}, "unknown source_locator key chapter"), + ({}, "at least one of url, page, timestamp, anchor"), + ("page 12", "source_locator must be an object"), + ({"timestamp": "12m30s"}, "H:MM or HH:MM:SS"), + ({"timestamp": "1:2:03"}, "H:MM or HH:MM:SS"), + ({"page": 0}, "positive integer"), + ({"page": -4}, "positive integer"), + ({"page": "12"}, "positive integer"), + ({"url": ""}, "source_locator.url"), + ], +) +def test_bad_source_locator_rejected(locator, fragment): + claim = dict(VALID) + claim["source_locator"] = locator + errors = validate_claim(claim, "S01") + assert errors, f"expected violation for source_locator={locator!r}" + assert any(fragment in e for e in errors) + + def test_workspace_generation(ws): pdir = make_project() run_extract("demo") @@ -112,3 +157,81 @@ def test_coverage_heuristic_warns(ws, tmp_path): result = run_extract_verify("cov") assert result["ok"] assert any("coverage" in w for w in result["gate"]["warnings"]) + + +VIDEO_URL = "https://example.com/talks/argon2" + + +def _video_project(project="vid"): + """A project with one handcrafted schema v2 video-transcript source.""" + from resynth import config + from resynth.fsutil import sha256_text + from resynth.intake import check_intake_gate + from resynth.project import run_init + + run_init(project) + pdir = config.project_dir(project) + body = "# Argon2 conference talk\n\nThe speaker recommends Argon2id throughout.\n" + frontmatter = ( + "---\n" + "source_id: S01\n" + "title: Argon2 conference talk\n" + "origin: test\n" + "author_or_tool: unknown\n" + "date_authored: unknown\n" + "date_ingested: '2026-06-12'\n" + "authority_tier: unknown\n" + "recency_rank: 1\n" + f"sha256: {sha256_text(body)}\n" + "schema_version: 2\n" + "source_type: video-transcript\n" + f"url: {VIDEO_URL}\n" + "resolved_from: null\n" + "transcript_status: fetched\n" + "---\n" + ) + (pdir / "sources" / "S01-argon2-conference-talk.md").write_text( + frontmatter + body, encoding="utf-8" + ) + check_intake_gate(pdir) + run_extract(project) + return pdir + + +def test_verify_warns_video_claim_without_timestamp(ws): + pdir = _video_project() + (pdir / "claims" / "S01-claims.jsonl").write_text( + json.dumps(VALID) + "\n", encoding="utf-8" + ) + result = run_extract_verify("vid") + assert result["ok"] + assert any( + "S01-C001: video source claim without a timestamp locator" in w + for w in result["gate"]["warnings"] + ) + + +def test_verify_warns_locator_url_mismatch(ws): + pdir = _video_project() + claim = dict(VALID) + claim["source_locator"] = {"timestamp": "00:14:32", "url": "https://elsewhere.example.com"} + (pdir / "claims" / "S01-claims.jsonl").write_text( + json.dumps(claim) + "\n", encoding="utf-8" + ) + result = run_extract_verify("vid") + assert result["ok"] + warnings = result["gate"]["warnings"] + assert any("S01-C001: locator url does not match the source url" in w for w in warnings) + assert not any("without a timestamp" in w for w in warnings) + + +def test_verify_no_url_warning_when_locator_url_matches(ws): + pdir = _video_project() + claim = dict(VALID) + claim["source_locator"] = {"timestamp": "00:14:32", "url": VIDEO_URL} + (pdir / "claims" / "S01-claims.jsonl").write_text( + json.dumps(claim) + "\n", encoding="utf-8" + ) + result = run_extract_verify("vid") + assert result["ok"] + assert not any("locator url" in w for w in result["gate"]["warnings"]) diff --git a/tests/test_migrate.py b/tests/test_migrate.py new file mode 100644 index 0000000..639d0bd --- /dev/null +++ b/tests/test_migrate.py @@ -0,0 +1,155 @@ +"""Tests for the schema v1 to v2 project upgrader.""" + +import pytest + +from helpers import snapshot +from resynth import config +from resynth.errors import ResynthError +from resynth.fsutil import parse_frontmatter, sha256_text +from resynth.intake import FRONTMATTER_FIELDS, SCHEMA_VERSION, V2_FIELDS, register_source +from resynth.migrate import run_migrate +from resynth.project import run_init + +BODY_A = "# Alpha Report\n\nArgon2id is preferred for password hashing.\n" +BODY_B = "# Beta Paper\n\nA bcrypt work factor of at least 12 is required.\n" + + +def write_v1(pdir, sid, origin, body, rank): + """Handcraft a pre 0.2.0 source file with the nine v1 fields only.""" + lines = [ + f"source_id: {sid}", + f"title: Source {sid}", + f"origin: {origin}", + "author_or_tool: unknown", + "date_authored: unknown", + "date_ingested: '2026-01-01'", + "authority_tier: unknown", + f"recency_rank: {rank}", + f"sha256: {sha256_text(body)}", + ] + path = pdir / "sources" / f"{sid}-source.md" + path.write_text( + "---\n" + "\n".join(lines) + "\n---\n" + body, encoding="utf-8", newline="\n" + ) + return path + + +def v1_project(ws, project="demo"): + run_init(project) + pdir = config.project_dir(project) + write_v1(pdir, "S01", "notes/alpha-report.md", BODY_A, 1) + write_v1(pdir, "S02", "papers/beta-paper.PDF", BODY_B, 2) + return pdir + + +def test_migrate_adds_v2_fields_in_canonical_order(ws): + pdir = v1_project(ws) + res = run_migrate("demo") + assert res["ok"] is True + assert [e["action"] for e in res["events"]] == ["replaced", "replaced"] + for fname, stype in (("S01-source.md", "report"), ("S02-source.md", "pdf")): + fm, _body = parse_frontmatter( + (pdir / "sources" / fname).read_text(encoding="utf-8"), fname + ) + assert list(fm) == [*FRONTMATTER_FIELDS, *V2_FIELDS] + assert fm["schema_version"] == SCHEMA_VERSION + assert fm["source_type"] == stype + assert fm["url"] is None + assert fm["resolved_from"] is None + assert "transcript_status" not in fm + assert res["events"][0]["source_type"] == "report" + assert res["events"][1]["source_type"] == "pdf" + + +def test_v1_values_preserved(ws): + pdir = v1_project(ws) + before, _ = parse_frontmatter( + (pdir / "sources" / "S01-source.md").read_text(encoding="utf-8"), "S01" + ) + run_migrate("demo") + after, _ = parse_frontmatter( + (pdir / "sources" / "S01-source.md").read_text(encoding="utf-8"), "S01" + ) + for field in FRONTMATTER_FIELDS: + assert after[field] == before[field] + + +def test_body_untouched_and_hash_still_valid(ws): + pdir = v1_project(ws) + res = run_migrate("demo") + raw = (pdir / "sources" / "S01-source.md").read_bytes() + assert raw.endswith(BODY_A.encode("utf-8")) + _fm, body = parse_frontmatter(raw.decode("utf-8"), "S01") + assert body == BODY_A + gate = res["gate"] + assert gate["status"] == "PASS" + assert gate["warnings"] == [] + assert (pdir / "gates" / "01-intake.yaml").is_file() + + +def test_messages_verbatim(ws): + v1_project(ws) + res = run_migrate("demo") + assert res["messages"] == [ + "S01: upgraded to schema v2 (source_type report)", + "S02: upgraded to schema v2 (source_type pdf)", + "Frontmatter has changed, so the existing seal no longer matches these files.", + "The sealed git tag still pins the old state.", + "When you are ready, re-seal with: resynth audit demo then resynth seal demo", + "gate 01-intake: PASS", + ] + + +def test_idempotent_second_run(ws): + pdir = v1_project(ws) + run_migrate("demo") + sources = sorted((pdir / "sources").glob("S*.md")) + mtimes = {f.name: f.stat().st_mtime_ns for f in sources} + before = snapshot(pdir / "sources", pdir / "gates") + res = run_migrate("demo") + assert res["ok"] is True + assert [e["action"] for e in res["events"]] == ["unchanged", "unchanged"] + assert res["messages"] == [ + "S01: already schema v2", + "S02: already schema v2", + "gate 01-intake: PASS", + ] + assert snapshot(pdir / "sources", pdir / "gates") == before + for f in sources: + assert f.stat().st_mtime_ns == mtimes[f.name] + + +def test_dry_run_writes_nothing(ws): + pdir = v1_project(ws) + before = snapshot(pdir) + res = run_migrate("demo", dry_run=True) + assert res["ok"] is True + assert res["gate"] is None + assert [e["action"] for e in res["events"]] == ["dry-run", "dry-run"] + assert snapshot(pdir) == before + + +def test_mixed_project_migrates_only_v1(ws): + run_init("demo") + pdir = config.project_dir("demo") + write_v1(pdir, "S01", "notes/alpha-report.md", BODY_A, 1) + register_source(pdir, BODY_B, title="Beta Paper", origin="papers/beta-paper.md") + v2_file = next(f for f in (pdir / "sources").glob("S02*.md")) + v2_before = v2_file.read_bytes() + res = run_migrate("demo") + events = {e["source"]: e["action"] for e in res["events"]} + assert events["S01"] == "replaced" + assert events["S02"] == "unchanged" + assert v2_file.read_bytes() == v2_before + fm, _ = parse_frontmatter( + (pdir / "sources" / "S01-source.md").read_text(encoding="utf-8"), "S01" + ) + assert fm["schema_version"] == SCHEMA_VERSION + assert res["gate"]["status"] == "PASS" + assert "S02: already schema v2" in res["messages"] + + +def test_empty_project_raises(ws): + run_init("demo") + with pytest.raises(ResynthError, match="no sources to migrate"): + run_migrate("demo") diff --git a/tests/test_reconcile.py b/tests/test_reconcile.py index 5d8d7c3..3dbc18e 100644 --- a/tests/test_reconcile.py +++ b/tests/test_reconcile.py @@ -23,6 +23,30 @@ def test_workspace_and_candidates(ws): assert {"S01-C001", "S02-C001"} in pairs +def test_claims_index_carries_locator_hint(ws): + pdir = to_extracted() + path = pdir / "claims" / "S03-claims.jsonl" + claim = { + "claim_id": "S03-C099", + "source_id": "S03", + "claim_text": "Deep linked claim", + "claim_type": "fact", + "topic_tags": ["locator-test"], + "supporting_quote_location": "Somewhere", + "confidence_as_stated": "unstated", + "depends_on": [], + "source_locator": {"timestamp": "00:14:32", "page": 12}, + } + path.write_text( + path.read_text(encoding="utf-8") + json.dumps(claim) + "\n", encoding="utf-8" + ) + run_reconcile("demo") + index = (pdir / "index" / "claims-index.md").read_text(encoding="utf-8") + line = next(l for l in index.splitlines() if "S03-C099" in l) + assert " @ 00:14:32" in line + assert line.index("@ 00:14:32") < line.index("Deep linked claim") + + def test_gate_fails_until_decisions_written(ws): to_extracted() result = run_reconcile("demo") diff --git a/tests/test_resolve.py b/tests/test_resolve.py new file mode 100644 index 0000000..32ff40c --- /dev/null +++ b/tests/test_resolve.py @@ -0,0 +1,565 @@ +import io +import shutil +import urllib.error +from pathlib import Path +from urllib.parse import quote + +import pytest + +from helpers import snapshot + +from resynth import config +from resynth.errors import ResynthError +from resynth.fsutil import iter_jsonl, parse_frontmatter, sha256_text +from resynth.intake import run_intake +from resynth.project import run_init +from resynth.resolve import manifest_path, preview_targets, run_resolve +from resynth.resolve import net +from resynth.resolve.discover import discover_targets +from resynth.resolve.fetchers import ( + classify_target, + fetch_local, + fetch_url, + fetch_vimeo, + fetch_youtube, +) +from resynth.resolve.net import FetchError +from resynth.resolve.reduce_html import extract_title, reduce_html + +FIX = Path(__file__).parent / "fixtures" / "resolve" + +ARTICLE_URL = "https://example-articles.test/guide" +COPY_URL = "https://example-articles.test/guide-copy" +VIMEO_URL = "https://vimeo.com/123456" +YT_URL = "https://www.youtube.com/watch?v=abc123XYZ" +BLOCKED_URL = "https://blocked.test/secret" + +YT_OEMBED = f"https://www.youtube.com/oembed?url={quote(YT_URL, safe='')}&format=json" +YT_LIST = "https://www.youtube.com/api/timedtext?type=list&v=abc123XYZ" +YT_TRACK = "https://www.youtube.com/api/timedtext?lang=en&v=abc123XYZ" +VIMEO_OEMBED = f"https://vimeo.com/api/oembed.json?url={quote(VIMEO_URL, safe='')}" +VIMEO_CONFIG = "https://player.vimeo.com/video/123456/config" +VIMEO_VTT = "https://player.vimeo.com/texttrack/123.vtt?token=x" + + +class _Headers: + def __init__(self, ctype): + self.ctype = ctype + + def get(self, name, default=None): + return self.ctype if name.lower() == "content-type" else default + + +class _Resp: + def __init__(self, url, status, ctype, body): + self.url = url + self.status = status + self.headers = _Headers(ctype) + self._body = body + + def read(self, n=-1): + return self._body if n is None or n < 0 else self._body[:n] + + def geturl(self): + return self.url + + def __enter__(self): + return self + + def __exit__(self, *exc): + return False + + +class FakeNet: + """Maps url -> (status, content_type, bytes); unmapped urls assert.""" + + def __init__(self, mapping): + self.mapping = dict(mapping) + self.calls = [] + + def __call__(self, req, timeout=None, **kwargs): + url = getattr(req, "full_url", req) + self.calls.append(url) + assert url in self.mapping, f"unmapped url: {url}" + status, ctype, body = self.mapping[url] + if status >= 400: + raise urllib.error.HTTPError(url, status, "error", None, io.BytesIO(b"")) + return _Resp(url, status, ctype, body) + + +def robots_404(host): + return (f"https://{host}/robots.txt", (404, "text/plain", b"")) + + +def base_mapping(): + return dict( + [ + robots_404("example-articles.test"), + robots_404("www.youtube.com"), + robots_404("vimeo.com"), + robots_404("player.vimeo.com"), + ( + "https://blocked.test/robots.txt", + (200, "text/plain", (FIX / "robots_disallow.txt").read_bytes()), + ), + (ARTICLE_URL, (200, "text/html; charset=utf-8", (FIX / "article.html").read_bytes())), + (COPY_URL, (200, "text/html; charset=utf-8", (FIX / "article.html").read_bytes())), + (YT_OEMBED, (200, "application/json", (FIX / "youtube_oembed.json").read_bytes())), + (YT_LIST, (200, "text/xml", b"")), + (VIMEO_OEMBED, (200, "application/json", (FIX / "vimeo_oembed.json").read_bytes())), + (VIMEO_CONFIG, (404, "application/json", b"")), + ] + ) + + +def use_net(monkeypatch, mapping): + fake = FakeNet(mapping) + monkeypatch.setattr(net, "urlopen", fake) + return fake + + +@pytest.fixture(autouse=True) +def _clean_net(monkeypatch): + monkeypatch.setattr(net, "_robots", {}) + monkeypatch.setattr(net, "_last_hit", {}) + monkeypatch.setattr(net, "sleep", lambda _s: None) + + +def make_links_project(ws): + srcdir = ws / "incoming" + srcdir.mkdir() + notes = srcdir / "notes-with-links.md" + shutil.copy(FIX / "notes-with-links.md", notes) + (srcdir / "extra-notes.md").write_text( + "# Extra notes\n\nLocal supporting notes for the resolve test suite.\n", + encoding="utf-8", + ) + run_init("links") + run_intake("links", [str(notes)]) + return config.project_dir("links") + + +def local_target(ws): + return str((ws / "incoming" / "extra-notes.md").resolve()) + + +def load_manifest(pdir): + return { + rec["target"]: rec for _n, _raw, rec, _err in iter_jsonl(manifest_path(pdir)) if rec + } + + +# --- classify --------------------------------------------------------------- + + +@pytest.mark.parametrize( + ("raw", "kind"), + [ + ("https://www.youtube.com/watch?v=abc", "youtube"), + ("https://youtu.be/abc", "youtube"), + ("https://m.youtube.com/shorts/abc", "youtube"), + ("https://vimeo.com/123456", "vimeo"), + ("https://player.vimeo.com/video/123456", "vimeo"), + ("https://example.com/page", "url"), + ("http://example.com/page.pdf", "url"), + ], +) +def test_classify_target_urls(raw, kind): + assert classify_target(raw) == kind + + +def test_classify_target_local(tmp_path): + f = tmp_path / "doc.md" + f.write_text("x", encoding="utf-8") + assert classify_target(str(f)) == "local" + assert classify_target(str(tmp_path / "missing.md")) == "url" + + +# --- reducer ---------------------------------------------------------------- + + +def test_reduce_html_main_only_and_block_forms(): + html = (FIX / "article.html").read_text(encoding="utf-8") + body, title = reduce_html(html, ARTICLE_URL) + assert title == "Field Guide to Widgets" + assert "# Field Guide to Widgets" in body + assert "## Setup" in body + assert "- First step: inventory existing widgets" in body + assert "> Quoted wisdom about widgets from the maintainers." in body + assert "```\nwidgetctl install --all\n```" in body + assert "Name | Status" in body + assert "Alpha | stable" in body + assert "the spec (https://example-articles.test/spec.html)" in body + assert "manual (https://example-articles.test/manual)" in body + for noise in ( + "Site navigation", + "Site header", + "Footer copyright", + "Related links", + "scriptVar", + "Outside main", + ): + assert noise not in body + + +def test_extract_title_precedence(): + og = ( + '' + "Doc Title

H1 Title

" + ) + assert extract_title(og) == "OG Title" + titled = "Doc Title

H1 Title

" + assert extract_title(titled) == "Doc Title" + assert extract_title("

H1 Title

") == "H1 Title" + assert extract_title("

nothing

") is None + + +# --- net -------------------------------------------------------------------- + + +def test_robots_disallow_blocks_fetch(monkeypatch): + use_net( + monkeypatch, + dict([("https://blocked.test/robots.txt", (200, "text/plain", (FIX / "robots_disallow.txt").read_bytes()))]), + ) + with pytest.raises(FetchError, match="robots"): + net.http_get(BLOCKED_URL) + + +def test_size_cap(monkeypatch): + big = b"x" * (net.MAX_BYTES + 1) + use_net( + monkeypatch, + dict([robots_404("big.test"), ("https://big.test/file", (200, "text/html", big))]), + ) + with pytest.raises(FetchError, match="10 MiB"): + net.http_get("https://big.test/file") + + +def test_rate_limit_sleeps_between_same_host_requests(monkeypatch): + use_net( + monkeypatch, + dict( + [ + robots_404("rate.test"), + robots_404("other.test"), + ("https://rate.test/a", (200, "text/plain", b"a")), + ("https://rate.test/b", (200, "text/plain", b"b")), + ("https://other.test/c", (200, "text/plain", b"c")), + ] + ), + ) + naps = [] + monkeypatch.setattr(net, "sleep", naps.append) + net.http_get("https://rate.test/a") + net.http_get("https://rate.test/b") + assert len(naps) == 1 + assert 0 < naps[0] <= net.HOST_DELAY + net.http_get("https://other.test/c") + assert len(naps) == 1 + + +# --- fetchers --------------------------------------------------------------- + + +def test_fetch_local(tmp_path): + f = tmp_path / "extra.md" + f.write_text("# Local Title\n\nBody text.\n", encoding="utf-8") + doc = fetch_local(str(f)) + assert doc["title"] == "Local Title" + assert doc["source_type"] == "notes" + assert doc["url"] is None + assert doc["origin"] == str(f) + assert doc["transcript_status"] is None + assert "Body text." in doc["body_markdown"] + plain = tmp_path / "plain.txt" + plain.write_text("no heading here\n", encoding="utf-8") + assert fetch_local(str(plain))["title"] == "plain" + + +def test_fetch_url_article(monkeypatch): + use_net(monkeypatch, base_mapping()) + doc = fetch_url(ARTICLE_URL) + assert doc["source_type"] == "html-article" + assert doc["title"] == "Field Guide to Widgets" + assert doc["url"] == ARTICLE_URL + assert doc["origin"] == ARTICLE_URL + assert "# Field Guide to Widgets" in doc["body_markdown"] + + +def test_fetch_url_no_extractable_text(monkeypatch): + tiny = b"

too short

" + use_net( + monkeypatch, + dict([robots_404("tiny.test"), ("https://tiny.test/p", (200, "text/html", tiny))]), + ) + with pytest.raises(FetchError, match="no extractable text"): + fetch_url("https://tiny.test/p") + + +def test_fetch_url_unsupported_content_type(monkeypatch): + use_net( + monkeypatch, + dict([robots_404("img.test"), ("https://img.test/x", (200, "image/png", b"\x89PNG"))]), + ) + with pytest.raises(FetchError, match="unsupported content type"): + fetch_url("https://img.test/x") + + +def test_fetch_youtube_happy_path(monkeypatch): + mapping = base_mapping() + mapping[YT_LIST] = (200, "text/xml", (FIX / "youtube_timedtext_list.xml").read_bytes()) + mapping[YT_TRACK] = (200, "text/xml", (FIX / "youtube_timedtext_en.xml").read_bytes()) + use_net(monkeypatch, mapping) + doc = fetch_youtube(YT_URL) + assert doc["transcript_status"] == "fetched" + assert doc["source_type"] == "video-transcript" + assert doc["title"] == "Deep Dive Video" + assert doc["author_or_tool"] == "Chan Academy" + body = doc["body_markdown"] + assert body.startswith("# Deep Dive Video\n\n## Transcript\n") + assert "[00:00:00] Welcome to the deep dive." in body + assert "[00:00:04] Today we cover widgets & gadgets." in body + # gap over 8 seconds starts a new paragraph + assert "gadgets.\n\n[00:00:20] After a long pause we resume." in body + + +def test_fetch_youtube_no_captions_yields_pending_stub(monkeypatch): + use_net(monkeypatch, base_mapping()) + doc = fetch_youtube(YT_URL) + assert doc["transcript_status"] == "pending" + assert doc["title"] == "Deep Dive Video" + body = doc["body_markdown"] + assert body.startswith("# Deep Dive Video\n") + assert "> [!info] Video transcript pending" in body + assert f"> Link: {YT_URL}" in body + + +def test_fetch_vimeo_happy_path(monkeypatch): + mapping = base_mapping() + mapping[VIMEO_CONFIG] = (200, "application/json", (FIX / "vimeo_config.json").read_bytes()) + mapping[VIMEO_VTT] = (200, "text/vtt", (FIX / "vimeo_track.vtt").read_bytes()) + use_net(monkeypatch, mapping) + doc = fetch_vimeo(VIMEO_URL) + assert doc["transcript_status"] == "fetched" + assert doc["title"] == "Vimeo Talk" + assert doc["author_or_tool"] == "Speaker Person" + assert doc["date_authored"] == "2024-05-01" + body = doc["body_markdown"] + assert "[00:00:00] Welcome to the talk." in body + assert "[00:00:04] We discuss widgets at length." in body + assert "length.\n\n[00:00:20] Closing remarks after a pause." in body + + +def test_fetch_vimeo_no_captions_yields_pending_stub(monkeypatch): + use_net(monkeypatch, base_mapping()) + doc = fetch_vimeo(VIMEO_URL) + assert doc["transcript_status"] == "pending" + assert "> [!info] Video transcript pending" in doc["body_markdown"] + assert f"> Link: {VIMEO_URL}" in doc["body_markdown"] + + +# --- discovery -------------------------------------------------------------- + + +def test_discover_targets_urls_and_punctuation(): + body = ( + "see https://example.com/x, then and\n" + "[wiki](https://example.com/page_(1)) plus https://example.com/a#frag.\n" + "repeat https://example.com/x once more\n" + ) + raws = [t["raw"] for t in discover_targets(body, "")] + assert raws == [ + "https://example.com/x", + "https://example.com/y", + "https://example.com/page_(1)", + "https://example.com/a#frag", + ] + + +def test_discover_targets_local_paths(tmp_path): + extra = tmp_path / "extra.md" + extra.write_text("x", encoding="utf-8") + origin = str(tmp_path / "notes.md") + body = f"see [extra](extra.md) and `missing.md` and `{extra}`\nplain extra.md mention\n" + targets = discover_targets(body, origin) + assert targets == [{"raw": str(extra.resolve()), "kind": "local"}] + + +def test_preview_targets(ws, monkeypatch): + pdir = make_links_project(ws) + targets = preview_targets("links") + assert [t["raw"] for t in targets] == [ + ARTICLE_URL, + COPY_URL, + VIMEO_URL, + YT_URL, + BLOCKED_URL, + local_target(ws), + ] + assert [t["kind"] for t in targets] == ["url", "url", "vimeo", "youtube", "url", "local"] + assert all(t["parent"] == "S01" for t in targets) + + +# --- run_resolve ------------------------------------------------------------ + + +def test_run_resolve_requires_project_and_sources(ws): + with pytest.raises(ResynthError, match="not found"): + run_resolve("nope") + run_init("empty") + with pytest.raises(ResynthError, match="no sources"): + run_resolve("empty") + + +def test_run_resolve_unknown_source_id(ws): + make_links_project(ws) + with pytest.raises(ResynthError, match="unknown source"): + run_resolve("links", source_ids=["S99"]) + + +def test_run_resolve_integration(ws, monkeypatch): + pdir = make_links_project(ws) + use_net(monkeypatch, base_mapping()) + result = run_resolve("links") + assert result["ok"] is True + assert result["gate"]["status"] == "PASS" + assert result["counts"] == { + "fetched": 2, + "cached": 0, + "duplicate": 1, + "transcript_pending": 2, + "failed": 1, + } + files = sorted(f.name for f in (pdir / "sources").glob("S*.md")) + assert len(files) == 5 + + article = next((pdir / "sources").glob("S02-*.md")) + fm, body = parse_frontmatter(article.read_text(encoding="utf-8"), article.name) + assert fm["schema_version"] == 2 + assert fm["source_type"] == "html-article" + assert fm["url"] == ARTICLE_URL + assert fm["resolved_from"] == "S01" + assert fm["sha256"] == sha256_text(body) + assert "transcript_status" not in fm + + vimeo = next((pdir / "sources").glob("S03-*.md")) + fm_v, body_v = parse_frontmatter(vimeo.read_text(encoding="utf-8"), vimeo.name) + assert fm_v["source_type"] == "video-transcript" + assert fm_v["transcript_status"] == "pending" + assert "> [!info] Video transcript pending" in body_v + + local = next((pdir / "sources").glob("S05-*.md")) + fm_l, _body_l = parse_frontmatter(local.read_text(encoding="utf-8"), local.name) + assert fm_l["source_type"] == "notes" + assert fm_l["url"] is None + assert fm_l["resolved_from"] == "S01" + + recs = load_manifest(pdir) + assert manifest_path(pdir).read_text(encoding="utf-8").startswith("#") + assert len(recs) == 6 + assert recs[ARTICLE_URL]["status"] == "fetched" + assert recs[ARTICLE_URL]["source_id"] == "S02" + assert recs[COPY_URL] == {**recs[COPY_URL], "status": "duplicate", "source_id": "S02"} + assert recs[VIMEO_URL]["status"] == "transcript_pending" + assert recs[YT_URL]["status"] == "transcript_pending" + assert recs[BLOCKED_URL]["status"] == "failed" + assert recs[BLOCKED_URL]["note"] == "disallowed by robots.txt" + assert recs[BLOCKED_URL]["source_id"] is None + assert recs[local_target(ws)]["status"] == "fetched" + assert all(r["resolved_from"] == "S01" for r in recs.values()) + + msgs = result["messages"] + assert f"{ARTICLE_URL}: fetched as S02 (html-article)" in msgs + assert f"{COPY_URL}: duplicate of S02" in msgs + assert f"{VIMEO_URL}: transcript pending, stub created as S03" in msgs + assert f"{YT_URL}: transcript pending, stub created as S04" in msgs + assert f"{BLOCKED_URL}: failed (disallowed by robots.txt)" in msgs + assert f"{local_target(ws)}: fetched as S05 (notes)" in msgs + assert msgs[-1] == "gate 01-intake: PASS" + + +def test_run_resolve_second_run_is_idempotent(ws, monkeypatch): + make_links_project(ws) + use_net(monkeypatch, base_mapping()) + run_resolve("links") + before = snapshot(ws) + result = run_resolve("links") + assert snapshot(ws) == before + # fetched and duplicate targets are cached; pending and failed retry + assert result["counts"] == { + "fetched": 0, + "cached": 3, + "duplicate": 0, + "transcript_pending": 2, + "failed": 1, + } + msgs = result["messages"] + assert f"{ARTICLE_URL}: cached (S02)" in msgs + assert f"{COPY_URL}: cached (S02)" in msgs + assert f"{local_target(ws)}: cached (S05)" in msgs + + +def test_transcript_upgrade_in_place(ws, monkeypatch): + pdir = make_links_project(ws) + fake = use_net(monkeypatch, base_mapping()) + run_resolve("links") + stub = next((pdir / "sources").glob("S03-*.md")) + fm_before, _ = parse_frontmatter(stub.read_text(encoding="utf-8"), stub.name) + fake.mapping[VIMEO_CONFIG] = ( + 200, + "application/json", + (FIX / "vimeo_config.json").read_bytes(), + ) + fake.mapping[VIMEO_VTT] = (200, "text/vtt", (FIX / "vimeo_track.vtt").read_bytes()) + result = run_resolve("links") + assert result["counts"]["fetched"] == 1 + assert result["counts"]["cached"] == 3 + assert result["counts"]["transcript_pending"] == 1 + assert f"{VIMEO_URL}: fetched as S03 (video-transcript)" in result["messages"] + + upgraded = next((pdir / "sources").glob("S03-*.md")) + assert upgraded.name == stub.name + fm, body = parse_frontmatter(upgraded.read_text(encoding="utf-8"), upgraded.name) + assert fm["source_id"] == "S03" + assert fm["transcript_status"] == "fetched" + assert fm["sha256"] == sha256_text(body) + assert fm["recency_rank"] == fm_before["recency_rank"] + assert fm["date_ingested"] == fm_before["date_ingested"] + assert fm["resolved_from"] == "S01" + assert "## Transcript" in body + assert "[00:00:00] Welcome to the talk." in body + recs = load_manifest(pdir) + assert recs[VIMEO_URL]["status"] == "fetched" + assert recs[VIMEO_URL]["source_id"] == "S03" + assert len(list((pdir / "sources").glob("S*.md"))) == 5 + + +def test_run_resolve_dry_run_writes_nothing(ws, monkeypatch): + pdir = make_links_project(ws) + fake = use_net(monkeypatch, {}) + before = snapshot(ws) + result = run_resolve("links", dry_run=True) + assert snapshot(ws) == before + assert fake.calls == [] + assert not manifest_path(pdir).exists() + would = [m for m in result["messages"] if "would fetch" in m] + assert len(would) == 6 + assert f"{VIMEO_URL}: would fetch (vimeo)" in result["messages"] + assert f"{YT_URL}: would fetch (youtube)" in result["messages"] + assert f"{local_target(ws)}: would fetch (local)" in result["messages"] + + +def test_run_resolve_only_filter(ws, monkeypatch): + pdir = make_links_project(ws) + use_net(monkeypatch, base_mapping()) + result = run_resolve("links", only="vimeo") + assert result["counts"] == { + "fetched": 0, + "cached": 0, + "duplicate": 0, + "transcript_pending": 1, + "failed": 0, + } + recs = load_manifest(pdir) + assert list(recs) == [VIMEO_URL] diff --git a/tests/test_synthesis.py b/tests/test_synthesis.py index 9730b8d..26057aa 100644 --- a/tests/test_synthesis.py +++ b/tests/test_synthesis.py @@ -18,6 +18,11 @@ def test_scaffold_generation(ws): assert "## Appendix: Source Register" in text assert "[!todo]" in text assert "[S01-C001, S02-C001]" in text + assert "| Source | Title | Type | Authority | Authored | Link | Content hash |" in text + row = next(line for line in text.splitlines() if line.startswith("| S01 |")) + cells = [c.strip() for c in row.strip("|").split("|")] + assert cells[2] == "report", "Type cell carries source_type" + assert cells[5] == "-", "Link cell renders a dash when url is absent" def test_full_synthesis_passes(ws): diff --git a/tests/test_wizard.py b/tests/test_wizard.py index 6feb4a3..999f53d 100644 --- a/tests/test_wizard.py +++ b/tests/test_wizard.py @@ -49,3 +49,19 @@ def test_state_done_after_seal(ws): pdir = run_full() run_brief("demo", "topic") assert project_state(pdir) == "done" + + +def test_cli_version_and_new_commands(ws): + from click.testing import CliRunner + + from resynth import __version__ + from resynth.cli import main + + runner = CliRunner() + res = runner.invoke(main, ["--version"]) + assert res.exit_code == 0 + assert f"resynth, version {__version__}" in res.output + assert "bye" not in res.output + for cmd in ("resolve", "migrate"): + res = runner.invoke(main, [cmd, "--help"]) + assert res.exit_code == 0, res.output