Skip to content

Latest commit

 

History

History
780 lines (634 loc) · 52.6 KB

File metadata and controls

780 lines (634 loc) · 52.6 KB

MSRBot.io — Consolidated Technical Chronicle

This document consolidates the MSRBot.io worklog into a single, category‑organized technical chronicle. Dates are de‑emphasized in favor of system architecture and implementation detail. All filenames, scripts, fields, and JSON keys are shown in monospace.

Consolidation Date: 2025-11-26
Applies to: pre-release → v1.0.0
Superseded by: CHANGELOG.md entries for v1.1.0+

How to Use This Document

This is a technical chronicle, not a release changelog.

  • Use CHANGELOG.md for user-facing release summaries.
  • Use this file when you need:
    • Architectural context
    • Implementation rationale
    • Recovery from data or build corruption
    • Deep validation of registry behavior

Sections are organized by system area, not by date. Dates are intentionally de-emphasized.

Executive Index

  • Extraction & Parsing: §1.1–§1.4
  • Status, MSI, Lineage: §1.2, §1.5
  • References & MRI: §1.6, §3
  • $meta & Provenance: §2
  • URL Validation & Normalization: §3
  • CI / Workflows: §4
  • Frontend & UX Systems: §6
  • Groups / Projects / RefTree: §6.6–§6.7

Table of Contents

1 Extraction & Automation Pipeline

This section covers the end-to-end extraction pipeline and its supporting automation: HTML-first parsing with PDF fallback, deterministic status wiring, resilient reference handling, and publisher/folder normalization. It documents the “source-of-truth” behaviors that keep ingestion stable across document families.

1.1 HTML + PDF Fallback Extraction (SMPTE milestone)

  • End‑to‑end ingestion for SMPTE documents with HTML primary parsing and safe PDF‑only fallback.
  • index.html missing → treated as likely PDF‑only; inferMetadataFromPath() derives docId, releaseTag, publicationDate, doi, href, docType, docNumber, docPart, publisher. Inferred fields are merged without overwriting existing data.
  • Amendment suffix handling corrected end‑to‑end: docId, doi, and href are derived from the final ID including amendment suffixes (e.g., .2011Am1.2013).
  • Added revisionOf extraction from HTML via <meta itemprop="pubRevisionOf">; value stored as array.

1.2 Status Wiring & Normalization

  • Only one document per lineage can have status.latestVersion: true; that document is also status.active: true and status.superseded: false. All others: latestVersion: false, active: false, superseded: true.
  • Deterministic mapping for ambiguous cases: unknown → superseded: false.
  • Base releases without amendments receive explicit defaults: status.amended = false, status.amendedBy = [].
  • status.supersededBy wiring: each base points to the next base in sequence; amendments inherit the base’s pointer. status.supersededDate injected from the next base’s releaseTag. $meta injected for both fields on create/update.
  • Publisher status derivation: status.active and status.superseded automatically computed from status.latestVersion. Guarantees lineage consistency and prevents conflicting “active” flags.
  • Amendment promotion logic: when an amendment is the latest, the amendment is active: true, latestVersion: true; the base remains active: false, superseded: true.
    Prevents incorrect base flips when an amendment becomes latest.

1.3 Reference Parsing & Resilience

  • Reference arrays are always present and normalized: defaults for references.normative and references.bibliographic.
  • $meta injected consistently for new docs and updates; avoids emission for undefined or empty arrays.
  • Latest‑version determination aligned with wrapper releaseTag ordering.

1.4 Folder & Publisher Parsing

  • Version‑folder regex upgraded to handle amendments and publication stages; accepts *-dp.
  • Publisher parsed from <span itemprop="publisher"> (defaults to SMPTE only if missing). Guards prevent -undefined in docId/docLabel/doi.
  • docLabel formatting for amendments standardized (space before Am, e.g., SMPTE ST 429-2:2011 Am1:2013).
  • Publisher taxonomy refinement: standardized publisher metadata and document group classification across SMPTE, ISO/IEC, ITU, AES, and related families.
    Prevents mismatched publisher aliases during extraction and validation.
  • Normalized OM/AG handling: standardized organizational markers
    (e.g., AG10b → AG10B; removed “SMPTE ” prefix in OM titles).
  • Automated publisher detection: extraction automatically maps publisher metadata using url.rules.js expectations, selecting the correct organization context during parsing.

1.5 Master Suite Index (MSI) & Lineage

  • buildMasterSuiteIndex.js produces a lineage view with publisher, suite, number, part, history, and latest flags.
  • Diagnostics and flags: MISSING_BASE_FOR_AMENDMENT, MULTIPLE_LATEST_FLAGS, draft filtering (status.draft: true), versionless handling via inferVersionless() and statusVersionless.
  • Output is stably sorted with counts, latest IDs, and consolidated publisher normalization across SMPTE, ISO/IEC, NIST, W3C, IETF, DCI, ATSC, ITU, AMWA, AES, AMPAS, AIM, ARIB, NFPA, etc.
  • Documents now explicitly annotated with lineage keys:
    msiLatestBase, msiLatestAny, latestDoc, docBase, and docBaseLabel for stable linkage between MSI and MSR datasets.
    These fields are injected with $meta provenance during index build.
  • Added SMPTE-only sanity flag SMPTE_MISSING_RELEASE_TAG:<docId>; surfaces missing releaseTag on SMPTE documents during MSI build.

1.6 Reference Mapping, MSI Integration, and MRI Foundations

  • src/main/lib/keying.js centralizes keying logic; MSI loaded once to build latestByLineage and baseIndex maps used across extraction and site build.
  • Reference upgrader behavior: dated citations left as‑is; undated citations upgraded via lineage (trailing‑dot probe algorithm). Confirmed upgrades include IEC.61966-2-1, ISO.10646, ISO.15444-1, ISO.15948.
  • Missed hits upgraded cleanly; template shows undated labels while links resolve to latest. Optional hover tip supported.
  • Structural refactor initiated: move reference parsing/building into a single referencing.js library for both extraction and build.
  • Master Reference Index (MRI): new artifact planned under src/main/reports/, logging all seen refs, parsed IDs, source doc, raw forms, and titles; serves as the first point of truth for orphans and future PDF parsing.
  • Added detailed reference-upgrader diagnostics — logs now trace probe → key → HIT → upgrade sequence for transparency when resolving undated citations.
  • Confirmed upgrade coverage explicitly documented for ISO 15444-1, ISO 10646, and IEC 61966-2-1 families.
  • Manual namespace backfill: temporary process remains in use until full automated extraction of targetNamespace and import structures is implemented.

2 Metadata & Provenance System

This section defines how MSRBot.io tracks provenance and prevents noisy diffs: $meta injection rules, inferred vs parsed confidence, namespace metadata foundations, and field-level change locks. The intent is to preserve curated data while keeping automated updates deterministic and explainable.

  • $meta injection logic overhauled to write only when a field value actually changes; eliminates false‑positive diffs and redundant metadata writes.
  • Inferred vs. parsed provenance tracked. Default confidence: "medium" applied to inferred fields; source annotated per field path.
  • Namespace metadata extended: xmlNamespace objects include deprecated: boolean (foundation for structured namespace tracking with uri, targetNamespace, imported, sourceDocId, schemaLocation).
  • $meta.note definition mapping: centralized through metaConfig for provenance consistency; future extensions may enrich note templates with field-specific context.
  • Introduced $meta.excludeChanges: true as a field-level lock (applies to any field, including nested like status.active, status.latestVersion, status.superseded). Extraction respects locks via setFieldIfAllowed(doc, fieldPath, newValue) and isFieldExcluded().
  • Nested awareness: isFieldExcluded() handles one-level nested paths (e.g., status.active) with extension headroom for deeper hierarchies.
  • Behavior: locked fields are skipped cleanly during extraction/inference (console log only, no PR entry); $meta.overridden and PR diffs update only when a change is allowed and actually occurs.

3 Validation & URL Resolution

This section documents the URL reliability layer: validation audits, normalization/backfill behavior, publisher-specific rules, and document-level checks. It’s focused on producing consistent reports, minimizing false positives, and enabling safe CI gating.

3.1 URL Validation (url.validate.js)

  • Added “good URL” count alongside unreachable and redirect totals.
  • Redirect issues split into two buckets: undefined (missing resolved target) and mismatch (existing redirect differs from expectation).
  • Unified audit written to src/main/reports/url_validate_audit.json with a clear header summary.

3.2 URL Normalization (url.normalize.js)

  • Replaces url.enrich.js. Performs targeted backfill for resolvedHref with $meta tracking.
  • Defaults to validation‑only; writes only in apply mode.
  • Emits src/main/reports/url_validate_normalize.json summarizing proposed/applied normalizations for CI gating.

3.3 URL Rules (url.rules.js)

  • Publisher‑specific expectation map (SMPTE, W3C, IETF, etc.). Informational for now; reports mismatches without auto‑fix.
  • Establishes foundation for enforcing expected href patterns and redirect targets.

3.4 Documents Validation

  • documents.validate.js checks duplicate docId, registry sort order, and performs soft URL reachability checks.
  • Modular resolveUrlAndInject() shared across extraction and validation; injects resolvedHref when missing or changed.
  • All URL‑related reports written under src/main/reports/ with consistent JSON headers.

3.5 Overrides Audit

  • Added src/main/scripts/audit.overrides.js to scan for $meta.overridden === true.
  • Outputs src/main/reports/overrides_audit.json (JSON-only; CSV export dropped). No PR creation and no MSI dependency.
  • Skips trivial/null/empty originalValue entries; groups results alphabetically by field with per-field totals.

4 Workflow & CI/CD

This section describes the GitHub Actions “factory line”: chained MSI → MRI → site builds, concurrency controls, preview deployments, branch hygiene, and operational hardening. It also captures identity consolidation under the PrZ3 Unit GitHub App and the expanded URL-issue taxonomy.

4.1 Chain Orchestration (MSI → MRI → MSR)

  • Workflows run in strict sequence using workflow_run triggers. Any upstream change triggers the full chain rebuild.
  • Concurrency protections:
    • MSI: mastersuite-index
    • MRI: masterreference-index
    • MSR Site (MSRBot.io): msr-site-${{ github.ref_name }} with cancel‑in‑progress enabled
  • Permissions set per workflow: contents: write, pull-requests: write, issues: write.

4.2 MRI Workflow (build-master-reference-index.yml)

  • Metadata‑only paths (generatedAt updates) commit directly to main (no empty PRs).
  • Commits both masterReferenceIndex.json and mri_presence_audit.json directly to main when in metadata‑only mode; no hard reset to avoid file loss.
  • Issue creation rebuilt: proper Markdown newlines, readable bullets for cite, title, href, rawRef. Missing‑ref issues auto‑close when resolved. onlyMeta=true suppresses PR creation.
  • PR base parameter fix: all MRI workflow PRs now set base: ${{ github.event.repository.default_branch }} explicitly to ensure correct merge targeting.
    Prevents orphaned branches from detached workflows.

4.3 Weekly MSI Workflow Hardening

  • UNKEYED issues: one per docKey, idempotent, closed only from default‑branch runs.
  • PR policy: lineage/inventory deltas → PR; flags/UNKEYED/metadata → auto‑commit to main.
  • Diff classifier (inventoryChanged) routes outputs appropriately. PR bodies include flags and UNKEYED counts.
  • Triggers: weekly cron (04:15 UTC), push to main, and manual dispatch.

4.4 PR Preview & Build Chain (pr-build-preview.yml)

  • Automatic PR previews deployed to gh-pages/pr/<PR#>/ with a comment posting the live link.
  • Works for both direct PRs and workflow_run triggers from extraction; preview check appears in the PR Checks tab and links to the deployed preview.
  • Reliability improvements: fixed reused‑PR gaps, retry logic, stable destination_dir, keep_files: true to preserve previews, and CNAME‑safe redirects.

4.5 Branch Hygiene (Branch Sweeper)

  • .github/workflows/branch-sweeper.yml cleans stale branches with dry‑run support.
  • Protections: default branch, main, master, gh-pages, and branches with open PRs.
  • Options: exclude chore/ prefixes by default; manual runs can include them. Logs show would‑delete/deleted/skipped categories with reasons.
  • Fixes include injected core globals, robust YAML boolean coercion, full input sanity logging, commit‑date fallbacks, and pagination for repos with >100 branches.

4.6 Repo/Workflow Ops

  • Node cache for faster CI startup. Conditional normalization + PR creation gated on real change signals (redirectUndefinedCount > 0, applied > 0).
  • Post‑audit sync‑to‑main prevents base/head conflicts. Normalization PRs use a rolling branch chore/url-normalize and auto‑delete on merge; guards prevent self‑trigger.
  • README documentation expanded with an “Automated Workflow Chain (with Samples)” section: triggers, datasets, expected outputs, and sample links to runs, reports, PRs, and issues.

4.7 PrZ3 Unit Integration — Phase 1 (GitHub App Identity)

  • Adopted PrZ3 Unit (GitHub App) as the unified automation identity across all workflows; replaces github-actions[bot].
  • App connection established to the PrZ3r organization and MSRBot.io repository.

Core Setup

  • Secrets added: APP_ID, APP_PRIVATE_KEY.

Behavior & Identity

  • Commits authored as: PrZ3 Unit <prz3-unit[bot]@users.noreply.github.com>.
  • PRs and comments show actor: prz3-unit[bot].
  • Issue automation (resolve/close) confirmed (“Resolved: no outstanding 404s…”, etc.).
  • Preview builds post comments and deploy commits as the bot.
  • Validation and extract PRs open with bot authorship; auto-closing logic intact.

Operational Notes

  • Actions audit trail displays consistent bot identity across runs.

Verification

  • App token issuance and usage validated end-to-end.
  • Commit history, PRs, and Pages deploy/cleanup all operate correctly under PrZ3 Unit.

Result

  • All automation now runs under a single, auditable identity with minimal token sprawl and App-based security hardening.

4.8 URL Validation Workflow Enhancements (Publisher & Item-Level Issue Expansion)

  • Scope: .github/workflows/validate:url.yml

Additions

  • New error categories with full upsert + autoclose behavior:
    • 403 — per publisher (labels: automated, url, 403)
    • HPE_INVALID_CONSTANT — per publisher (labels: automated, url, hpe-invalid-constant)
    • ERR_BAD_REQUEST — per item (labels: automated, url, bad-request)
    • UNABLE_TO_VERIFY_LEAF_SIGNATURE — per item (labels: automated, url, leaf-signature)
  • Integrated into audit parsing, issue creation, and autoclose passes.
  • Issue bodies use a deterministic summary (Count, docIds, URLs, run tag), matching existing ENOTFOUND/404/400 conventions.
  • Added ENOTFOUND handling (treated equivalently to 404) with deterministic issue body and canonical labels.
  • Capped per-issue body list to first 25 entries with a total count noted.
  • Mirrored behavior applied to 404 for consistency.

Refinements

  • Label scheme cleaned:
    • Removed legacy single-labels (e.g., url-404, url-enotfound).
    • All issues now use canonical multi-label sets: automated, url, <category>.
  • Autoclose logic validated across new and existing issue types.
  • Body formatting aligned to established deterministic Markdown style.
  • Removed the postChunks() helper and all call sites across grouped/publisher sections.
  • Applied a unified issue body structure to all publisher-level categories:
    • 404, 403, ENOTFOUND, HPE_INVALID_CONSTANT, ERR_FR_TOO_MANY_REDIRECTS, SMPTE.resolvedHref.standards-prefix
  • Left per-item categories unchanged (bodies link directly to the audit JSON):
    • 400, ERR_BAD_REQUEST, UNABLE_TO_VERIFY_LEAF_SIGNATURE
  • Verified autoclose and label handling remain unaffected.

Outcome

  • Workflow now covers 9 error/mismatch categories with consistent publisher vs. item separation, self-updating issue bodies, and full autoclose hygiene.
  • Old label compatibility removed; unified label taxonomy is authoritative going forward.
  • Issue bodies are clean, consistent, and readable at a glance.
  • Full datasets remain accessible via a single stable link: src/main/reports/url_validate_audit.json.
  • Large audits avoid comment spam; runs stay fast and quiet.

5 Registry Architecture & Data Model Evolution

This section summarizes structural evolutions to the registry data model: metaConfig governance, withdrawn/stabilized handling, repo link validation, and related extraction-to-build refinements. It’s the bridge between raw ingestion and the normalized artifacts the site depends on.

  • Consolidated metaConfig governs notes for status.stabilized, status.withdrawn, and status.withdrawnNotice.
  • Withdrawn notice handling:
    • Reachability check performed once per URL.
    • Non‑enumerable __withdrawnNoticeSuffix tracks "verified reachable" or "link unreachable at extraction".
    • On create: $meta.note combines base note + suffix (deduplicated). On update: $meta.note updated only if URL changes. Regex normalizer strips duplicate suffixes.
  • Repo URL validation: HEAD checks before writing repo prevent invalid links.
  • Discovery output cleanup: suite/child formatting improved; merge/update phase uses logSmart.
  • Withdrawn and stabilized flag extraction: extraction recognizes and populates status.withdrawn and status.stabilized fields directly from document metadata for registry completeness.

6 Frontend & Site Publishing

Section Summary

This section documents the transformation of MSRBot.io from a raw registry viewer into a polished, reader‑first product. It covers the full frontend stack: card‑based browsing, deterministic search and faceting, per‑document rendering, citations, branding, RefTree exploration, normalized Groups/Projects views, and theming. The emphasis throughout is clarity, stability, and discoverability for end readers, with strong guardrails to prevent UI regressions from backend changes.

Section 6 Contents

This section covers the site UX and publishing layer: the card-based registry, deterministic search/faceting, per-doc rendering hardening, MSI surfacing, citations tooling, branding/SEO assets, RefTree, and normalized Groups/Projects views. It’s where the registry becomes a usable product.

  • PR previews deployed for each open PR with a durable URL. Checks include write permission and attach to the PR’s head SHA.
  • Links are stable under both github.io and the CNAME (msrbot.io).
  • Plan: staging subdomain (e.g., test.msrbot.io) for broader pre‑prod validation.

6.1 Card-Based Registry View

  • Refactored registry interface to a responsive card-based view using Handlebars templates with synchronized filters and facets.
  • Implemented src/site/js/cards.js to load search-index.json and facets.json, render cards via Handlebars, and manage live filtering.
  • Added helper functions: join, len, gt, statusBadge, coalesce, and hasAny for inline template logic.
  • Introduced fallback diagnostics for missing templates (#card-tpl / #card-tpl-src) to prevent silent render failures.
  • Corrected publiccd mapping within statusBadge to ensure proper badge display across document types.
  • Added syncFacetCheckboxes() to maintain state parity between filter chips and facet checkboxes, including the “Clear All” action.
  • Updated msrbot.css for badge right-alignment and text wrapping on small screens.
  • Updated cards.hbs to integrate existing Bootstrap header/footer, sticky topbar, and unified card render block.
  • Verified automatic facet expansion for status and docType; confirmed live synchronization between chip removal and facet checkboxes.
  • Result: responsive, visually consistent registry cards with fully synchronized filters and improved mobile usability.

6.2 Branding, SEO & 404 Overhaul

  • Project rebrand and repository migration: moved to PrZ3r/MSRBot.io; redirects active for previous repo and domain.
  • Domain alignment: msrbot.io established as canonical host; mediastandardsregistry.org configured to redirect via DNS.
  • Single-source configuration: introduced src/main/config/site.json as canonical metadata store; removed duplicate inline defaults from build.js.
  • Environment variable overrides: supports SITE_CANONICAL_BASE, SITE_NAME, and SITE_DESCRIPTION for staging.
  • Meta & SEO injection: build now applies siteName, siteDescription, canonicalUrl, ogTitle, ogDescription, and ogImage across all pages.
  • Header partial overhaul (header.hbs): canonical URL, Open Graph, and JSON-LD schema (WebSite + searchAction); robots meta (index/noindex) applied dynamically.
  • Added comprehensive favicon and language metadata; optimized preconnects and HTTPS-only links.
  • Generated build assets: robots.txt, sitemap.xml, opensearch.xml, and 404.html all rendered with header/footer and OG context.
  • Dynamic 404 page (“Disappointed Penguin Edition”): Handlebars-driven layout with randomized, config-based message list; robots meta = noindex,follow.
  • Asset and link cleanup: introduced assetPrefix for consistent partial reuse; navbar links now root-absolute; redundant defaults removed.
  • OG image generation: 1200×630 PNG featuring new penguin + logo branding on teal/circuit background.
  • Identity & icon: finalized MSRBot.io logo and SVG favicon (clean circular mark, document-forward, non-derivative); scalable and consistent across all resolutions.
  • Result: complete brand, SEO, and metadata integration; clean 404 UX; site now fully single-config, portable, and identity-consistent.

6.3 Registry Cards & Search System

  • Extended card-based registry UI with full search, filtering, and navigation logic.

Rendering & Layout

  • Handlebars runtime loaded on demand for client-side rendering.
  • Status badge alignment/wrapping corrected; consistent badge classes and typo cleanup.
  • Introduced sticky bottom pager (later removed sticky top) with proper persistence.
  • Added results summary line: “Showing X–Y of N (filtered from M)”.
  • Page-size selector improved for clarity; auto-selects current value.

Filters & Chips

  • Two-way synchronization between facet checkboxes and active chips.
  • “Clear All” now clears filters, chips, and search box.
  • Removing a chip unchecks the corresponding facet.
  • Replaced hasCurrentWork boolean with currentWork list facet; updated labels and search mapping.

Sorting

  • Newest/Oldest now sort by pubTs with fallbacks.
  • Added Title and Label A↔Z sorts with article stripping (“the”, “a”, “an”).
  • Introduced reversible sort toggling for all modes.

Pagination & Navigation

  • Numbered page jumpers with prev/next and keyboard navigation.
  • URL-synced pagination and page size for deep-linking and browser history.
  • Bottom pager hidden when top is visible to reduce clutter.
  • Year facet implemented as a compact dropdown (“All years” default) linked with chips and clear-all.

Deep Links & Anchors

  • #docId anchors locate items across pages/filters; if filtered out, filters reset and jump performed.
  • Scroll offset accounts for sticky navbar and topbar with highlight flash.
  • On filter/sort/page/search change, hash stripped to prevent accidental jumps.
  • Offset tuned for consistent landing beneath both sticky elements.

URL State

  • Query string now mirrors page, size, sort, q, and f (filters).
  • On load, state rehydrates search box, sort, page size, and filters from URL.
  • Year dropdown and page-size controls initialize from URL state.

Search Engine

Index Build (Node)

  • build.search-index.js generates:
    • build/cards/search-index.json (flat rows from documents.json + group/project joins).
    • build/cards/facets.json with counts and group-label map.
  • Optional synonyms.json copied into build.
  • MiniSearch UMD resolved from node_modules or CDN (redirect-follow supported).
  • Fixed undefined.false builder error; tightened status bucketing.

MiniSearch (Client)

  • Loader prefers UMD; falls back to simple includes() if unavailable.
  • Field boosts: title:6, id:5, label:4, keywords:3, keywordsSearch:2, currentWork:1.
  • Token rules: prefix ≥ 3 chars; fuzzy 0.1 ≥ 4 chars (to curb over-broad hits).
  • Bi-directional synonyms and intra-group linking
    ("isdcf" ↔ "inter-society digital cinema forum" etc.).
  • Quoted phrases perform literal matches via precomputed haystack intersection.
  • Search syntax supports field scopes (publisher:isdcf, title:"accessibility"), exclusions (-draft), and facet combination.

Behavior Fixes & Polish

  • Keywords now included in both faceting and weighted search.
  • Search bar updates URL state; “Clear All” resets it.
  • Guarded MiniSearch failure path logs error, falls back gracefully.
  • Fixed chip/checkbox desync edge cases.
  • Case-insensitive, article-stripped sorting for label/title.
  • Resolved “0 results” initialization bug from state-init order.
  • Page truncation at 40 fixed; page-size control honored via URL.
  • Year-facet chip removal resets dropdown correctly.
  • Deterministic facet-section order; docType and status expand by default.

Result

  • Registry now offers full-fidelity search, deep-linkable state, and robust filter synchronization within a responsive, mobile-friendly card interface.

6.4 Search Index & UX Enhancements

Search Index & Schema

  • Enriched search index rows with publisher, doi, group, and publicationDate; verified propagation to cards and facets.
  • docType now stores the full name (e.g., “Standard”); abbreviations (e.g., “ST”) moved to docTypeAbr.
  • Removed Unknown fallback — docType is mandatory.

Search Behavior

  • MiniSearch tuning:
    • Improved exact-phrase detection for quoted queries.
    • Adjusted fuzzy and prefix thresholds to reduce over-broad matches.
    • Enforced true AND semantics across multiple terms (eliminates OR overloads).
    • Document-number normalization (e.g., 429-2:2020) returns exact matches reliably.
  • Introduced mode toggle scaffolding for simple vs smart search (future-ready).

Facet & Status Logic

  • Removed statusPrimary; facets and badges now derive from any statusFlags marked true.
  • Added facets.statusLabels for user-friendly names (e.g., “Active”, “Amended”); consistent badge order on cards.
  • Multi-flag AND logic for status filtering (e.g., “Active + Amended” requires both).

File Organization

  • Build outputs reorganized: documents.json now emitted under build/cards/_data/ for clearer separation.
  • Card views render exclusively from the search index; no dependency on the root documents.json.

UI Enhancements

  • Added a Bootstrap Search Tips popover:
    • Covers phrase matching, AND logic, field filters, doc-number normalization, fuzzy rules, and synonyms.
    • Keyboard shortcut ? opens tips from the search bar.
    • Placement refined to sit just before the sort dropdown (discoverable and unobtrusive).

Resilience & Integration

  • Tip-button installer guarded against duplicates (safe on popstate reloads).
  • Works with Bootstrap fallback (alert() version) if JS is unavailable.
  • popstate re-invokes installSearchTips() to maintain functionality on back/forward navigation.

Result

  • Deterministic search/filtering with consistent docType, status, and metadata rendering.
  • Clear, discoverable search UX; build and data pipelines write to the correct directories.

6.5 Per-Doc Rendering, MSI Surfacing & Citations

Per-Doc Emit & Resilience

  • Per-document emit stabilized and debuggable; added defensive prepareDocForRender() normalizers to guard against missing/partial fields.
  • Upgraded per-doc failure logs to include docId, publisher, docType, and refs: yes/no for faster triage.
  • Sub-registries load correctly: registries[].subRegistry now pulls groups, projects, and documents where declared; templates receive dataDocuments, dataGroups, dataProjects.
  • Hardened detail pages against missing subregistry data; isolated doc failures now log cleanly instead of crashing templates.

MSI Wiring on Pages

  • Reads reports/masterSuiteIndex.json and annotates each document with: msiLatestAny, msiLatestBase, isLatestAny, isLatestBase, docBase, docBaseLabel.
  • Surfaces a per-base suite (docSuite) sorted by date for clear lineage context on detail views.
  • Suite logic is stable and sorted; title/label preference rules implemented:
    • docLabel used for listings.
    • docTitle preferred in detail view when docType is in titleLabelDocTypes.
    • perDocTitle and listTitle reflect the chosen hierarchy.

Reference Plumbing (Non-Mutating)

  • Computes referencesResolved in parallel (does not mutate source arrays).
  • Builds referencedBy and a bounded referenceTree (DFS depth 3) for contextual back-links on pages.

Status Strings, Badges & DocId Layout

  • Consolidated currentStatus label; getStatusButton returns green check for Active, red slash for Superseded/Withdrawn.
  • Card and detail templates use consistent badge ordering and classes.
  • DocId header layout rebuilt:
    • Logo → Label → Badges on a tight flex row; label wraps fully without breaking layout.
    • Status badges moved out of the H1 into a dedicated flex region.
    • Small screens: header intelligently stacks; badges move to top-left; label wrap and word-break behavior fixed.
    • H1 title isolated from layout shifts; badge overflow issues resolved.
  • CSS stabilized for badges:
    • align-items: flex-end, per-badge white-space: nowrap, controlled flex-basis.
    • Long titles no longer push logos/badges into odd rows; explicit min-width: 0 on the title block.
    • Status badges right-aligned on wide screens, stack correctly on small screens.
    • Media-query overrides enforce predictable wrap behavior; layout is now stable and deterministic.

Search & Data Outputs / Build Pipeline

  • Documents build emits build/docs/_data/documents.json (strips $meta) for the site layer.
  • build.search-index.js produces _data/search-index.json and related artifacts; cards render from the search index (no root documents.json dependency).
  • Per-doc generator ensures each detail page has access to groups, projects, and referencesResolved; suites use the effective docs dataset.
  • CSV export and effective docs datasets now strip all $meta fields, significantly reducing output size and keeping UI-facing data clean.

Publisher Branding

  • Server writes _data/publisher-logos.json and _data/publisher-urls.json from site.json (honors alias map).
  • Handlebars helpers {{publisherLogo}} and {{publisherLink}} added:
    • Cards display publisher logos next to labels with controlled max-width to prevent runaway logos.
    • DocId pages place the logo on the title line (next to label) for strong visual identity.
    • Logos are linkable on detail pages via publisher URLs defined in site.json.
  • Fixed 404s by ensuring _data/ exists under build/ and using correct asset prefixes.

Citations System

  • Centralized helpers: citeText, citeHtmlGeneric, citeHtmlSmpte, plus code-safe variants for copy blocks.
  • Config-driven SMPTE previews vs snippets (site.json → citations.smpte.preview/snippet).
  • joinAuthors is CSL-JSON aware with Oxford comma support and configurable separators; [object Object] issues resolved.
  • New helpers introduced:
    • citeIfEq, citeIfNotEq
    • getUndatedLabel, getUndatedTitle
  • citeIfEq supports config lists (e.g., nonLineageDocTypes) or comma-separated lists, allowing templates to branch without hardcoding.
  • Snippet engine:
    • Copy buttons always pull from snippet text, never the preview.
    • Uses innerText / textContent for reliable extraction.
    • Per-snippet IDs and DOM-safe getElementById lookups ensure correct binding.
  • Undated variant infrastructure scaffolded:
    • Preview + snippet wrappers wired.
    • Buttons appear only when allowed.
    • Logic paused mid-implementation pending a cleaner approach after complexity surfaced.

Template & DX Helpers

  • Utility helpers added and refined: ifeq, ifnoteq, or, and, len, asArray, formatLineageKey, getLabel, getUndatedLabel, ID-safe slugging in citations, and group/project lookup helpers.
  • Fixed edge cases in len, asArray, and helper sequencing that previously caused template crashes.
  • Error messages for isolated document failures are clearer and more localized.

Site Chrome

  • Build emits robots.txt, sitemap.xml, opensearch.xml, and header/footer-styled 404.html with penguin quips (consistent with branding and SEO overhauls).

Result

  • Fewer brittle template crashes and clearer failure signals.
  • Single source of truth (site.json) drives visuals (logos/links) and logic (doc-type lists and publisher metadata).
  • MSI data is first-class on pages; every document exposes suite context and “latestness.”
  • Detail pages (DocId views) present a stable, responsive header with clean logo/badge behavior; CSS is predictable across breakpoints.
  • Editors get copy-ready citations with accurate previews, robust author formatting, and safer snippet behavior.

6.6 Reference Tree System — Graph-Based Cross-Reference Explorer

Overview

  • Implemented a full Reference Tree system that exposes cross-document relationships as a first-class feature.
  • Entry points:
    • /reftree/ index page with search and curated starting points.
    • Per-document RefTree pages generated from registry data and MSI/refs plumbing.
  • RefTree now behaves as a stable, user-facing feature rather than a prototype.

RefTree Index Page

  • Built a dedicated landing page at /reftree/ with two primary panels:
    • Start Exploring — search box with smart matching and suggestions.
    • Interesting Starting Points — auto-ranked “most-connected” documents to onboard new users.
  • Added suggestion UI and preview lists to avoid 404 traps when searching by label/title/number.

Search Engine & Matching

  • Normalized queries for robustness (case-insensitive, tolerant of dots/dashes and number formatting).
  • Implemented match scoring:
    • Exact docId matches score highest.
    • Exact label/title matches score just below.
    • Compact and substring matches use lower tiers.
  • Auto-redirects only on high-confidence matches; otherwise presents a result list with label + title.
  • Added “Show all N matches” behavior for broad queries (e.g., publisher or series searches).

Most-Connected Documents

  • Derived “most-connected” ranking from referencesResolved + referencedBy data.
  • Surface top documents by combined reference degree, showing label, title, and docId for clarity.
  • Used as curated tour starting points (e.g., 428/429 families, IMF, ST 2110, NIST 800-series).

Per-Document RefTree Pages

  • Introduced per-doc RefTree views with two modes:
    • Simple — depth-based neighbors only.
    • Expanded — full graph traversal via resolved references.
  • Features:
    • Depth selector (1–5 plus max, with support for values above 5).
    • Node expand/collapse controls to manage large graphs.
    • Clear upstream/downstream labeling and parent-level detection.
    • Graph-line and layout adjustments for readability at multiple depths.

Root Card & Layout

  • Root document presented as a sticky top card:
    • Uses build pipeline helpers for accurate label/title/status logic.
    • Consistent ordering of label, title, status, and publisher icons.
    • Styling aligned with the main docId detail pages.
  • Layout improvements:
    • Cleaner top-of-page copy oriented toward non-expert users.
    • Controls (depth selector, expand/collapse) kept on a single stable toolbar row across breakpoints.
    • No-underline button styles for interactive controls; text remains readable and scan-friendly.

Templates, Helpers & Shared JS

  • Split logic between:
    • refTree.js — per-doc tree behavior.
    • refTreeIndex.js — index/search and “most-connected” logic.
  • Removed inline script fragments from templates; moved behavior into shared JS modules.
  • Integrated build-time helpers (e.g., getLabel, status helpers, publisher helpers, “any” checks) for consistent display.
  • Fixed missing publisher variables and mismatched helper usage that previously caused inconsistencies.

Build & Data Plumbing

  • Updated build.js to:
    • Emit per-doc RefTree JSON cleanly for each document.
    • Inject a full documents snapshot into _data for the RefTree index and search.
    • Resolve initialization ordering issues (e.g., docProjs before use).
  • Ensured RefTree draws exclusively from the registry’s authoritative documents and refs (including referencesResolved and referencedBy).

Deferred Enhancements (Future Work)

  • Left space for future iterations:
    • Fully interactive GraphViz/SVG-style visualizations.
    • Path highlighting for step-by-step traversal.
    • Dedicated /search/ endpoint for full-registry search.
    • Specialized typography or indicators for cycles and suite/family clustering.

Result

  • RefTree now provides:
    • A stable UX foundation for exploring document relationships.
    • Onboarding-friendly entry points for new users.
    • High-signal tooling for editors working with complex reference graphs.
    • Clear extension paths for future visualization and search enhancements.

6.7 Groups, Projects, Theming & Navigation Overhaul

Normalized Groups Registry

  • Rebuilt groups system as a first-class normalized registry; removed legacy recursive overwrite loop.
  • build/groups/_data/groups.json emitted as the authoritative normalized dataset; raw export provided as groups.raw.json.
  • Stable normalized fields: groupId, groupOrg, groupName, groupDesc, groupType, groupStatus.active, isActive, statusText, hasParent, tcId, groupLabel, searchText.
  • Standardized groupLabel rules:
    • TCs: ORG <tcName> <tcDesc>
    • Child of TC: ORG <tcName> <childName> <childDesc>
    • Deep children: ORG <tcName> <nonTC ancestors> <selfName> <selfDesc>
    • TC description not inherited by children.

Groups UI

  • Groups page rebuilt to operate on normalized data only.
  • Cards display groupLabel, groupOrg, groupType, groupSummary, statusText, isActive.
  • Parent/TC links working; assigned groups, linked projects, and documents resolve cleanly.
  • Facets, filters, chips, and URL-hash behavior aligned with documents and cards UI.
  • Sorting stable on groupId, groupOrg, groupType, groupLabel; paging applied after sorting.

Projects Registry & UI

  • Introduced a new Projects registry and UI built parallel to Groups.
  • Cards display core project metadata: projectId, work type, status, primary/affected docs, repo link.
  • Cross-link helpers (getLabel, getTitle, publisherLink, logos) applied consistently across doc/project/group lookups.
  • Facets and filters fully wired; chip + hash behavior matches Groups.
  • Sorting aligned with dropdown text; no mismatched internal sort keys.

Documents, RefTree, and Detail View Integration

  • RefTree pages built in the documents pass; canonical URLs at /reftree/{docId}/.
  • Per-doc pages include normalized MSI, suite, referencesResolved, referencedBy, referenceTree, docSuite, currentWork, and related group/project context.
  • RefTree index renewed with intro content, curated starting points, and consistent site styling.

Build Pipeline Improvements

  • writeFileSafe fixed to prevent accidental clobbering of groups data.
  • Removed unused /documents/_data emission paths.
  • RefTree no longer emits its own documents payload; uses effective snapshot.
  • Project registry integrated via registryProject and docProjsMap, preferring non-Complete projects.

Theming System

  • Theme popover added with Auto/Light/Dark options; choices persist and update without reload.
  • Global dark/light modes corrected for cards, facets, and accordions.
  • Publisher logos now theme-aware: helpers emit data-logo-light and data-logo-dark and JS switches them dynamically.

Navigation & Dev Tools

  • Navbar cleaned; removed legacy DataTables script remnants.
  • Core nav centered; tools and preferences kept right-aligned without shifting layout.
  • Added Dev Tools popover exposing CSV and JSON exports under build/_data/ with short descriptions.
  • Site logo linked simply to /.

Search, OpenSearch & Sitemap

  • opensearch.xml generated declaratively; uses /docs/?q={searchTerms}.
  • New sitemap.xml emitted at build time, containing:
    • /, /groups/, /projects/, /docs/, /reftree/
    • All document detail pages (/docs/{docId}/)
    • Absolute URLs from canonicalBase; includes lastmod, changefreq, and priority.
  • Removed static sitemap.xml to avoid clobbering.

Misc Helpers & Polish

  • Added getUndatedLabel, getUndatedDoiCite, getUndatedHrefCite, getUndatedTitle to support undated citation display.
  • Status pipeline unified via currentStatus; icons rendered through getStatus and getstatusButton.
  • PublisherLink helper now resolves via alias map shared with logos.
  • Hash offsets again respect sticky headers site-wide.
  • Numerous CSS adjustments to card headers, facets, and dark-mode cases.

Deferred / Future Work

  • Dev-facing API page with schema versions and filtered slices.
  • Advanced theming (reduced-motion, density modes).
  • Namespace/schema validation for XML families (planned core infra).

7 Logging, Diffing, and PR Output

This section defines operational observability: log budgeting, heartbeats/tripwires, artifact retention, and PR diff formatting conventions. The goal is to keep CI readable while preserving deep traceability when something goes wrong.

  • logSmart.js centralizes logging with a console budget (~3.5 MiB). Excess console chatter is tripwired while full logs are persisted to file.
  • Heartbeats and tripwires: periodic progress messages ([HB pid:####] ... still processing — X/Y (Z%)) with a start‑of‑run settings banner.
  • Full extract log artifacts (extract-full.log) uploaded for every run, including early exits or skipped PRs.
  • PR log formatting:
    • One‑line diffs for status children, revisionOf, and reference updates.
    • Duplicate‑skip reporting simplified: PR shows only a count; detailed list in workflow logs.
    • Diff linking: PR body includes a __PR_DETAILS_DIFF_LINK__ token replaced with a link to the PR Files tab anchored to the details file blob SHA.
  • PR creation skip: legacy skip-pr-flag.log removed; PR body text check governs skipping.
  • Heartbeat + Tripwire logging: logSmart now emits periodic heartbeats during long extraction runs and tripwire alerts when log volume approaches budget.
    Ensures visibility in CI logs without exceeding console limits.

8 URL Validation & Normalization Suite — Summary (Operational)

This is an at-a-glance operational recap of the URL toolchain: validator/audit outputs, normalization apply/validate modes, rules baselines, and scheduling/hygiene behaviors. Treat it as the quick reference for what runs, when, and what artifacts to expect.

  • URL Validator reports: good URL totals, unreachable and redirect mismatches split by cause. Audit logged to src/main/reports/url_validate_audit.json.
  • URL Normalizer operates in validate‑only by default; writes only in apply mode. Summary emitted to src/main/reports/url_validate_normalize.json.
  • URL Rules provide publisher‑specific checks and expected patterns (informational baseline for future enforcement).
  • Repository hygiene and scheduling configured for weekly runs, manual dispatch, and PR‑merge triggers; runs auto‑cancel when superseded.

9 Net Results / System Readiness

This section is the “so what” snapshot: end-to-end chain reliability, preview determinism, report consistency, metadata-only commit behavior, and verified provenance corrections. It’s the readiness checklist and confidence statement for the overall system.

  • End‑to‑end weekly chain hardened. MSI → MRI → MSR runs reliably and in order.
  • PR previews deploy deterministically and self‑report via PR checks.
  • URL validator, normalizer, and branch sweeper operate on schedules with clean, uniform reports.
  • All major CI workflows are concurrency‑protected and idempotent.
  • Project emits core JSON reports under src/main/reports/ with uniform headers.
  • Clean metadata‑only commit flow (no empty PRs); auto‑closing issues for missing references.
  • Cite‑first resolution logic with refMap overrides; undated references upgraded via lineage when appropriate.
  • Extractor honors $meta locks; normal flows unaffected for unlocked fields.
  • MSI checks extended with SMPTE releaseTag audit.
  • Provenance corrections verified in practice (examples: docLabel normalization 2086→2085, publicationDate normalized to HTML pubDateTime, amendment promotion behaves as specified).

Appendix A: Implementation Notes (selected specifics retained)

This appendix keeps a curated set of low-level implementation details that are useful during debugging or refactors but too granular for the main narrative. It’s intentionally selective and may contain one-off notes that supported larger changes above.

  • MSI lineage logic refined across publishers; draft and versionless handling normalized; ICC errata regex fixed; console logs simplified (Found vs Added vs Skipped) with reduced UNKNOWN noise via early publisher normalization.
  • Safety guard on references: skip MSI probing if docId already exists in documents.json to reduce unnecessary lookups and false gaps.
  • README expanded: automated chain diagram, sample outputs, triggers, and dataset descriptions.

Gold‑Copy Guidance This file is the current gold‑copy consolidation. If corruption or lock‑up occurs in downstream artifacts, restore from this document and only re‑apply changes made after the consolidation date above.

Maintained by Steve LLamb — MSRBot.io

Appendix B: Daily Done List Protocol

This appendix defines the ongoing update format for future work: a structured “Done List” that can be normalized, deduped, and merged into this chronicle without bloating it. Use it to keep the log consistent as the project evolves beyond v1.0.0.

To simplify ongoing updates, each day’s accomplishments can be logged in a structured “Done List” that is parsed and merged into this changelog. Use the format below for clarity and automation compatibility.

Template

# Done List — YYYY‑MM‑DD

## 1 Extraction & Automation Pipeline
- [summary of task or fix; include filename or script if relevant]

## 2 Metadata & Provenance System
- [$meta or status logic updates]

## 3 Validation & URL Resolution
- [url.validate.js or url.normalize.js changes]

## 4 Workflow & CI/CD
- [workflow name or YAML file update]

## 5 Registry Architecture & Data Model
- [structural schema or field changes]

## 6 Frontend & Site Publishing
- [public site or preview deployment details]

## 7 Logging, Diffing, PR Output
- [logging or PR formatting changes]

## 8 Misc / Notes
- [anything cross‑cutting or prep for next session]

Usage Rules

- Keep bullets short, declarative, and in past tense (one per atomic change).
- Include filenames or JSON keys for traceability.
- No dates or emojis inside bullets; they are stripped on import.
- Omit empty sections — they collapse automatically.
- Duplicate items are flagged as Merged/Implicit instead of duplicated.

Processing Workflow

1. Drop the formatted Done List into chat.
2. The assistant parses, normalizes, and compares entries against the current gold‑copy changelog.
3. A Markdown patch block is generated for commit.
4. Each item is classified as ✅ New addition, ⚙️ Merged, or ✏️ Reworded/clarified.
5. Paste the block into the appropriate sections and commit the update.

Optional Short Form

Done List — YYYY‑MM‑DD
- Extraction: added DOI fix for PDF fallback.
- Metadata: tuned $meta.note for status.superseded.
- Workflow: fixed YAML boolean coercion.

Short form lines are auto‑routed to their matching sections.