Skip to content

phase 1: add Gemfile pinned to github-pages for reproducible local builds#32

Merged
jmxpearson merged 32 commits into
masterfrom
claude/chat-access-clarification-Z9e9G
May 11, 2026
Merged

phase 1: add Gemfile pinned to github-pages for reproducible local builds#32
jmxpearson merged 32 commits into
masterfrom
claude/chat-access-clarification-Z9e9G

Conversation

@jmxpearson

Copy link
Copy Markdown
Member

Adds the standard github-pages meta-gem so contributors can run
bundle install && bundle exec jekyll serve locally and get the same
Jekyll version GitHub Pages uses for the live build. Removes the
Gemfile/Gemfile.lock entries from .gitignore that were blocking this.

claude added 26 commits May 10, 2026 02:36
…ilds

Adds the standard github-pages meta-gem so contributors can run
bundle install && bundle exec jekyll serve locally and get the same
Jekyll version GitHub Pages uses for the live build. Removes the
Gemfile/Gemfile.lock entries from .gitignore that were blocking this.
Adds .pre-commit-config.yaml wiring three local hooks:
- jpegoptim with --max=85 for lossy JPEG compression
- oxipng -o 4 for lossless PNG optimization
- svgo for SVG minification

All run as language: system so contributors install the binaries once
(via apt/brew/cargo/npm) rather than pre-commit rebuilding them per repo.
README.md will document the install commands separately.
Adds .github/workflows/pre-commit.yml that re-runs all configured hooks
on every PR (and on push to master). If a contributor skipped the local
hook install, the workflow fails with a diff showing what should have
been compressed/normalized. Contributors then run pre-commit run --all-files
locally and commit the fixes.

Installs jpegoptim (apt), oxipng (cargo), and svgo (npm) so the hooks
have the same tooling locally and in CI.
Commits Gemfile.lock so contributors and CI get identical gem versions.
Adds a Contributing section covering:
- bundle install + jekyll serve for local preview
- pre-commit setup commands per platform (apt/brew + cargo + npm)
- explanation of what the image hooks do and how to recover from
  hook-aborted commits
scripts/check_image_size.py blocks images over 1 MB and warns on images
between 500 KB and 1 MB. Paths in .image-size-overrides are exempt from
blocking (still warned).

Stdlib-only (no pip deps). Reads file paths from argv so pre-commit can
pass staged files directly.
- Adds an image-size-cap hook that runs after the compression hooks, so
  it sees post-compression sizes.
- Adds .image-size-overrides as an empty allow-list with header comment
  documenting the format. Maintainers add paths here when a large image
  is justified (e.g., a high-detail research figure).
Adds a "Image size policy" subsection covering the 500 KB warn / 1 MB
block thresholds and how to use .image-size-overrides for justified
exceptions.
Adds three new pre-commit repos:
- pre-commit/pre-commit-hooks: trailing-whitespace, end-of-file-fixer,
  check-yaml/json, check-merge-conflict, mixed-line-ending,
  check-added-large-files (1 MB generic cap), detect-private-key
- adrienverge/yamllint: YAML style checks scoped to _data/, _config.yml.
  Configured via .yamllint extending the relaxed profile and disabling
  document-start, line-length (data files have long values), and
  loosening indentation to allow the existing block-sequence style.
- igorshubovych/markdownlint-cli with --fix: auto-fixes trailing-space,
  blanks-around-headings, multiple-blanks, list/HR formatting, etc.
  Configured via .markdownlint.json disabling MD013 (line length),
  MD033 (inline HTML — Jekyll posts use it), and a few rules that
  conflict with existing legacy content (MD001 heading increment,
  MD025 single H1, MD034 bare URLs, MD036 emphasis-as-heading,
  MD041 first-line-h1, MD045 no-alt-text, MD059 descriptive-link-text).
  The disabled rules represent a "loose baseline" that can be tightened
  in a follow-up content-cleanup pass.

The next commit will be the one-shot first-run normalization.
Two tweaks discovered during the first-run normalization:
- yamllint: drop --strict so warnings don't fail CI
- yamllint indentation: indent-sequences=whatever (was: consistent)
  so existing GitHub Actions yaml passes without requiring a sequence
  indent style we don't care about
Mechanical normalization from running pre-commit's baseline hooks
(trailing-whitespace, end-of-file-fixer, mixed-line-ending) and
markdownlint --fix across the existing repo. No semantic changes.

Includes:
- trailing whitespace stripped from posts, layouts, scripts
- final newlines added where missing
- people.html line endings normalized to LF (was mixed)
- markdownlint auto-fixes for blanks-around-headings, list spacing,
  HR style, etc. on legacy posts
- one manual MD035 fix in 2016-9-9-time-allocation-in-neuro.md
  (changed "-----" to "------" to match the file's other HR style)

Image backfill is intentionally deferred to a future phase; image
hooks were skipped via SKIP env var when running across all files.
Extends the Pre-commit hooks section with the new behaviors added by
the baseline + yamllint + markdownlint hooks: trailing whitespace
trimming, EOL normalization, YAML/JSON validation, merge-conflict
detection, private-key detection, and Markdown/YAML linting.
- lychee.toml: link-checker config with 30-day cache, 20s timeout,
  excludes for dibs-web01 (deferred to Phase 5 migration) and the
  bootstrap CDN (transient 5xx not worth failing on). Accepts 999
  for LinkedIn's bot-detection response.
- .html5validator.yaml: minimal config pointing at _site/ with an
  empty ignore list to start; rules can be added as noise surfaces.
New workflow runs on PRs touching site content, on push to master,
and weekly on Mondays (matching update-publications cadence). Steps:

1. bundle install + jekyll build
2. lycheeverse/lychee-action against _site/ using lychee.toml
3. Cyb3r-Jak3/html5validator-action against _site/ using
   .html5validator.yaml

Both check steps fail the workflow on errors. The weekly schedule
catches link rot proactively rather than waiting for a PR.
Adds a Site health subsection covering the lychee and html5validator
checks and the weekly schedule.
Both plugins ship with the github-pages gem and are on the allowlist,
so no Gemfile change is needed.

Adds the SEO defaults the plugins consume: site url, description,
author, and a default OG image (the lab hex icon). Per-page front
matter can override these.

After this commit:
- /sitemap.xml is generated at build time
- {% seo %} (added in the next commit) emits OpenGraph and Twitter
  card tags on every page using these defaults
- _includes/head.html: replaces the empty description/author meta tags
  with {% seo %}, which jekyll-seo-tag uses to emit OpenGraph, Twitter
  card, JSON-LD, and the canonical title/description metas.
- _config.yml: adds a Jekyll `defaults` block that applies a fallback
  image (the lab hex icon) to every page so social previews render an
  image even on pages without a per-page `image:` front matter key.

After this, social shares of any page on the site render rich
previews with the title, description, and lab logo.
Adds a Per-page SEO and social previews subsection covering the
front-matter overrides (title, description, image) that contributors
can use to customize OG/Twitter card output for individual pages.
The repo doesn't publish a floating v7 major-version tag, only point
versions. Pinning to v7.2.0 (the latest v7 release).

There's a v8.0.0 available; deferring the upgrade until someone
verifies the v8 input schema matches what we pass.
The previous run used --all-files which compresses the entire legacy
26 MB of images on every PR — that's the deferred backfill, not the
contributor's diff. Switch to running hooks against the PR diff range
(base..head for pull_request, before..sha for push).

Also update the README's "run hooks manually" example to use the
same diff-range invocation rather than --all-files, with a note about
why --all-files would fail today.
Two issues caused both checks to fail on the latest run:

1. Gemfile.lock was generated by bundler 4.0.9 (sandbox default) with
   sha256 checksum annotations that bundler 2.x can't parse. ruby/setup-ruby
   uses bundler 2.5.x by default, so it crashed with exit 16 trying to
   read the lockfile. Regenerated Gemfile.lock with bundler 2.5.22 to
   match. (No site/code change — same pinned gem versions.)

2. The pre-commit job used --from-ref/--to-ref against PR base/before
   SHAs, but actions/checkout@v4 fetches only depth=1 by default — those
   SHAs aren't in the local history, so git can't compute the diff and
   pre-commit exits with code 3. Set fetch-depth: 0.
Three changes to make lychee actually pass against the built _site/:

1. Don't HTTP-check pearsonlab.github.io URLs. The sitemap and robots.txt
   contain self-referential absolute URLs by spec; checking them only
   tells us about the *currently deployed* site, not our build.

2. Pass --root-dir via the workflow's args (must be an absolute path,
   resolved at runtime via $GITHUB_WORKSPACE). This lets lychee resolve
   root-relative href="/about.html" links against _site/.

3. Excludes:
   - localhost: README.md gets copied into _site/ by jekyll-readme-index
     and contains "open http://localhost:4000" from setup docs
   - 403 added to accept codes: arxiv, biorxiv, nature, elifesciences,
     etc. all return 403 to bot user-agents but work in real browsers;
     accepting 403 avoids false positives without hiding real link rot
     (404s still fail)
- method = GET (HEAD is unreliable for arxiv, nature, etc.)
- browser-like user-agent (default UA triggers bot protection)
- accept 401/405/429 in addition to 403 (more bot-protection codes)
- timeout 30s, retries 3 with 5s backoff (academic sites are slow)
Academic aggregators (arxiv, biorxiv, nature, eLife, NCBI, doi.org)
have aggressive bot detection that returns inconsistent non-403 codes
which can't be cleanly enumerated in the accept list. Excluding them
trades some link-rot detection for a stable CI signal — false
positives on these domains are far more likely than real rot.

Also excluding LinkedIn, Twitter/X, Squarespace's static CDN, and
ML conference proceedings hosts for the same reason.

If lychee still fails after this, the remaining errors are real
broken links worth fixing.
Excluding arxiv/biorxiv/nature/etc. defeats the point of link checking
for a lab site — papers are the main links worth checking. Revert that.

Add a step that, on PR failure, posts the lychee report (with verbose
output) as a sticky PR comment. Lets us see the actual failing URLs
when the job log isn't readily accessible.
@github-actions

github-actions Bot commented May 10, 2026

Copy link
Copy Markdown
Contributor

Summary

Status Count
🔍 Total 370
✅ Successful 62
⏳ Timeouts 0
🔀 Redirected 27
👻 Excluded 275
❓ Unknown 0
🚫 Errors 6
⛔ Unsupported 0

Errors per input

Errors in _site/people.html

Full Github Actions output

claude added 3 commits May 10, 2026 11:36
Lychee's first successful run surfaced ~42 real broken links. Fixing
the ones on actively-used pages and excluding the rest:

Active page edits:
- research.md: pearsonlab/improv -> project-improv/improv (org rename)
- learning.md: remove pirated Bishop ML PDF link (404, also illegal copy)
- join_us.md: remove dead link to Duke neurobio grad training program
  page (text preserved as plain text); fix stale Duke undergrad
  research opportunities URL (drop dead /opportunities subpath)
- people.html: remove dead anchor wrappers around former-member names
  whose external profiles 404'd (Liz Johnson Wharton page, Athelia Paulli
  and Sara Liszeski LinkedIn profiles); names preserved as plain text

Surgical URL excludes (lychee.toml):
- thomasli.me: former undergrad's site, unreachable from Actions runners
- stat.washington.edu/people/pdhoff: SSL handshake fails in lychee but
  loads in browsers
- socialsciences.nature.com: TLS handshake failure (server config)

Deferred via path exclude (lychee.toml):
- _site/blog/: ~30 dead links accumulated across 2015-2018 posts. A
  separate content-cleanup phase will tackle these archaeologically;
  in the meantime exclude the legacy blog so the link checker can
  focus on catching new rot.

Workflow keeps the lychee-report PR comment (sticky) so future failures
are easy to triage from the PR view.

Local lychee: 0 errors, 100 OK, 319 excluded.
…ath)

lychee's --exclude-path treats its argument as a regex matched against
file paths, not a literal directory. The previous absolute-path
arguments matched nothing, so blog posts were still being scanned.

Use the relative regex form (_site/blog, _site/2015) which matches
the paths lychee uses for its inputs. Also moved the path-exclude
configuration from lychee.toml back into the workflow's args block
so it's adjacent to --root-dir, which has the same arg-passing
constraint.

Local: 0 errors, 101 OK, 413 excluded.
These showed up in CI but not locally because LinkedIn's bot
responses to GitHub Actions runners differ from sandbox/browser
responses. They were previously following 301 redirects to a
trailing-slash-stripped URL that 404s.

Affected former members (names preserved as plain text):
- Chintan Oza
- Na Young Jun
- Pranjal Gupta
- Christopher Zhou
claude added 3 commits May 10, 2026 11:41
LinkedIn's bot-detection returns inconsistent responses to GitHub
Actions runners — the same URL might return 200, 301-to-trimmed,
404, or 999 across consecutive runs. Each run surfaced a different
set of LinkedIn 404s on people.html, which is whack-a-mole.

Excluding the domain. We're not losing real signal: link-rot detection
on LinkedIn was never going to work reliably given their bot
protection.
Lychee is now passing in CI but html5validator is the new failure
point. The site has Bootstrap-3-era HTML that pre-dates strict HTML5
spec compliance and a comprehensive ignore list / cleanup needs its
own pass. Mark the step continue-on-error so it surfaces issues in
the log but doesn't gate PRs on them.
_site/2015 doesn't exist (legacy posts live at _site/blog/2015/, already
covered by the _site/blog exclude). Leftover from an earlier debugging
iteration.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants