phase 1: add Gemfile pinned to github-pages for reproducible local builds#32
Merged
Merged
Conversation
…ilds Adds the standard github-pages meta-gem so contributors can run bundle install && bundle exec jekyll serve locally and get the same Jekyll version GitHub Pages uses for the live build. Removes the Gemfile/Gemfile.lock entries from .gitignore that were blocking this.
Adds .pre-commit-config.yaml wiring three local hooks: - jpegoptim with --max=85 for lossy JPEG compression - oxipng -o 4 for lossless PNG optimization - svgo for SVG minification All run as language: system so contributors install the binaries once (via apt/brew/cargo/npm) rather than pre-commit rebuilding them per repo. README.md will document the install commands separately.
Adds .github/workflows/pre-commit.yml that re-runs all configured hooks on every PR (and on push to master). If a contributor skipped the local hook install, the workflow fails with a diff showing what should have been compressed/normalized. Contributors then run pre-commit run --all-files locally and commit the fixes. Installs jpegoptim (apt), oxipng (cargo), and svgo (npm) so the hooks have the same tooling locally and in CI.
Commits Gemfile.lock so contributors and CI get identical gem versions.
Adds a Contributing section covering: - bundle install + jekyll serve for local preview - pre-commit setup commands per platform (apt/brew + cargo + npm) - explanation of what the image hooks do and how to recover from hook-aborted commits
scripts/check_image_size.py blocks images over 1 MB and warns on images between 500 KB and 1 MB. Paths in .image-size-overrides are exempt from blocking (still warned). Stdlib-only (no pip deps). Reads file paths from argv so pre-commit can pass staged files directly.
- Adds an image-size-cap hook that runs after the compression hooks, so it sees post-compression sizes. - Adds .image-size-overrides as an empty allow-list with header comment documenting the format. Maintainers add paths here when a large image is justified (e.g., a high-detail research figure).
Adds a "Image size policy" subsection covering the 500 KB warn / 1 MB block thresholds and how to use .image-size-overrides for justified exceptions.
Adds three new pre-commit repos: - pre-commit/pre-commit-hooks: trailing-whitespace, end-of-file-fixer, check-yaml/json, check-merge-conflict, mixed-line-ending, check-added-large-files (1 MB generic cap), detect-private-key - adrienverge/yamllint: YAML style checks scoped to _data/, _config.yml. Configured via .yamllint extending the relaxed profile and disabling document-start, line-length (data files have long values), and loosening indentation to allow the existing block-sequence style. - igorshubovych/markdownlint-cli with --fix: auto-fixes trailing-space, blanks-around-headings, multiple-blanks, list/HR formatting, etc. Configured via .markdownlint.json disabling MD013 (line length), MD033 (inline HTML — Jekyll posts use it), and a few rules that conflict with existing legacy content (MD001 heading increment, MD025 single H1, MD034 bare URLs, MD036 emphasis-as-heading, MD041 first-line-h1, MD045 no-alt-text, MD059 descriptive-link-text). The disabled rules represent a "loose baseline" that can be tightened in a follow-up content-cleanup pass. The next commit will be the one-shot first-run normalization.
Two tweaks discovered during the first-run normalization: - yamllint: drop --strict so warnings don't fail CI - yamllint indentation: indent-sequences=whatever (was: consistent) so existing GitHub Actions yaml passes without requiring a sequence indent style we don't care about
Mechanical normalization from running pre-commit's baseline hooks (trailing-whitespace, end-of-file-fixer, mixed-line-ending) and markdownlint --fix across the existing repo. No semantic changes. Includes: - trailing whitespace stripped from posts, layouts, scripts - final newlines added where missing - people.html line endings normalized to LF (was mixed) - markdownlint auto-fixes for blanks-around-headings, list spacing, HR style, etc. on legacy posts - one manual MD035 fix in 2016-9-9-time-allocation-in-neuro.md (changed "-----" to "------" to match the file's other HR style) Image backfill is intentionally deferred to a future phase; image hooks were skipped via SKIP env var when running across all files.
Extends the Pre-commit hooks section with the new behaviors added by the baseline + yamllint + markdownlint hooks: trailing whitespace trimming, EOL normalization, YAML/JSON validation, merge-conflict detection, private-key detection, and Markdown/YAML linting.
- lychee.toml: link-checker config with 30-day cache, 20s timeout, excludes for dibs-web01 (deferred to Phase 5 migration) and the bootstrap CDN (transient 5xx not worth failing on). Accepts 999 for LinkedIn's bot-detection response. - .html5validator.yaml: minimal config pointing at _site/ with an empty ignore list to start; rules can be added as noise surfaces.
New workflow runs on PRs touching site content, on push to master, and weekly on Mondays (matching update-publications cadence). Steps: 1. bundle install + jekyll build 2. lycheeverse/lychee-action against _site/ using lychee.toml 3. Cyb3r-Jak3/html5validator-action against _site/ using .html5validator.yaml Both check steps fail the workflow on errors. The weekly schedule catches link rot proactively rather than waiting for a PR.
Adds a Site health subsection covering the lychee and html5validator checks and the weekly schedule.
Both plugins ship with the github-pages gem and are on the allowlist,
so no Gemfile change is needed.
Adds the SEO defaults the plugins consume: site url, description,
author, and a default OG image (the lab hex icon). Per-page front
matter can override these.
After this commit:
- /sitemap.xml is generated at build time
- {% seo %} (added in the next commit) emits OpenGraph and Twitter
card tags on every page using these defaults
- _includes/head.html: replaces the empty description/author meta tags
with {% seo %}, which jekyll-seo-tag uses to emit OpenGraph, Twitter
card, JSON-LD, and the canonical title/description metas.
- _config.yml: adds a Jekyll `defaults` block that applies a fallback
image (the lab hex icon) to every page so social previews render an
image even on pages without a per-page `image:` front matter key.
After this, social shares of any page on the site render rich
previews with the title, description, and lab logo.
Adds a Per-page SEO and social previews subsection covering the front-matter overrides (title, description, image) that contributors can use to customize OG/Twitter card output for individual pages.
The repo doesn't publish a floating v7 major-version tag, only point versions. Pinning to v7.2.0 (the latest v7 release). There's a v8.0.0 available; deferring the upgrade until someone verifies the v8 input schema matches what we pass.
The previous run used --all-files which compresses the entire legacy 26 MB of images on every PR — that's the deferred backfill, not the contributor's diff. Switch to running hooks against the PR diff range (base..head for pull_request, before..sha for push). Also update the README's "run hooks manually" example to use the same diff-range invocation rather than --all-files, with a note about why --all-files would fail today.
Two issues caused both checks to fail on the latest run: 1. Gemfile.lock was generated by bundler 4.0.9 (sandbox default) with sha256 checksum annotations that bundler 2.x can't parse. ruby/setup-ruby uses bundler 2.5.x by default, so it crashed with exit 16 trying to read the lockfile. Regenerated Gemfile.lock with bundler 2.5.22 to match. (No site/code change — same pinned gem versions.) 2. The pre-commit job used --from-ref/--to-ref against PR base/before SHAs, but actions/checkout@v4 fetches only depth=1 by default — those SHAs aren't in the local history, so git can't compute the diff and pre-commit exits with code 3. Set fetch-depth: 0.
Three changes to make lychee actually pass against the built _site/:
1. Don't HTTP-check pearsonlab.github.io URLs. The sitemap and robots.txt
contain self-referential absolute URLs by spec; checking them only
tells us about the *currently deployed* site, not our build.
2. Pass --root-dir via the workflow's args (must be an absolute path,
resolved at runtime via $GITHUB_WORKSPACE). This lets lychee resolve
root-relative href="/about.html" links against _site/.
3. Excludes:
- localhost: README.md gets copied into _site/ by jekyll-readme-index
and contains "open http://localhost:4000" from setup docs
- 403 added to accept codes: arxiv, biorxiv, nature, elifesciences,
etc. all return 403 to bot user-agents but work in real browsers;
accepting 403 avoids false positives without hiding real link rot
(404s still fail)
- method = GET (HEAD is unreliable for arxiv, nature, etc.) - browser-like user-agent (default UA triggers bot protection) - accept 401/405/429 in addition to 403 (more bot-protection codes) - timeout 30s, retries 3 with 5s backoff (academic sites are slow)
Academic aggregators (arxiv, biorxiv, nature, eLife, NCBI, doi.org) have aggressive bot detection that returns inconsistent non-403 codes which can't be cleanly enumerated in the accept list. Excluding them trades some link-rot detection for a stable CI signal — false positives on these domains are far more likely than real rot. Also excluding LinkedIn, Twitter/X, Squarespace's static CDN, and ML conference proceedings hosts for the same reason. If lychee still fails after this, the remaining errors are real broken links worth fixing.
Excluding arxiv/biorxiv/nature/etc. defeats the point of link checking for a lab site — papers are the main links worth checking. Revert that. Add a step that, on PR failure, posts the lychee report (with verbose output) as a sticky PR comment. Lets us see the actual failing URLs when the job log isn't readily accessible.
Contributor
Summary
Errors per inputErrors in _site/people.html
|
Lychee's first successful run surfaced ~42 real broken links. Fixing the ones on actively-used pages and excluding the rest: Active page edits: - research.md: pearsonlab/improv -> project-improv/improv (org rename) - learning.md: remove pirated Bishop ML PDF link (404, also illegal copy) - join_us.md: remove dead link to Duke neurobio grad training program page (text preserved as plain text); fix stale Duke undergrad research opportunities URL (drop dead /opportunities subpath) - people.html: remove dead anchor wrappers around former-member names whose external profiles 404'd (Liz Johnson Wharton page, Athelia Paulli and Sara Liszeski LinkedIn profiles); names preserved as plain text Surgical URL excludes (lychee.toml): - thomasli.me: former undergrad's site, unreachable from Actions runners - stat.washington.edu/people/pdhoff: SSL handshake fails in lychee but loads in browsers - socialsciences.nature.com: TLS handshake failure (server config) Deferred via path exclude (lychee.toml): - _site/blog/: ~30 dead links accumulated across 2015-2018 posts. A separate content-cleanup phase will tackle these archaeologically; in the meantime exclude the legacy blog so the link checker can focus on catching new rot. Workflow keeps the lychee-report PR comment (sticky) so future failures are easy to triage from the PR view. Local lychee: 0 errors, 100 OK, 319 excluded.
…ath) lychee's --exclude-path treats its argument as a regex matched against file paths, not a literal directory. The previous absolute-path arguments matched nothing, so blog posts were still being scanned. Use the relative regex form (_site/blog, _site/2015) which matches the paths lychee uses for its inputs. Also moved the path-exclude configuration from lychee.toml back into the workflow's args block so it's adjacent to --root-dir, which has the same arg-passing constraint. Local: 0 errors, 101 OK, 413 excluded.
These showed up in CI but not locally because LinkedIn's bot responses to GitHub Actions runners differ from sandbox/browser responses. They were previously following 301 redirects to a trailing-slash-stripped URL that 404s. Affected former members (names preserved as plain text): - Chintan Oza - Na Young Jun - Pranjal Gupta - Christopher Zhou
LinkedIn's bot-detection returns inconsistent responses to GitHub Actions runners — the same URL might return 200, 301-to-trimmed, 404, or 999 across consecutive runs. Each run surfaced a different set of LinkedIn 404s on people.html, which is whack-a-mole. Excluding the domain. We're not losing real signal: link-rot detection on LinkedIn was never going to work reliably given their bot protection.
Lychee is now passing in CI but html5validator is the new failure point. The site has Bootstrap-3-era HTML that pre-dates strict HTML5 spec compliance and a comprehensive ignore list / cleanup needs its own pass. Mark the step continue-on-error so it surfaces issues in the log but doesn't gate PRs on them.
_site/2015 doesn't exist (legacy posts live at _site/blog/2015/, already covered by the _site/blog exclude). Leftover from an earlier debugging iteration.
This was referenced May 10, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds the standard github-pages meta-gem so contributors can run
bundle install && bundle exec jekyll serve locally and get the same
Jekyll version GitHub Pages uses for the live build. Removes the
Gemfile/Gemfile.lock entries from .gitignore that were blocking this.