This is the latest version of my homepage's source code. Feel free to use and share.
For more details, please refer to this repository: https://github.com/yaoyao-liu/minimal-light.
You need to install Ruby and Jekyll fisrt.
Install and run:
bundle install
bundle exec jekyll serve --livereloadView the live page using localhost:
http://localhost:4000. You can get the html files in the _site folder.
The instructions for the Google Scholar crawler can be found in this repository.
Before using that, you need to change the Google Scholar ID in the following file:
https://github.com/yaoyao-liu/yaoyao-liu.github.io/blob/7d16d828a229580815428782fb74d937710eb50e/google_scholar_crawler/main.py#L7
This project includes an automated updater for the Embedded AI page.
If you only want to run the updater manually in the current terminal session, you can use the following steps.
First, go to the updater folder and install the required dependencies:
cd scripts/embedded-ai
npm installThen configure the API key in the current terminal session only. For example, in PowerShell:
$env:GEMINI_API_KEY="your_gemini_key"
$env:OPENROUTER_API_KEY="your_openrouter_key"
$env:TENCENT_TOKENHUB_API_KEY="your_tencent_tokenhub_key"
$env:GEMINI_MODEL="gemini-2.5-flash"
$env:OPENROUTER_MODEL="openrouter/free"
$env:TENCENT_TOKENHUB_MODEL="hunyuan-2.0-instruct-20251111"Long-Term Usage:
setx GEMINI_API_KEY "your_gemini_key"
setx OPENROUTER_API_KEY "your_openrouter_key"
setx TENCENT_TOKENHUB_API_KEY "your_tencent_tokenhub_key"
setx GEMINI_MODEL "gemini-2.5-flash"
setx OPENROUTER_MODEL "openrouter/free"
setx TENCENT_TOKENHUB_MODEL "hunyuan-2.0-instruct-20251111"Daily GitHub Action uses arXiv as the search source, and enables --skip-arxiv-in-a and --refilter-all by default.
This means:
- arXiv is still used for discovery.
- Papers that match raw group A cannot pass filtering by arXiv fallback alone.
- A-related papers must match a real TH_CPL / THP_CPL venue.
- B/C papers may still use arXiv fallback when no real venue is available, unless future rules change.
Without PDF download:
node scripts/embedded-ai/update-papers.mjs --source arxiv --force-full-search --year-low 2025 --skip-arxiv-in-a --refilter-all --skip-download --clear-cache 2>&1 | tee run_2025_full.logWith PDF download:
node scripts/embedded-ai/update-papers.mjs --source arxiv --force-full-search --year-low 2025 --skip-arxiv-in-a --refilter-all --no-skip-download --clear-cache 2>&1 | tee run_2025_full.lognode scripts/embedded-ai/update-papers.mjs --source arxiv --skip-arxiv-in-a --refilter-all --skip-download_data/embedded_ai_papers.jsonartifacts/*.bibgoogle_scholar_crawler/cache/normalized_papers.jsongoogle_scholar_crawler/state/last_search_state.jsongoogle_scholar_crawler/state/classification_checkpoint.jsongoogle_scholar_crawler/state/download_state.json(if download step is used)google_scholar_crawler/state/download_quota.json(if download step is used)
google_scholar_crawler/cache/arxiv/*.jsongoogle_scholar_crawler/state/source_stats.jsongoogle_scholar_crawler/state/*.tmprun_*.log*.log
Do not ignore or delete the entire google_scholar_crawler/cache or google_scholar_crawler/state directories.
Some files inside them are required for daily incremental updates:
normalized_papers.jsonkeeps the canonical historical paper set.last_search_state.jsonstores arXiv incremental watermarks.classification_checkpoint.jsoncaches classification results.
Filter bucket distribution:
jq '
[
.categories[].papers[].filter_bucket
]
| group_by(.)
| map({bucket: .[0], count: length})
' _data/embedded_ai_papers.jsonTH_CPL matched count:
jq '
[
.categories[].papers[]
| select((.matched_th_cpl_level // "") != "")
]
| length
' _data/embedded_ai_papers.jsonMatched venue distribution:
jq '
[
.categories[].papers[]
| select((.matched_venue // "") != "")
| .matched_venue
]
| group_by(.)
| map({venue: .[0], count: length})
| sort_by(-.count)
' _data/embedded_ai_papers.jsonCheck source_stats total consistency with final output:
jq '.stats.after_filter, .stats.classification.total, .stats.source_stats.total_summary.total' _data/embedded_ai_papers.jsonWhen using --force-full-search, the pipeline enforces all-or-nothing semantics:
- Complete Success: All three groups (A, B, C) fetch their complete result sets from arXiv. The watermark is updated, and output files are written.
- Partial Failure: If any group fails mid-pagination (e.g., arXiv API returns 503 errors after fetching 300 of 6469 papers), the pipeline aborts entirely:
- No watermark is updated
- No final JSON output is written
- No BibTeX artifacts are generated
- Exit code is non-zero
- Error details are logged including the failed group, page where failure occurred, and error message
This prevents data corruption from partial results being silently accepted as complete.
Troubleshooting Partial Search Failures:
Detect if the pipeline failed due to incomplete search:
# Check last run log for CRITICAL arXiv messages
grep "CRITICAL: Incomplete full search" run_2025_full.log
# Check per-group completion status in logs
grep "\[arxiv-bridge\].*failed" run_2025_full.log
# Verify watermark was not updated after failed run
cat google_scholar_crawler/state/last_search_state.jsonIf a full search fails, you have two options:
- Retry the full search later when the arXiv API is stable
- Split by year range to reduce pages per request:
node scripts/embedded-ai/update-papers.mjs --source arxiv --force-full-search --year-low 2025 --year-high 2025 --skip-download --clear-cache node scripts/embedded-ai/update-papers.mjs --source arxiv --force-full-search --year-low 2024 --year-high 2024 --skip-download
- Workflow file:
.github/workflows/update-papers.yml - Runs daily on schedule and supports manual dispatch inputs:
year_lowyear_hightotal_limitskip_download(defaulttrue)
This project uses the source code from the following repositories: