Stop scraping RLS website by Copilot · Pull Request #73 · yanirs/rls-data

Copilot · 2026-03-06T22:24:43Z

Original prompt:

Update the data pipeline to stop depending on scraped species-page output for API JSON generation.

Scope:

Processor changes

In rls/processor.py, update create_api_jsons so it relies on species.json instead of crawl output.

Replace the crawl_json_path parameter with species_json_path.

Remove logic that reads crawler output and validates minimum crawl item counts.

Update _create_species_file (and any helper logic) so species metadata is sourced from parsed species.json data.

Keep output schema for api-species.json unchanged ([species_name, common_name, url, data_type_code, image_urls]).

Image behavior: drop dependence on scraped local image artifacts/symlinks. Use image URLs from species.json where available, otherwise empty list.

Preserve fallback behavior for missing species metadata (empty common name, null/None URL, empty image list).

GitHub Actions workflow changes

In .github/workflows/update-data.yml, remove the "Crawl RLS species data" step.

Keep the existing Download species.json step.

Update the Create text output files command so rls-data create-api-jsons passes output/species.json as the first argument instead of data/rls-site-crawl.json.

Tests/validation expectations

Do not modify scraper functionality unless required by failing tests; scraper output will simply no longer be used by the processing pipeline.

Add or update processor-focused tests only as needed for changed signatures/behavior.

Keep changes minimal and focused.

Acceptance criteria:

Workflow no longer runs Scrapy crawl step.

create_api_jsons works with species.json input and produces expected JSON outputs.

api-species.json still includes common name, url, and image_urls per species, derived from species.json/fallbacks.

Existing scraper module remains in repo but is not required by the workflow for API JSON generation.

This pull request was created from Copilot chat.

✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Co-authored-by: yanirs <3952615+yanirs@users.noreply.github.com>

rls/processor.py

tests/test_processor.py

yanirs · 2026-03-06T22:45:09Z

@copilot Please fix my comments on this PR.

Co-authored-by: yanirs <3952615+yanirs@users.noreply.github.com>

rls/processor.py

Initial plan

1a646f7

Copilot AI assigned Copilot and yanirs Mar 6, 2026

Copilot started work on behalf of yanirs March 6, 2026 22:24 View session

Update data pipeline to use species.json instead of crawl output

33675dd

Co-authored-by: yanirs <3952615+yanirs@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Update the data pipeline for API JSON generation~~ Remove Scrapy crawl dependency from API JSON generation pipeline Mar 6, 2026

Copilot finished work on behalf of yanirs March 6, 2026 22:31

yanirs reviewed Mar 6, 2026

View reviewed changes

rls/processor.py Outdated Show resolved Hide resolved

yanirs reviewed Mar 6, 2026

View reviewed changes

tests/test_processor.py Outdated Show resolved Hide resolved

yanirs reviewed Mar 6, 2026

View reviewed changes

tests/test_processor.py Outdated Show resolved Hide resolved

Copilot started work on behalf of yanirs March 6, 2026 22:45 View session

Use large_url for image URLs; simplify test assertions

2a82763

Co-authored-by: yanirs <3952615+yanirs@users.noreply.github.com>

Copilot finished work on behalf of yanirs March 6, 2026 22:47

Copilot started work on behalf of yanirs March 6, 2026 22:49 View session

Fix static checker issues: ruff format, ruff PT001, E501, mypy types

471f015

Co-authored-by: yanirs <3952615+yanirs@users.noreply.github.com>

Copilot AI changed the title ~~Remove Scrapy crawl dependency from API JSON generation pipeline~~ Fix static checker issues in processor and test files Mar 6, 2026

Copilot finished work on behalf of yanirs March 6, 2026 23:00

Copilot started work on behalf of yanirs March 6, 2026 23:03 View session

Copilot AI and others added 2 commits March 6, 2026 23:08

Revert unrelated formatting changes from previous commit

5f262f0

Co-authored-by: yanirs <3952615+yanirs@users.noreply.github.com>

Fix static checker issues with pinned ruff 0.1.9 and mypy 1.4.1

2e53327

Co-authored-by: yanirs <3952615+yanirs@users.noreply.github.com>

Copilot AI changed the title ~~Fix static checker issues in processor and test files~~ Fix static checker failures with pinned ruff 0.1.9 and mypy 1.4.1 Mar 6, 2026

Copilot finished work on behalf of yanirs March 6, 2026 23:11

yanirs changed the title ~~Fix static checker failures with pinned ruff 0.1.9 and mypy 1.4.1~~ Stop scraping RLS website Mar 6, 2026

yanirs reviewed Mar 6, 2026

View reviewed changes

rls/processor.py Outdated Show resolved Hide resolved

GitHub Actions Bot and others added 3 commits March 6, 2026 23:19

Update API files

d780189

Merge branch 'master' into copilot/update-data-pipeline-dependencies

1823eea

Update rls/processor.py

f587f6d

yanirs marked this pull request as ready for review March 6, 2026 23:40

yanirs merged commit d6f6ea5 into master Mar 6, 2026
2 checks passed

yanirs deleted the copilot/update-data-pipeline-dependencies branch March 6, 2026 23:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stop scraping RLS website#73

Stop scraping RLS website#73
yanirs merged 9 commits intomasterfrom
copilot/update-data-pipeline-dependencies

Copilot AI commented Mar 6, 2026 •

edited by yanirs

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yanirs commented Mar 6, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Copilot AI commented Mar 6, 2026 • edited by yanirs Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yanirs commented Mar 6, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Mar 6, 2026 •

edited by yanirs

Loading