Skip to content

Stop scraping RLS website#73

Merged
yanirs merged 9 commits intomasterfrom
copilot/update-data-pipeline-dependencies
Mar 6, 2026
Merged

Stop scraping RLS website#73
yanirs merged 9 commits intomasterfrom
copilot/update-data-pipeline-dependencies

Conversation

Copy link
Contributor

Copilot AI commented Mar 6, 2026

Original prompt:

Update the data pipeline to stop depending on scraped species-page output for API JSON generation.

Scope:

  1. Processor changes
  • In rls/processor.py, update create_api_jsons so it relies on species.json instead of crawl output.
  • Replace the crawl_json_path parameter with species_json_path.
  • Remove logic that reads crawler output and validates minimum crawl item counts.
  • Update _create_species_file (and any helper logic) so species metadata is sourced from parsed species.json data.
  • Keep output schema for api-species.json unchanged ([species_name, common_name, url, data_type_code, image_urls]).
  • Image behavior: drop dependence on scraped local image artifacts/symlinks. Use image URLs from species.json where available, otherwise empty list.
  • Preserve fallback behavior for missing species metadata (empty common name, null/None URL, empty image list).
  1. GitHub Actions workflow changes
  • In .github/workflows/update-data.yml, remove the "Crawl RLS species data" step.
  • Keep the existing Download species.json step.
  • Update the Create text output files command so rls-data create-api-jsons passes output/species.json as the first argument instead of data/rls-site-crawl.json.
  1. Tests/validation expectations
  • Do not modify scraper functionality unless required by failing tests; scraper output will simply no longer be used by the processing pipeline.
  • Add or update processor-focused tests only as needed for changed signatures/behavior.
  • Keep changes minimal and focused.

Acceptance criteria:

  • Workflow no longer runs Scrapy crawl step.
  • create_api_jsons works with species.json input and produces expected JSON outputs.
  • api-species.json still includes common name, url, and image_urls per species, derived from species.json/fallbacks.
  • Existing scraper module remains in repo but is not required by the workflow for API JSON generation.

This pull request was created from Copilot chat.


✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Co-authored-by: yanirs <3952615+yanirs@users.noreply.github.com>
Copilot AI changed the title [WIP] Update the data pipeline for API JSON generation Remove Scrapy crawl dependency from API JSON generation pipeline Mar 6, 2026
@yanirs
Copy link
Owner

yanirs commented Mar 6, 2026

@copilot Please fix my comments on this PR.

Co-authored-by: yanirs <3952615+yanirs@users.noreply.github.com>
Co-authored-by: yanirs <3952615+yanirs@users.noreply.github.com>
Copilot AI changed the title Remove Scrapy crawl dependency from API JSON generation pipeline Fix static checker issues in processor and test files Mar 6, 2026
Copilot AI and others added 2 commits March 6, 2026 23:08
Co-authored-by: yanirs <3952615+yanirs@users.noreply.github.com>
Co-authored-by: yanirs <3952615+yanirs@users.noreply.github.com>
Copilot AI changed the title Fix static checker issues in processor and test files Fix static checker failures with pinned ruff 0.1.9 and mypy 1.4.1 Mar 6, 2026
@yanirs yanirs changed the title Fix static checker failures with pinned ruff 0.1.9 and mypy 1.4.1 Stop scraping RLS website Mar 6, 2026
@yanirs yanirs marked this pull request as ready for review March 6, 2026 23:40
@yanirs yanirs merged commit d6f6ea5 into master Mar 6, 2026
2 checks passed
@yanirs yanirs deleted the copilot/update-data-pipeline-dependencies branch March 6, 2026 23:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants