Add phased testing and CI/CD implementation roadmap by Copilot · Pull Request #19 · UIUCLibrary/arcflow

Copilot · 2026-02-23T19:55:51Z

Documents a 6-phase plan for automated testing and CI/CD infrastructure. Phases ordered by ROI to enable incremental adoption: foundation → core logic → integration → E2E → advanced tooling → production polish.

Documentation Added

TESTING_PLAN.md - Comprehensive implementation roadmap with:

Phase 1 (1-2d, High ROI): pytest + GitHub Actions + linting + coverage reporting → CI on every PR
Phase 2 (2-3d, High ROI): Unit tests for agent filtering, XML manipulation, config handling → 80%+ coverage of critical logic
Phase 3 (3-4d, Med ROI): Mock Solr, workflow integration tests, subprocess mocking → multi-component validation
Phase 4 (4-5d, Med ROI): Docker Compose (ArchivesSpace + Solr) + E2E tests → agent self-testing environment
Phase 5 (2-3d, Low ROI): pre-commit hooks, type checking, security scanning, Dependabot → proactive quality gates
Phase 6 (3-4d, Low ROI): mutation testing, performance regression, load testing → production hardening

Each phase includes specific tasks, deliverables, and success metrics.

Recommended Path

Start with Phases 1-2 for immediate CI value, jump to Phase 4.1 (Docker Compose) to enable agent testing, then backfill Phase 3 for integration coverage.

Context

This addresses the request to enable Copilot agents to test PRs against infrastructure without manual intervention on live systems. PR #19 is the referenced conversation.

💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

Implement data ETL for standalone creator records from ArchivesSpace agents and automatically index them to Solr for discovery in ArcLight. PROBLEM: ArchivesSpace agents (people, organizations, families) were not discoverable as standalone entities in ArcLight. Users could not browse or search for agents independently of their collections. SOLUTION: 1. Extract ALL agents from ArchivesSpace via API 2. Generate EAC-CPF XML documents for each agent 3. Define Solr schema fields for agent metadata 4. Configure traject to index agent XML to Solr 5. Implement automatic indexing after XML generation FEATURES: - Processes all agent types (people, corporate entities, families) - Generates standards-compliant EAC-CPF XML - Links agents to their collections via persistent_id - Automatic discovery of traject config (bundle show arcuit) - Batch processing (100 files per traject call) - Robust error handling with detailed logging - Multiple processing modes (normal, agents-only, collections-only) COMPONENTS: 1. Python Processing (arcflow/main.py - 1428 lines): - get_all_agents() - Fetch ALL agents from ArchivesSpace API - task_agent() - Generate EAC-CPF XML via archival_contexts endpoint - process_creators() - Batch process all agents in parallel (10 workers) - find_traject_config() - Auto-discover traject configuration - index_creators() - Batch index to Solr via traject 2. Solr Schema (solr/conf/arcuit_creator_fields.xml - 11 fields): - is_creator (boolean) - Identifies agent records - creator_persistent_id (string) - Unique identifier - agent_type (string) - Type: corporate/person/family - agent_id (string) - ArchivesSpace agent ID - agent_uri (string) - ArchivesSpace agent URI - entity_type (string) - EAC-CPF entity type - related_agents_ssim (multiValued) - Related agent names - related_agent_uris_ssim (multiValued) - Related agent URIs - relationship_types_ssim (multiValued) - Relationship types - document_type (string) - Document type (eac-cpf) - record_type (string) - Record type (creator/agent) 3. Traject Configuration (traject_config_eac_cpf.rb - 271 lines): - Maps EAC-CPF XML elements to Solr fields - Extracts agent identity information - Processes biographical/historical notes - Captures related agents and relationships - Handles collection linkages USAGE: python -m arcflow.main \ --arclight-dir /path/to/arclight \ --aspace-dir /path/to/archivesspace \ --solr-url http://localhost:8983/solr/blacklight-core python -m arcflow.main ... --agents-only python -m arcflow.main ... --collections-only python -m arcflow.main ... --skip-creator-indexing python -m arcflow.main ... --arcuit-dir /path/to/arcuit COMMAND LINE OPTIONS: --arclight-dir PATH Path to ArcLight application (required) --aspace-dir PATH Path to ArchivesSpace data (required) --solr-url URL Solr instance URL (required) --arcuit-dir PATH Path to arcuit gem (optional, auto-detected) --agents-only Process only agents, skip collections --collections-only Process only collections, skip agents --skip-creator-indexing Generate XML but don't index to Solr --force-update Process all records regardless of timestamps ARCHITECTURE: ArchivesSpace API ↓ (archivessnake library) arcflow Python code ↓ (fetches via /repositories/1/archival_contexts) EAC-CPF XML files (public/xml/agents/*.xml) ↓ (indexed via traject) Solr (blacklight-core) ↓ (discovered via ArcLight) ArcLight DATA FLOW: 1. arcflow calls get_all_agents() - fetches ALL agents from ArchivesSpace API 2. For each agent, task_agent() retrieves EAC-CPF from archival_contexts endpoint 3. Saves EAC-CPF XML to public/xml/agents/ directory 4. find_traject_config() discovers config via 'bundle show arcuit' or --arcuit-dir 5. index_creators() batches XML files (100 per call) and invokes traject 6. traject indexes XML to Solr with is_creator:true flag 7. Agent records now searchable in ArcLight BENEFITS: - Users can discover all agents independently of collections - Direct navigation to agent pages - Browse all agents of a specific type - View all collections linked to a specific agent - Standards-based EAC-CPF format for interoperability - Automatic indexing reduces manual steps - Flexible processing modes for different workflows TECHNICAL DETAILS: - EAC-CPF format: urn:isbn:1-931666-33-4 namespace - ID extraction: Filename-based (handles empty control element in EAC-CPF) - Batch size: 100 files per traject call - Parallel processing: 10 worker processes for agent generation - Timeout: 5 minutes per batch - Error handling: Log errors, continue processing - Linking: Via Solr persistent_id field (not direct XML updates) FILES CHANGED: arcflow-phase1-revised/ ├── arcflow/ │ ├── __init__.py (updated imports) │ ├── main.py (1428 lines, core logic) │ └── utils/ │ ├── __init__.py │ ├── bulk_import.py │ └── stage_classifications.py ├── traject_config_eac_cpf.rb (271 lines) ├── requirements.txt ├── .archivessnake.yml.example ├── README.md (updated) ├── HOW_TO_USE.md (updated) └── TESTING.md (updated) solr/ ├── README.md (installation instructions) └── conf/ └── arcuit_creator_fields.xml (11 field definitions) Documentation: ├── CREATOR_INDEXING_GUIDE.md (comprehensive guide) ├── AUTOMATED_INDEXING_IMPLEMENTATION.md (technical details) └── README.md (updated with creator section) TESTING: Manual verification: 1. Run arcflow with --agents-only flag 2. Verify XML files generated in public/xml/agents/ 3. Check Solr for indexed agent records 4. Verify is_creator:true in Solr documents 5. Test agent-collection linking via persistent_id Automated testing: - Python syntax validation - Ruby syntax validation (traject config) - Solr schema validation DEPLOYMENT: 1. Add Solr schema fields to schema.xml 2. Reload Solr core 3. Run arcflow to generate and index agents 4. Verify agents appear in Solr 5. Test in ArcLight interface BACKWARD COMPATIBILITY: - No breaking changes to existing functionality - Collections continue to work as before - Agent indexing is additive - Can be disabled with --skip-creator-indexing

There is no longer any need for these to be defined outside the method

allow for whitespace in filenames and for quoted arguments Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

…index_creators

fix typo Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

In Nokogiri XPath, every namespaced element must be prefixed (e.g., //eac:control/eac:recordId) with a namespace mapping. exists. Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>

Fix EAC-CPF namespace handling in XPath queries

punctuation/spacing Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

as a separate command. Detailed explanation: self.traject_extra_config is constructed as a single string containing a space (e.g., "-c /path/to/file.rb"), but subprocess.run(cmd) passes arguments verbatim and Traject won’t parse that as two flags/values. Store the extra config as a path (or as an already-split argv list) and append it as ['-c', traject_extra_config] (or extend with a list) when building cmd. Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

to prevent commands from running together Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>

Revert to manual returncode checking for subprocess error handling

Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>

… selection Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>

Filter non-creator agents from indexing (exclude system users, donors)

…#14) * Improve traject config discovery and logging - Add fallback search in arcflow package directory for development - Add clear logging showing which traject config is being used - Add warning when using arcflow package version (development mode) - Improve error messages when traject config not found - Document that traject config belongs in arcuit gem, not arcflow - Update README with traject config location guidance Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com> * Address code review feedback - Change log level from error to warning for missing traject config - Update example path to clarify arcuit gem location - Show actual searched paths in error message for better troubleshooting Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com> * Reorder traject config search to follow collection records pattern - Change search order: arcuit_dir (1st) → bundle show (2nd) → example file (3rd) - Rename traject_config_eac_cpf.rb to example_traject_config_eac_cpf.rb - Prioritize arcuit_dir parameter as most up-to-date user control - Fall back to example file for module usage without arcuit - Update README with new search order and example file guidance Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com> * Address code review feedback on example file - Update usage comment to reference correct filename - Improve log message formatting for consistency - Add note about copying to arcuit for production use Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com> * Update traject config search paths to follow ArcLight pattern - Remove arcuit_dir/arcflow path (development artifact) - Add arcuit_dir/lib/arcuit/traject path (matches EAD traject location) - Apply same paths to both arcuit_dir and bundle show arcuit searches - Update debug message to reflect new subdirectory checked Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com> * Simplify example traject config search to single known location - Remove candidate paths loop for example file - Directly check the one known location at repo root - Add comment explaining we know the exact location Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>

and update documentation to reflect creator records are also deleted upon --force-update.

This add logic to delete creator records similar to resources and refactors directory structure of xml files so it is clear where resource and agent document live for create/read/delete.

Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>

alexdryden and others added 30 commits February 11, 2026 14:45

fix: remove method to test a single creator record

8a96e8d

refactor: move variable declarations into method body

604f68c

There is no longer any need for these to be defined outside the method

remove unwanted documentation

38c3612

refactor: revert to PIPE for sterr for consistency

4912ddc

fix: spacing

6014dd1

fix: duplicate line

5dbe81e

fix: remove unused method

1c63cae

Update arcflow/main.py

fef9307

allow for whitespace in filenames and for quoted arguments Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

fix: add time in case it isn't loaded when traject runs

849098e

Update arcflow/main.py

ec3c961

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Merge branch 'index_creators' of github.com:UIUCLibrary/arcflow into …

29d1c7f

…index_creators

fix: remove hardcoded directory from local implemantaiton

a043c9d

fix: use updated example

a407d72

Update README.md

8ec0445

fix typo Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update traject_config_eac_cpf.rb

202a532

In Nokogiri XPath, every namespaced element must be prefixed (e.g., //eac:control/eac:recordId) with a namespace mapping. exists. Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Initial plan

deb325e

Use consistent eac: namespace prefix in all XPath queries

05239d1

Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>

Refactor dates XPath for improved readability

5d2588e

Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>

Merge pull request #9 from UIUCLibrary/copilot/sub-pr-8

0e25143

Fix EAC-CPF namespace handling in XPath queries

Update README.md

e092b10

punctuation/spacing Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

fix: use dynamic mappings for ArcLight Solr fields

32d923e

add extra traject config with extend

ae77c98

to prevent commands from running together Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Preserve html markup

cd1f94d

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Initial plan

107e43d

Add check=True to subprocess.run and handle exception properly

b542367

Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>

Move success log to else block for clarity

092d1be

Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>

Revert to original error handling without check=True

924ea47

Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>

Merge pull request #10 from UIUCLibrary/copilot/sub-pr-8-again

db949e2

Revert to manual returncode checking for subprocess error handling

Copilot AI and others added 10 commits February 13, 2026 17:34

Initial plan

4923eb2

Add agent filtering to exclude system users and donors

093d936

Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>

Remove redundant software agent filter - already excluded by endpoint…

f264373

… selection Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>

Merge pull request #12 from UIUCLibrary/copilot/sub-pr-8

b0bcf33

Filter non-creator agents from indexing (exclude system users, donors)

fix: convert to string and use shell for wildcard expansion

5569acc

Merge branch 'main' into index_creators

a3166e3

fix: ignore last update param with --force-update

3b3d1c7

and update documentation to reflect creator records are also deleted upon --force-update.

Add delete logic for creators and refactor

989d0f4

This add logic to delete creator records similar to resources and refactors directory structure of xml files so it is clear where resource and agent document live for create/read/delete.

Initial plan

b828fe7

Copilot AI assigned Copilot and alexdryden Feb 23, 2026

Copilot started work on behalf of alexdryden February 23, 2026 19:56 View session

Add comprehensive testing and CI/CD phased implementation plan document

a65dea5

Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Add automated testing and CI/CD infrastructure~~ Add phased testing and CI/CD implementation roadmap Feb 23, 2026

Copilot AI requested a review from alexdryden February 23, 2026 20:00

Copilot finished work on behalf of alexdryden February 23, 2026 20:00

alexdryden force-pushed the index_creators branch 3 times, most recently from f23fe83 to 89057a9 Compare March 4, 2026 23:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add phased testing and CI/CD implementation roadmap#19

Add phased testing and CI/CD implementation roadmap#19
Copilot wants to merge 41 commits intoindex_creatorsfrom
copilot/add-automated-testing-ci-cd

Copilot AI commented Feb 23, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Copilot AI commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Documentation Added

Recommended Path

Context

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Feb 23, 2026 •

edited

Loading