[pull] dev from ArchiveBox:dev#1
Open
pull[bot] wants to merge 4064 commits intomrbenns:devfrom
Open
Conversation
| 'youtube_dl', | ||
| ], capture_output=True, text=True, cwd=out_dir).stdout.split('Location: ')[-1].split('\n', 1)[0] | ||
| NEW_YOUTUBEDL_BINARY = Path(pkg_path) / 'youtube_dl' / '__main__.py' | ||
| os.chmod(NEW_YOUTUBEDL_BINARY, 0o777) |
Check failure
Code scanning / CodeQL
Overly permissive file permissions
| if PUBLIC_INDEX: | ||
| return redirect('/public') | ||
|
|
||
| return redirect(f'/admin/login/?next={request.path}') |
Check warning
Code scanning / CodeQL
URL redirection from remote source
|
|
||
| def get(self, request, path): | ||
| if not request.user.is_authenticated and not PUBLIC_SNAPSHOTS: | ||
| return redirect(f'/admin/login/?next={request.path}') |
Check warning
Code scanning / CodeQL
URL redirection from remote source
|
|
||
| # missing trailing slash -> redirect to index | ||
| if '/' not in path: | ||
| return redirect(f'{path}/index.html') |
Check warning
Code scanning / CodeQL
URL redirection from remote source
| response = super().get(*args, **kwargs) | ||
| return response | ||
| else: | ||
| return redirect(f'/admin/login/?next={self.request.path}') |
Check warning
Code scanning / CodeQL
URL redirection from remote source
|
|
||
| def add_view(self, request): | ||
| if not request.user.is_authenticated: | ||
| return redirect(f'/admin/login/?next={request.path}') |
Check warning
Code scanning / CodeQL
URL redirection from remote source
Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>
Multiple hooks in the same plugin directory were overwriting each other's stdout.log, stderr.log, hook.pid, and cmd.sh files. Now each hook uses filenames prefixed with its hook name: - on_Snapshot__20_chrome_tab.bg.stdout.log - on_Snapshot__20_chrome_tab.bg.stderr.log - on_Snapshot__20_chrome_tab.bg.pid - on_Snapshot__20_chrome_tab.bg.sh Updated: - hooks.py run_hook() to use hook-specific names - core/models.py cleanup and update_from_output methods - Plugin scripts to no longer write redundant hook.pid files <!-- IMPORTANT: Do not submit PRs with only formatting / PEP8 / line length changes. --> # Summary <!--e.g. This PR fixes ABC or adds the ability to do XYZ...--> # Related issues <!-- e.g. #123 or Roadmap goal # https://github.com/pirate/ArchiveBox/wiki/Roadmap --> # Changes these areas - [ ] Bugfixes - [ ] Feature behavior - [ ] Command line interface - [ ] Configuration options - [ ] Internal architecture - [ ] Snapshot data layout on disk <!-- This is an auto-generated description by cubic. --> --- ## Summary by cubic Prevented hook file collisions by giving each hook its own stdout, stderr, pid, and cmd filenames. This fixes mixed logs and ensures correct cleanup and status checks when multiple hooks run in the same plugin directory. - **Bug Fixes** - hooks.py: write hook-specific stdout/stderr/pid/cmd files and exclude them from new_files; derive cmd.sh from pid for safe kill. - core/models.py: read hook-specific logs; exclude hook output files when computing outputs; cleanup and background detection use *.pid. - Plugins: stop writing redundant hook.pid files; minor chrome utils cleanup. <sup>Written for commit 754b096. Summary will update on new commits.</sup> <!-- End of auto-generated description by cubic. -->
Simplifies the comma-separated parsing logic to: - If value contains '[', parse as JSON array - Otherwise, parse as comma-separated values This prevents incorrect splitting of arguments containing internal commas when there's only one argument. For arguments with commas, users should use JSON format: CHROME_ARGS='["--arg1,val", "--arg2"]' Also exports getEnvArray in module.exports for consistency. Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>
…ling logic on model methods (#1734) <!-- IMPORTANT: Do not submit PRs with only formatting / PEP8 / line length changes. --> # Summary <!--e.g. This PR fixes ABC or adds the ability to do XYZ...--> # Related issues <!-- e.g. #123 or Roadmap goal # https://github.com/pirate/ArchiveBox/wiki/Roadmap --> # Changes these areas - [ ] Bugfixes - [ ] Feature behavior - [ ] Command line interface - [ ] Configuration options - [ ] Internal architecture - [ ] Snapshot data layout on disk <!-- This is an auto-generated description by cubic. --> --- ## Summary by cubic Added an implementation plan to centralize subprocess handling on the machine.Process model. It covers process hierarchy, Process.current(), safe lifecycle methods (launch/kill/wait), PID reuse protection, and phased changes across hooks, workers, CLI, migrations, and admin. <sup>Written for commit 3ae9410. Summary will update on new commits.</sup> <!-- End of auto-generated description by cubic. -->
…#1735) Comprehensive plan for implementing JSONL-based CLI piping: - Phase 1: Model prerequisites (ArchiveResult.from_json, tags_str fix) - Phase 2: Extract shared apply_filters() to cli_utils.py - Phase 3: Implement pass-through behavior for all create commands - Phase 4-6: Test infrastructure with pytest-django, unit/integration tests Key changes from original plan: - ArchiveResult.from_json() identified as missing prerequisite - Pass-through documented as new feature to implement - archivebox run updated to create-or-update pattern - conftest.py redesigned to use pytest-django with isolated tmp_path - Standardized on tags_str field name across all models - Reordered phases: implement before test <!-- IMPORTANT: Do not submit PRs with only formatting / PEP8 / line length changes. --> # Summary <!--e.g. This PR fixes ABC or adds the ability to do XYZ...--> # Related issues <!-- e.g. #123 or Roadmap goal # https://github.com/pirate/ArchiveBox/wiki/Roadmap --> # Changes these areas - [ ] Bugfixes - [ ] Feature behavior - [ ] Command line interface - [ ] Configuration options - [ ] Internal architecture - [ ] Snapshot data layout on disk
This change consolidates duplicated logic between chrome_utils.js and extension installer hooks, as well as between Python plugin tests: JavaScript changes: - Add getExtensionsDir() to centralize extension directory path calculation - Add installExtensionWithCache() to handle extension install + cache workflow - Add CLI commands for new utilities - Refactor all 3 extension installers (ublock, istilldontcareaboutcookies, twocaptcha) to use shared utilities, reducing each from ~115 lines to ~60 - Update chrome_launch hook to use getExtensionsDir() Python test changes: - Add chrome_test_helpers.py with shared Chrome session management utilities - Refactor infiniscroll and modalcloser tests to use shared helpers - setup_chrome_session(), cleanup_chrome(), get_test_env() now centralized - Add chrome_session() context manager for automatic cleanup Net result: ~208 lines of code removed while maintaining same functionality.
- Update Crawl.output_dir_parent to use username instead of user_id
for consistency with Snapshot paths
- Add domain from first URL to Crawl path structure for easier debugging:
users/{username}/crawls/YYYYMMDD/{domain}/{crawl_id}/
- Add CRAWL_OUTPUT_DIR to config passed to Snapshot hooks so chrome_tab
can find the shared Chrome session from the Crawl
- Update comment in chrome_tab hook to reflect new config source
- Update Crawl.output_dir_parent to use username instead of user_id for
consistency with Snapshot paths
- Add domain from first URL to Crawl path structure for easier
debugging: users/{username}/crawls/YYYYMMDD/{domain}/{crawl_id}/
- Add CRAWL_OUTPUT_DIR to config passed to Snapshot hooks so chrome_tab
can find the shared Chrome session from the Crawl
- Update comment in chrome_tab hook to reflect new config source
<!-- IMPORTANT: Do not submit PRs with only formatting / PEP8 / line
length changes. -->
# Summary
<!--e.g. This PR fixes ABC or adds the ability to do XYZ...-->
# Related issues
<!-- e.g. #123 or Roadmap goal #
https://github.com/pirate/ArchiveBox/wiki/Roadmap -->
# Changes these areas
- [ ] Bugfixes
- [ ] Feature behavior
- [ ] Command line interface
- [ ] Configuration options
- [ ] Internal architecture
- [ ] Snapshot data layout on disk
This change consolidates duplicated logic between chrome_utils.js and extension installer hooks, as well as between Python plugin tests: JavaScript changes: - Add getExtensionsDir() to centralize extension directory path calculation - Add installExtensionWithCache() to handle extension install + cache workflow - Add CLI commands for new utilities - Refactor all 3 extension installers (ublock, istilldontcareaboutcookies, twocaptcha) to use shared utilities, reducing each from ~115 lines to ~60 - Update chrome_launch hook to use getExtensionsDir() Python test changes: - Add chrome_test_helpers.py with shared Chrome session management utilities - Refactor infiniscroll and modalcloser tests to use shared helpers - setup_chrome_session(), cleanup_chrome(), get_test_env() now centralized - Add chrome_session() context manager for automatic cleanup Net result: ~208 lines of code removed while maintaining same functionality. <!-- IMPORTANT: Do not submit PRs with only formatting / PEP8 / line length changes. --> # Summary <!--e.g. This PR fixes ABC or adds the ability to do XYZ...--> # Related issues <!-- e.g. #123 or Roadmap goal # https://github.com/pirate/ArchiveBox/wiki/Roadmap --> # Changes these areas - [ ] Bugfixes - [ ] Feature behavior - [ ] Command line interface - [ ] Configuration options - [ ] Internal architecture - [ ] Snapshot data layout on disk
- Add setup_test_env, launch_chromium_session, kill_chromium_session to chrome_test_helpers.py for extension tests - Add chromium_session context manager for cleaner test code - Refactor ublock, istilldontcareaboutcookies, twocaptcha tests to use shared helpers (~450 lines removed) - Refactor screenshot, dom, pdf tests to use shared get_test_env and get_lib_dir (~60 lines removed) - Net reduction: 228 lines of duplicate code
- Add get_machine_type() to chrome_test_helpers.py - Update get_test_env() to include MACHINE_TYPE - Refactor test_chrome.py to import from shared helpers - Removes ~50 lines of duplicate code
- Import shared Chrome test helpers - Add test_singlefile_with_chrome_session() to verify CDP connection - Add test_singlefile_disabled_skips() for config testing - Update existing test to use get_test_env()
New helpers in chrome_test_helpers.py: - get_plugin_dir(__file__) - get plugin dir from test file path - get_hook_script(dir, pattern) - find hook script by glob pattern - run_hook() - run hook script and return (returncode, stdout, stderr) - parse_jsonl_output() - parse JSONL from hook output - run_hook_and_parse() - convenience combo of above two - LIB_DIR, NODE_MODULES_DIR - lazy-loaded module constants - _LazyPath class for deferred path resolution Updated test files to use simpler patterns: - screenshot/tests/test_screenshot.py - dom/tests/test_dom.py - pdf/tests/test_pdf.py - singlefile/tests/test_singlefile.py Before: PLUGIN_DIR = Path(__file__).parent.parent After: PLUGIN_DIR = get_plugin_dir(__file__) Before: LIB_DIR = get_lib_dir(); NODE_MODULES_DIR = LIB_DIR / 'npm' / 'node_modules' After: from chrome_test_helpers import LIB_DIR, NODE_MODULES_DIR
Changed Snapshot.cleanup() to gracefully terminate background hooks: 1. Send SIGTERM to all background hook processes first 2. Wait up to each hook's plugin-specific timeout 3. Send SIGKILL only to hooks still running after their timeout Added graceful_terminate_background_hooks() function in hooks.py that: - Collects all .pid files from output directory - Validates process identity using mtime - Sends SIGTERM to all valid processes in phase 1 - Polls each process for up to its plugin-specific timeout - Sends SIGKILL as last resort if timeout expires - Returns status for each hook (sigterm/sigkill/already_dead/invalid)
- Add getMachineType, getLibDir, getNodeModulesDir, getTestEnv CLI commands to chrome_utils.js These are now the single source of truth for path calculations - Update chrome_test_helpers.py with call_chrome_utils() dispatcher - Add get_test_env_from_js(), get_machine_type_from_js(), kill_chrome_via_js() helpers - Update cleanup_chrome and kill_chromium_session to use JS killChrome - Remove unused Chrome binary search lists from singlefile hook (~25 lines) - Update readability, mercury, favicon, title tests to use shared helpers
Added 10 practical examples demonstrating the JSONL piping architecture: 1. Basic archive with auto-cascade 2. Retry failed extractions (by status, plugin, domain) 3. Pinboard bookmark import with jq 4. GitHub repo filtering with jq regex 5. Selective extraction (screenshots only) 6. Bulk tag management 7. Deep documentation crawling 8. RSS feed monitoring 9. Archive audit with jq aggregation 10. Incremental backup with diff Also added auto-cascade principle: `archivebox run` automatically creates Snapshots from Crawls and ArchiveResults from Snapshots, so intermediate commands are only needed for customization.
Extended graceful_terminate_background_hooks() to: - Reap processes with os.waitpid() to get exit codes - Write returncode to .returncode file for update_from_output() - Return detailed result dict with status, returncode, and pid Updated update_from_output() to: - Read .returncode and .stderr.log files - Determine status from returncode if no ArchiveResult JSONL record - Include stderr in output_str for failed hooks - Handle signal termination (negative returncodes like -9 for SIGKILL) - Clean up .returncode files along with other hook output files
- get_machine_type() matches JS getMachineType() - get_lib_dir() matches JS getLibDir() - get_node_modules_dir() matches JS getNodeModulesDir() - get_extensions_dir() matches JS getExtensionsDir() - find_chromium() matches JS findChromium() - kill_chrome() matches JS killChrome() - get_test_env() matches JS getTestEnv() All functions now try JS first (single source of truth) with Python fallback. Added backward compatibility aliases for old names.
<!-- IMPORTANT: Do not submit PRs with only formatting / PEP8 / line length changes. --> # Summary <!--e.g. This PR fixes ABC or adds the ability to do XYZ...--> # Related issues <!-- e.g. #123 or Roadmap goal # https://github.com/pirate/ArchiveBox/wiki/Roadmap --> # Changes these areas - [ ] Bugfixes - [ ] Feature behavior - [ ] Command line interface - [ ] Configuration options - [ ] Internal architecture - [ ] Snapshot data layout on disk
Tags now support full unicode with no restrictions. URL-encode the tag name wherever it previously used the slug (export filenames, lookups). - Remove `slug` field, `_generate_unique_slug`, and slug handling in save() - Add migration 0034 to drop the slug column - `get_tag_by_ref` now resolves by URL-decoded exact name match - Tag search/autocomplete/export filenames use the name directly - Drop slug from admin search_fields/readonly_fields/fieldsets - Remove slug display from similar-tag cards and client download filename
Applies pirate's review suggestion on PR #1789: mark the Content-Disposition filename encoding as a known-rough approach that could be hardened further (strip punctuation, convert to ASCII equivalents) in a follow-up.
Addresses review feedback from cubic and devin: quote()'s percent- encoding isn't decoded by browsers in Content-Disposition's filename parameter (Safari saves literal %20). Switch to Django's slugify() which does NFKD normalization, ASCII transliteration, and replaces punctuation with hyphens — producing clean names like "tag-alpha-research-urls.txt". - Add tag_filename_safe(name) helper wrapping slugify - Use it in both tag export endpoints - Drop the now-unneeded JS fallback name (server always sets Content-Disposition)
Replaces the tag_filename_safe() helper with a Tag.slug property that returns the slugified form via django.utils.text.slugify. Call sites now just use tag.slug directly.
Address pirate's review: restore the slug in the client-side download fallback filename. Expose tag.slug as data-slug on the card element and in the search card schema so the JS can read it directly without slugifying client-side.
## Summary This PR removes the `slug` field from the Tag model and all related slug generation logic. Tags are now identified and referenced by their name instead of a generated slug, simplifying the data model and reducing complexity. ## Related issues N/A ## Changes these areas - [x] Internal architecture - [x] Snapshot data layout on disk ## Details ### What changed 1. **Model changes**: Removed the `slug` field from the Tag model, including the `_generate_unique_slug()` method and slug generation logic in the `save()` method 2. **Database migration**: Added migration `0034_remove_tag_slug` to drop the slug column 3. **API updates**: Removed `slug` from all API schemas (TagSchema, TagSearchCardSchema, TagUpdateResponseSchema) and responses 4. **Tag lookup**: Updated `get_tag_by_ref()` to use URL-decoded tag names instead of slugs for lookups 5. **Tag filtering**: Simplified `get_matching_tags()` to only filter by name instead of both name and slug 6. **Export filenames**: Changed tag export filenames to use `quote(tag.name)` instead of `tag.slug` 7. **Admin interface**: Removed slug from TagAdmin search fields, readonly fields, and fieldsets 8. **Templates**: Removed slug display from tag cards and similar tags UI 9. **Tests**: Updated test expectations and removed slug assertions; updated export filename checks to use `quote(tag.name)` ### Why This simplifies the Tag model by removing the derived slug field. Tags can be uniquely identified by their name, and URL encoding handles special characters in filenames and URLs. This reduces database complexity and eliminates the need for slug generation and uniqueness logic. ## Test Plan Existing tests have been updated to verify the new behavior: - `test_tag_rename_api_updates_name` verifies tag renaming works without slug - `test_tag_snapshots_export_returns_jsonl` and `test_tag_urls_export_returns_plain_text_urls` verify export filenames use encoded tag names - `test_tag_table_has_required_columns` verifies the database schema no longer includes slug All related tests pass with the updated assertions. https://claude.ai/code/session_014KmEXoA64Ayp2t8BW2xfVP <!-- devin-review-badge-begin --> --- <a href="https://app.devin.ai/review/archivebox/archivebox/pull/1789" target="_blank"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1"> <img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open in Devin Review"> </picture> </a> <!-- devin-review-badge-end --> <!-- This is an auto-generated description by cubic. --> --- ## Summary by cubic Removed the stored `slug` from Tag and moved to name-based tags. Added a derived `Tag.slug` via `django.utils.text.slugify` for clean export filenames and an admin download fallback; public APIs no longer include slugs and lookups resolve by URL-decoded exact name. - **Refactors** - Replaced stored slug with a derived `Tag.slug` property; removed slug generation/save logic. - Public API schemas and autocomplete drop `slug`; matching/filtering uses `name` only. - `get_tag_by_ref` resolves by URL-decoded `name` (case-insensitive exact match). - Export endpoints set filenames using `tag.slug`; admin tag cards expose `data-slug`, and the client uses it as a fallback filename. Removed slug from admin search fields/fieldsets and UI displays. - **Migration** - Run database migrations. - Update any consumers expecting `slug` in Tag API/admin; use the tag `name` for references (URL-encode names in links). Rely on server-provided filenames, with the built-in client fallback using `tag.slug` where needed. <sup>Written for commit 7c3a3e0. Summary will update on new commits.</sup> <!-- End of auto-generated description by cubic. -->
Signed-off-by: Nick Sweeting <git@sweeting.me>
## Summary
- add a vanilla HTML/CSS landing page under repo-root `publicsite/`
- keep the existing ArchiveBox logo and custom domain CNAME in the Pages
artifact
- use the light-mode ArchiveBox design tokens with no dark-mode CSS
- update the GitHub Pages workflow to deploy `./publicsite` directly
without Jekyll
- remove the old top-level `website/` tree and duplicate Jekyll Pages
workflow
## Validation
- `ruby -e "require 'yaml';
YAML.load_file('.github/workflows/gh-pages.yml')"`
- parsed `publicsite/index.html` with Python `HTMLParser`
- served `publicsite` locally and verified `/`, `styles.css`,
`icon.png`, and `CNAME` return 200
Signed-off-by: Nick Sweeting <git@sweeting.me>
Signed-off-by: Nick Sweeting <git@sweeting.me>
Signed-off-by: Nick Sweeting <git@sweeting.me>
Signed-off-by: Nick Sweeting <git@sweeting.me>
Signed-off-by: Nick Sweeting <git@sweeting.me>
Signed-off-by: Nick Sweeting <git@sweeting.me>
Signed-off-by: Nick Sweeting <git@sweeting.me>
Signed-off-by: Nick Sweeting <git@sweeting.me>
Signed-off-by: Nick Sweeting <git@sweeting.me>
Signed-off-by: Nick Sweeting <git@sweeting.me>
Signed-off-by: Nick Sweeting <git@sweeting.me>
Signed-off-by: Nick Sweeting <git@sweeting.me>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
See Commits and Changes for more details.
Created by
pull[bot] (v2.0.0-alpha.4)
Can you help keep this open source service alive? 💖 Please sponsor : )