[pull] dev from ArchiveBox:dev by pull[bot] · Pull Request #1 · mrbenns/ArchiveBox

pull · 2022-05-21T17:39:36Z

See Commits and Changes for more details.

Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

+                'youtube_dl',
+            ], capture_output=True, text=True, cwd=out_dir).stdout.split('Location: ')[-1].split('\n', 1)[0]
+            NEW_YOUTUBEDL_BINARY = Path(pkg_path) / 'youtube_dl' / '__main__.py'
+            os.chmod(NEW_YOUTUBEDL_BINARY, 0o777)


+        if PUBLIC_INDEX:
+            return redirect('/public')
+
+        return redirect(f'/admin/login/?next={request.path}')


+
+    def get(self, request, path):
+        if not request.user.is_authenticated and not PUBLIC_SNAPSHOTS:
+            return redirect(f'/admin/login/?next={request.path}')


+
+            # missing trailing slash -> redirect to index
+            if '/' not in path:
+                return redirect(f'{path}/index.html')


+            response = super().get(*args, **kwargs)
+            return response
+        else:
+            return redirect(f'/admin/login/?next={self.request.path}')


+
+    def add_view(self, request):
+        if not request.user.is_authenticated:
+            return redirect(f'/admin/login/?next={request.path}')


Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>

Multiple hooks in the same plugin directory were overwriting each other's stdout.log, stderr.log, hook.pid, and cmd.sh files. Now each hook uses filenames prefixed with its hook name: - on_Snapshot__20_chrome_tab.bg.stdout.log - on_Snapshot__20_chrome_tab.bg.stderr.log - on_Snapshot__20_chrome_tab.bg.pid - on_Snapshot__20_chrome_tab.bg.sh Updated: - hooks.py run_hook() to use hook-specific names - core/models.py cleanup and update_from_output methods - Plugin scripts to no longer write redundant hook.pid files  # Summary  # Related issues  # Changes these areas - [ ] Bugfixes - [ ] Feature behavior - [ ] Command line interface - [ ] Configuration options - [ ] Internal architecture - [ ] Snapshot data layout on disk  --- ## Summary by cubic Prevented hook file collisions by giving each hook its own stdout, stderr, pid, and cmd filenames. This fixes mixed logs and ensures correct cleanup and status checks when multiple hooks run in the same plugin directory. - **Bug Fixes** - hooks.py: write hook-specific stdout/stderr/pid/cmd files and exclude them from new_files; derive cmd.sh from pid for safe kill. - core/models.py: read hook-specific logs; exclude hook output files when computing outputs; cleanup and background detection use *.pid. - Plugins: stop writing redundant hook.pid files; minor chrome utils cleanup. Written for commit 754b096. Summary will update on new commits.

Simplifies the comma-separated parsing logic to: - If value contains '[', parse as JSON array - Otherwise, parse as comma-separated values This prevents incorrect splitting of arguments containing internal commas when there's only one argument. For arguments with commas, users should use JSON format: CHROME_ARGS='["--arg1,val", "--arg2"]' Also exports getEnvArray in module.exports for consistency. Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>

…ling logic on model methods (#1734)  # Summary  # Related issues  # Changes these areas - [ ] Bugfixes - [ ] Feature behavior - [ ] Command line interface - [ ] Configuration options - [ ] Internal architecture - [ ] Snapshot data layout on disk  --- ## Summary by cubic Added an implementation plan to centralize subprocess handling on the machine.Process model. It covers process hierarchy, Process.current(), safe lifecycle methods (launch/kill/wait), PID reuse protection, and phased changes across hooks, workers, CLI, migrations, and admin. Written for commit 3ae9410. Summary will update on new commits.

…#1735) Comprehensive plan for implementing JSONL-based CLI piping: - Phase 1: Model prerequisites (ArchiveResult.from_json, tags_str fix) - Phase 2: Extract shared apply_filters() to cli_utils.py - Phase 3: Implement pass-through behavior for all create commands - Phase 4-6: Test infrastructure with pytest-django, unit/integration tests Key changes from original plan: - ArchiveResult.from_json() identified as missing prerequisite - Pass-through documented as new feature to implement - archivebox run updated to create-or-update pattern - conftest.py redesigned to use pytest-django with isolated tmp_path - Standardized on tags_str field name across all models - Reordered phases: implement before test  # Summary  # Related issues  # Changes these areas - [ ] Bugfixes - [ ] Feature behavior - [ ] Command line interface - [ ] Configuration options - [ ] Internal architecture - [ ] Snapshot data layout on disk

This change consolidates duplicated logic between chrome_utils.js and extension installer hooks, as well as between Python plugin tests: JavaScript changes: - Add getExtensionsDir() to centralize extension directory path calculation - Add installExtensionWithCache() to handle extension install + cache workflow - Add CLI commands for new utilities - Refactor all 3 extension installers (ublock, istilldontcareaboutcookies, twocaptcha) to use shared utilities, reducing each from ~115 lines to ~60 - Update chrome_launch hook to use getExtensionsDir() Python test changes: - Add chrome_test_helpers.py with shared Chrome session management utilities - Refactor infiniscroll and modalcloser tests to use shared helpers - setup_chrome_session(), cleanup_chrome(), get_test_env() now centralized - Add chrome_session() context manager for automatic cleanup Net result: ~208 lines of code removed while maintaining same functionality.

- Update Crawl.output_dir_parent to use username instead of user_id for consistency with Snapshot paths - Add domain from first URL to Crawl path structure for easier debugging: users/{username}/crawls/YYYYMMDD/{domain}/{crawl_id}/ - Add CRAWL_OUTPUT_DIR to config passed to Snapshot hooks so chrome_tab can find the shared Chrome session from the Crawl - Update comment in chrome_tab hook to reflect new config source

- Update Crawl.output_dir_parent to use username instead of user_id for consistency with Snapshot paths - Add domain from first URL to Crawl path structure for easier debugging: users/{username}/crawls/YYYYMMDD/{domain}/{crawl_id}/ - Add CRAWL_OUTPUT_DIR to config passed to Snapshot hooks so chrome_tab can find the shared Chrome session from the Crawl - Update comment in chrome_tab hook to reflect new config source  # Summary  # Related issues  # Changes these areas - [ ] Bugfixes - [ ] Feature behavior - [ ] Command line interface - [ ] Configuration options - [ ] Internal architecture - [ ] Snapshot data layout on disk

This change consolidates duplicated logic between chrome_utils.js and extension installer hooks, as well as between Python plugin tests: JavaScript changes: - Add getExtensionsDir() to centralize extension directory path calculation - Add installExtensionWithCache() to handle extension install + cache workflow - Add CLI commands for new utilities - Refactor all 3 extension installers (ublock, istilldontcareaboutcookies, twocaptcha) to use shared utilities, reducing each from ~115 lines to ~60 - Update chrome_launch hook to use getExtensionsDir() Python test changes: - Add chrome_test_helpers.py with shared Chrome session management utilities - Refactor infiniscroll and modalcloser tests to use shared helpers - setup_chrome_session(), cleanup_chrome(), get_test_env() now centralized - Add chrome_session() context manager for automatic cleanup Net result: ~208 lines of code removed while maintaining same functionality.  # Summary  # Related issues  # Changes these areas - [ ] Bugfixes - [ ] Feature behavior - [ ] Command line interface - [ ] Configuration options - [ ] Internal architecture - [ ] Snapshot data layout on disk

- Add setup_test_env, launch_chromium_session, kill_chromium_session to chrome_test_helpers.py for extension tests - Add chromium_session context manager for cleaner test code - Refactor ublock, istilldontcareaboutcookies, twocaptcha tests to use shared helpers (~450 lines removed) - Refactor screenshot, dom, pdf tests to use shared get_test_env and get_lib_dir (~60 lines removed) - Net reduction: 228 lines of duplicate code

- Add get_machine_type() to chrome_test_helpers.py - Update get_test_env() to include MACHINE_TYPE - Refactor test_chrome.py to import from shared helpers - Removes ~50 lines of duplicate code

- Import shared Chrome test helpers - Add test_singlefile_with_chrome_session() to verify CDP connection - Add test_singlefile_disabled_skips() for config testing - Update existing test to use get_test_env()

New helpers in chrome_test_helpers.py: - get_plugin_dir(__file__) - get plugin dir from test file path - get_hook_script(dir, pattern) - find hook script by glob pattern - run_hook() - run hook script and return (returncode, stdout, stderr) - parse_jsonl_output() - parse JSONL from hook output - run_hook_and_parse() - convenience combo of above two - LIB_DIR, NODE_MODULES_DIR - lazy-loaded module constants - _LazyPath class for deferred path resolution Updated test files to use simpler patterns: - screenshot/tests/test_screenshot.py - dom/tests/test_dom.py - pdf/tests/test_pdf.py - singlefile/tests/test_singlefile.py Before: PLUGIN_DIR = Path(__file__).parent.parent After: PLUGIN_DIR = get_plugin_dir(__file__) Before: LIB_DIR = get_lib_dir(); NODE_MODULES_DIR = LIB_DIR / 'npm' / 'node_modules' After: from chrome_test_helpers import LIB_DIR, NODE_MODULES_DIR

Changed Snapshot.cleanup() to gracefully terminate background hooks: 1. Send SIGTERM to all background hook processes first 2. Wait up to each hook's plugin-specific timeout 3. Send SIGKILL only to hooks still running after their timeout Added graceful_terminate_background_hooks() function in hooks.py that: - Collects all .pid files from output directory - Validates process identity using mtime - Sends SIGTERM to all valid processes in phase 1 - Polls each process for up to its plugin-specific timeout - Sends SIGKILL as last resort if timeout expires - Returns status for each hook (sigterm/sigkill/already_dead/invalid)

- Add getMachineType, getLibDir, getNodeModulesDir, getTestEnv CLI commands to chrome_utils.js These are now the single source of truth for path calculations - Update chrome_test_helpers.py with call_chrome_utils() dispatcher - Add get_test_env_from_js(), get_machine_type_from_js(), kill_chrome_via_js() helpers - Update cleanup_chrome and kill_chromium_session to use JS killChrome - Remove unused Chrome binary search lists from singlefile hook (~25 lines) - Update readability, mercury, favicon, title tests to use shared helpers

Added 10 practical examples demonstrating the JSONL piping architecture: 1. Basic archive with auto-cascade 2. Retry failed extractions (by status, plugin, domain) 3. Pinboard bookmark import with jq 4. GitHub repo filtering with jq regex 5. Selective extraction (screenshots only) 6. Bulk tag management 7. Deep documentation crawling 8. RSS feed monitoring 9. Archive audit with jq aggregation 10. Incremental backup with diff Also added auto-cascade principle: `archivebox run` automatically creates Snapshots from Crawls and ArchiveResults from Snapshots, so intermediate commands are only needed for customization.

Extended graceful_terminate_background_hooks() to: - Reap processes with os.waitpid() to get exit codes - Write returncode to .returncode file for update_from_output() - Return detailed result dict with status, returncode, and pid Updated update_from_output() to: - Read .returncode and .stderr.log files - Determine status from returncode if no ArchiveResult JSONL record - Include stderr in output_str for failed hooks - Handle signal termination (negative returncodes like -9 for SIGKILL) - Clean up .returncode files along with other hook output files

- get_machine_type() matches JS getMachineType() - get_lib_dir() matches JS getLibDir() - get_node_modules_dir() matches JS getNodeModulesDir() - get_extensions_dir() matches JS getExtensionsDir() - find_chromium() matches JS findChromium() - kill_chrome() matches JS killChrome() - get_test_env() matches JS getTestEnv() All functions now try JS first (single source of truth) with Python fallback. Added backward compatibility aliases for old names.

# Summary  # Related issues  # Changes these areas - [ ] Bugfixes - [ ] Feature behavior - [ ] Command line interface - [ ] Configuration options - [ ] Internal architecture - [ ] Snapshot data layout on disk

Tags now support full unicode with no restrictions. URL-encode the tag name wherever it previously used the slug (export filenames, lookups). - Remove `slug` field, `_generate_unique_slug`, and slug handling in save() - Add migration 0034 to drop the slug column - `get_tag_by_ref` now resolves by URL-decoded exact name match - Tag search/autocomplete/export filenames use the name directly - Drop slug from admin search_fields/readonly_fields/fieldsets - Remove slug display from similar-tag cards and client download filename

Applies pirate's review suggestion on PR #1789: mark the Content-Disposition filename encoding as a known-rough approach that could be hardened further (strip punctuation, convert to ASCII equivalents) in a follow-up.

Addresses review feedback from cubic and devin: quote()'s percent- encoding isn't decoded by browsers in Content-Disposition's filename parameter (Safari saves literal %20). Switch to Django's slugify() which does NFKD normalization, ASCII transliteration, and replaces punctuation with hyphens — producing clean names like "tag-alpha-research-urls.txt". - Add tag_filename_safe(name) helper wrapping slugify - Use it in both tag export endpoints - Drop the now-unneeded JS fallback name (server always sets Content-Disposition)

Replaces the tag_filename_safe() helper with a Tag.slug property that returns the slugified form via django.utils.text.slugify. Call sites now just use tag.slug directly.

Address pirate's review: restore the slug in the client-side download fallback filename. Expose tag.slug as data-slug on the card element and in the search card schema so the JS can read it directly without slugifying client-side.

## Summary This PR removes the `slug` field from the Tag model and all related slug generation logic. Tags are now identified and referenced by their name instead of a generated slug, simplifying the data model and reducing complexity. ## Related issues N/A ## Changes these areas - [x] Internal architecture - [x] Snapshot data layout on disk ## Details ### What changed 1. **Model changes**: Removed the `slug` field from the Tag model, including the `_generate_unique_slug()` method and slug generation logic in the `save()` method 2. **Database migration**: Added migration `0034_remove_tag_slug` to drop the slug column 3. **API updates**: Removed `slug` from all API schemas (TagSchema, TagSearchCardSchema, TagUpdateResponseSchema) and responses 4. **Tag lookup**: Updated `get_tag_by_ref()` to use URL-decoded tag names instead of slugs for lookups 5. **Tag filtering**: Simplified `get_matching_tags()` to only filter by name instead of both name and slug 6. **Export filenames**: Changed tag export filenames to use `quote(tag.name)` instead of `tag.slug` 7. **Admin interface**: Removed slug from TagAdmin search fields, readonly fields, and fieldsets 8. **Templates**: Removed slug display from tag cards and similar tags UI 9. **Tests**: Updated test expectations and removed slug assertions; updated export filename checks to use `quote(tag.name)` ### Why This simplifies the Tag model by removing the derived slug field. Tags can be uniquely identified by their name, and URL encoding handles special characters in filenames and URLs. This reduces database complexity and eliminates the need for slug generation and uniqueness logic. ## Test Plan Existing tests have been updated to verify the new behavior: - `test_tag_rename_api_updates_name` verifies tag renaming works without slug - `test_tag_snapshots_export_returns_jsonl` and `test_tag_urls_export_returns_plain_text_urls` verify export filenames use encoded tag names - `test_tag_table_has_required_columns` verifies the database schema no longer includes slug All related tests pass with the updated assertions. https://claude.ai/code/session_014KmEXoA64Ayp2t8BW2xfVP  --- <a href="https://app.devin.ai/review/archivebox/archivebox/pull/1789" target="_blank"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1"> <img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open in Devin Review"> </picture> </a>   --- ## Summary by cubic Removed the stored `slug` from Tag and moved to name-based tags. Added a derived `Tag.slug` via `django.utils.text.slugify` for clean export filenames and an admin download fallback; public APIs no longer include slugs and lookups resolve by URL-decoded exact name. - **Refactors** - Replaced stored slug with a derived `Tag.slug` property; removed slug generation/save logic. - Public API schemas and autocomplete drop `slug`; matching/filtering uses `name` only. - `get_tag_by_ref` resolves by URL-decoded `name` (case-insensitive exact match). - Export endpoints set filenames using `tag.slug`; admin tag cards expose `data-slug`, and the client uses it as a fallback filename. Removed slug from admin search fields/fieldsets and UI displays. - **Migration** - Run database migrations. - Update any consumers expecting `slug` in Tag API/admin; use the tag `name` for references (URL-encode names in links). Rely on server-provided filenames, with the built-in client fallback using `tag.slug` where needed. Written for commit 7c3a3e0. Summary will update on new commits.

Signed-off-by: Nick Sweeting <git@sweeting.me>

## Summary - add a vanilla HTML/CSS landing page under repo-root `publicsite/` - keep the existing ArchiveBox logo and custom domain CNAME in the Pages artifact - use the light-mode ArchiveBox design tokens with no dark-mode CSS - update the GitHub Pages workflow to deploy `./publicsite` directly without Jekyll - remove the old top-level `website/` tree and duplicate Jekyll Pages workflow ## Validation - `ruby -e "require 'yaml'; YAML.load_file('.github/workflows/gh-pages.yml')"` - parsed `publicsite/index.html` with Python `HTMLParser` - served `publicsite` locally and verified `/`, `styles.css`, `icon.png`, and `CNAME` return 200

Signed-off-by: Nick Sweeting <git@sweeting.me>

pull Bot added ⤵️ pull merge-conflict Resolve conflicts manually labels May 21, 2022

github-advanced-security AI found potential problems Nov 2, 2022

View reviewed changes

pirate force-pushed the dev branch from 19e9c1c to 1773146 Compare January 19, 2024 11:47

pirate force-pushed the dev branch from b6107ec to af669d2 Compare April 25, 2024 12:55

pirate force-pushed the dev branch from 603f87e to bf073b0 Compare October 5, 2024 10:19

pirate and others added 24 commits December 31, 2025 02:31

Apply suggestion from @cubic-dev-ai[bot]

f7b186d

Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>

Update TODO_process_tracking.md

3ae9410

Merge branch 'dev' into claude/refactor-process-management-WcQyZ

dfe6841

fix cubic comments

84a4fb0

tweak comment

65b93d5

tweak comment

29eb628

Refactor test_chrome.py to use shared helpers

ef92a99

- Add get_machine_type() to chrome_test_helpers.py - Update get_test_env() to include MACHINE_TYPE - Refactor test_chrome.py to import from shared helpers - Removes ~50 lines of duplicate code

Add Chrome CDP integration tests for singlefile

7d74dd9

- Import shared Chrome test helpers - Add test_singlefile_with_chrome_session() to verify CDP connection - Add test_singlefile_disabled_skips() for config testing - Update existing test to use get_test_env()

pirate and others added 30 commits April 4, 2026 23:11

ignore outfiles

1c6b782

small fixes

4d66996

symlink lock_pkgs to setup monorepo script

f126c6e

split tag editor issue

0b9b3b7

update dev instructions

2cc5a11

rename abxpkg

f128751

rename abx-pkg to abxpkg

ee56853

rename abx-pkg to abxpkg

7d8c468

fix monorepo script

b68ff3e

Add TODO on tag export filename encoding

b83e2de

Applies pirate's review suggestion on PR #1789: mark the Content-Disposition filename encoding as a known-rough approach that could be hardened further (strip punctuation, convert to ASCII equivalents) in a follow-up.

Move tag slug logic onto Tag.slug @Property

2ea66d0

Replaces the tag_filename_safe() helper with a Tag.slug property that returns the slugified form via django.utils.text.slugify. Call sites now just use tag.slug directly.

Put tag slug back in JS download filename

7c3a3e0

Address pirate's review: restore the slug in the client-side download fallback filename. Expose tag.slug as data-slug on the card element and in the search card schema so the JS can read it directly without slugifying client-side.

Add static ArchiveBox landing page

2c1700a

Signed-off-by: Nick Sweeting <git@sweeting.me>

public site tweaks

e013817

Signed-off-by: Nick Sweeting <git@sweeting.me>

Rename publicsite Pages workflow

163e9bd

Signed-off-by: Nick Sweeting <git@sweeting.me>

Update publicsite configuration header

ca7eeb7

Signed-off-by: Nick Sweeting <git@sweeting.me>

Update publicsite intro header

abc987c

Signed-off-by: Nick Sweeting <git@sweeting.me>

Tighten publicsite hero header

35d630b

Signed-off-by: Nick Sweeting <git@sweeting.me>

Fix publicsite hero typo

fc3682a

Signed-off-by: Nick Sweeting <git@sweeting.me>

Update publicsite source header

4804ad3

Signed-off-by: Nick Sweeting <git@sweeting.me>

Refine publicsite hero heading

4fef401

Signed-off-by: Nick Sweeting <git@sweeting.me>

Align publicsite hero and nav with design system

166a161

Signed-off-by: Nick Sweeting <git@sweeting.me>

Add README shields to publicsite hero

9c71acc

Signed-off-by: Nick Sweeting <git@sweeting.me>

tweaks

166dcd5

Signed-off-by: Nick Sweeting <git@sweeting.me>

more tweaks

9b8f00f

Signed-off-by: Nick Sweeting <git@sweeting.me>

Link publicsite capability chips

caba6e4

Signed-off-by: Nick Sweeting <git@sweeting.me>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pull] dev from ArchiveBox:dev#1

[pull] dev from ArchiveBox:dev#1
pull[bot] wants to merge 4064 commits intomrbenns:devfrom
ArchiveBox:dev

pull Bot commented May 21, 2022 •

edited

Loading

Uh oh!

Check failure

Check warning

Check warning

Check warning

Check warning

Check warning

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

pull Bot commented May 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Check failure

Check warning

Check warning

Check warning

Check warning

Check warning

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

pull Bot commented May 21, 2022 •

edited

Loading