Skip to content

[pull] dev from ArchiveBox:dev#1

Open
pull[bot] wants to merge 4064 commits intomrbenns:devfrom
ArchiveBox:dev
Open

[pull] dev from ArchiveBox:dev#1
pull[bot] wants to merge 4064 commits intomrbenns:devfrom
ArchiveBox:dev

Conversation

@pull
Copy link
Copy Markdown

@pull pull Bot commented May 21, 2022

See Commits and Changes for more details.


Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

@pull pull Bot added ⤵️ pull merge-conflict Resolve conflicts manually labels May 21, 2022
Comment thread archivebox/main.py Outdated
'youtube_dl',
], capture_output=True, text=True, cwd=out_dir).stdout.split('Location: ')[-1].split('\n', 1)[0]
NEW_YOUTUBEDL_BINARY = Path(pkg_path) / 'youtube_dl' / '__main__.py'
os.chmod(NEW_YOUTUBEDL_BINARY, 0o777)

Check failure

Code scanning / CodeQL

Overly permissive file permissions

Overly permissive mask in chmod sets file to world writable.
Comment thread archivebox/core/views.py Outdated
if PUBLIC_INDEX:
return redirect('/public')

return redirect(f'/admin/login/?next={request.path}')

Check warning

Code scanning / CodeQL

URL redirection from remote source

Untrusted URL redirection depends on [a user-provided value](1).
Comment thread archivebox/core/views.py Outdated

def get(self, request, path):
if not request.user.is_authenticated and not PUBLIC_SNAPSHOTS:
return redirect(f'/admin/login/?next={request.path}')

Check warning

Code scanning / CodeQL

URL redirection from remote source

Untrusted URL redirection depends on [a user-provided value](1).
Comment thread archivebox/core/views.py Outdated

# missing trailing slash -> redirect to index
if '/' not in path:
return redirect(f'{path}/index.html')

Check warning

Code scanning / CodeQL

URL redirection from remote source

Untrusted URL redirection depends on [a user-provided value](1).
Comment thread archivebox/core/views.py Outdated
response = super().get(*args, **kwargs)
return response
else:
return redirect(f'/admin/login/?next={self.request.path}')

Check warning

Code scanning / CodeQL

URL redirection from remote source

Untrusted URL redirection depends on [a user-provided value](1).
Comment thread archivebox/core/admin.py Outdated

def add_view(self, request):
if not request.user.is_authenticated:
return redirect(f'/admin/login/?next={request.path}')

Check warning

Code scanning / CodeQL

URL redirection from remote source

Untrusted URL redirection depends on [a user-provided value](1).
pirate and others added 24 commits December 31, 2025 02:31
Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>
Multiple hooks in the same plugin directory were overwriting each
other's stdout.log, stderr.log, hook.pid, and cmd.sh files. Now each
hook uses filenames prefixed with its hook name:
- on_Snapshot__20_chrome_tab.bg.stdout.log
- on_Snapshot__20_chrome_tab.bg.stderr.log
- on_Snapshot__20_chrome_tab.bg.pid
- on_Snapshot__20_chrome_tab.bg.sh

Updated:
- hooks.py run_hook() to use hook-specific names
- core/models.py cleanup and update_from_output methods
- Plugin scripts to no longer write redundant hook.pid files

<!-- IMPORTANT: Do not submit PRs with only formatting / PEP8 / line
length changes. -->

# Summary

<!--e.g. This PR fixes ABC or adds the ability to do XYZ...-->

# Related issues

<!-- e.g. #123 or Roadmap goal #
https://github.com/pirate/ArchiveBox/wiki/Roadmap -->

# Changes these areas

- [ ] Bugfixes
- [ ] Feature behavior
- [ ] Command line interface
- [ ] Configuration options
- [ ] Internal architecture
- [ ] Snapshot data layout on disk


<!-- This is an auto-generated description by cubic. -->
---
## Summary by cubic
Prevented hook file collisions by giving each hook its own stdout,
stderr, pid, and cmd filenames. This fixes mixed logs and ensures
correct cleanup and status checks when multiple hooks run in the same
plugin directory.

- **Bug Fixes**
- hooks.py: write hook-specific stdout/stderr/pid/cmd files and exclude
them from new_files; derive cmd.sh from pid for safe kill.
- core/models.py: read hook-specific logs; exclude hook output files
when computing outputs; cleanup and background detection use *.pid.
- Plugins: stop writing redundant hook.pid files; minor chrome utils
cleanup.

<sup>Written for commit 754b096.
Summary will update on new commits.</sup>

<!-- End of auto-generated description by cubic. -->
Simplifies the comma-separated parsing logic to:
- If value contains '[', parse as JSON array
- Otherwise, parse as comma-separated values

This prevents incorrect splitting of arguments containing internal commas
when there's only one argument. For arguments with commas, users should
use JSON format: CHROME_ARGS='["--arg1,val", "--arg2"]'

Also exports getEnvArray in module.exports for consistency.

Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>
…ling logic on model methods (#1734)

<!-- IMPORTANT: Do not submit PRs with only formatting / PEP8 / line
length changes. -->

# Summary

<!--e.g. This PR fixes ABC or adds the ability to do XYZ...-->

# Related issues

<!-- e.g. #123 or Roadmap goal #
https://github.com/pirate/ArchiveBox/wiki/Roadmap -->

# Changes these areas

- [ ] Bugfixes
- [ ] Feature behavior
- [ ] Command line interface
- [ ] Configuration options
- [ ] Internal architecture
- [ ] Snapshot data layout on disk

<!-- This is an auto-generated description by cubic. -->
---
## Summary by cubic
Added an implementation plan to centralize subprocess handling on the
machine.Process model. It covers process hierarchy, Process.current(),
safe lifecycle methods (launch/kill/wait), PID reuse protection, and
phased changes across hooks, workers, CLI, migrations, and admin.

<sup>Written for commit 3ae9410.
Summary will update on new commits.</sup>

<!-- End of auto-generated description by cubic. -->
…#1735)

Comprehensive plan for implementing JSONL-based CLI piping:
- Phase 1: Model prerequisites (ArchiveResult.from_json, tags_str fix)
- Phase 2: Extract shared apply_filters() to cli_utils.py
- Phase 3: Implement pass-through behavior for all create commands
- Phase 4-6: Test infrastructure with pytest-django, unit/integration
tests

Key changes from original plan:
- ArchiveResult.from_json() identified as missing prerequisite
- Pass-through documented as new feature to implement
- archivebox run updated to create-or-update pattern
- conftest.py redesigned to use pytest-django with isolated tmp_path
- Standardized on tags_str field name across all models
- Reordered phases: implement before test

<!-- IMPORTANT: Do not submit PRs with only formatting / PEP8 / line
length changes. -->

# Summary

<!--e.g. This PR fixes ABC or adds the ability to do XYZ...-->

# Related issues

<!-- e.g. #123 or Roadmap goal #
https://github.com/pirate/ArchiveBox/wiki/Roadmap -->

# Changes these areas

- [ ] Bugfixes
- [ ] Feature behavior
- [ ] Command line interface
- [ ] Configuration options
- [ ] Internal architecture
- [ ] Snapshot data layout on disk
This change consolidates duplicated logic between chrome_utils.js and
extension installer hooks, as well as between Python plugin tests:

JavaScript changes:
- Add getExtensionsDir() to centralize extension directory path calculation
- Add installExtensionWithCache() to handle extension install + cache workflow
- Add CLI commands for new utilities
- Refactor all 3 extension installers (ublock, istilldontcareaboutcookies,
  twocaptcha) to use shared utilities, reducing each from ~115 lines to ~60
- Update chrome_launch hook to use getExtensionsDir()

Python test changes:
- Add chrome_test_helpers.py with shared Chrome session management utilities
- Refactor infiniscroll and modalcloser tests to use shared helpers
- setup_chrome_session(), cleanup_chrome(), get_test_env() now centralized
- Add chrome_session() context manager for automatic cleanup

Net result: ~208 lines of code removed while maintaining same functionality.
- Update Crawl.output_dir_parent to use username instead of user_id
  for consistency with Snapshot paths
- Add domain from first URL to Crawl path structure for easier debugging:
  users/{username}/crawls/YYYYMMDD/{domain}/{crawl_id}/
- Add CRAWL_OUTPUT_DIR to config passed to Snapshot hooks so chrome_tab
  can find the shared Chrome session from the Crawl
- Update comment in chrome_tab hook to reflect new config source
- Update Crawl.output_dir_parent to use username instead of user_id for
consistency with Snapshot paths
- Add domain from first URL to Crawl path structure for easier
debugging: users/{username}/crawls/YYYYMMDD/{domain}/{crawl_id}/
- Add CRAWL_OUTPUT_DIR to config passed to Snapshot hooks so chrome_tab
can find the shared Chrome session from the Crawl
- Update comment in chrome_tab hook to reflect new config source

<!-- IMPORTANT: Do not submit PRs with only formatting / PEP8 / line
length changes. -->

# Summary

<!--e.g. This PR fixes ABC or adds the ability to do XYZ...-->

# Related issues

<!-- e.g. #123 or Roadmap goal #
https://github.com/pirate/ArchiveBox/wiki/Roadmap -->

# Changes these areas

- [ ] Bugfixes
- [ ] Feature behavior
- [ ] Command line interface
- [ ] Configuration options
- [ ] Internal architecture
- [ ] Snapshot data layout on disk
This change consolidates duplicated logic between chrome_utils.js and
extension installer hooks, as well as between Python plugin tests:

JavaScript changes:
- Add getExtensionsDir() to centralize extension directory path
calculation
- Add installExtensionWithCache() to handle extension install + cache
workflow
- Add CLI commands for new utilities
- Refactor all 3 extension installers (ublock,
istilldontcareaboutcookies, twocaptcha) to use shared utilities,
reducing each from ~115 lines to ~60
- Update chrome_launch hook to use getExtensionsDir()

Python test changes:
- Add chrome_test_helpers.py with shared Chrome session management
utilities
- Refactor infiniscroll and modalcloser tests to use shared helpers
- setup_chrome_session(), cleanup_chrome(), get_test_env() now
centralized
- Add chrome_session() context manager for automatic cleanup

Net result: ~208 lines of code removed while maintaining same
functionality.

<!-- IMPORTANT: Do not submit PRs with only formatting / PEP8 / line
length changes. -->

# Summary

<!--e.g. This PR fixes ABC or adds the ability to do XYZ...-->

# Related issues

<!-- e.g. #123 or Roadmap goal #
https://github.com/pirate/ArchiveBox/wiki/Roadmap -->

# Changes these areas

- [ ] Bugfixes
- [ ] Feature behavior
- [ ] Command line interface
- [ ] Configuration options
- [ ] Internal architecture
- [ ] Snapshot data layout on disk
- Add setup_test_env, launch_chromium_session, kill_chromium_session
  to chrome_test_helpers.py for extension tests
- Add chromium_session context manager for cleaner test code
- Refactor ublock, istilldontcareaboutcookies, twocaptcha tests to use
  shared helpers (~450 lines removed)
- Refactor screenshot, dom, pdf tests to use shared get_test_env
  and get_lib_dir (~60 lines removed)
- Net reduction: 228 lines of duplicate code
- Add get_machine_type() to chrome_test_helpers.py
- Update get_test_env() to include MACHINE_TYPE
- Refactor test_chrome.py to import from shared helpers
- Removes ~50 lines of duplicate code
- Import shared Chrome test helpers
- Add test_singlefile_with_chrome_session() to verify CDP connection
- Add test_singlefile_disabled_skips() for config testing
- Update existing test to use get_test_env()
New helpers in chrome_test_helpers.py:
- get_plugin_dir(__file__) - get plugin dir from test file path
- get_hook_script(dir, pattern) - find hook script by glob pattern
- run_hook() - run hook script and return (returncode, stdout, stderr)
- parse_jsonl_output() - parse JSONL from hook output
- run_hook_and_parse() - convenience combo of above two
- LIB_DIR, NODE_MODULES_DIR - lazy-loaded module constants
- _LazyPath class for deferred path resolution

Updated test files to use simpler patterns:
- screenshot/tests/test_screenshot.py
- dom/tests/test_dom.py
- pdf/tests/test_pdf.py
- singlefile/tests/test_singlefile.py

Before: PLUGIN_DIR = Path(__file__).parent.parent
After:  PLUGIN_DIR = get_plugin_dir(__file__)

Before: LIB_DIR = get_lib_dir(); NODE_MODULES_DIR = LIB_DIR / 'npm' / 'node_modules'
After:  from chrome_test_helpers import LIB_DIR, NODE_MODULES_DIR
Changed Snapshot.cleanup() to gracefully terminate background hooks:
1. Send SIGTERM to all background hook processes first
2. Wait up to each hook's plugin-specific timeout
3. Send SIGKILL only to hooks still running after their timeout

Added graceful_terminate_background_hooks() function in hooks.py that:
- Collects all .pid files from output directory
- Validates process identity using mtime
- Sends SIGTERM to all valid processes in phase 1
- Polls each process for up to its plugin-specific timeout
- Sends SIGKILL as last resort if timeout expires
- Returns status for each hook (sigterm/sigkill/already_dead/invalid)
- Add getMachineType, getLibDir, getNodeModulesDir, getTestEnv CLI commands to chrome_utils.js
  These are now the single source of truth for path calculations
- Update chrome_test_helpers.py with call_chrome_utils() dispatcher
- Add get_test_env_from_js(), get_machine_type_from_js(), kill_chrome_via_js() helpers
- Update cleanup_chrome and kill_chromium_session to use JS killChrome
- Remove unused Chrome binary search lists from singlefile hook (~25 lines)
- Update readability, mercury, favicon, title tests to use shared helpers
Added 10 practical examples demonstrating the JSONL piping architecture:
1. Basic archive with auto-cascade
2. Retry failed extractions (by status, plugin, domain)
3. Pinboard bookmark import with jq
4. GitHub repo filtering with jq regex
5. Selective extraction (screenshots only)
6. Bulk tag management
7. Deep documentation crawling
8. RSS feed monitoring
9. Archive audit with jq aggregation
10. Incremental backup with diff

Also added auto-cascade principle: `archivebox run` automatically
creates Snapshots from Crawls and ArchiveResults from Snapshots,
so intermediate commands are only needed for customization.
Extended graceful_terminate_background_hooks() to:
- Reap processes with os.waitpid() to get exit codes
- Write returncode to .returncode file for update_from_output()
- Return detailed result dict with status, returncode, and pid

Updated update_from_output() to:
- Read .returncode and .stderr.log files
- Determine status from returncode if no ArchiveResult JSONL record
- Include stderr in output_str for failed hooks
- Handle signal termination (negative returncodes like -9 for SIGKILL)
- Clean up .returncode files along with other hook output files
- get_machine_type() matches JS getMachineType()
- get_lib_dir() matches JS getLibDir()
- get_node_modules_dir() matches JS getNodeModulesDir()
- get_extensions_dir() matches JS getExtensionsDir()
- find_chromium() matches JS findChromium()
- kill_chrome() matches JS killChrome()
- get_test_env() matches JS getTestEnv()

All functions now try JS first (single source of truth) with Python fallback.
Added backward compatibility aliases for old names.
<!-- IMPORTANT: Do not submit PRs with only formatting / PEP8 / line
length changes. -->

# Summary

<!--e.g. This PR fixes ABC or adds the ability to do XYZ...-->

# Related issues

<!-- e.g. #123 or Roadmap goal #
https://github.com/pirate/ArchiveBox/wiki/Roadmap -->

# Changes these areas

- [ ] Bugfixes
- [ ] Feature behavior
- [ ] Command line interface
- [ ] Configuration options
- [ ] Internal architecture
- [ ] Snapshot data layout on disk
pirate and others added 30 commits April 4, 2026 23:11
Tags now support full unicode with no restrictions. URL-encode the tag
name wherever it previously used the slug (export filenames, lookups).

- Remove `slug` field, `_generate_unique_slug`, and slug handling in save()
- Add migration 0034 to drop the slug column
- `get_tag_by_ref` now resolves by URL-decoded exact name match
- Tag search/autocomplete/export filenames use the name directly
- Drop slug from admin search_fields/readonly_fields/fieldsets
- Remove slug display from similar-tag cards and client download filename
Applies pirate's review suggestion on PR #1789: mark the
Content-Disposition filename encoding as a known-rough approach
that could be hardened further (strip punctuation, convert to
ASCII equivalents) in a follow-up.
Addresses review feedback from cubic and devin: quote()'s percent-
encoding isn't decoded by browsers in Content-Disposition's filename
parameter (Safari saves literal %20). Switch to Django's slugify()
which does NFKD normalization, ASCII transliteration, and replaces
punctuation with hyphens — producing clean names like
"tag-alpha-research-urls.txt".

- Add tag_filename_safe(name) helper wrapping slugify
- Use it in both tag export endpoints
- Drop the now-unneeded JS fallback name (server always sets
  Content-Disposition)
Replaces the tag_filename_safe() helper with a Tag.slug property
that returns the slugified form via django.utils.text.slugify.
Call sites now just use tag.slug directly.
Address pirate's review: restore the slug in the client-side
download fallback filename. Expose tag.slug as data-slug on the
card element and in the search card schema so the JS can read it
directly without slugifying client-side.
## Summary

This PR removes the `slug` field from the Tag model and all related slug
generation logic. Tags are now identified and referenced by their name
instead of a generated slug, simplifying the data model and reducing
complexity.

## Related issues

N/A

## Changes these areas

- [x] Internal architecture
- [x] Snapshot data layout on disk

## Details

### What changed

1. **Model changes**: Removed the `slug` field from the Tag model,
including the `_generate_unique_slug()` method and slug generation logic
in the `save()` method
2. **Database migration**: Added migration `0034_remove_tag_slug` to
drop the slug column
3. **API updates**: Removed `slug` from all API schemas (TagSchema,
TagSearchCardSchema, TagUpdateResponseSchema) and responses
4. **Tag lookup**: Updated `get_tag_by_ref()` to use URL-decoded tag
names instead of slugs for lookups
5. **Tag filtering**: Simplified `get_matching_tags()` to only filter by
name instead of both name and slug
6. **Export filenames**: Changed tag export filenames to use
`quote(tag.name)` instead of `tag.slug`
7. **Admin interface**: Removed slug from TagAdmin search fields,
readonly fields, and fieldsets
8. **Templates**: Removed slug display from tag cards and similar tags
UI
9. **Tests**: Updated test expectations and removed slug assertions;
updated export filename checks to use `quote(tag.name)`

### Why

This simplifies the Tag model by removing the derived slug field. Tags
can be uniquely identified by their name, and URL encoding handles
special characters in filenames and URLs. This reduces database
complexity and eliminates the need for slug generation and uniqueness
logic.

## Test Plan

Existing tests have been updated to verify the new behavior:
- `test_tag_rename_api_updates_name` verifies tag renaming works without
slug
- `test_tag_snapshots_export_returns_jsonl` and
`test_tag_urls_export_returns_plain_text_urls` verify export filenames
use encoded tag names
- `test_tag_table_has_required_columns` verifies the database schema no
longer includes slug

All related tests pass with the updated assertions.

https://claude.ai/code/session_014KmEXoA64Ayp2t8BW2xfVP
<!-- devin-review-badge-begin -->

---

<a href="https://app.devin.ai/review/archivebox/archivebox/pull/1789"
target="_blank">
  <picture>
<source media="(prefers-color-scheme: dark)"
srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1">
<img
src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1"
alt="Open in Devin Review">
  </picture>
</a>
<!-- devin-review-badge-end -->

<!-- This is an auto-generated description by cubic. -->
---
## Summary by cubic
Removed the stored `slug` from Tag and moved to name-based tags. Added a
derived `Tag.slug` via `django.utils.text.slugify` for clean export
filenames and an admin download fallback; public APIs no longer include
slugs and lookups resolve by URL-decoded exact name.

- **Refactors**
- Replaced stored slug with a derived `Tag.slug` property; removed slug
generation/save logic.
- Public API schemas and autocomplete drop `slug`; matching/filtering
uses `name` only.
- `get_tag_by_ref` resolves by URL-decoded `name` (case-insensitive
exact match).
- Export endpoints set filenames using `tag.slug`; admin tag cards
expose `data-slug`, and the client uses it as a fallback filename.
Removed slug from admin search fields/fieldsets and UI displays.

- **Migration**
  - Run database migrations.
- Update any consumers expecting `slug` in Tag API/admin; use the tag
`name` for references (URL-encode names in links). Rely on
server-provided filenames, with the built-in client fallback using
`tag.slug` where needed.

<sup>Written for commit 7c3a3e0.
Summary will update on new commits.</sup>

<!-- End of auto-generated description by cubic. -->
Signed-off-by: Nick Sweeting <git@sweeting.me>
## Summary
- add a vanilla HTML/CSS landing page under repo-root `publicsite/`
- keep the existing ArchiveBox logo and custom domain CNAME in the Pages
artifact
- use the light-mode ArchiveBox design tokens with no dark-mode CSS
- update the GitHub Pages workflow to deploy `./publicsite` directly
without Jekyll
- remove the old top-level `website/` tree and duplicate Jekyll Pages
workflow

## Validation
- `ruby -e "require 'yaml';
YAML.load_file('.github/workflows/gh-pages.yml')"`
- parsed `publicsite/index.html` with Python `HTMLParser`
- served `publicsite` locally and verified `/`, `styles.css`,
`icon.png`, and `CNAME` return 200
Signed-off-by: Nick Sweeting <git@sweeting.me>
Signed-off-by: Nick Sweeting <git@sweeting.me>
Signed-off-by: Nick Sweeting <git@sweeting.me>
Signed-off-by: Nick Sweeting <git@sweeting.me>
Signed-off-by: Nick Sweeting <git@sweeting.me>
Signed-off-by: Nick Sweeting <git@sweeting.me>
Signed-off-by: Nick Sweeting <git@sweeting.me>
Signed-off-by: Nick Sweeting <git@sweeting.me>
Signed-off-by: Nick Sweeting <git@sweeting.me>
Signed-off-by: Nick Sweeting <git@sweeting.me>
Signed-off-by: Nick Sweeting <git@sweeting.me>
Signed-off-by: Nick Sweeting <git@sweeting.me>
Signed-off-by: Nick Sweeting <git@sweeting.me>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⤵️ pull merge-conflict Resolve conflicts manually

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants