Dropbox Rclone Corpus Sync Feature#80
Merged
Merged
Conversation
Detailed plan for an out-of-band, rclone-based Dropbox -> data/corpus sync as a standalone sync/ package (python -m sync.cli), adapted from the PsyClaw sync v1.0 prior art and re-validated against CyClaw main. Covers: transport decision (rclone vs Maestral/SDK/dbxcli), pull-vs-bisync safety, hardened filter denylist, audit schema reusing utils.logger.audit_log, SyncError hierarchy under RAGError, additive config.yaml sync: block, reindex-on-change exit-code contract, cross-platform out-of-band scheduling, security/CI compliance (Ruff S/Bandit, CodeQL, DevSkim, OSV, coverage), fully-mocked test plan, phased PR delivery, risk register, and a demonstration that all eight security invariants are preserved. Planning only - no functional code in this change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01KkUGHTn7au2zpUdPJEidvP
…sync plan Adds §16 (three-role split: A=foundation, B=runner/CLI, C=scheduler/docs; A runs solo first, B+C parallelize against a frozen contract), Appendix C (frozen public surfaces for errors/config/filters/scheduler/runner so the parallel roles cannot clash), Appendix D (test isolation via --noconftest since the repo conftest pulls chromadb), and an orchestrator-owned Python 3.12 verification protocol. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01KkUGHTn7au2zpUdPJEidvP
Implements the planning guide's sync/ package: a thin rclone binary
wrapper that mirrors a Dropbox folder into data/corpus/ entirely
out-of-band. Invoked only via `python -m sync.cli` — never imported by
gate.py, graph.py, or mcp_hybrid_server.py, so every CyClaw security
invariant holds by construction.
Built by three coordinated roles against a frozen interface contract:
- Foundation: utils/errors.py SyncError hierarchy, sync/config.py
(RcloneConfig + validating loader), sync/filters.py (hardened
denylist incl. data/personality/** excluded by default), config.yaml
additive `sync:` block.
- Runner/CLI: sync/runner.py (rclone>=1.68.2 gate for CVE-2024-52522,
argv-list subprocess no shell, SHA-256 per-file audit, exit-code 10
reindex signal), sync/cli.py, sync/selftest.py.
- Scheduler/Docs: sync/scheduler.py (cron + Task Scheduler, idempotent),
docs/SYNC_README.md, README roadmap update.
Security/perf: zero new Python deps (stdlib + existing pyyaml/utils);
no FastAPI endpoint, no graph node/edge, no network listener; refresh
token lives only in user-owned rclone.conf, never in repo/config/logs;
audit events carry metadata only (per-file "file" key, never "query");
one-way pull default (never deletes) with --max-delete/--max-transfer/
--check-first/--checksum fuses; reindex only when data/corpus/** changed.
Verified on Python 3.12.3: 78 mocked tests pass (no network, rclone not
required), ruff (incl. Bandit S) clean, mypy-consistent, CLI degrades
cleanly to exit 3 when rclone is absent. No Dropbox auth/keys/tokens
added. Sync test files wired into CI; coverage gate preserved.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01KkUGHTn7au2zpUdPJEidvP
rclone copy never deletes destination files, so --max-delete=N is a no-op when passed to `rclone copy`. Confirmed by official rclone docs: the flag only has effect for `rclone sync` and `rclone bisync`. Move it from _common_args() into build_bisync_argv() where it can actually trigger as a safety fuse. Pull mode (the default) is already non-destructive by design via `rclone copy`; --max-transfer remains the operative guard for pull. Update test_pull_argv_is_list_no_shell to assert absence (not presence) of --max-delete, and assert presence in bisync argv test. Verified by deep-research against official rclone.org/commands/rclone_copy/ and rclone.org/commands/rclone_sync/ documentation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01KkUGHTn7au2zpUdPJEidvP
Three failures on windows-latest CI runner: 1. FAKE_RCLONE="/usr/bin/rclone": Path.is_absolute() returns False on Windows (requires drive letter). Use sys.platform to select a platform-appropriate fake absolute path (C:\Windows\rclone.exe on win32). 2. Same FAKE_RCLONE issue in test_run_sync_corpus_changed_and_exit_10's dispatch assertion — resolved by fix #1. 3. YAML ScannerError in test_scheduler_from_loaded_config: Windows paths like C:\Users\... embedded in YAML double-quoted strings cause parse errors (\U, \r etc. are YAML escape sequences). Fix: normalize backslashes to forward slashes before embedding in YAML; Python's Path.resolve() accepts forward slashes on Windows. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01KkUGHTn7au2zpUdPJEidvP
…cheduling Address five issues found in code review of the Dropbox sync implementation: Bug 1 (functional, merge-blocker): corpus_changed never fired. rclone logs file paths relative to the transfer root (data/corpus), e.g. "notes.md", not "data/corpus/notes.md", so the data/corpus/ prefix check was always False and exit code 10 (reindex signal) could never trigger. Since local_path is validated to resolve under data/corpus, every parsed event is a corpus change by construction; we now treat it as such while defensively skipping rclone's own scratch/state artifacts. Tests updated to use realistic transfer-root- relative log paths. Bug 2: sync.enabled: false was a no-op. The flag was discarded with a comment claiming it "gates sync at the CLI level," but nothing read it. load_sync_config now reads it onto the config and cmd_sync no-ops cleanly (exit 0) when false, so a scheduled run respects the toggle instead of running anyway. Bug 3: documented that parse_log captures rclone bisync execution-phase verbs but intentionally skips its pre-sync structured-diff lines, so bisync per-file counts are a lower bound (bisync remains opt-in/discouraged). Gap 1: added a single-instance guard. run_sync holds an atomically created lock directory for the duration of a run so a manual run and the scheduled run cannot drive rclone concurrently; a lock left by a crashed run is reclaimed after 3h. Portable (os.mkdir), no new dependency. Gap 2: Windows scheduling now registers a generated cyclaw_sync.bat launcher via schtasks /TR instead of an inline cmd /c string, avoiding quote fragility when the repo path contains spaces. Docs (SYNC_README, config.yaml) updated to match. Sync suite: 84 passed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01KkUGHTn7au2zpUdPJEidvP
Practical end-user setup and usage guide for the Dropbox corpus sync feature covering both platforms in detail. Includes: rclone install, OAuth setup (including headless Linux), config.yaml walkthrough, manual sync, scheduled sync (cron, systemd timer, Windows Task Scheduler), exit codes, post-sync reindexing, cron-friendly automation script, and troubleshooting for common failure modes. Companion to docs/SYNC_README.md (internals) and the implementation plan. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01KkUGHTn7au2zpUdPJEidvP
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds
docs/DROPBOX_SYNC_IMPLEMENTATION_PLAN.md— a detailed planning guide for adding an out-of-band Dropbox →data/corpus/sync to CyClaw, adapted from the PsyClaw Sync v1.0 prior art and re-validated against the currentmainbranch plus 2025–2026 Dropbox/rclone best practices.This PR is planning only — no functional code. It is the blueprint for the subsequent implementation PRs.
Design in one paragraph
A standalone
sync/package that is a thinrclonebinary wrapper (subprocess, argv list, nevershell=True), invoked only viapython -m sync.clifrom cron / systemd timer / launchd / Task Scheduler — never imported bygate.py,graph.py, ormcp_hybrid_server.py. It adds zero new Python deps, no FastAPI endpoint, no LangGraph node/edge; writes only to the local FS +logs/audit.jsonlvia the existingutils.logger.audit_log(); defaults to one-way pull (rclone copy, never deletes); keeps the Dropbox refresh token entirely inrclone.conf; and signals "corpus changed → reindex" via a dedicated exit code 10 so a wrapper can conditionally runpython -m retrieval.indexer.How it was produced
config/filters/runner/scheduler/cli+SYNC_README.md) — the PsyClaw→CyClaw mapping is ~1:1 (identicalaudit_log(),RAGError,data/corpus, soul-governance gate), making the port low-risk.mainfor performance + security compliance (readinggate.py,graph.py,utils/logger.py,utils/errors.py,config.yaml,retrieval/indexer.py, the test suite, and the CI/security tooling).What the guide covers
dropboxSDK / dbxcli) with rationale tied to CyClaw's stated values--max-delete/--max-transfer/--check-first/--checksumfuses, reindex-on-changesync/utils.logger.audit_log(noquery-named field; no secret fields)SyncErrorhierarchy underRAGError; additiveconfig.yamlsync:block (no secrets)S/Bandit (S602/S603/S607), mypy strict, CodeQL/DevSkim/Fortify subprocess hygiene, OSV/pip-audit unaffected (zero deps), coveragesource+ 80% gatetests/test_sync.py), phased PR delivery, risk register, definition of done--useroneshot timer preferred on Linux,*.db*excluded (soul DB),"sync"added to coverage sourceReview focus
This is a doc-only change. Please review the design decisions (especially the §2 invariant contract, §4 sync semantics, and §10 security/CI compliance) before the implementation PRs are opened per the §13 phased plan.