Skip to content

Dropbox Rclone Corpus Sync Feature#80

Merged
CGFixIT merged 7 commits into
mainfrom
claude/cyclaw-dropbox-sync-plan-q1xjr4
Jun 20, 2026
Merged

Dropbox Rclone Corpus Sync Feature#80
CGFixIT merged 7 commits into
mainfrom
claude/cyclaw-dropbox-sync-plan-q1xjr4

Conversation

@CGFixIT

@CGFixIT CGFixIT commented Jun 20, 2026

Copy link
Copy Markdown
Owner

Summary

Adds docs/DROPBOX_SYNC_IMPLEMENTATION_PLAN.md — a detailed planning guide for adding an out-of-band Dropbox → data/corpus/ sync to CyClaw, adapted from the PsyClaw Sync v1.0 prior art and re-validated against the current main branch plus 2025–2026 Dropbox/rclone best practices.

This PR is planning only — no functional code. It is the blueprint for the subsequent implementation PRs.

Design in one paragraph

A standalone sync/ package that is a thin rclone binary wrapper (subprocess, argv list, never shell=True), invoked only via python -m sync.cli from cron / systemd timer / launchd / Task Scheduler — never imported by gate.py, graph.py, or mcp_hybrid_server.py. It adds zero new Python deps, no FastAPI endpoint, no LangGraph node/edge; writes only to the local FS + logs/audit.jsonl via the existing utils.logger.audit_log(); defaults to one-way pull (rclone copy, never deletes); keeps the Dropbox refresh token entirely in rclone.conf; and signals "corpus changed → reindex" via a dedicated exit code 10 so a wrapper can conditionally run python -m retrieval.indexer.

How it was produced

  • Analyzed the unzipped PsyClaw Sync v1.0 build (config/filters/runner/scheduler/cli + SYNC_README.md) — the PsyClaw→CyClaw mapping is ~1:1 (identical audit_log(), RAGError, data/corpus, soul-governance gate), making the port low-risk.
  • Ran two research passes: (1) Dropbox sync best practices vs CyClaw's minimal-deps/offline-first constraints; (2) clean integration into main for performance + security compliance (reading gate.py, graph.py, utils/logger.py, utils/errors.py, config.yaml, retrieval/indexer.py, the test suite, and the CI/security tooling).

What the guide covers

  • Transport decision (rclone vs Maestral / dropbox SDK / dbxcli) with rationale tied to CyClaw's stated values
  • Invariant contract — all 8 security invariants shown preserved by construction
  • Sync semantics — pull-vs-bisync safety, hardened filter denylist, --max-delete/--max-transfer/--check-first/--checksum fuses, reindex-on-change
  • File layout & per-module responsibilities for sync/
  • Audit schema reusing utils.logger.audit_log (no query-named field; no secret fields)
  • SyncError hierarchy under RAGError; additive config.yaml sync: block (no secrets)
  • Exit-code contract (0 / 10 / 1 / 2 / 3)
  • Security & CI compliance — Ruff S/Bandit (S602/S603/S607), mypy strict, CodeQL/DevSkim/Fortify subprocess hygiene, OSV/pip-audit unaffected (zero deps), coverage source + 80% gate
  • Fully-mocked test plan (tests/test_sync.py), phased PR delivery, risk register, definition of done
  • Net-new CyClaw decisions vs PsyClaw: rclone floor ≥ 1.68.2 (CVE-2024-52522), systemd --user oneshot timer preferred on Linux, *.db* excluded (soul DB), "sync" added to coverage source

Review focus

This is a doc-only change. Please review the design decisions (especially the §2 invariant contract, §4 sync semantics, and §10 security/CI compliance) before the implementation PRs are opened per the §13 phased plan.

claude added 7 commits June 20, 2026 08:19
Detailed plan for an out-of-band, rclone-based Dropbox -> data/corpus
sync as a standalone sync/ package (python -m sync.cli), adapted from
the PsyClaw sync v1.0 prior art and re-validated against CyClaw main.

Covers: transport decision (rclone vs Maestral/SDK/dbxcli), pull-vs-bisync
safety, hardened filter denylist, audit schema reusing utils.logger.audit_log,
SyncError hierarchy under RAGError, additive config.yaml sync: block,
reindex-on-change exit-code contract, cross-platform out-of-band scheduling,
security/CI compliance (Ruff S/Bandit, CodeQL, DevSkim, OSV, coverage),
fully-mocked test plan, phased PR delivery, risk register, and a
demonstration that all eight security invariants are preserved.

Planning only - no functional code in this change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01KkUGHTn7au2zpUdPJEidvP
…sync plan

Adds §16 (three-role split: A=foundation, B=runner/CLI, C=scheduler/docs;
A runs solo first, B+C parallelize against a frozen contract), Appendix C
(frozen public surfaces for errors/config/filters/scheduler/runner so the
parallel roles cannot clash), Appendix D (test isolation via --noconftest
since the repo conftest pulls chromadb), and an orchestrator-owned Python
3.12 verification protocol.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01KkUGHTn7au2zpUdPJEidvP
Implements the planning guide's sync/ package: a thin rclone binary
wrapper that mirrors a Dropbox folder into data/corpus/ entirely
out-of-band. Invoked only via `python -m sync.cli` — never imported by
gate.py, graph.py, or mcp_hybrid_server.py, so every CyClaw security
invariant holds by construction.

Built by three coordinated roles against a frozen interface contract:
  - Foundation: utils/errors.py SyncError hierarchy, sync/config.py
    (RcloneConfig + validating loader), sync/filters.py (hardened
    denylist incl. data/personality/** excluded by default), config.yaml
    additive `sync:` block.
  - Runner/CLI: sync/runner.py (rclone>=1.68.2 gate for CVE-2024-52522,
    argv-list subprocess no shell, SHA-256 per-file audit, exit-code 10
    reindex signal), sync/cli.py, sync/selftest.py.
  - Scheduler/Docs: sync/scheduler.py (cron + Task Scheduler, idempotent),
    docs/SYNC_README.md, README roadmap update.

Security/perf: zero new Python deps (stdlib + existing pyyaml/utils);
no FastAPI endpoint, no graph node/edge, no network listener; refresh
token lives only in user-owned rclone.conf, never in repo/config/logs;
audit events carry metadata only (per-file "file" key, never "query");
one-way pull default (never deletes) with --max-delete/--max-transfer/
--check-first/--checksum fuses; reindex only when data/corpus/** changed.

Verified on Python 3.12.3: 78 mocked tests pass (no network, rclone not
required), ruff (incl. Bandit S) clean, mypy-consistent, CLI degrades
cleanly to exit 3 when rclone is absent. No Dropbox auth/keys/tokens
added. Sync test files wired into CI; coverage gate preserved.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01KkUGHTn7au2zpUdPJEidvP
rclone copy never deletes destination files, so --max-delete=N is a
no-op when passed to `rclone copy`. Confirmed by official rclone docs:
the flag only has effect for `rclone sync` and `rclone bisync`.

Move it from _common_args() into build_bisync_argv() where it can
actually trigger as a safety fuse. Pull mode (the default) is already
non-destructive by design via `rclone copy`; --max-transfer remains
the operative guard for pull.

Update test_pull_argv_is_list_no_shell to assert absence (not presence)
of --max-delete, and assert presence in bisync argv test.

Verified by deep-research against official rclone.org/commands/rclone_copy/
and rclone.org/commands/rclone_sync/ documentation.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01KkUGHTn7au2zpUdPJEidvP
Three failures on windows-latest CI runner:

1. FAKE_RCLONE="/usr/bin/rclone": Path.is_absolute() returns False on
   Windows (requires drive letter). Use sys.platform to select a
   platform-appropriate fake absolute path (C:\Windows\rclone.exe on
   win32).

2. Same FAKE_RCLONE issue in test_run_sync_corpus_changed_and_exit_10's
   dispatch assertion — resolved by fix #1.

3. YAML ScannerError in test_scheduler_from_loaded_config: Windows
   paths like C:\Users\... embedded in YAML double-quoted strings cause
   parse errors (\U, \r etc. are YAML escape sequences). Fix: normalize
   backslashes to forward slashes before embedding in YAML; Python's
   Path.resolve() accepts forward slashes on Windows.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01KkUGHTn7au2zpUdPJEidvP
…cheduling

Address five issues found in code review of the Dropbox sync implementation:

Bug 1 (functional, merge-blocker): corpus_changed never fired. rclone logs
file paths relative to the transfer root (data/corpus), e.g. "notes.md", not
"data/corpus/notes.md", so the data/corpus/ prefix check was always False and
exit code 10 (reindex signal) could never trigger. Since local_path is
validated to resolve under data/corpus, every parsed event is a corpus change
by construction; we now treat it as such while defensively skipping rclone's
own scratch/state artifacts. Tests updated to use realistic transfer-root-
relative log paths.

Bug 2: sync.enabled: false was a no-op. The flag was discarded with a comment
claiming it "gates sync at the CLI level," but nothing read it. load_sync_config
now reads it onto the config and cmd_sync no-ops cleanly (exit 0) when false, so
a scheduled run respects the toggle instead of running anyway.

Bug 3: documented that parse_log captures rclone bisync execution-phase verbs
but intentionally skips its pre-sync structured-diff lines, so bisync per-file
counts are a lower bound (bisync remains opt-in/discouraged).

Gap 1: added a single-instance guard. run_sync holds an atomically created lock
directory for the duration of a run so a manual run and the scheduled run cannot
drive rclone concurrently; a lock left by a crashed run is reclaimed after 3h.
Portable (os.mkdir), no new dependency.

Gap 2: Windows scheduling now registers a generated cyclaw_sync.bat launcher via
schtasks /TR instead of an inline cmd /c string, avoiding quote fragility when
the repo path contains spaces.

Docs (SYNC_README, config.yaml) updated to match. Sync suite: 84 passed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01KkUGHTn7au2zpUdPJEidvP
Practical end-user setup and usage guide for the Dropbox corpus sync
feature covering both platforms in detail. Includes: rclone install,
OAuth setup (including headless Linux), config.yaml walkthrough, manual
sync, scheduled sync (cron, systemd timer, Windows Task Scheduler),
exit codes, post-sync reindexing, cron-friendly automation script, and
troubleshooting for common failure modes.

Companion to docs/SYNC_README.md (internals) and the implementation plan.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01KkUGHTn7au2zpUdPJEidvP
@CGFixIT CGFixIT marked this pull request as ready for review June 20, 2026 09:56
@CGFixIT CGFixIT merged commit 5ac5f8c into main Jun 20, 2026
13 checks passed
@CGFixIT CGFixIT changed the title docs: Dropbox corpus sync implementation planning guide Dropbox Rclone Corpus Sync Feature Jun 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants