From f90072d14bbe18838d955776d5db0761e9398313 Mon Sep 17 00:00:00 2001 From: Hermes Agent Date: Mon, 8 Jun 2026 12:07:02 +0000 Subject: [PATCH 1/3] docs: clarify privacy and onboarding boundaries --- README.md | 20 +++++++++------ docs/RUNBOOK.md | 2 +- docs/getting-started/configuration.mdx | 2 +- docs/getting-started/introduction.mdx | 4 ++- docs/getting-started/quick-start.mdx | 30 ++++++++++++---------- docs/guides/cloud-run-deployment.mdx | 4 +-- docs/guides/security-hardening.mdx | 2 +- docs/index.mdx | 8 +++--- infra/cloud-tasks/README.md | 21 ++++++++-------- infra/gcs/README.md | 35 +++++++++++++++----------- 10 files changed, 72 insertions(+), 56 deletions(-) diff --git a/README.md b/README.md index d65f529..6e5f30a 100644 --- a/README.md +++ b/README.md @@ -46,9 +46,9 @@ Textrawl is a personal knowledge server with persistent memory, searchable docum **Beyond keyword search.** Most search tools only match exact words. Textrawl combines semantic understanding (finds "automobile" when you search "car") with traditional keyword matching — so you get relevant results without missing exact phrases. -**Your data, your choice.** Use OpenAI's embeddings for best accuracy, Google AI for multimodal support, or run completely locally with Ollama — no API costs, no data leaving your machine. +**Your data, your choice.** Use OpenAI's embeddings for best accuracy, Google AI for multimodal support, or run locally with Ollama and local Postgres to keep document text and embeddings on your machine. -**Import everything.** Emails from Gmail exports, PDFs from your research, saved web pages, images, audio files, Google Takeout archives — Textrawl converts them all into searchable knowledge. +**Import everything.** Emails from Gmail exports, PDFs from your research, saved web pages, images, audio files, Google Takeout archives — Textrawl converts them into searchable knowledge where the relevant converter/provider is configured. ## Features @@ -69,7 +69,11 @@ Textrawl is a personal knowledge server with persistent memory, searchable docum | **Flexible Embeddings** | OpenAI, Google AI, or Ollama (free, local) | | **Smart Chunking** | Paragraph-aware splitting with overlap for context | | **CLI Tools** | Batch processing for large archives | -| **Cloud Ready** | Deploy to Docker, Cloud Run, or any container platform | +| **Cloud Ready** | Deploy to Docker, Cloud Run, or any container platform; large uploads require GCS/Cloud Tasks configuration | + +## Privacy Model + +Textrawl is self-hosted, but data leaves your machine when you configure cloud services. Document text, chunks, embeddings, extracted memories, conversation summaries, images, or audio may be sent to providers such as OpenAI/Google embeddings, Anthropic/OpenAI/Google extraction, Neon/Supabase/RDS, Cloud Run, or GCS. For sensitive data, prefer Ollama/local Postgres and disable cloud LLM extraction/insights. ## Quick Start @@ -124,7 +128,7 @@ If you've set `API_BEARER_TOKEN` in `.env`, add the auth header: ```json "--header", -"Authorization: Bearer YOUR_TOKEN_HERE" +"Authorization: Bearer " ``` Restart Claude Desktop - you'll now see Textrawl's tools available. @@ -135,12 +139,14 @@ ChatGPT Desktop supports MCP servers natively (Pro/Plus required): 1. Open **Settings → Connectors → Advanced → Developer mode** 2. Add a new connector with your server URL: `http://localhost:3000/mcp` -3. If using auth, add the `Authorization: Bearer YOUR_TOKEN` header +3. If using auth, add the `Authorization: Bearer ` header See [OpenAI MCP documentation](https://platform.openai.com/docs/mcp) for details. ### 4. Add Your Documents +Imported documents, extracted memories, and conversation summaries are stored in your configured database/storage until deleted. Treat a Textrawl server as single-tenant unless you have added your own user isolation. Set `API_BEARER_TOKEN`, restrict CORS with `ALLOWED_ORIGINS`, and avoid importing third-party or private data without consent. Use `forget_entity` and `delete_conversation` to remove memory/conversation data, and `list_documents`/`update_document` to audit imported documents. + **Option A: Desktop App** (easiest) ```bash @@ -160,7 +166,7 @@ pnpm upload -- ./converted/ | Guide | Description | |-------|-------------| -| [Database Sizing](docs/guides/supabase-requirements.mdx) | Vector dimensions, index counts, and storage estimates by embedding provider | +| [Database Sizing](docs/guides/database-requirements.mdx) | Vector dimensions, index counts, and storage estimates by embedding provider | | [CLI Tools](docs/cli/) | Batch conversion and upload from command line | | [Security](docs/guides/security-hardening.mdx) | Row Level Security and access controls | @@ -296,7 +302,7 @@ Enabled when `DATABASE_URL` is configured. Connects directly to Postgres. ```bash curl -X POST http://localhost:3000/api/upload \ - -H "Authorization: Bearer YOUR_TOKEN" \ + -H "Authorization: Bearer " \ -F "file=@document.pdf" \ -F "title=Optional Title" \ -F "tags=tag1,tag2" diff --git a/docs/RUNBOOK.md b/docs/RUNBOOK.md index 71a1827..49a4102 100644 --- a/docs/RUNBOOK.md +++ b/docs/RUNBOOK.md @@ -102,7 +102,7 @@ Add auth header when `API_BEARER_TOKEN` is set. ### ChatGPT Desktop -Use Settings → Connectors (Developer mode), point to `http://localhost:3000/mcp`, then add `Authorization: Bearer ` header if enabled. +Use Settings → Connectors (Developer mode), point to `http://localhost:3000/mcp`, then add the `Authorization: Bearer ` header if enabled. ## 6) MCP Inspector diff --git a/docs/getting-started/configuration.mdx b/docs/getting-started/configuration.mdx index 10f09ce..9e5101a 100644 --- a/docs/getting-started/configuration.mdx +++ b/docs/getting-started/configuration.mdx @@ -94,7 +94,7 @@ ALLOWED_ORIGINS=http://localhost:3000,https://myapp.com API_BEARER_TOKEN=your-very-secure-token-with-at-least-32-characters ``` -When set, all API endpoints require the `Authorization: Bearer ` header. +When set, all API endpoints require an `Authorization: Bearer ` header. **Unprotected endpoints** (for health checks): diff --git a/docs/getting-started/introduction.mdx b/docs/getting-started/introduction.mdx index adffb93..116a187 100644 --- a/docs/getting-started/introduction.mdx +++ b/docs/getting-started/introduction.mdx @@ -11,10 +11,12 @@ The **Model Context Protocol (MCP)** is an open standard for connecting AI assis - **Tool Use**: Claude can call functions to search, retrieve, and create content - **Context Sharing**: Your documents become part of Claude's working knowledge -- **Privacy**: Data stays on your infrastructure, not uploaded to the cloud +- **Privacy**: Data stays in the infrastructure/providers you configure ## Key Features +Textrawl stores imported documents, extracted memories, and conversation summaries in your configured database/storage until deleted. Before importing emails, conversations, Takeout archives, images, or audio, configure authentication/CORS and avoid importing third-party/private data without consent. + ### Hybrid Search textrawl combines two search strategies: diff --git a/docs/getting-started/quick-start.mdx b/docs/getting-started/quick-start.mdx index 6c45f16..12df5f0 100644 --- a/docs/getting-started/quick-start.mdx +++ b/docs/getting-started/quick-start.mdx @@ -8,8 +8,8 @@ Get textrawl running in 5 minutes. ## Prerequisites - **Node.js 22+** ([Download](https://nodejs.org/)) -- **Supabase account** ([Sign up free](https://supabase.com/)) -- **OpenAI API key** ([Get one](https://platform.openai.com/api-keys)) +- **PostgreSQL with pgvector** (Neon, Supabase, RDS, or self-hosted) +- **Embedding provider**: OpenAI, Google AI, or Ollama for local embeddings ## Step 1: Clone and Install @@ -29,24 +29,26 @@ pnpm run setup You'll be prompted for: -- Supabase URL -- Supabase Service Key -- OpenAI API Key +- `DATABASE_URL` (use a pooled Postgres connection string for the server) +- Embedding provider credentials, or Ollama settings for local embeddings +- `API_BEARER_TOKEN` for authenticated clients ## Step 3: Initialize Database -In the Supabase SQL Editor, run: +Run the schema against your configured database: -```sql --- Paste contents of scripts/setup-db.sql +```bash +psql $DATABASE_URL -f scripts/setup-db.sql ``` -Then run the security script: +Use the provider-specific schema if needed (`setup-db-ollama.sql`, `setup-db-ollama-v2.sql`, or `setup-db-google.sql`). Then run the security script: -```sql --- Paste contents of scripts/security-rls.sql +```bash +psql $DATABASE_URL -f scripts/security-rls.sql ``` +Supabase is supported as one Postgres option, but never expose Supabase service-role credentials to browser, desktop, or other client code. Prefer a pooled `DATABASE_URL` used only by the server. + ## Step 4: Start the Server ```bash @@ -84,7 +86,7 @@ Restart Claude Desktop to connect. Query textrawl directly over HTTP (the setup script generates `API_BEARER_TOKEN` in your `.env`): ```bash -curl -H "Authorization: Bearer $API_BEARER_TOKEN" http://localhost:3000/api/documents +curl -H "Authorization: Bearer " http://localhost:3000/api/documents ``` ## Verify It Works @@ -93,6 +95,8 @@ curl -H "Authorization: Bearer $API_BEARER_TOKEN" http://localhost:3000/api/docu Open `http://localhost:3000`. Try uploading a document or creating a note from the dashboard. +Imported documents, extracted memories, and conversation summaries remain in your configured database/storage until deleted. Set `API_BEARER_TOKEN`, restrict CORS for production, and avoid importing third-party/private data without consent. + ### Using the REST API ```bash @@ -100,7 +104,7 @@ Open `http://localhost:3000`. Try uploading a document or creating a note from t curl http://localhost:3000/health # Search for documents -curl -H "Authorization: Bearer $API_BEARER_TOKEN" \ +curl -H "Authorization: Bearer " \ "http://localhost:3000/api/search?q=test&limit=5" ``` diff --git a/docs/guides/cloud-run-deployment.mdx b/docs/guides/cloud-run-deployment.mdx index 82fd440..afe03ff 100644 --- a/docs/guides/cloud-run-deployment.mdx +++ b/docs/guides/cloud-run-deployment.mdx @@ -126,8 +126,8 @@ in-memory fake (local dev only). ```bash # Point the service at the provisioned bucket -gcloud run services update textrawl --region us-east4 \ - --update-env-vars GCS_UPLOAD_BUCKET=textrawl-uploads +gcloud run services update --region \ + --update-env-vars GCS_UPLOAD_BUCKET= # GCS_PROJECT_ID is optional — auto-detected from the runtime service account. ``` diff --git a/docs/guides/security-hardening.mdx b/docs/guides/security-hardening.mdx index 72860d9..79cb956 100644 --- a/docs/guides/security-hardening.mdx +++ b/docs/guides/security-hardening.mdx @@ -13,7 +13,7 @@ Always set `API_BEARER_TOKEN` in production: API_BEARER_TOKEN=$(openssl rand -base64 32) ``` -All API endpoints require `Authorization: Bearer ` header. +All API endpoints require the `Authorization: Bearer ` header. ## Row Level Security diff --git a/docs/index.mdx b/docs/index.mdx index e7fdfa1..ec9265b 100644 --- a/docs/index.mdx +++ b/docs/index.mdx @@ -33,7 +33,7 @@ Access your knowledge via the web dashboard, MCP for AI assistants, REST API, CL ### Privacy First -Self-hosted on your infrastructure. Your documents never leave your control. +Self-hosted on infrastructure and providers you configure. Use Ollama/local Postgres for fully local document text and embeddings; cloud embeddings, extraction, databases, storage, and deployment providers receive the data needed for those features. ## What is MCP? @@ -41,7 +41,7 @@ The **Model Context Protocol (MCP)** is an open standard for connecting AI assis - **Tool Use**: Claude can call functions to search, retrieve, and create content - **Context Sharing**: Your documents become part of Claude's working knowledge -- **Privacy**: Data stays on your infrastructure, not uploaded to the cloud +- **Privacy**: Data stays in the infrastructure/providers you configure ## Quick Start @@ -59,7 +59,7 @@ pnpm dev ## Tools -textrawl exposes 25 tools, available via MCP and REST API: +textrawl exposes 26 tools, available via MCP and REST API: ### Document Tools @@ -144,7 +144,7 @@ textrawl exposes 25 tools, available via MCP and REST API: │ └──────────────┴───────────┴─────────────┘ │ │ ▼ │ │ ┌─────────────────────────────────────────────────────┐│ -│ │ Supabase PostgreSQL + pgvector ││ +│ │ PostgreSQL + pgvector (Neon/Supabase/self-hosted) ││ │ │ • documents + chunks (document search) ││ │ │ • memory_entities + observations (memory) ││ │ │ • conversation_sessions + turns (conversations) ││ diff --git a/infra/cloud-tasks/README.md b/infra/cloud-tasks/README.md index 7664ee1..1e59ea9 100644 --- a/infra/cloud-tasks/README.md +++ b/infra/cloud-tasks/README.md @@ -10,7 +10,7 @@ token itself (the service is public for MCP/API, so this is the access control). | Resource | Value | |---|---| -| Cloud Tasks queue | `textrawl-upload-processing` (location `us-east4`, colocated with Cloud Run + the GCS bucket) | +| Cloud Tasks queue | `` (colocated with Cloud Run + the GCS bucket) | | OIDC invoker SA | `textrawl-tasks@.iam.gserviceaccount.com` (identity in the task's OIDC token) | | Enqueuer | the Cloud Run runtime SA (default compute SA) — granted `cloudtasks.enqueuer` (queue-scoped) | | actAs | runtime SA granted `iam.serviceAccountUser` on the invoker SA | @@ -22,15 +22,15 @@ Tasks' first 1M operations/month are free and an idle queue has no standing cost ## Run it -The script targets `--project=textrawl` explicitly and ignores your active -gcloud project (which is currently a different project). Review, then: +Set the target project/region explicitly instead of relying on your active +gcloud project. Review, then: ```sh bash infra/cloud-tasks/setup.sh # overridable: GCP_PROJECT_ID, GCP_REGION, CLOUD_RUN_SERVICE, CLOUD_TASKS_QUEUE, TASKS_SA_NAME ``` -The runner needs admin on the `textrawl` project (service-usage, Cloud Tasks, +The runner needs admin on the target project (service-usage, Cloud Tasks, service-account, project-IAM admin — or Editor/Owner). If your account lacks these, run as the project owner. @@ -44,8 +44,8 @@ re-run. they are inert until that code ships, so set them at the same deploy: ```sh -gcloud run services update textrawl --project=textrawl --region=us-east4 \ - --update-env-vars="CLOUD_TASKS_QUEUE=textrawl-upload-processing,CLOUD_TASKS_LOCATION=us-east4,CLOUD_TASKS_SERVICE_ACCOUNT=textrawl-tasks@textrawl.iam.gserviceaccount.com,UPLOAD_PROCESS_URL=/api/upload/process,GCS_UPLOAD_BUCKET=textrawl-uploads" \ +gcloud run services update --project= --region= \ + --update-env-vars="CLOUD_TASKS_QUEUE=,CLOUD_TASKS_LOCATION=,CLOUD_TASKS_SERVICE_ACCOUNT=@.iam.gserviceaccount.com,UPLOAD_PROCESS_URL=/api/upload/process,GCS_UPLOAD_BUCKET=" \ --timeout=600 ``` @@ -53,15 +53,14 @@ Notes: - `UPLOAD_PROCESS_URL` is the live service URL + `/api/upload/process`; it is both the task target base and the OIDC audience the app verifies. -- `GCS_UPLOAD_BUCKET` is **not currently set** on the service, so Phase 3's GCS - storage falls back to the in-memory fake in prod — set it here to activate the - real bucket. +- `GCS_UPLOAD_BUCKET` must be set for production large uploads. If it is unset, + storage falls back to an in-memory fake intended only for local/dev use. - `--timeout=600` raises the request budget for streaming/extraction (currently 300s). ## Teardown ```sh -gcloud tasks queues delete textrawl-upload-processing --project=textrawl --location=us-east4 -gcloud iam service-accounts delete textrawl-tasks@textrawl.iam.gserviceaccount.com --project=textrawl +gcloud tasks queues delete --project= --location= +gcloud iam service-accounts delete @.iam.gserviceaccount.com --project= ``` diff --git a/infra/gcs/README.md b/infra/gcs/README.md index f825e19..a592330 100644 --- a/infra/gcs/README.md +++ b/infra/gcs/README.md @@ -7,14 +7,14 @@ Infra-as-config for the resumable large-upload workflow | Setting | Value | |---|---| -| Bucket | `gs://textrawl-uploads` | -| Project | `textrawl` (607480003712) | -| Location | `us-east4` (colocated with the Cloud Run `textrawl` service) | +| Bucket | `gs://` | +| Project | `` (``) | +| Location | `` (colocated with the Cloud Run service) | | Uniform bucket-level access | enabled | | Public access prevention | enforced | | Soft-delete | disabled (transient bytes; avoids paying to retain deleted upload objects) | | Lifecycle | delete objects ≥ 1 day old (abandoned-upload cleanup; `UPLOAD_CLEANUP_TTL_HOURS=24`) | -| IAM | `607480003712-compute@developer.gserviceaccount.com` → `roles/storage.objectAdmin` (bucket-scoped) | +| IAM | `@.iam.gserviceaccount.com` → `roles/storage.objectAdmin` (bucket-scoped) | The Cloud Run runtime SA (the default compute SA) reads/writes via ADC — no service-account keys. @@ -22,8 +22,8 @@ service-account keys. ## Re-apply config ```sh -PROJECT=textrawl -BUCKET=gs://textrawl-uploads +PROJECT= +BUCKET=gs:// # CORS (dashboard origin + localhost for browser-direct resumable PUTs) gcloud storage buckets update $BUCKET --project=$PROJECT --cors-file=infra/gcs/cors.json @@ -32,26 +32,31 @@ gcloud storage buckets update $BUCKET --project=$PROJECT --cors-file=infra/gcs/c gcloud storage buckets update $BUCKET --project=$PROJECT --lifecycle-file=infra/gcs/lifecycle.json ``` -## One-time creation (already done) +## One-time creation ```sh -gcloud storage buckets create gs://textrawl-uploads \ - --project=textrawl --location=us-east4 \ +PROJECT= +REGION= +BUCKET=gs:// +RUNTIME_SA=@.iam.gserviceaccount.com + +gcloud storage buckets create $BUCKET \ + --project=$PROJECT --location=$REGION \ --uniform-bucket-level-access --public-access-prevention -gcloud storage buckets add-iam-policy-binding gs://textrawl-uploads \ - --project=textrawl \ - --member="serviceAccount:607480003712-compute@developer.gserviceaccount.com" \ +gcloud storage buckets add-iam-policy-binding $BUCKET \ + --project=$PROJECT \ + --member="serviceAccount:$RUNTIME_SA" \ --role="roles/storage.objectAdmin" -gcloud storage buckets update gs://textrawl-uploads --project=textrawl --clear-soft-delete +gcloud storage buckets update $BUCKET --project=$PROJECT --clear-soft-delete ``` ## Server env (set on Cloud Run when the GCS StorageService lands, T3.2) ```sh -GCS_UPLOAD_BUCKET=textrawl-uploads -GCS_PROJECT_ID=textrawl # optional; auto-detected from ADC otherwise +GCS_UPLOAD_BUCKET= +GCS_PROJECT_ID= # optional; auto-detected from ADC otherwise ``` ## CORS origins From 2d1837eb53bcf0523a98e250dbfad2455d19214f Mon Sep 17 00:00:00 2001 From: Hermes Agent Date: Mon, 8 Jun 2026 17:16:57 +0000 Subject: [PATCH 2/3] docs: address CodeRabbit onboarding feedback Quote DATABASE_URL examples and clarify auth requirements for health endpoints. --- docs/getting-started/quick-start.mdx | 9 +++++---- docs/guides/security-hardening.mdx | 2 +- 2 files changed, 6 insertions(+), 5 deletions(-) diff --git a/docs/getting-started/quick-start.mdx b/docs/getting-started/quick-start.mdx index 12df5f0..62bc817 100644 --- a/docs/getting-started/quick-start.mdx +++ b/docs/getting-started/quick-start.mdx @@ -38,13 +38,13 @@ You'll be prompted for: Run the schema against your configured database: ```bash -psql $DATABASE_URL -f scripts/setup-db.sql +psql "$DATABASE_URL" -f scripts/setup-db.sql ``` Use the provider-specific schema if needed (`setup-db-ollama.sql`, `setup-db-ollama-v2.sql`, or `setup-db-google.sql`). Then run the security script: ```bash -psql $DATABASE_URL -f scripts/security-rls.sql +psql "$DATABASE_URL" -f scripts/security-rls.sql ``` Supabase is supported as one Postgres option, but never expose Supabase service-role credentials to browser, desktop, or other client code. Prefer a pooled `DATABASE_URL` used only by the server. @@ -86,7 +86,8 @@ Restart Claude Desktop to connect. Query textrawl directly over HTTP (the setup script generates `API_BEARER_TOKEN` in your `.env`): ```bash -curl -H "Authorization: Bearer " http://localhost:3000/api/documents +curl -H "Authorization: Bearer $API_BEARER_TOKEN" \ + "http://localhost:3000/api/documents" ``` ## Verify It Works @@ -104,7 +105,7 @@ Imported documents, extracted memories, and conversation summaries remain in you curl http://localhost:3000/health # Search for documents -curl -H "Authorization: Bearer " \ +curl -H "Authorization: Bearer $API_BEARER_TOKEN" \ "http://localhost:3000/api/search?q=test&limit=5" ``` diff --git a/docs/guides/security-hardening.mdx b/docs/guides/security-hardening.mdx index 79cb956..001ea25 100644 --- a/docs/guides/security-hardening.mdx +++ b/docs/guides/security-hardening.mdx @@ -13,7 +13,7 @@ Always set `API_BEARER_TOKEN` in production: API_BEARER_TOKEN=$(openssl rand -base64 32) ``` -All API endpoints require the `Authorization: Bearer ` header. +Data and write API endpoints require the `Authorization: Bearer ***` header. Health endpoints such as `/health`, `/health/live`, and `/health/ready` remain unauthenticated so uptime checks can probe them. ## Row Level Security From 24a2fef82a46fe6ca3856a5e437cc4fc1874e88c Mon Sep 17 00:00:00 2001 From: Hermes Agent Date: Mon, 8 Jun 2026 17:31:53 +0000 Subject: [PATCH 3/3] docs: clarify quick start production notes Split persistence, security, and consent guidance after CodeRabbit follow-up. --- docs/getting-started/quick-start.mdx | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/docs/getting-started/quick-start.mdx b/docs/getting-started/quick-start.mdx index 62bc817..6e0e85c 100644 --- a/docs/getting-started/quick-start.mdx +++ b/docs/getting-started/quick-start.mdx @@ -96,7 +96,13 @@ curl -H "Authorization: Bearer $API_BEARER_TOKEN" \ Open `http://localhost:3000`. Try uploading a document or creating a note from the dashboard. -Imported documents, extracted memories, and conversation summaries remain in your configured database/storage until deleted. Set `API_BEARER_TOKEN`, restrict CORS for production, and avoid importing third-party/private data without consent. +Imported documents, extracted memories, and conversation summaries remain in your configured database/storage until deleted. + +For production: + +- Set `API_BEARER_TOKEN`. +- Restrict CORS. +- Avoid importing third-party/private data without consent. ### Using the REST API