Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 13 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,9 +46,9 @@ Textrawl is a personal knowledge server with persistent memory, searchable docum

**Beyond keyword search.** Most search tools only match exact words. Textrawl combines semantic understanding (finds "automobile" when you search "car") with traditional keyword matching — so you get relevant results without missing exact phrases.

**Your data, your choice.** Use OpenAI's embeddings for best accuracy, Google AI for multimodal support, or run completely locally with Ollama — no API costs, no data leaving your machine.
**Your data, your choice.** Use OpenAI's embeddings for best accuracy, Google AI for multimodal support, or run locally with Ollama and local Postgres to keep document text and embeddings on your machine.

**Import everything.** Emails from Gmail exports, PDFs from your research, saved web pages, images, audio files, Google Takeout archives — Textrawl converts them all into searchable knowledge.
**Import everything.** Emails from Gmail exports, PDFs from your research, saved web pages, images, audio files, Google Takeout archives — Textrawl converts them into searchable knowledge where the relevant converter/provider is configured.

## Features

Expand All @@ -69,7 +69,11 @@ Textrawl is a personal knowledge server with persistent memory, searchable docum
| **Flexible Embeddings** | OpenAI, Google AI, or Ollama (free, local) |
| **Smart Chunking** | Paragraph-aware splitting with overlap for context |
| **CLI Tools** | Batch processing for large archives |
| **Cloud Ready** | Deploy to Docker, Cloud Run, or any container platform |
| **Cloud Ready** | Deploy to Docker, Cloud Run, or any container platform; large uploads require GCS/Cloud Tasks configuration |

## Privacy Model

Textrawl is self-hosted, but data leaves your machine when you configure cloud services. Document text, chunks, embeddings, extracted memories, conversation summaries, images, or audio may be sent to providers such as OpenAI/Google embeddings, Anthropic/OpenAI/Google extraction, Neon/Supabase/RDS, Cloud Run, or GCS. For sensitive data, prefer Ollama/local Postgres and disable cloud LLM extraction/insights.

## Quick Start

Expand Down Expand Up @@ -124,7 +128,7 @@ If you've set `API_BEARER_TOKEN` in `.env`, add the auth header:

```json
"--header",
"Authorization: Bearer YOUR_TOKEN_HERE"
"Authorization: Bearer <your-token>"
```

Restart Claude Desktop - you'll now see Textrawl's tools available.
Expand All @@ -135,12 +139,14 @@ ChatGPT Desktop supports MCP servers natively (Pro/Plus required):

1. Open **Settings → Connectors → Advanced → Developer mode**
2. Add a new connector with your server URL: `http://localhost:3000/mcp`
3. If using auth, add the `Authorization: Bearer YOUR_TOKEN` header
3. If using auth, add the `Authorization: Bearer <your-token>` header

See [OpenAI MCP documentation](https://platform.openai.com/docs/mcp) for details.

### 4. Add Your Documents

Imported documents, extracted memories, and conversation summaries are stored in your configured database/storage until deleted. Treat a Textrawl server as single-tenant unless you have added your own user isolation. Set `API_BEARER_TOKEN`, restrict CORS with `ALLOWED_ORIGINS`, and avoid importing third-party or private data without consent. Use `forget_entity` and `delete_conversation` to remove memory/conversation data, and `list_documents`/`update_document` to audit imported documents.

**Option A: Desktop App** (easiest)

```bash
Expand All @@ -160,7 +166,7 @@ pnpm upload -- ./converted/

| Guide | Description |
|-------|-------------|
| [Database Sizing](docs/guides/supabase-requirements.mdx) | Vector dimensions, index counts, and storage estimates by embedding provider |
| [Database Sizing](docs/guides/database-requirements.mdx) | Vector dimensions, index counts, and storage estimates by embedding provider |
| [CLI Tools](docs/cli/) | Batch conversion and upload from command line |
| [Security](docs/guides/security-hardening.mdx) | Row Level Security and access controls |

Expand Down Expand Up @@ -296,7 +302,7 @@ Enabled when `DATABASE_URL` is configured. Connects directly to Postgres.

```bash
curl -X POST http://localhost:3000/api/upload \
-H "Authorization: Bearer YOUR_TOKEN" \
-H "Authorization: Bearer <your-token>" \
-F "file=@document.pdf" \
-F "title=Optional Title" \
-F "tags=tag1,tag2"
Expand Down
2 changes: 1 addition & 1 deletion docs/RUNBOOK.md
Original file line number Diff line number Diff line change
Expand Up @@ -102,7 +102,7 @@ Add auth header when `API_BEARER_TOKEN` is set.

### ChatGPT Desktop

Use Settings → Connectors (Developer mode), point to `http://localhost:3000/mcp`, then add `Authorization: Bearer <token>` header if enabled.
Use Settings → Connectors (Developer mode), point to `http://localhost:3000/mcp`, then add the `Authorization: Bearer <your-token>` header if enabled.

## 6) MCP Inspector

Expand Down
2 changes: 1 addition & 1 deletion docs/getting-started/configuration.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -94,7 +94,7 @@ ALLOWED_ORIGINS=http://localhost:3000,https://myapp.com
API_BEARER_TOKEN=your-very-secure-token-with-at-least-32-characters
```

When set, all API endpoints require the `Authorization: Bearer <token>` header.
When set, all API endpoints require an `Authorization: Bearer <your-token>` header.

**Unprotected endpoints** (for health checks):

Expand Down
4 changes: 3 additions & 1 deletion docs/getting-started/introduction.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -11,10 +11,12 @@ The **Model Context Protocol (MCP)** is an open standard for connecting AI assis

- **Tool Use**: Claude can call functions to search, retrieve, and create content
- **Context Sharing**: Your documents become part of Claude's working knowledge
- **Privacy**: Data stays on your infrastructure, not uploaded to the cloud
- **Privacy**: Data stays in the infrastructure/providers you configure

## Key Features

Textrawl stores imported documents, extracted memories, and conversation summaries in your configured database/storage until deleted. Before importing emails, conversations, Takeout archives, images, or audio, configure authentication/CORS and avoid importing third-party/private data without consent.

### Hybrid Search

textrawl combines two search strategies:
Expand Down
35 changes: 23 additions & 12 deletions docs/getting-started/quick-start.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,8 @@ Get textrawl running in 5 minutes.
## Prerequisites

- **Node.js 22+** ([Download](https://nodejs.org/))
- **Supabase account** ([Sign up free](https://supabase.com/))
- **OpenAI API key** ([Get one](https://platform.openai.com/api-keys))
- **PostgreSQL with pgvector** (Neon, Supabase, RDS, or self-hosted)
- **Embedding provider**: OpenAI, Google AI, or Ollama for local embeddings

## Step 1: Clone and Install

Expand All @@ -29,24 +29,26 @@ pnpm run setup

You'll be prompted for:

- Supabase URL
- Supabase Service Key
- OpenAI API Key
- `DATABASE_URL` (use a pooled Postgres connection string for the server)
- Embedding provider credentials, or Ollama settings for local embeddings
- `API_BEARER_TOKEN` for authenticated clients

## Step 3: Initialize Database

In the Supabase SQL Editor, run:
Run the schema against your configured database:

```sql
-- Paste contents of scripts/setup-db.sql
```bash
psql "$DATABASE_URL" -f scripts/setup-db.sql
```
Comment thread
coderabbitai[bot] marked this conversation as resolved.

Then run the security script:
Use the provider-specific schema if needed (`setup-db-ollama.sql`, `setup-db-ollama-v2.sql`, or `setup-db-google.sql`). Then run the security script:

```sql
-- Paste contents of scripts/security-rls.sql
```bash
psql "$DATABASE_URL" -f scripts/security-rls.sql
```

Supabase is supported as one Postgres option, but never expose Supabase service-role credentials to browser, desktop, or other client code. Prefer a pooled `DATABASE_URL` used only by the server.

## Step 4: Start the Server

```bash
Expand Down Expand Up @@ -84,7 +86,8 @@ Restart Claude Desktop to connect.
Query textrawl directly over HTTP (the setup script generates `API_BEARER_TOKEN` in your `.env`):

```bash
curl -H "Authorization: Bearer $API_BEARER_TOKEN" http://localhost:3000/api/documents
curl -H "Authorization: Bearer $API_BEARER_TOKEN" \
"http://localhost:3000/api/documents"
```

## Verify It Works
Expand All @@ -93,6 +96,14 @@ curl -H "Authorization: Bearer $API_BEARER_TOKEN" http://localhost:3000/api/docu

Open `http://localhost:3000`. Try uploading a document or creating a note from the dashboard.

Imported documents, extracted memories, and conversation summaries remain in your configured database/storage until deleted.

For production:

- Set `API_BEARER_TOKEN`.
- Restrict CORS.
- Avoid importing third-party/private data without consent.

### Using the REST API

```bash
Expand Down
4 changes: 2 additions & 2 deletions docs/guides/cloud-run-deployment.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -126,8 +126,8 @@ in-memory fake (local dev only).

```bash
# Point the service at the provisioned bucket
gcloud run services update textrawl --region us-east4 \
--update-env-vars GCS_UPLOAD_BUCKET=textrawl-uploads
gcloud run services update <service-name> --region <your-region> \
--update-env-vars GCS_UPLOAD_BUCKET=<your-upload-bucket>
# GCS_PROJECT_ID is optional — auto-detected from the runtime service account.
```

Expand Down
2 changes: 1 addition & 1 deletion docs/guides/security-hardening.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ Always set `API_BEARER_TOKEN` in production:
API_BEARER_TOKEN=$(openssl rand -base64 32)
```

All API endpoints require `Authorization: Bearer <token>` header.
Data and write API endpoints require the `Authorization: Bearer ***` header. Health endpoints such as `/health`, `/health/live`, and `/health/ready` remain unauthenticated so uptime checks can probe them.

## Row Level Security

Expand Down
8 changes: 4 additions & 4 deletions docs/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -33,15 +33,15 @@ Access your knowledge via the web dashboard, MCP for AI assistants, REST API, CL

### Privacy First

Self-hosted on your infrastructure. Your documents never leave your control.
Self-hosted on infrastructure and providers you configure. Use Ollama/local Postgres for fully local document text and embeddings; cloud embeddings, extraction, databases, storage, and deployment providers receive the data needed for those features.

## What is MCP?

The **Model Context Protocol (MCP)** is an open standard for connecting AI assistants to external data sources and tools. Adopted by Anthropic for Claude and donated to the Linux Foundation's Agentic AI Foundation, MCP enables:

- **Tool Use**: Claude can call functions to search, retrieve, and create content
- **Context Sharing**: Your documents become part of Claude's working knowledge
- **Privacy**: Data stays on your infrastructure, not uploaded to the cloud
- **Privacy**: Data stays in the infrastructure/providers you configure

## Quick Start

Expand All @@ -59,7 +59,7 @@ pnpm dev

## Tools

textrawl exposes 25 tools, available via MCP and REST API:
textrawl exposes 26 tools, available via MCP and REST API:

### Document Tools

Expand Down Expand Up @@ -144,7 +144,7 @@ textrawl exposes 25 tools, available via MCP and REST API:
│ └──────────────┴───────────┴─────────────┘ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────┐│
│ │ Supabase PostgreSQL + pgvector ││
│ │ PostgreSQL + pgvector (Neon/Supabase/self-hosted) ││
│ │ • documents + chunks (document search) ││
│ │ • memory_entities + observations (memory) ││
│ │ • conversation_sessions + turns (conversations) ││
Expand Down
21 changes: 10 additions & 11 deletions infra/cloud-tasks/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ token itself (the service is public for MCP/API, so this is the access control).

| Resource | Value |
|---|---|
| Cloud Tasks queue | `textrawl-upload-processing` (location `us-east4`, colocated with Cloud Run + the GCS bucket) |
| Cloud Tasks queue | `<your-upload-processing-queue>` (colocated with Cloud Run + the GCS bucket) |
| OIDC invoker SA | `textrawl-tasks@<project>.iam.gserviceaccount.com` (identity in the task's OIDC token) |
| Enqueuer | the Cloud Run runtime SA (default compute SA) — granted `cloudtasks.enqueuer` (queue-scoped) |
| actAs | runtime SA granted `iam.serviceAccountUser` on the invoker SA |
Expand All @@ -22,15 +22,15 @@ Tasks' first 1M operations/month are free and an idle queue has no standing cost

## Run it

The script targets `--project=textrawl` explicitly and ignores your active
gcloud project (which is currently a different project). Review, then:
Set the target project/region explicitly instead of relying on your active
gcloud project. Review, then:

```sh
bash infra/cloud-tasks/setup.sh
# overridable: GCP_PROJECT_ID, GCP_REGION, CLOUD_RUN_SERVICE, CLOUD_TASKS_QUEUE, TASKS_SA_NAME
```

The runner needs admin on the `textrawl` project (service-usage, Cloud Tasks,
The runner needs admin on the target project (service-usage, Cloud Tasks,
service-account, project-IAM admin — or Editor/Owner). If your account lacks
these, run as the project owner.

Expand All @@ -44,24 +44,23 @@ re-run.
they are inert until that code ships, so set them at the same deploy:

```sh
gcloud run services update textrawl --project=textrawl --region=us-east4 \
--update-env-vars="CLOUD_TASKS_QUEUE=textrawl-upload-processing,CLOUD_TASKS_LOCATION=us-east4,CLOUD_TASKS_SERVICE_ACCOUNT=textrawl-tasks@textrawl.iam.gserviceaccount.com,UPLOAD_PROCESS_URL=<service-url>/api/upload/process,GCS_UPLOAD_BUCKET=textrawl-uploads" \
gcloud run services update <service-name> --project=<your-project> --region=<your-region> \
--update-env-vars="CLOUD_TASKS_QUEUE=<your-upload-processing-queue>,CLOUD_TASKS_LOCATION=<your-region>,CLOUD_TASKS_SERVICE_ACCOUNT=<tasks-sa>@<project>.iam.gserviceaccount.com,UPLOAD_PROCESS_URL=<service-url>/api/upload/process,GCS_UPLOAD_BUCKET=<your-upload-bucket>" \
--timeout=600
```

Notes:

- `UPLOAD_PROCESS_URL` is the live service URL + `/api/upload/process`; it is
both the task target base and the OIDC audience the app verifies.
- `GCS_UPLOAD_BUCKET` is **not currently set** on the service, so Phase 3's GCS
storage falls back to the in-memory fake in prod — set it here to activate the
real bucket.
- `GCS_UPLOAD_BUCKET` must be set for production large uploads. If it is unset,
storage falls back to an in-memory fake intended only for local/dev use.
- `--timeout=600` raises the request budget for streaming/extraction (currently
300s).

## Teardown

```sh
gcloud tasks queues delete textrawl-upload-processing --project=textrawl --location=us-east4
gcloud iam service-accounts delete textrawl-tasks@textrawl.iam.gserviceaccount.com --project=textrawl
gcloud tasks queues delete <your-upload-processing-queue> --project=<your-project> --location=<your-region>
gcloud iam service-accounts delete <tasks-sa>@<project>.iam.gserviceaccount.com --project=<your-project>
```
35 changes: 20 additions & 15 deletions infra/gcs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,23 +7,23 @@ Infra-as-config for the resumable large-upload workflow

| Setting | Value |
|---|---|
| Bucket | `gs://textrawl-uploads` |
| Project | `textrawl` (607480003712) |
| Location | `us-east4` (colocated with the Cloud Run `textrawl` service) |
| Bucket | `gs://<your-upload-bucket>` |
| Project | `<your-project>` (`<your-project-number>`) |
| Location | `<your-region>` (colocated with the Cloud Run service) |
| Uniform bucket-level access | enabled |
| Public access prevention | enforced |
| Soft-delete | disabled (transient bytes; avoids paying to retain deleted upload objects) |
| Lifecycle | delete objects ≥ 1 day old (abandoned-upload cleanup; `UPLOAD_CLEANUP_TTL_HOURS=24`) |
| IAM | `607480003712-compute@developer.gserviceaccount.com` → `roles/storage.objectAdmin` (bucket-scoped) |
| IAM | `<runtime-sa>@<project>.iam.gserviceaccount.com` → `roles/storage.objectAdmin` (bucket-scoped) |

The Cloud Run runtime SA (the default compute SA) reads/writes via ADC — no
service-account keys.

## Re-apply config

```sh
PROJECT=textrawl
BUCKET=gs://textrawl-uploads
PROJECT=<your-project>
BUCKET=gs://<your-upload-bucket>

# CORS (dashboard origin + localhost for browser-direct resumable PUTs)
gcloud storage buckets update $BUCKET --project=$PROJECT --cors-file=infra/gcs/cors.json
Expand All @@ -32,26 +32,31 @@ gcloud storage buckets update $BUCKET --project=$PROJECT --cors-file=infra/gcs/c
gcloud storage buckets update $BUCKET --project=$PROJECT --lifecycle-file=infra/gcs/lifecycle.json
```

## One-time creation (already done)
## One-time creation

```sh
gcloud storage buckets create gs://textrawl-uploads \
--project=textrawl --location=us-east4 \
PROJECT=<your-project>
REGION=<your-region>
BUCKET=gs://<your-upload-bucket>
RUNTIME_SA=<runtime-sa>@<project>.iam.gserviceaccount.com

gcloud storage buckets create $BUCKET \
--project=$PROJECT --location=$REGION \
--uniform-bucket-level-access --public-access-prevention

gcloud storage buckets add-iam-policy-binding gs://textrawl-uploads \
--project=textrawl \
--member="serviceAccount:607480003712-compute@developer.gserviceaccount.com" \
gcloud storage buckets add-iam-policy-binding $BUCKET \
--project=$PROJECT \
--member="serviceAccount:$RUNTIME_SA" \
--role="roles/storage.objectAdmin"

gcloud storage buckets update gs://textrawl-uploads --project=textrawl --clear-soft-delete
gcloud storage buckets update $BUCKET --project=$PROJECT --clear-soft-delete
```

## Server env (set on Cloud Run when the GCS StorageService lands, T3.2)

```sh
GCS_UPLOAD_BUCKET=textrawl-uploads
GCS_PROJECT_ID=textrawl # optional; auto-detected from ADC otherwise
GCS_UPLOAD_BUCKET=<your-upload-bucket>
GCS_PROJECT_ID=<your-project> # optional; auto-detected from ADC otherwise
```

## CORS origins
Expand Down
Loading