"I think, therefore I am" — but what, exactly, are you thinking?
Cogito is a methodology for building a cognitive self-model from your personal knowledge base and git history.
The premise: your Obsidian vault (or any git-versioned knowledge base) contains two kinds of signal — what you changed (git diffs) and what you wrote (note content). Together they are a timestamped record of your thinking across time. This pipeline extracts both, embeds them, and surfaces patterns you were not aware of.
Your data stays local. What you open-source is the method.
Cogito approaches self-modeling as three separable questions. You can run any combination.
When did you work on what? How did your focus shift over time?
extract_diffs.py → data/diffs.jsonl (timestamped diff records)
embed_and_cluster.py → data/clusters.jsonl (embedded + KMeans clustered)
visualise.py → data/self_model_3d.html (interactive timeline)
Each dot is one commit. Position in space = semantic content. X axis = time. Color = topic cluster. Dot size = output intensity (net lines changed).
When are you most active? What does your work rhythm look like?
direction_b/temporal_patterns.py → data/temporal_report.html (hour/weekday/intensity heatmaps)
Uses the hour, weekday, and net_lines fields already in diffs.jsonl. No extra API calls. Four charts: commit frequency heatmap, output intensity heatmap, monthly volume, and monthly volume broken down by domain.
What concepts appear across all your domains without you being aware of it?
direction_c/embed_notes.py → data/notes_chunks.jsonl (note content, chunked + embedded)
direction_c/cross_domain_analysis.py → data/cross_domain_map.html (combined behavioral + content map)
→ data/cross_domain_clusters.json (machine-readable, for AI querying)
This is the hard direction. It combines two data layers:
- Behavioral layer (
diffs.jsonl) — what you did - Cognitive layer (
notes_chunks.jsonl) — what you thought and wrote
Clusters that span multiple domains (creative writing, technical work, legal documents, job search) are candidates for your cognitive identity anchors — the patterns you return to everywhere without choosing to.
The output is not a dashboard. It's a conversation starter.
A 2D scatter plot of 5000+ points tells you nothing. The real output is cross_domain_clusters.json — a structured summary of which domains cluster together, with representative samples. Feed it to an AI that knows your context and ask: what does this mean about me?
The system surfaces the pattern. The AI explains what it means. You decide if it's true.
This is the part no existing self-tracking tool does: not visualization, but interpretation grounded in your actual behavioral record.
Systems don't acknowledge humans as humans.
Seat time ≠ competence. Keywords ≠ capability. A formal record ≠ what actually happened. Every existing classification system optimizes for what it can measure, not for what is real.
Cogito is one answer to this: if you can reconstruct a person's cognitive structure from their behavioral traces, you have evidence that doesn't depend on institutional categories. The person who falls through every existing classification still left a record — in the diffs, in the notes, in the patterns that span everything they ever worked on.
This is also why git diffs matter more than just reading notes:
A note is a final state. A diff is a decision.
When you delete a sentence and rewrite it, that's a data point about how your thinking changed. Standard RAG over your notes throws this away. Cogito keeps it.
pip install gitpython openai umap-learn scikit-learn plotly numpy python-dotenv matplotlib networkx jiebaCreate a .env file in your vault root:
OPENAI_API_KEY=sk-...
Edit VAULT_DIR at the top of each script to point to your vault (must be a git repo).
Cogito works on any folder of text files — Obsidian is just the most natural starting point because it already has a git plugin. The ingest.py script normalises other sources into the same format:
| Source | Command | Directions supported |
|---|---|---|
Any .md / .txt folder |
python ingest.py --source folder --path /path/to/notes |
C only |
| Notion export (zip or folder) | python ingest.py --source notion --path /path/to/export.zip |
C only |
| Google Docs via Google Takeout | python ingest.py --source gdocs --path /path/to/takeout.zip |
C only |
| Any git repo | point VAULT_DIR at it in the scripts |
A + B + C |
Output lands in data/ingested/<source>/. Then run direction_c/embed_notes.py pointing at that folder.
Why non-git sources only support Direction C: Direction A and B require commit history — timestamps, diffs, net lines changed. A Google Doc or Notion export is a snapshot, not a history. You get the content layer but not the behavioral layer.
If you want the full pipeline on Google Docs or Notion, export to a git-tracked folder and commit regularly. Even one commit per writing session is enough signal for Direction A.
| Step | Records | Est. cost |
|---|---|---|
| Direction A (diffs) | ~4000 commits | ~$0.50–1.00 USD |
| Direction C (notes) | ~3000 chunks | ~$0.30–0.60 USD |
Uses text-embedding-3-small. Resume support built in — safe to interrupt and restart.
- Plot Ark — the institutional-scale version of this problem: xAPI behavioral analytics for learning systems
- career-ops — OSS job search pipeline; SQLite architecture RFC (#919) contributed here
- Not a therapy tool
- Not a productivity tracker
- Not a replacement for introspection
It's a mirror with a longer memory than you have.
MIT — use it, fork it, run it on your own data.
Built by someone who couldn't figure out what was driving them, so they built a system to find out.