Merge branch 'main' of https://github.com/jwhco/scripts

hittjwX · hittjwX · commit 884bf764b563 · 2026-03-19T00:19:07.000-04:00
diff --git a/.vscode/extensions.json b/.vscode/extensions.json
@@ -2,17 +2,17 @@
 	"recommendations": [
 		"github.copilot-chat",
 		"github.vscode-pull-request-github",
-		"mark-wiemer.vscode-autohotkey-plus-plus",
-		"janisdd.vscode-edit-csv",
-		"ltex-plus.vscode-ltex-plus",
-		"ms-vscode.makefile-tools",
-		"ms-vscode-remote.remote-wsl",
-		"ms-toolsai.jupyter",
-		"ms-toolsai.jupyter-keymap",
-		"ms-toolsai.jupyter-renderers",
 		"ms-python.python",
 		"ms-python.vscode-pylance",
 		"ms-python.vscode-python-envs",
-		"yzhang.markdown-all-in-one"
+		"ms-toolsai.jupyter",
+		"ms-toolsai.jupyter-keymap",
+		"ms-toolsai.jupyter-renderers",
+		"yzhang.markdown-all-in-one",
+		"ms-vscode.makefile-tools",
+		"ms-vscode-remote.remote-wsl",
+		"mark-wiemer.vscode-autohotkey-plus-plus",
+		"janisdd.vscode-edit-csv",
+		"ltex-plus.vscode-ltex-plus"
 	]
 }
diff --git a/.vscode/k8s.code-workspace b/.vscode/k8s.code-workspace
@@ -1,11 +1,11 @@
 {
 	"folders": [
 		{
-			"path": "/workspace/scripts"
+			"path": "/workspaces/scripts",
 		},
 		{
-			"path": "/workspace/obsidian"
-		}
+			"path": "/workspaces/obsidian",
+		},
 	],
-	"settings": {}
-}
+	"settings": {},
+}
diff --git a/AGENTS.md b/AGENTS.md
@@ -2,7 +2,7 @@
 
 ## Project Overview
 
-- A library of scripts to support text conversion, editirial productivity, and quality assurance of markdown.
+- A library of scripts to support text conversion, editorial productivity, and quality assurance of markdown.
 
 ## Conciseness
 
diff --git a/MarkdownTools/docs/MarkdownTools-Installation.md b/MarkdownTools/docs/MarkdownTools-Installation.md
@@ -0,0 +1,26 @@
+# Installation Guide for Markdown Tools
+
+## Environment
+
+1. Setup the Python3 virtual environment,
+2. Install required Python3 modules,
+
+```bash
+python3 -m venv .venv
+source .venv/bin/activate
+python3 -m pip install -r MarkdownTools/requirements.txt
+```
+
+3. Open a Juyper in VsCode to run,
+   1. Correct any errors in execution,
+      1. Change Kernels to Pythong Virtual Environment,
+   2. Make corrections in `requirements.txt`
+   3. Check running in right environment,
+4. FINISH
+
+## Execution
+
+- Make sure you are point at the right markdown corpus,
+
+
+/EOF/
diff --git a/MarkdownTools/docs/extract-ngram-phrases-README.md b/MarkdownTools/docs/extract-ngram-phrases-README.md
@@ -2,11 +2,14 @@
 
 ## Use Case
 
-- Extract from a file all of the ngrams (trigram by default), then print on the screen.
+- Extract from a file all of the ngrams (trigram by default), then print on the screen. Use to better understand a single file or a corpus of files.
+- The n-grams can be piped into a script for further analysis, or handled by another tool for clustering. 
 
-## Configuration
+## Configure Run-Time Environment
 
-- Install NLTK data sets, https://www.nltk.org/data.html
+1. Work from a `.venv` Python Virtual Environment,
+2. Prepare packages, `pip install -r requirements.txt`,
+3. Install NLTK data sets, https://www.nltk.org/data.html
 
 ```python
 import nltk
diff --git a/MarkdownTools/docs/visualize-content-clusters-README.md b/MarkdownTools/docs/visualize-content-clusters-README.md
@@ -2,7 +2,7 @@
 
 ## Use Case
 
-- For a directory of markdown notes, determine what are the top five topical clusters. 
+- For a directory of markdown notes, determine what are the top five topical clusters.
 - Beacause hashtags and front matter tags are normalized, related terms will group on tags.
 - Works with markdown note-taking applications like Obsidian, Zettlr, LogSeq, and FOAM.
 
@@ -12,21 +12,58 @@
 
 ## Requirements
 
-- Break out YAML front matter tags and Camel case hash tags as plain words. 
-  - Example, `key-word` becomes `key word` for analysis. 
-  - Example, `KeyWord` becomes `key word` for analysis.
-  - Conversion happens before n-gram analysis of body text.
+- Break out YAML front matter tags and Camel case hash tags as plain words.
+    - Example, `key-word` becomes `key word` for analysis.
+    - Example, `KeyWord` becomes `key word` for analysis.
+    - Conversion happens before n-gram analysis of body text.
 - Ignore short common headers. The best way to to only tokenize headers three words or longer.
 - The ability to have custom stop words to clean up cluster results. Use this for brands, fractional words, and other words that show up in clusters but isn't useful.
+- Use Jupyter for concepts, for implementation use command line script that can focus on specific directories.
+
+
+
+## Interpretation
+
+### Scatter Plot: Content Semantic Map
+
+Each dot represents one markdown note from your corpus `ZETTEL_ROOT`, a markdown repo.
+
+Here's how to interpret the scatter plot it produces:
+
+
+
+- **Color/cluster membership** indicates semantic similarity—notes of the same color share similar concepts and vocabulary
+- **Physical proximity** means notes are highly semantically related; dots clustered together contain overlapping ideas
+- **Distance between clusters** shows conceptual separation—far clusters represent distinct topics
+- **Cluster density** reflects thematic cohesion—tight clusters have focused meaning; loose clusters contain diverse but related concepts
+- **Isolated outliers** (dots far from clusters) represent unique notes that don't align well with major themes
+- **Top terms printed for each cluster** (C0, C1, etc.) reveal the dominant concepts defining that cluster
+- **Dimensionality reduction caveat** as the 2D plot compresses high-dimensional semantic space, so visual distance is approximate
+
+The key insight: **examine cluster labels and look for outliers**, then review the notes associated with them to validate whether the semantic grouping makes sense for your content.
+
+
 
 ## User Story
 
+### "Is my writing on topic?"
+
+- User has a markdown note-taking application with files stored as plain text. They want to get an idea of what they have been writing about.
+    - After running the script, they can see the top eight clusters of note-taking topics.
+    - After careful consideration, the user focuses on a specific cluster to create a report.
+- For the desired cluster, the tool reports observed context. User sees tight mapping of dots.
+
+### "Where to prune research set? Tighten work up?"
 
-- User have a markdown note-taking application with files stored as plain text. They want to get an idea of what they have been writing about.
-  - After running the script, they can see the top eight clusters of note-taking topics. 
-  - After careful consideration, the user focuses on a specific cluster to create a report.
+- User is examining a body of research, looking for a concentration to write a paper, but also wants awareness when it comes to distractions.
+  - All relevant research, proposal, and paper outline is put in the same directory.
+  - There may include draft materials, relevant commentary, and research notes.
+- User runs script against directory to see if there are any outlyers to validate. Decision on tangents.
+- An outliner is found, a cluster of n-grams that has out of place words. User searches corpus to move those notes out of the project.
+  - There is a level of curation, determinging if the note is on purpose for the project.
+  - In some cases, the outlier indicates a relevant topic that needs more research or expanding of context.
 
 
 > Copyright 2026 [JWH Consolidated LLC](https://www.jwhco.com/?utm_source=repository&utm_medium=github.com&utm_content=visualize-content-clusters) All rights reserved.
 
-/EOF/
+/EOF/
diff --git a/MarkdownTools/extract-hashtag-terms.py b/MarkdownTools/extract-hashtag-terms.py
@@ -9,7 +9,7 @@
 
 # Configuration
 # ZETTEL_ROOT = "/home/hittjw/Documents/GitHub/obsidian/Zettelkasten" # Ubuntu
-ZETTEL_ROOT = "/workspace/obsidian/Zettelkasten" # K8S
+ZETTEL_ROOT = "/workspaces/obsidian/Zettelkasten" # K8S
 
 WHITELIST = {
     "vscode", "latex", "zettlr", "github", "obsidian", "python", "jupyter", 
diff --git a/MarkdownTools/visualize-content-clusters.ipynb b/MarkdownTools/visualize-content-clusters.ipynb
diff --git a/Text2Markdown/Is-PDF-Machine-Readable.ipynb b/Text2Markdown/Is-PDF-Machine-Readable.ipynb
@@ -53,18 +53,15 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 1,
    "id": "4",
    "metadata": {},
    "outputs": [
     {
-     "ename": "",
-     "evalue": "",
-     "output_type": "error",
-     "traceback": [
-      "\u001b[1;31mRunning cells with 'venv (3.10.12) (Python 3.10.12)' requires the ipykernel package.\n",
-      "\u001b[1;31mInstall 'ipykernel' into the Python environment. \n",
-      "\u001b[1;31mCommand: '/workspaces/scripts/venv/bin/python -m pip install ipykernel -U --force-reinstall'"
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Hello World\n"
      ]
     }
    ],
@@ -84,7 +81,7 @@
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "venv (3.10.12)",
+   "display_name": ".venv (3.12.3)",
    "language": "python",
    "name": "python3"
   },
@@ -98,7 +95,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.10.12"
+   "version": "3.12.3"
   }
  },
  "nbformat": 4,
diff --git a/TidyObsidian/docs/markdown-tasks-quality-README.md b/TidyObsidian/docs/markdown-tasks-quality-README.md
@@ -1,6 +1,6 @@
 # Markdown Task Quality Checker
 
-## Purpose 
+## Purpose
 
 - Find non-standard markdown tasks, fix them or highlight for user, report all tasks.
 - The script itself doesn't change the markdown files, it reports a higher quality version of the tasks.
@@ -12,8 +12,8 @@
 
 ```markdown
 - [ ] Task description. (Est: 8h)
-	- [ ] Sub-Task Description (2h)
-	- [ ] Sub-Task Two Description.
+    - [ ] Sub-Task Description (2h)
+    - [ ] Sub-Task Two Description.
 ```
 
 - Because Obsidian is my primary note-tking application, favor syntax compatable with Task plugin.
@@ -27,11 +27,10 @@
 ## Requirements
 
 - When cleaning up a task, don't change layout. Don't change indentation, tab spacing in front of bullet list. The task could have sub-tasks for details in a list.
-- Find all the markdown tasks like `grep -r -E '^[\t ]*[-*]\s*\[.?\].*' /workspace/obsidian --include=*.md` which works well. It finds things the script mixed.
+- Find all the markdown tasks like `grep -r -E '^[\t ]*[-*]\s*\[.?\].*' /workspaces/obsidian --include=*.md` which works well. It finds things the script mixed.
 - Script needs to know if a task is in a `---` or code block as an example. Wholesale updating format may be okay, except in documentation showing poor syntax.
 - Understand tasks that are hierachal, attributing the indented sub-tasks as inherint dependency to the higher level task. An outline of tasks implies highest level tasks are completed after the sub-tasks, or sub-sub-tasks are completed.
 
-
 ## Workflow Pseodocode
 
 1. Isolate leading structure (indentation + marker + checkbox). Find the task via basic formatting. Only looking for `- [ ]` task in various forms.
@@ -45,17 +44,14 @@
 9. Report best quality markdown task. Make sure that every task is hashed in a way to match back with original when updates are available.
 10. END
 
-
-
-
 ## Notes
 
-- Python library `markdown-checklist` can crate task lists with checkboxes in Markdown format. 
+- Python library `markdown-checklist` can crate task lists with checkboxes in Markdown format.
 - Python library `markdown-analysis` can parse markdown, extracting headers, paragraphs, and links. https://pypi.org/project/markdown-analysis/
 
 ## Reference
 
 - Matthew Rathbone. (2025, August 19) Markdown Task Lists and Checkboxes: Complete Guide for Project Management. https://blog.markdowntools.com/posts/markdown-task-lists-and-checkboxes-complete-guide
-  - Highlights good and bad syntax for basic task list. As well as some platform specific.
+    - Highlights good and bad syntax for basic task list. As well as some platform specific.
 
-/EOF/
+/EOF/

Original file line number	Diff line number	Diff line change
`@@ -1,11 +1,11 @@`
`1`	`1`	`{`
`2`	`2`	`"folders": [`
`3`	`3`	`{`
`4`		`- "path": "/workspace/scripts"`
	`4`	`+ "path": "/workspaces/scripts",`
`5`	`5`	`},`
`6`	`6`	`{`
`7`		`- "path": "/workspace/obsidian"`
`8`		`- }`
	`7`	`+ "path": "/workspaces/obsidian",`
	`8`	`+ },`
`9`	`9`	`],`
`10`		`- "settings": {}`
`11`		`-}`
	`10`	`+ "settings": {},`
	`11`	`+}`