diff --git a/.github/workflows/test.yml b/.github/workflows/test.yml
new file mode 100644
index 0000000..7ec4ee6
--- /dev/null
+++ b/.github/workflows/test.yml
@@ -0,0 +1,32 @@
+name: Node.js CI Pipeline
+
+on:
+ push:
+ branches: [ main ]
+ pull_request:
+ branches: [ main ]
+
+permissions:
+ contents: read
+
+jobs:
+ test:
+ runs-on: ubuntu-latest
+ strategy:
+ matrix:
+ node-version: [20.x, 22.x]
+
+ steps:
+ - uses: actions/checkout@v3
+
+ - name: Use Node.js ${{ matrix.node-version }}
+ uses: actions/setup-node@v3
+ with:
+ node-version: ${{ matrix.node-version }}
+ cache: 'npm'
+
+ - name: Clean Install and Test
+ run: |
+ npm ci
+ npm test
+ npm run docs:check
\ No newline at end of file
diff --git a/.gitignore b/.gitignore
index 84545d8..8f86baa 100644
--- a/.gitignore
+++ b/.gitignore
@@ -32,6 +32,7 @@ Thumbs.db
.env.local
copilot-chat-history.json
*.traineddata
+.analytics_cache.json
# =========================
# Bot Specific: Data & Media
diff --git a/.husky/pre-commit b/.husky/pre-commit
new file mode 100755
index 0000000..86f640a
--- /dev/null
+++ b/.husky/pre-commit
@@ -0,0 +1,6 @@
+#!/usr/bin/env sh
+
+npm run docs:generate || exit 1
+git add docs/ || exit 1
+npm test || exit 1
+npm run docs:check || exit 1
diff --git a/README.md b/README.md
index b4a1862..e4124dc 100644
--- a/README.md
+++ b/README.md
@@ -80,6 +80,9 @@ The current Node ingestion pipeline only analyzes text-oriented files.
| `.csv` | Ingested by the active Node pipeline |
| `.log` | Ingested by the active Node pipeline |
| `.pdf` | Ingested by the active Node pipeline |
+| `.png` | Ingested by the active Node pipeline |
+| `.jpg` | Ingested by the active Node pipeline |
+| `.jpeg` | Ingested by the active Node pipeline |
## Repository Layout
@@ -87,7 +90,7 @@ The current Node ingestion pipeline only analyzes text-oriented files.
- `src/index.js` — Node CLI entry point.
- `src/pipeline.js` — Pipeline coordinator that assembles all analytics tiers.
-- `src/ingestion/file-ingestion.js` — Read-only recursive file ingestion for supported text files.
+- `src/ingestion/file-ingestion.js` — Read-only recursive file ingestion for supported files.
- `src/analytics/` — Descriptive, diagnostic, predictive, and prescriptive analytics modules.
- `test/pipeline.test.js` — Node test coverage for core pipeline behavior.
- `docs/architecture.md` — Hand-authored architecture overview for current and planned system design.
@@ -125,6 +128,99 @@ The bot must never modify, move, or delete ingested source files. Ingestion is r
- When adding analytics, classify behavior under one of the four analytics tiers.
- Update [docs/architecture.md](docs/architecture.md) when implementation changes affect current-vs-planned system boundaries.
+
+
+
+
+
+## ⚙️ Installation & Setup
+
+**Prerequisites:** Ensure you have [Node.js](https://nodejs.org/) installed (version 18, 20, or 22+ recommended).
+
+1. **Clone the repository:**
+```bash
+git clone https://github.com/aj1126/uap_analyticsbot.git
+cd uap_analyticsbot
+
+```
+
+
+2. **Install dependencies:**
+This project installs as a standard Node.js CLI package, so there are no extra native build steps required for the current worker-thread ingestion flow. Simply run:
+```bash
+npm install
+
+```
+
+
+3. **Verify the installation:**
+Run the local test suite to ensure the multithreaded worker pool and caching engine are functioning correctly on your machine:
+```bash
+npm test
+
+```
+
+
+*(If all tests pass green, you are ready to start analyzing documents!)*
+
+
+---
+
+
+
+
+
+
+## Usage
+
+
+To run the AnalyticsBot, simply pass the target directory containing your text files as the first argument:
+
+```bash
+node src/index.js ./my_folder/
+
+```
+
+By default, this will parse the documents and output a formatted JSON report directly to your console.
+
+### 👀 Watch Mode
+
+Keep the pipeline running in the background. It will automatically re-analyze the documents and recalculate the math whenever you add, edit, or delete a file in the target directory:
+
+```bash
+node src/index.js ./my_folder/ --watch
+
+```
+
+### 🖨️ Report Generation
+
+Instead of dumping JSON directly to the console, you can generate formatted report files that are automatically saved to the `/data_exports/` directory:
+
+```bash
+node src/index.js ./my_folder/ --format=md
+
+```
+
+*(Supports `md` for Markdown or `csv` for spreadsheet datasets).*
+
+
+---
+
+
+### 🚀 Advanced Usage
+
+The v1.2.0 AnalyticsBot engine supports multithreading and memoization caching. You can control these via CLI arguments:
+
+* `node src/index.js ./my_folder --workers=4` : Manually set the number of Node.js worker threads (defaults to max CPU cores).
+* `node src/index.js ./my_folder --clear-cache` : Bypasses the `.analytics_cache.json` file and forces a fresh read of all documents.
+* `node src/index.js ./my_folder --format=csv` : Exports the final report as a spreadsheet-compatible `.csv` file.
+
+
+
+
+
+
+
## 🚀 Planned Technical Optimizations
### 1. Performance & Infrastructure
diff --git a/docs/USER_GUIDE.md b/docs/USER_GUIDE.md
index d8f5898..4f88bf4 100644
--- a/docs/USER_GUIDE.md
+++ b/docs/USER_GUIDE.md
@@ -56,14 +56,16 @@ npm start -- "C:\Path\To\Folder" > analytics_report.json
## Supported File Types
-Currently, the ingestion engine natively parses the following text-based extensions:
+Currently, the ingestion engine natively parses the following extensions:
* `.txt`
* `.md`
* `.json`
* `.csv`
* `.log`
-
-*(Note: Binary and multimedia extraction, such as PDF parsing and Image OCR, are tracked for a future development stage).*
+* `.pdf`
+* `.png`
+* `.jpg`
+* `.jpeg`
## Testing & Validation
diff --git a/docs/architecture.md b/docs/architecture.md
index c79a4eb..2346525 100644
--- a/docs/architecture.md
+++ b/docs/architecture.md
@@ -4,26 +4,31 @@
The repository currently ships a Node.js CLI-centered analytics flow:
-1. **CLI Orchestrator (`src/index.js`)** resolves the source directory and writes the final report to stdout.
-2. **Read-Only Ingestion (`src/ingestion/file-ingestion.js`)** recursively scans supported text files, streams file content, and extracts words, dates, locations, and filesystem metadata.
+1. **CLI Orchestrator (`src/index.js`)** resolves the source directory, supports watch mode, and routes report output to stdout or export files.
+2. **Read-Only Ingestion (`src/ingestion/file-ingestion.js`)** recursively scans supported text files, dispatches parsing work to Node.js worker threads, memoizes compatible results in `.analytics_cache.json`, and extracts words, dates, locations, and filesystem metadata.
3. **Analytics Pipeline (`src/pipeline.js`)** builds the descriptive, diagnostic, predictive, and prescriptive tiers from the ingested file set.
-4. **Output Layer** returns a single structured JSON report for the requested directory.
+4. **Output Layer** returns structured JSON or saves Markdown / CSV exports for the requested directory.
+
+### v1.2.0 Pipeline Architecture
+* **Ingestion (Multithreaded):** Utilizes Node.js `worker_threads` and file-stat fingerprinting (`.analytics_cache.json`) to bypass redundant processing and drastically speed up execution.
+* **Semantic Analytics:** Employs a TF-IDF weighting engine to filter generic stop-words and a Cosine Similarity math engine to automatically cluster related UAP documents based on vector distance.
## Current Runtime Boundaries
Implemented today:
- recursive read-only ingestion for `.txt`, `.md`, `.json`, `.csv`, and `.log`
+- multithreaded parsing with fingerprint-based cache reuse for compatible ingestions
- tokenization plus lightweight date/location extraction
- descriptive, diagnostic, predictive, and prescriptive analytics modules
-- JSON report delivery through the Node CLI
+- JSON, Markdown, and CSV report delivery through the Node CLI
+- directory watch mode that re-runs the pipeline after file changes
Not yet implemented in the active system:
- binary or multimedia extraction
- Named Entity Recognition (NER)
-- dashboard or alternate export formats
-- background scheduling or directory watching
+- dashboard or background scheduling
## Planned Expansion
diff --git a/docs/docs-source.json b/docs/docs-source.json
index 568d598..2f8acb3 100644
--- a/docs/docs-source.json
+++ b/docs/docs-source.json
@@ -26,7 +26,7 @@
"description": "Auto-generate CHANGELOG.md, bump the semantic version, and create a Git release tag based on conventional commit history."
}
],
- "supportedFileTypes": [".txt", ".md", ".json", ".csv", ".log", ".pdf"],
+ "supportedFileTypes": [".txt", ".md", ".json", ".csv", ".log", ".pdf", ".png", ".jpg", ".jpeg"],
"repoLayout": [
{
"path": "src/index.js",
@@ -38,7 +38,7 @@
},
{
"path": "src/ingestion/file-ingestion.js",
- "description": "Read-only recursive file ingestion for supported text files."
+ "description": "Read-only recursive file ingestion for supported files."
},
{
"path": "src/analytics/",
diff --git a/package-lock.json b/package-lock.json
index d1e5ba5..49baa7c 100644
--- a/package-lock.json
+++ b/package-lock.json
@@ -16,7 +16,8 @@
"tesseract.js": "^7.0.0"
},
"devDependencies": {
- "commit-and-tag-version": "^12.7.3"
+ "commit-and-tag-version": "^12.7.3",
+ "husky": "^9.1.7"
}
},
"node_modules/@babel/code-frame": {
@@ -1299,6 +1300,22 @@
"node": ">=10"
}
},
+ "node_modules/husky": {
+ "version": "9.1.7",
+ "resolved": "https://registry.npmjs.org/husky/-/husky-9.1.7.tgz",
+ "integrity": "sha512-5gs5ytaNjBrh5Ow3zrvdUUY+0VxIuWVL4i9irt6friV+BqdCfmV11CQTWMiBYWHbXhco+J1kHfTOUkePhCDvMA==",
+ "dev": true,
+ "license": "MIT",
+ "bin": {
+ "husky": "bin.js"
+ },
+ "engines": {
+ "node": ">=18"
+ },
+ "funding": {
+ "url": "https://github.com/sponsors/typicode"
+ }
+ },
"node_modules/idb-keyval": {
"version": "6.2.5",
"resolved": "https://registry.npmjs.org/idb-keyval/-/idb-keyval-6.2.5.tgz",
diff --git a/package.json b/package.json
index 7cc9abd..b8c1b2c 100644
--- a/package.json
+++ b/package.json
@@ -6,9 +6,10 @@
"main": "src/index.js",
"scripts": {
"start": "node src/index.js",
- "test": "node --test",
+ "test": "node --test --experimental-test-coverage",
"docs:generate": "node scripts/generate-docs.js",
"docs:check": "node scripts/generate-docs.js --check && node scripts/validate-docs.js",
+ "prepare": "husky",
"release": "commit-and-tag-version",
"postrelease": "git push --follow-tags && gh release create v%npm_package_version% --notes-file CHANGELOG.md --title \"Release v%npm_package_version%\""
},
@@ -29,6 +30,7 @@
"tesseract.js": "^7.0.0"
},
"devDependencies": {
- "commit-and-tag-version": "^12.7.3"
+ "commit-and-tag-version": "^12.7.3",
+ "husky": "^9.1.7"
}
}
diff --git a/src/analytics/descriptive.js b/src/analytics/descriptive.js
index 111739e..60ed4ff 100644
--- a/src/analytics/descriptive.js
+++ b/src/analytics/descriptive.js
@@ -1,34 +1,43 @@
-function countBy(items) {
- return items.reduce((counts, item) => {
- counts[item] = (counts[item] ?? 0) + 1;
- return counts;
- }, {});
-}
-
function sortEntriesDescending(record) {
return Object.entries(record).sort((left, right) => right[1] - left[1] || left[0].localeCompare(right[0]));
}
function buildDescriptiveAnalytics(files) {
- const allWords = files.flatMap((file) => file.words);
- const allDates = files.flatMap((file) => file.dates);
- const allLocations = files.flatMap((file) => file.locations);
+ const allDates = files.flatMap((file) => file.dates || []);
+ const allLocations = files.flatMap((file) => file.locations || []);
+
+ const globalWordFrequency = {};
+ const glossarySet = new Set();
- const wordFrequency = countBy(allWords);
+ // Iterate through files using the new memory-efficient object format
+ files.forEach((file) => {
+ if (file.wordFrequency) {
+ for (const [word, count] of Object.entries(file.wordFrequency)) {
+ globalWordFrequency[word] = (globalWordFrequency[word] || 0) + count;
+ glossarySet.add(word);
+ }
+ } else if (file.words) {
+ // Backwards compatibility layer
+ for (const word of file.words) {
+ globalWordFrequency[word] = (globalWordFrequency[word] || 0) + 1;
+ glossarySet.add(word);
+ }
+ }
+ });
return {
fileCount: files.length,
- glossary: [...new Set(allWords)].sort(),
- wordFrequency,
- topWords: sortEntriesDescending(wordFrequency).slice(0, 10).map(([word, count]) => ({ word, count })),
+ glossary: [...glossarySet].sort(),
+ wordFrequency: globalWordFrequency,
+ topWords: sortEntriesDescending(globalWordFrequency).slice(0, 10).map(([word, count]) => ({ word, count })),
dates: [...new Set(allDates)].sort(),
locations: [...new Set(allLocations)].sort(),
files: files.map((file) => ({
path: file.relativePath,
- extension: file.extension, // <-- FIX: Added extension propagation
+ extension: file.extension,
size: file.size,
modifiedAt: file.modifiedAt,
- wordCount: file.words.length,
+ wordCount: file.totalWords || (file.words ? file.words.length : 0),
dates: file.dates,
locations: file.locations,
metadata: file.metadata || {}
diff --git a/src/analytics/diagnostic.js b/src/analytics/diagnostic.js
index acbbaf6..1d63749 100644
--- a/src/analytics/diagnostic.js
+++ b/src/analytics/diagnostic.js
@@ -1,28 +1,22 @@
function incrementNestedCount(target, firstKey, secondKey, amount = 1) {
- if (!target[firstKey]) {
- target[firstKey] = {};
- }
-
+ if (!target[firstKey]) target[firstKey] = {};
target[firstKey][secondKey] = (target[firstKey][secondKey] ?? 0) + amount;
}
function buildUsageRates(files, groupSelector) {
const groupedCounts = {};
-
for (const file of files) {
const groups = groupSelector(file);
- if (groups.length === 0 || file.words.length === 0) {
- continue;
- }
+ const uniqueWords = file.uniqueWords || (file.words ? [...new Set(file.words)] : []);
+
+ if (!groups || groups.length === 0 || uniqueWords.length === 0) continue;
- const uniqueWords = new Set(file.words);
for (const group of groups) {
for (const word of uniqueWords) {
incrementNestedCount(groupedCounts, group, word);
}
}
}
-
return Object.fromEntries(
Object.entries(groupedCounts).map(([group, counts]) => {
const total = Object.values(counts).reduce((sum, count) => sum + count, 0) || 1;
@@ -30,19 +24,102 @@ function buildUsageRates(files, groupSelector) {
.map(([word, count]) => ({ word, usageRate: Number((count / total).toFixed(4)) }))
.sort((left, right) => right.usageRate - left.usageRate || left.word.localeCompare(right.word))
.slice(0, 5);
-
return [group, topWords];
})
);
}
+function calculateCosineSimilarity(vecA, vecB) {
+ let dotProduct = 0;
+ let normA = 0;
+ let normB = 0;
+
+ for (const word in vecA) {
+ dotProduct += (vecA[word] || 0) * (vecB[word] || 0);
+ normA += Math.pow(vecA[word], 2);
+ }
+ for (const word in vecB) {
+ normB += Math.pow(vecB[word], 2);
+ }
+
+ if (normA === 0 || normB === 0) return 0;
+ return dotProduct / (Math.sqrt(normA) * Math.sqrt(normB));
+}
+
+function calculateTFIDF(files) {
+ const fileCount = files.length;
+ const documentFrequencies = {};
+
+ files.forEach(file => {
+ const unique = file.uniqueWords || (file.words ? [...new Set(file.words)] : []);
+ unique.forEach(word => { documentFrequencies[word] = (documentFrequencies[word] || 0) + 1; });
+ });
+
+ // Pass 1: Build Multidimensional Vectors
+ const vectorizedFiles = files.map(file => {
+ let tf = file.wordFrequency || {};
+ let totalWords = file.totalWords || 1;
+
+ // Backwards compatibility layer for un-cleared caches
+ if (!file.wordFrequency && file.words) {
+ file.words.forEach(word => { tf[word] = (tf[word] || 0) + 1; });
+ totalWords = file.words.length || 1;
+ }
+
+ const vector = {};
+ const tfidf = Object.keys(tf).map(word => {
+ const termFrequency = tf[word] / totalWords;
+ const inverseDocumentFrequency = Math.log(fileCount / (1 + documentFrequencies[word]));
+ const weight = termFrequency * inverseDocumentFrequency;
+ vector[word] = weight;
+ return { word, weight };
+ }).sort((a, b) => b.weight - a.weight);
+
+ return { ...file, topKeywords: tfidf.slice(0, 5).map(t => t.word), vector };
+ });
+
+ // ✨ Pass 2: Semantic Cross-Linking Loop (🚀 Optimised to prevent thread blocking)
+ const MAX_CROSS_REF = 500;
+ const targetFiles = vectorizedFiles.slice(0, MAX_CROSS_REF);
+ const relatedByIndex = Array.from({ length: vectorizedFiles.length }, () => []);
+
+ const getFileLabel = (file) => file.fileName || file.relativePath || 'unknown';
+
+ for (let indexA = 0; indexA < targetFiles.length; indexA += 1) {
+ for (let indexB = indexA + 1; indexB < targetFiles.length; indexB += 1) {
+ const fileA = targetFiles[indexA];
+ const fileB = targetFiles[indexB];
+ const score = calculateCosineSimilarity(fileA.vector, fileB.vector);
+
+ if (score > 0.05) {
+ const correlationScore = Number(score.toFixed(4));
+ relatedByIndex[indexA].push({ match: getFileLabel(fileB), correlationScore });
+ relatedByIndex[indexB].push({ match: getFileLabel(fileA), correlationScore });
+ }
+ }
+ }
+
+ return vectorizedFiles.map((file, index) => {
+ const related = relatedByIndex[index]
+ ? relatedByIndex[index].sort((left, right) => right.correlationScore - left.correlationScore).slice(0, 3)
+ : [];
+
+ return {
+ fileName: getFileLabel(file),
+ topKeywords: file.topKeywords,
+ relatedDocuments: related
+ };
+ });
+}
+
function buildDiagnosticAnalytics(files) {
+ const tfIdfAnalysis = calculateTFIDF(files);
+
return {
- wordUsageByDate: buildUsageRates(files, (file) => file.dates),
- wordUsageByLocation: buildUsageRates(files, (file) => file.locations)
+ wordUsageByDate: buildUsageRates(files, (file) => file.dates || []),
+ wordUsageByLocation: buildUsageRates(files, (file) => file.locations || []),
+ semanticAnalysis: tfIdfAnalysis
};
}
-module.exports = {
- buildDiagnosticAnalytics
-};
+module.exports = { buildDiagnosticAnalytics };
\ No newline at end of file
diff --git a/src/analytics/predictive.js b/src/analytics/predictive.js
index 2634874..492e76b 100644
--- a/src/analytics/predictive.js
+++ b/src/analytics/predictive.js
@@ -1,30 +1,25 @@
-function monthKey(dateString) {
- return dateString.slice(0, 7);
-}
-
-function average(values) {
- if (values.length === 0) {
- return 0;
- }
-
- return values.reduce((sum, value) => sum + value, 0) / values.length;
-}
+function monthKey(dateString) { return dateString.slice(0, 7); }
+// ✨ Nonlinear Forecasting Tweaks (Weighted Moving Average)
function forecastNextValue(series) {
- if (series.length === 0) {
- return 0;
- }
-
- if (series.length === 1) {
- return series[0].count;
- }
+ if (series.length === 0) return 0;
+ if (series.length === 1) return series[0].count;
const deltas = [];
for (let index = 1; index < series.length; index += 1) {
deltas.push(series[index].count - series[index - 1].count);
}
- return Math.max(0, Math.round(series[series.length - 1].count + average(deltas)));
+ let weightedSum = 0;
+ let weightTotal = 0;
+ for (let i = 0; i < deltas.length; i++) {
+ const weight = i + 1; // More recent intervals gain higher weight
+ weightedSum += deltas[i] * weight;
+ weightTotal += weight;
+ }
+
+ const wma = weightTotal === 0 ? 0 : weightedSum / weightTotal;
+ return Math.max(0, Math.round(series[series.length - 1].count + wma));
}
function addMonth(month) {
@@ -33,33 +28,50 @@ function addMonth(month) {
return nextDate.toISOString().slice(0, 7);
}
+// ✨ Support empty intervals
+function fillEmptyIntervals(orderedMonths, timeline) {
+ if (orderedMonths.length === 0) return [];
+ const filledSeries = [];
+ let currentMonth = orderedMonths[0];
+ const lastMonth = orderedMonths[orderedMonths.length - 1];
+
+ while (currentMonth <= lastMonth) {
+ filledSeries.push({
+ month: currentMonth,
+ count: timeline[currentMonth] ? timeline[currentMonth].totalWords : 0
+ });
+ currentMonth = addMonth(currentMonth);
+ }
+ return filledSeries;
+}
+
function buildKeywordSeries(files) {
const timeline = {};
-
for (const file of files) {
- const key = monthKey(file.modifiedAt);
- if (!timeline[key]) {
- timeline[key] = { totalWords: 0, locations: {} };
- }
+ const documentDate = (file.dates || []).find((value) => /^[0-9]{4}-[0-9]{2}(?:-[0-9]{2})?$/.test(value)) || file.modifiedAt;
+ if (!documentDate) continue;
+
+ const key = monthKey(documentDate);
+ if (!timeline[key]) timeline[key] = { totalWords: 0, locations: {} };
- timeline[key].totalWords += file.words.length;
- for (const location of file.locations) {
+ timeline[key].totalWords += file.totalWords || (file.words || []).length;
+ for (const location of (file.locations || [])) {
timeline[key].locations[location] = (timeline[key].locations[location] ?? 0) + 1;
}
}
-
return timeline;
}
function buildPredictiveAnalytics(files) {
const timeline = buildKeywordSeries(files);
const orderedMonths = Object.keys(timeline).sort();
- const keywordSeries = orderedMonths.map((month) => ({ month, count: timeline[month].totalWords }));
+
+ const keywordSeries = fillEmptyIntervals(orderedMonths, timeline);
const nextMonth = orderedMonths.length > 0 ? addMonth(orderedMonths[orderedMonths.length - 1]) : new Date().toISOString().slice(0, 7);
const locationTotals = {};
for (const month of orderedMonths) {
- for (const [location, count] of Object.entries(timeline[month].locations)) {
+ for (const [location, count] of Object.entries(timeline[month]?.locations || {})) {
locationTotals[location] = (locationTotals[location] ?? 0) + count;
}
}
@@ -68,18 +80,9 @@ function buildPredictiveAnalytics(files) {
.sort((left, right) => right[1] - left[1] || left[0].localeCompare(right[0]))[0]?.[0] ?? null;
return {
- keywordFrequencyForecast: {
- basis: keywordSeries,
- forecastMonth: nextMonth,
- forecastWordCount: forecastNextValue(keywordSeries)
- },
- locationClusterForecast: {
- basis: locationTotals,
- likelyNextHotspot: topLocation
- }
+ keywordFrequencyForecast: { basis: keywordSeries, forecastMonth: nextMonth, forecastWordCount: forecastNextValue(keywordSeries) },
+ locationClusterForecast: { basis: locationTotals, likelyNextHotspot: topLocation }
};
}
-module.exports = {
- buildPredictiveAnalytics
-};
+module.exports = { buildPredictiveAnalytics };
\ No newline at end of file
diff --git a/src/analytics/prescriptive.js b/src/analytics/prescriptive.js
index d9ca602..c1fcde2 100644
--- a/src/analytics/prescriptive.js
+++ b/src/analytics/prescriptive.js
@@ -1,7 +1,8 @@
function buildPrescriptiveAnalytics(files, descriptiveAnalytics) {
const missingMetadataFiles = files
- .filter((file) => file.dates.length === 0 || file.locations.length === 0)
- .map((file) => file.relativePath);
+ .filter((file) => !file.dates?.length || !file.locations?.length)
+ // FIX: The worker pool returns `fileName`, so we map that instead (with a fallback)
+ .map((file) => file.fileName || file.relativePath);
const recommendations = [];
@@ -13,7 +14,7 @@ function buildPrescriptiveAnalytics(files, descriptiveAnalytics) {
});
}
- if (descriptiveAnalytics.locations.length > 1) {
+ if (descriptiveAnalytics.locations && descriptiveAnalytics.locations.length > 1) {
recommendations.push({
type: 'folder-restructure',
message: 'Consider grouping files into location-based subfolders to improve topic clustering and navigation.',
@@ -35,4 +36,4 @@ function buildPrescriptiveAnalytics(files, descriptiveAnalytics) {
module.exports = {
buildPrescriptiveAnalytics
-};
+};
\ No newline at end of file
diff --git a/src/delivery/csv-generator.js b/src/delivery/csv-generator.js
new file mode 100644
index 0000000..45e328c
--- /dev/null
+++ b/src/delivery/csv-generator.js
@@ -0,0 +1,36 @@
+const fs = require('node:fs/promises');
+const path = require('node:path');
+
+function escapeCsvCell(value) {
+ const stringValue = String(value ?? '');
+ const sanitizedValue = /^\s*[=+\-@]/.test(stringValue) ? `'${stringValue}` : stringValue;
+ return `"${sanitizedValue.replace(/"/g, '""')}"`;
+}
+
+function buildCsvRow(...cells) {
+ return `${cells.map(escapeCsvCell).join(',')}\n`;
+}
+
+async function generateCsvReport(report, exportsDir) {
+ await fs.mkdir(exportsDir, { recursive: true });
+ const csvPath = path.join(exportsDir, `report-${Date.now()}.csv`);
+
+ let csvContent = buildCsvRow('Category', 'Metric', 'Value');
+ csvContent += buildCsvRow('Descriptive', 'FileCount', report.descriptive.fileCount);
+
+ const locations = report.descriptive.locations || report.locations || [];
+ csvContent += buildCsvRow('Descriptive', 'UniqueLocations', locations.join(', '));
+
+ if (report.predictive?.locationClusterForecast) {
+ csvContent += buildCsvRow('Predictive', 'LikelyNextHotspot', report.predictive.locationClusterForecast.likelyNextHotspot);
+ }
+ if (report.predictive?.keywordFrequencyForecast) {
+ csvContent += buildCsvRow('Predictive', 'ForecastMonth', report.predictive.keywordFrequencyForecast.forecastMonth);
+ csvContent += buildCsvRow('Predictive', 'ForecastWordCount', report.predictive.keywordFrequencyForecast.forecastWordCount);
+ }
+
+ await fs.writeFile(csvPath, csvContent, 'utf-8');
+ return csvPath;
+}
+
+module.exports = { generateCsvReport };
\ No newline at end of file
diff --git a/src/index.js b/src/index.js
index 97b52f8..031fc69 100644
--- a/src/index.js
+++ b/src/index.js
@@ -3,14 +3,19 @@ const path = require('node:path');
const chokidar = require('chokidar');
const { generateAnalyticsReport } = require('./pipeline');
const { generateMarkdownReport } = require('./delivery/markdown-generator');
+const { generateCsvReport } = require('./delivery/csv-generator');
-async function runPipeline(sourceDirectory, format) {
+async function runPipeline(sourceDirectory, format, options) {
try {
- const report = await generateAnalyticsReport(sourceDirectory);
+ const report = await generateAnalyticsReport(sourceDirectory, options);
+ const exportsDir = path.join(process.cwd(), 'data_exports');
+
if (format === 'md' || format === 'markdown') {
- const exportsDir = path.join(process.cwd(), 'data_exports');
const savedPath = await generateMarkdownReport(report, exportsDir);
process.stdout.write(`✅ Markdown report successfully generated at:\n${savedPath}\n`);
+ } else if (format === 'csv') {
+ const savedPath = await generateCsvReport(report, exportsDir);
+ process.stdout.write(`✅ CSV report successfully generated at:\n${savedPath}\n`);
} else {
process.stdout.write(`${JSON.stringify(report, null, 2)}\n`);
}
@@ -22,42 +27,48 @@ async function runPipeline(sourceDirectory, format) {
async function main() {
const args = process.argv.slice(2);
- // Parse flags
const formatFlag = args.find(arg => arg.startsWith('--format='));
const format = formatFlag ? formatFlag.split('=')[1].toLowerCase() : 'json';
const isWatchMode = args.includes('--watch');
+ const clearCache = args.includes('--clear-cache');
+
+ const workersFlag = args.find(arg => arg.startsWith('--workers='));
+ let workers;
+ if (workersFlag) {
+ const parsed = parseInt(workersFlag.split('=')[1], 10);
+ if (Number.isNaN(parsed) || parsed < 1) {
+ process.stderr.write(`⚠️ Invalid --workers value. Must be a positive integer. Defaulting to CPU count.\n`);
+ } else {
+ workers = parsed;
+ }
+ }
- // Parse target directory
const sourceArg = args.find(arg => !arg.startsWith('--'));
const sourceDirectory = sourceArg ? path.resolve(sourceArg) : process.cwd();
+ const options = { clearCache, workers };
+
if (isWatchMode) {
process.stdout.write(`👀 Watching directory for changes: ${sourceDirectory}\n`);
-
- // Initialize OS Event Listener
const watcher = chokidar.watch(sourceDirectory, {
- ignored: [/(^|[\/\\])\../, /node_modules/, /data_exports/],
- persistent: true,
- ignoreInitial: false
+ ignored: [/(^|[\/\\])\../, /node_modules/, /[\/\\]data_exports([\/\\]|$)/],
+ persistent: true, ignoreInitial: false
});
- // Debounce logic to prevent CPU spikes on bulk file operations
let timeout;
+ let pipelineQueue = Promise.resolve();
const triggerPipeline = () => {
clearTimeout(timeout);
timeout = setTimeout(() => {
process.stdout.write(`\n🔄 File system event detected. Recalculating analytics...\n`);
- runPipeline(sourceDirectory, format);
- }, 500); // 500ms buffer
+ pipelineQueue = pipelineQueue
+ .then(() => runPipeline(sourceDirectory, format, options))
+ .catch(() => {});
+ }, 500);
};
-
- // Bind events
- watcher
- .on('add', triggerPipeline)
- .on('change', triggerPipeline)
- .on('unlink', triggerPipeline);
+ watcher.on('add', triggerPipeline).on('change', triggerPipeline).on('unlink', triggerPipeline);
} else {
- await runPipeline(sourceDirectory, format);
+ await runPipeline(sourceDirectory, format, options);
}
}
diff --git a/src/ingestion/file-ingestion.js b/src/ingestion/file-ingestion.js
index afa1617..3d1ecb9 100644
--- a/src/ingestion/file-ingestion.js
+++ b/src/ingestion/file-ingestion.js
@@ -3,12 +3,33 @@ const os = require("node:os");
const { promises: fsp } = require("node:fs");
const { Worker } = require("node:worker_threads");
+const CACHE_SCHEMA_VERSION = 1;
+
+function parseCacheEntries(cacheData) {
+ const parsedCache = JSON.parse(cacheData);
+
+ if (
+ parsedCache &&
+ typeof parsedCache === 'object' &&
+ parsedCache.version === CACHE_SCHEMA_VERSION &&
+ parsedCache.entries &&
+ typeof parsedCache.entries === 'object' &&
+ !Array.isArray(parsedCache.entries)
+ ) {
+ return parsedCache.entries;
+ }
+
+ return {};
+}
+
async function* walkFiles(rootDirectory) {
const directoryEntries = await fsp.readdir(rootDirectory, { withFileTypes: true });
for (const entry of directoryEntries) {
const absolutePath = path.join(rootDirectory, entry.name);
- if (entry.isDirectory()) {
+ if (entry.isSymbolicLink()) {
+ continue; // Skip symlinks to prevent traversal outside the source directory
+ } else if (entry.isDirectory()) {
yield* walkFiles(absolutePath);
} else if (entry.isFile()) {
yield absolutePath;
@@ -16,70 +37,98 @@ async function* walkFiles(rootDirectory) {
}
}
-async function ingestDirectory(rootDirectory) {
+async function ingestDirectory(rootDirectory, options = {}) {
const sourceDirectory = path.resolve(rootDirectory);
const files = [];
const pathsToProcess = [];
+ // State Caching (Memoization)
+ const cachePath = path.join(process.cwd(), '.analytics_cache.json');
+ let cache = {};
+ if (!options.clearCache) {
+ try {
+ const cacheData = await fsp.readFile(cachePath, 'utf-8');
+ cache = parseCacheEntries(cacheData);
+ } catch (err) {
+ cache = {};
+ }
+ }
+
+ const visitedPaths = new Set();
for await (const filePath of walkFiles(sourceDirectory)) {
- pathsToProcess.push(filePath);
+ visitedPaths.add(filePath);
+ const stats = await fsp.stat(filePath);
+ const fingerprint = `${stats.size}-${stats.mtimeMs}`; // Size + Modified Time
+
+ if (cache[filePath] && cache[filePath].fingerprint === fingerprint) {
+ files.push(cache[filePath].data); // Short-circuit bypass
+ } else {
+ pathsToProcess.push({ filePath, fingerprint });
+ }
+ }
+
+ // Evict stale cache keys scoped to this sourceDirectory
+ for (const key of Object.keys(cache)) {
+ if (
+ (key === sourceDirectory || key.startsWith(sourceDirectory + path.sep)) &&
+ !visitedPaths.has(key)
+ ) {
+ delete cache[key];
+ }
}
- // FIX 1: Cap workers to the number of files.
- // Prevents spawning 15 massive threads to process 1 tiny test file.
- const maxCores = Math.max(1, os.cpus().length - 1);
+ const maxCores = options.workers || Math.max(1, os.cpus().length - 1);
const numWorkers = Math.min(pathsToProcess.length, maxCores);
- if (numWorkers === 0) {
- return { sourceDirectory, files };
- }
+ if (numWorkers > 0) {
+ process.stdout.write(`\n🚀 Initializing WebAssembly Worker Pool (${numWorkers} threads)...\n`);
- process.stdout.write(`\n🚀 Initializing WebAssembly Worker Pool (${numWorkers} threads)...\n`);
+ let currentIndex = 0;
- let currentIndex = 0;
+ await Promise.all(
+ Array.from({ length: numWorkers }).map(() => {
+ return new Promise((resolve) => {
+ const worker = new Worker(path.join(__dirname, "worker.js"));
- await Promise.all(
- Array.from({ length: numWorkers }).map(() => {
- return new Promise((resolve) => {
- const worker = new Worker(path.join(__dirname, "worker.js"));
+ worker.on("message", (msg) => {
+ if (msg.success && msg.result) {
+ files.push(msg.result);
+ cache[msg.filePath] = { fingerprint: msg.fingerprint, data: msg.result };
+ } else if (!msg.success) {
+ process.stderr.write(`\n⚠️ File failed (${msg.filePath}): ${msg.error}\n`);
+ }
+ assignNextTask();
+ });
- worker.on("message", (msg) => {
- if (msg.success && msg.result) {
- files.push(msg.result);
- } else if (!msg.success) {
- process.stderr.write(`\n⚠️ File failed (${msg.filePath}): ${msg.error}\n`);
+ worker.on("error", (err) => {
+ process.stderr.write(`\n⚠️ Fatal Worker Crash: ${err.message}\n`);
+ worker.terminate().then(resolve);
+ });
+
+ function assignNextTask() {
+ if (currentIndex >= pathsToProcess.length) {
+ worker.terminate().then(resolve);
+ return;
+ }
+ const task = pathsToProcess[currentIndex++];
+ worker.postMessage({ filePath: task.filePath, fingerprint: task.fingerprint, rootDirectory: sourceDirectory });
}
- assignNextTask();
- });
- worker.on("error", (err) => {
- process.stderr.write(`\n⚠️ Fatal Worker Crash: ${err.message}\n`);
- // FIX 2: Await thread termination so it doesn't leave dangling memory leaks
- worker.terminate().then(resolve);
+ assignNextTask();
});
+ })
+ );
+ }
- function assignNextTask() {
- if (currentIndex >= pathsToProcess.length) {
- // FIX 2: Await thread termination to clear the Node.js event loop
- worker.terminate().then(resolve);
- return;
- }
-
- const filePath = pathsToProcess[currentIndex++];
- worker.postMessage({ filePath, rootDirectory: sourceDirectory });
- }
-
- assignNextTask();
- });
- })
+ // Save newly parsed data back to .analytics_cache.json
+ const tempCachePath = `${cachePath}.${process.pid}.${Date.now()}.tmp`;
+ await fsp.writeFile(
+ tempCachePath,
+ JSON.stringify({ version: CACHE_SCHEMA_VERSION, entries: cache }, null, 2)
);
+ await fsp.rename(tempCachePath, cachePath);
- return {
- sourceDirectory,
- files,
- };
+ return { sourceDirectory, files };
}
-module.exports = {
- ingestDirectory,
-};
\ No newline at end of file
+module.exports = { ingestDirectory };
\ No newline at end of file
diff --git a/src/ingestion/worker.js b/src/ingestion/worker.js
index 88f4c49..d333a2c 100644
--- a/src/ingestion/worker.js
+++ b/src/ingestion/worker.js
@@ -1,124 +1,133 @@
-const { parentPort } = require("node:worker_threads");
-const fs = require("node:fs");
-const path = require("node:path");
-const readline = require("node:readline");
-const { promises: fsp } = require("node:fs");
-const nlp = require("compromise");
+const path = require('node:path');
+const { parentPort } = require('node:worker_threads');
+const fs = require('node:fs');
+const fsp = require('node:fs/promises');
+const readline = require('node:readline');
+const nlp = require('compromise');
-// Protect the background V8 isolate from abrupt asynchronous library crashes
-process.on("unhandledRejection", (reason) => {
- parentPort.postMessage({ success: false, error: reason?.message || String(reason) });
-});
-
-const TEXT_EXTENSIONS = new Set([".txt", ".md", ".json", ".csv", ".log"]);
-const IMAGE_EXTENSIONS = new Set([".png", ".jpg", ".jpeg"]);
-const SUPPORTED_EXTENSIONS = new Set([...TEXT_EXTENSIONS, ...IMAGE_EXTENSIONS, ".pdf"]);
+const TEXT_EXTENSIONS = new Set(['.txt', '.md', '.json', '.csv', '.log']);
+const IMAGE_EXTENSIONS = new Set(['.png', '.jpg', '.jpeg']);
+const SUPPORTED_EXTENSIONS = new Set([...TEXT_EXTENSIONS, ...IMAGE_EXTENSIONS, '.pdf']);
+// ✨ Advanced Stop-Word Culling Dictionary
const STOP_WORDS = new Set([
- "the", "of", "to", "and", "in", "a", "for", "on", "that", "is", "it",
- "with", "as", "was", "at", "by", "be", "this", "an", "are", "from",
- "or", "which", "will", "not", "have", "has", "but", "they", "their",
- "we", "you", "i", "he", "she", "my", "his", "her", "its", "our", "your",
- "there", "can", "if", "would", "about", "who", "what", "where", "when", "how"
+ 'a', 'about', 'an', 'and', 'are', 'as', 'at', 'be', 'by', 'for', 'from',
+ 'how', 'i', 'in', 'is', 'it', 'of', 'on', 'or', 'that', 'the', 'this',
+ 'to', 'was', 'what', 'when', 'where', 'who', 'will', 'with'
]);
-function normalizeWords(text) {
- const rawWords = text.toLowerCase().match(/[a-z0-9']+/g) ?? [];
- return rawWords.filter(word => !STOP_WORDS.has(word) && isNaN(word) && word.length > 1);
-}
+parentPort.on('message', async (task) => {
+ try {
+ const extension = path.extname(task.filePath).toLowerCase();
+ if (!SUPPORTED_EXTENSIONS.has(extension)) {
+ parentPort.postMessage({
+ success: true,
+ filePath: task.filePath,
+ fingerprint: task.fingerprint
+ });
+ return;
+ }
-function extractDates(text) {
- const doc = nlp(text);
- return [...new Set(doc.match("#Date").out("array"))];
-}
+ const stats = await fsp.stat(task.filePath);
+
+ const dates = new Set();
+ const locations = new Set();
+ const wordFrequency = {};
+ let totalWords = 0;
-function extractLocations(text) {
- const doc = nlp(text);
- const knownPlaces = doc.match("#Place").out("array");
- const contextualPlaces = doc.match("(in|at|near|location) #ProperNoun").not("(in|at|near|location)").out("array");
- return [...new Set([...knownPlaces, ...contextualPlaces])];
-}
+ const processTextChunk = (text) => {
+ if (!text) return;
-async function processTextData(text, words, dates, locations) {
- if (!text) return;
- words.push(...normalizeWords(text));
- extractDates(text).forEach(date => dates.add(date));
- extractLocations(text).forEach(loc => locations.add(loc));
-}
+ const rawWords = text
+ .replace(/[^\w\s]/g, '')
+ .toLowerCase()
+ .split(/\s+/)
+ .filter(word => word.length > 1 && !STOP_WORDS.has(word) && !/^\d+$/.test(word));
-async function readFileData(filePath, rootDirectory) {
- const extension = path.extname(filePath).toLowerCase();
- if (!SUPPORTED_EXTENSIONS.has(extension)) return null;
+ // 🚀 OPTIMIZATION: Calculate map inside worker to drastically reduce IPC channel memory usage
+ for (const word of rawWords) {
+ wordFrequency[word] = (wordFrequency[word] || 0) + 1;
+ }
+ totalWords += rawWords.length;
- const stats = await fsp.stat(filePath);
- const words = [];
- const dates = new Set();
- const locations = new Set();
- let metadata = {};
+ const doc = nlp(text);
+ for (const value of doc.match('#Date').out('array')) {
+ dates.add(value);
+ }
+ for (const value of doc.match('#Place').out('array')) {
+ locations.add(value);
+ }
- if (TEXT_EXTENSIONS.has(extension)) {
- const stream = fs.createReadStream(filePath, { encoding: "utf8" });
- const lineReader = readline.createInterface({ input: stream, crlfDelay: Infinity });
- for await (const line of lineReader) await processTextData(line, words, dates, locations);
- stream.destroy();
- } else if (extension === ".pdf") {
- const dataBuffer = await fsp.readFile(filePath);
- let extractedText = "";
+ for (const match of text.matchAll(/Date:\s*([0-9]{4}-[0-9]{2}-[0-9]{2})/gi)) {
+ dates.add(match[1]);
+ }
+ for (const match of text.matchAll(/Location:\s*([A-Za-z][A-Za-z\s'-]*)/gi)) {
+ locations.add(match[1].trim());
+ }
+ };
- try {
- const pdfParse = require("pdf-parse");
- const parseFn = typeof pdfParse === "function" ? pdfParse : pdfParse.default;
- const pdfData = await parseFn(dataBuffer);
- extractedText = pdfData.text || "";
- metadata = pdfData.info || {};
- } catch (err) { /* OCR Fallback fallback loop logic flags */ }
+ const processTextFile = async () => {
+ const fileStream = fs.createReadStream(task.filePath, { encoding: 'utf-8' });
+ const lines = readline.createInterface({ input: fileStream, crlfDelay: Infinity });
- if (extractedText.trim().length < 50) {
- const tail = dataBuffer.toString("utf8", Math.max(0, dataBuffer.length - 1024));
-
- if (tail.includes("%%EOF") || tail.includes("startxref")) {
- try {
- const mupdf = await import("mupdf");
- const tesseract = require("tesseract.js");
-
- const doc = mupdf.Document.openDocument(dataBuffer, "application/pdf");
- let ocrText = "";
- for (let i = 0; i < doc.countPages(); i++) {
- const page = doc.loadPage(i);
- const pixmap = page.toPixmap(mupdf.Matrix.scale(2, 2), mupdf.ColorSpace.DeviceRGB, false);
- const { data: { text } } = await tesseract.recognize(Buffer.from(pixmap.asPNG()), "eng", { logger: () => {} });
- ocrText += text + " ";
- }
- if (ocrText.trim().length > 0) extractedText = ocrText;
- } catch (ocrError) { /* Fail safely over to parsed text metadata arrays */ }
+ try {
+ for await (const line of lines) {
+ processTextChunk(line);
+ }
+ } finally {
+ lines.close();
+ fileStream.destroy();
}
- }
- await processTextData(extractedText, words, dates, locations);
- } else if (IMAGE_EXTENSIONS.has(extension)) {
- const tesseract = require("tesseract.js");
- const { data: { text } } = await tesseract.recognize(filePath, "eng", { logger: () => {} });
- await processTextData(text, words, dates, locations);
- }
+ };
- return {
- path: filePath,
- relativePath: path.relative(rootDirectory, filePath),
- extension,
- size: stats.size,
- createdAt: stats.birthtime.toISOString(),
- modifiedAt: stats.mtime.toISOString(),
- words,
- dates: [...dates],
- locations: [...locations],
- metadata,
- };
-}
+ const processPdfFile = async () => {
+ try {
+ const pdfParse = require('pdf-parse');
+ const parseFn = typeof pdfParse === 'function' ? pdfParse : pdfParse.default;
+ const dataBuffer = await fsp.readFile(task.filePath);
+ const pdfResult = await parseFn(dataBuffer);
+ processTextChunk(pdfResult?.text || '');
+ } catch (error) {
+ process.stderr.write(`\n⚠️ PDF extraction skipped (${task.filePath}): ${error.message}\n`);
+ }
+ };
-parentPort.on("message", async ({ filePath, rootDirectory }) => {
- try {
- const result = await readFileData(filePath, rootDirectory);
- parentPort.postMessage({ success: true, result });
+ const processImageFile = async () => {
+ try {
+ const tesseract = require('tesseract.js');
+ const result = await tesseract.recognize(task.filePath, 'eng', { logger: () => {} });
+ processTextChunk(result?.data?.text || '');
+ } catch (error) {
+ process.stderr.write(`\n⚠️ Image OCR skipped (${task.filePath}): ${error.message}\n`);
+ }
+ };
+
+ if (TEXT_EXTENSIONS.has(extension)) {
+ await processTextFile();
+ } else if (extension === '.pdf') {
+ await processPdfFile();
+ } else if (IMAGE_EXTENSIONS.has(extension)) {
+ await processImageFile();
+ }
+
+ parentPort.postMessage({
+ success: true,
+ filePath: task.filePath,
+ fingerprint: task.fingerprint,
+ result: {
+ fileName: task.filePath.split(/[/\\]/).pop(),
+ relativePath: task.rootDirectory ? path.relative(task.rootDirectory, task.filePath) : task.filePath,
+ extension,
+ size: stats.size,
+ modifiedAt: stats.mtime.toISOString(),
+ wordFrequency,
+ totalWords,
+ uniqueWords: Object.keys(wordFrequency),
+ dates: [...dates],
+ locations: [...locations]
+ }
+ });
} catch (error) {
- parentPort.postMessage({ success: false, error: error.message, filePath });
+ parentPort.postMessage({ success: false, filePath: task.filePath, error: error.message });
}
});
\ No newline at end of file
diff --git a/src/pipeline.js b/src/pipeline.js
index 36b06db..837481c 100644
--- a/src/pipeline.js
+++ b/src/pipeline.js
@@ -4,8 +4,8 @@ const { buildDiagnosticAnalytics } = require('./analytics/diagnostic');
const { buildPredictiveAnalytics } = require('./analytics/predictive');
const { buildPrescriptiveAnalytics } = require('./analytics/prescriptive');
-async function generateAnalyticsReport(sourceDirectory) {
- const ingestionResult = await ingestDirectory(sourceDirectory);
+async function generateAnalyticsReport(sourceDirectory, options = {}) {
+ const ingestionResult = await ingestDirectory(sourceDirectory, options);
const descriptive = buildDescriptiveAnalytics(ingestionResult.files);
return {
@@ -17,6 +17,4 @@ async function generateAnalyticsReport(sourceDirectory) {
};
}
-module.exports = {
- generateAnalyticsReport
-};
+module.exports = { generateAnalyticsReport };
\ No newline at end of file
diff --git a/test/ingestion-regressions.test.js b/test/ingestion-regressions.test.js
new file mode 100644
index 0000000..3496e18
--- /dev/null
+++ b/test/ingestion-regressions.test.js
@@ -0,0 +1,73 @@
+const test = require('node:test');
+const assert = require('node:assert/strict');
+const fs = require('node:fs/promises');
+const os = require('node:os');
+const path = require('node:path');
+
+const { ingestDirectory } = require('../src/ingestion/file-ingestion');
+const { generateAnalyticsReport } = require('../src/pipeline');
+
+test('watch mode ignores data_exports directory and descendants', async () => {
+ const indexSource = await fs.readFile(path.join(__dirname, '..', 'src', 'index.js'), 'utf-8');
+
+ assert.ok(indexSource.includes('/[\\/\\\\]data_exports([\\/\\\\]|$)/'));
+ assert.ok(!indexSource.includes('/data_exports[\\/\\\\]?$/'));
+});
+
+test('cache eviction does not remove sibling directory entries', async () => {
+ const cwdBefore = process.cwd();
+ const workspace = await fs.mkdtemp(path.join(os.tmpdir(), 'uap-cache-'));
+ const sourceDirectory = path.join(workspace, 'UAP_Data');
+ const siblingDirectory = path.join(workspace, 'UAP_Data_Archive');
+ const liveFile = path.join(sourceDirectory, 'live.txt');
+ const staleSourceFile = path.join(sourceDirectory, 'stale.txt');
+ const staleSiblingFile = path.join(siblingDirectory, 'stale.txt');
+
+ try {
+ await fs.mkdir(sourceDirectory, { recursive: true });
+ await fs.mkdir(siblingDirectory, { recursive: true });
+ await fs.writeFile(liveFile, 'Roswell event on 2024-01-01');
+
+ await fs.writeFile(
+ path.join(workspace, '.analytics_cache.json'),
+ JSON.stringify(
+ {
+ version: 1,
+ entries: {
+ [staleSourceFile]: { fingerprint: 'old', data: { fileName: 'stale.txt' } },
+ [staleSiblingFile]: { fingerprint: 'old', data: { fileName: 'stale.txt' } },
+ },
+ },
+ null,
+ 2
+ )
+ );
+
+ process.chdir(workspace);
+ await ingestDirectory(sourceDirectory, { workers: 1 });
+
+ const cache = JSON.parse(await fs.readFile(path.join(workspace, '.analytics_cache.json'), 'utf-8'));
+ assert.equal(cache.entries[staleSourceFile], undefined);
+ assert.ok(cache.entries[staleSiblingFile]);
+ } finally {
+ process.chdir(cwdBefore);
+ await fs.rm(workspace, { recursive: true, force: true });
+ }
+});
+
+test('worker NLP extraction captures natural-language dates and places', async () => {
+ const fixtureRoot = await fs.mkdtemp(path.join(os.tmpdir(), 'uap-nlp-'));
+
+ try {
+ await fs.writeFile(
+ path.join(fixtureRoot, 'observation.txt'),
+ 'Witnesses reported unusual movement on 2024-03-05 near Phoenix in Arizona.'
+ );
+
+ const report = await generateAnalyticsReport(fixtureRoot, { workers: 1, clearCache: true });
+ assert.ok(report.descriptive.dates.length > 0);
+ assert.ok(report.descriptive.locations.includes('Phoenix'));
+ } finally {
+ await fs.rm(fixtureRoot, { recursive: true, force: true });
+ }
+});
diff --git a/test/pipeline.test.js b/test/pipeline.test.js
index fb5e54f..6c06805 100644
--- a/test/pipeline.test.js
+++ b/test/pipeline.test.js
@@ -5,6 +5,8 @@ const os = require('node:os');
const path = require('node:path');
const { generateAnalyticsReport } = require('../src/pipeline');
+const { buildDiagnosticAnalytics } = require('../src/analytics/diagnostic');
+const { generateCsvReport } = require('../src/delivery/csv-generator');
async function createFixtureDirectory() {
const fixtureRoot = await fs.mkdtemp(path.join(os.tmpdir(), 'uap-analytics-'));
@@ -65,24 +67,107 @@ test('generateAnalyticsReport flags files with missing metadata for prescriptive
}
});
-test('generateAnalyticsReport builds all analytics tiers from text files', async () => {
- const fixtureRoot = await createFixtureDirectory();
+test('generateAnalyticsReport passes ingestion options through to the pipeline', async () => {
+ const ingestionModulePath = require.resolve('../src/ingestion/file-ingestion');
+ const pipelineModulePath = require.resolve('../src/pipeline');
+ const originalIngestionModule = require.cache[ingestionModulePath];
+ const originalPipelineModule = require.cache[pipelineModulePath];
+ let receivedOptions;
+
+ delete require.cache[pipelineModulePath];
+ require.cache[ingestionModulePath] = {
+ id: ingestionModulePath,
+ filename: ingestionModulePath,
+ loaded: true,
+ exports: {
+ ingestDirectory: async (_sourceDirectory, options) => {
+ receivedOptions = options;
+ return {
+ sourceDirectory: '/tmp/mock-source',
+ files: [
+ {
+ fileName: 'fixture.txt',
+ locations: ['Roswell'],
+ dates: ['2024-01-01'],
+ wordFrequency: { sighting: 1 },
+ totalWords: 1,
+ uniqueWords: ['sighting']
+ }
+ ],
+ };
+ }
+ }
+ };
try {
- const report = await generateAnalyticsReport(fixtureRoot);
+ const { generateAnalyticsReport: generateMockedAnalyticsReport } = require('../src/pipeline');
+ const report = await generateMockedAnalyticsReport('/tmp/mock-source', { workers: 4, clearCache: true });
- assert.equal(report.descriptive.fileCount, 2);
- assert.deepEqual(report.descriptive.locations, ['Phoenix', 'Roswell']);
-
- // Use an OR condition to support both object paths during transition
- const dates = report.descriptive.dates || report.dates;
- assert.deepEqual(dates, ['2024-01-01', '2024-02-14']);
-
- assert.ok(report.descriptive.wordFrequency.location >= 2);
- assert.ok(report.diagnostic.wordUsageByLocation.Roswell.length > 0);
- assert.equal(report.predictive.locationClusterForecast.likelyNextHotspot, 'Phoenix');
- assert.equal(report.prescriptive.recommendations[0].type, 'folder-restructure');
+ assert.equal(report.sourceDirectory, '/tmp/mock-source');
+ assert.equal(report.descriptive.fileCount, 1);
+ assert.deepEqual(receivedOptions, { workers: 4, clearCache: true });
} finally {
- await fs.rm(fixtureRoot, { recursive: true, force: true });
+ if (originalIngestionModule) {
+ require.cache[ingestionModulePath] = originalIngestionModule;
+ } else {
+ delete require.cache[ingestionModulePath];
+ }
+
+ if (originalPipelineModule) {
+ require.cache[pipelineModulePath] = originalPipelineModule;
+ } else {
+ delete require.cache[pipelineModulePath];
+ }
+ }
+});
+
+test('buildDiagnosticAnalytics falls back to relative paths when file names are missing', () => {
+ const diagnostic = buildDiagnosticAnalytics([
+ {
+ relativePath: 'reports/alpha.txt',
+ wordFrequency: { signal: 2, light: 1 },
+ totalWords: 3,
+ uniqueWords: ['signal', 'light']
+ },
+ {
+ relativePath: 'reports/beta.txt',
+ wordFrequency: { signal: 2, glow: 1 },
+ totalWords: 3,
+ uniqueWords: ['signal', 'glow']
+ }
+ ]);
+
+ assert.equal(diagnostic.semanticAnalysis[0].fileName, 'reports/alpha.txt');
+ assert.equal(diagnostic.semanticAnalysis[0].relatedDocuments[0].match, 'reports/beta.txt');
+});
+
+test('generateCsvReport escapes spreadsheet-sensitive values', async () => {
+ const exportsDir = await fs.mkdtemp(path.join(os.tmpdir(), 'uap-analytics-csv-'));
+
+ try {
+ const csvPath = await generateCsvReport(
+ {
+ descriptive: {
+ fileCount: 1,
+ locations: ['=cmd|" /C calc"!A0', 'Phoenix, AZ']
+ },
+ predictive: {
+ locationClusterForecast: {
+ likelyNextHotspot: '@hidden'
+ },
+ keywordFrequencyForecast: {
+ forecastMonth: '2026-06',
+ forecastWordCount: 3
+ }
+ }
+ },
+ exportsDir
+ );
+
+ const csvContent = await fs.readFile(csvPath, 'utf-8');
+ assert.match(csvContent, /"'=cmd\|"" \/C calc""!A0, Phoenix, AZ"/);
+ assert.match(csvContent, /"'\@hidden"/);
+ } finally {
+ await fs.rm(exportsDir, { recursive: true, force: true });
}
});
\ No newline at end of file