v1.2.0 - Core Engine Optimization & Semantic Analytics | (Sprints 1-3) + Review Feedback/Regression Fixes#11
Conversation
…tion - Added TF-IDF analysis to diagnostic analytics for keyword extraction. - Implemented CSV report generation in the delivery module. - Improved file ingestion with caching and fingerprinting for efficiency. - Enhanced predictive analytics with weighted moving average forecasting. - Updated prescriptive analytics to handle missing metadata more gracefully. - Introduced GitHub Actions CI pipeline for automated testing across multiple Node.js versions.
- Added advanced CLI flags (--workers, --clear-cache, --format=csv) to the root README.md usage scope. - Updated docs/architecture.md to detail v1.2.0 pipeline enhancements, including multithreaded worker pool mechanics and semantic vector cross-linking via TF-IDF / Cosine Similarity. - Verified all documentation structures and ran local test runner pipelines cleanly.
Co-authored-by: Copilot <copilot@github.com>
There was a problem hiding this comment.
Pull request overview
This PR expands the Node CLI pipeline with multithreaded ingestion + memoization caching, adds semantic/forecasting enhancements to the analytics tiers, and introduces a CSV delivery format alongside CI automation.
Changes:
- Added ingestion options plumbing (worker count, cache clearing) from CLI → pipeline → ingestion, plus
.analytics_cache.jsonmemoization. - Implemented semantic diagnostics (TF‑IDF + cosine similarity) and updated predictive/prescriptive logic to align with new ingestion output shape.
- Added CSV report generation, updated docs, and introduced a GitHub Actions workflow to run tests + docs checks across Node 18/20/22.
Reviewed changes
Copilot reviewed 12 out of 13 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| test/pipeline.test.js | Adjusts/extends pipeline tests; currently includes a duplicated analytics-tier test block. |
| src/pipeline.js | Adds options passthrough to ingestion. |
| src/ingestion/worker.js | Reworks worker processing to text parsing + stop-word culling; now returns additional fields per file. |
| src/ingestion/file-ingestion.js | Adds symlink skipping, fingerprint-based cache reads/writes, and passes tasks (with fingerprints) to workers. |
| src/index.js | Adds CLI flags (--workers, --clear-cache, --format=csv) and routes to CSV delivery. |
| src/delivery/csv-generator.js | Introduces CSV export surface. |
| src/analytics/prescriptive.js | Adapts missing-metadata detection/mapping for updated ingestion output. |
| src/analytics/predictive.js | Switches forecasting to weighted moving average + fills missing month intervals. |
| src/analytics/diagnostic.js | Adds TF‑IDF + cosine similarity semantic analysis output. |
| README.md | Adds installation/usage guidance and documents new CLI flags (with some WASM-specific wording). |
| docs/architecture.md | Notes v1.2.0 architecture additions (worker pool, caching, semantic analytics). |
| .gitignore | Ignores .analytics_cache.json. |
| .github/workflows/test.yml | Adds CI job for tests + docs checks across Node versions. |
Comments suppressed due to low confidence (1)
test/pipeline.test.js:83
- This test duplicates the earlier “builds all analytics tiers” test but doesn’t assert any additional behavior. Since
generateAnalyticsReportnow accepts ingestion options, it’d be more valuable to repurpose this test to cover the newworkers/clearCacheoption plumbing instead of repeating the same assertions.
test('generateAnalyticsReport builds all analytics tiers from text files (descriptive dates path)', async () => {
const fixtureRoot = await createFixtureDirectory();
try {
const report = await generateAnalyticsReport(fixtureRoot);
assert.equal(report.descriptive.fileCount, 2);
assert.deepEqual(report.descriptive.locations, ['Phoenix', 'Roswell']);
assert.deepEqual(report.descriptive.dates, ['2024-01-01', '2024-02-14']);
assert.ok(report.descriptive.wordFrequency.location >= 2);
assert.ok(report.diagnostic.wordUsageByLocation.Roswell.length > 0);
assert.equal(report.predictive.locationClusterForecast.likelyNextHotspot, 'Phoenix');
assert.equal(report.prescriptive.recommendations[0].type, 'folder-restructure');
} finally {
| parentPort.on('message', async (task) => { | ||
| try { | ||
| const content = await fs.readFile(task.filePath, 'utf-8'); | ||
| const stats = await fs.stat(task.filePath); | ||
|
|
| // State Caching (Memoization) | ||
| const cachePath = path.join(process.cwd(), '.analytics_cache.json'); | ||
| let cache = {}; | ||
| if (!options.clearCache) { | ||
| try { | ||
| const cacheData = await fsp.readFile(cachePath, 'utf-8'); | ||
| cache = JSON.parse(cacheData); | ||
| } catch (err) { | ||
| cache = {}; | ||
| } | ||
| } |
| // Save newly parsed data back to .analytics_cache.json | ||
| await fsp.writeFile(cachePath, JSON.stringify(cache, null, 2)); | ||
|
|
| let timeout; | ||
| const triggerPipeline = () => { | ||
| clearTimeout(timeout); | ||
| timeout = setTimeout(() => { | ||
| process.stdout.write(`\n🔄 File system event detected. Recalculating analytics...\n`); | ||
| runPipeline(sourceDirectory, format); | ||
| }, 500); // 500ms buffer | ||
| runPipeline(sourceDirectory, format, options); | ||
| }, 500); | ||
| }; | ||
|
|
||
| // Bind events | ||
| watcher | ||
| .on('add', triggerPipeline) | ||
| .on('change', triggerPipeline) | ||
| .on('unlink', triggerPipeline); | ||
| watcher.on('add', triggerPipeline).on('change', triggerPipeline).on('unlink', triggerPipeline); |
| return vectorizedFiles.map(fileA => { | ||
| const related = []; | ||
| vectorizedFiles.forEach(fileB => { | ||
| if (fileA.fileName !== fileB.fileName) { | ||
| const score = calculateCosineSimilarity(fileA.vector, fileB.vector); | ||
| if (score > 0.05) { // Threshold for correlation relevancy | ||
| related.push({ match: fileB.fileName, correlationScore: Number(score.toFixed(4)) }); | ||
| } | ||
| } | ||
| }); |
| async function generateCsvReport(report, exportsDir) { | ||
| await fs.mkdir(exportsDir, { recursive: true }); | ||
| const csvPath = path.join(exportsDir, `report-${Date.now()}.csv`); | ||
|
|
||
| let csvContent = "Category,Metric,Value\n"; | ||
| csvContent += `Descriptive,FileCount,${report.descriptive.fileCount}\n`; | ||
|
|
||
| const locations = report.descriptive.locations || report.locations || []; | ||
| csvContent += `Descriptive,UniqueLocations,"${locations.join(', ')}"\n`; | ||
|
|
||
| if (report.predictive?.locationClusterForecast) { | ||
| csvContent += `Predictive,LikelyNextHotspot,${report.predictive.locationClusterForecast.likelyNextHotspot}\n`; | ||
| } | ||
| if (report.predictive?.keywordFrequencyForecast) { | ||
| csvContent += `Predictive,ForecastMonth,${report.predictive.keywordFrequencyForecast.forecastMonth}\n`; | ||
| csvContent += `Predictive,ForecastWordCount,${report.predictive.keywordFrequencyForecast.forecastWordCount}\n`; | ||
| } | ||
|
|
||
| await fs.writeFile(csvPath, csvContent, 'utf-8'); | ||
| return csvPath; | ||
| } |
| 2. **Install dependencies:** | ||
| Because this engine utilizes pre-compiled WebAssembly, there are no complex C++ build tools or `node-gyp` configurations required on Windows. Simply run: | ||
| ```bash |
| * `node src/index.js ./my_folder --workers=4` : Manually set the number of WebAssembly worker threads (defaults to max CPU cores). | ||
| * `node src/index.js ./my_folder --clear-cache` : Bypasses the `.analytics_cache.json` file and forces a fresh read of all documents. |
…mory - Fixes domain logic in predictive analytics by switching the timeline basis from the OS file modification time to actual parsed document dates. This prevents modern download timestamps from invalidating historical UAP forecasting. - Resolves worker IPC performance bottlenecks by calculating word frequencies directly inside the worker thread pool rather than passing massive raw string arrays across the boundary. - Mitigates main-thread blocking in diagnostic analytics by capping the O(N²) TF-IDF cosine similarity matrix calculations to a maximum of 500 files. - Adds backwards-compatibility layers in descriptive and diagnostic modules to gracefully handle legacy cache formats without crashing. - Refines watch mode path exclusions in the index file to use a strictly scoped regex for the data exports directory.
…ticsBot into Sprint1through3
|
Based on a comprehensive review of the latest commits on this branch, the codebase shows great progress—particularly with the multi-threaded worker pool management, atomic file-caching setup using temporary files, and structured analytical pipelines. However, before merging this pull request into 🚨 1. Critical Blocker BugChronological Timeline Corrupted by Filesystem Metadata
if(!file.modifiedAt) continue;
const key = monthKey(file.modifiedAt);
const documentDate = (file.dates && file.dates.length > 0) ? file.dates[0] : file.modifiedAt;
if (!documentDate) continue;
const key = monthKey(documentDate);🛡️ 2. Edge-Cases & Stability BugsPrototype Pollution via Arbitrary Document Content
// In descriptive.js:
return items.reduce((counts, item) => {
counts[item] = (counts[item] ?? 0) + 1;
return counts;
}, Object.create(null));Cache-Bypass Leaks for Unsupported File Extensions
// In worker.js:
parentPort.postMessage({ success: true, filePath: task.filePath, fingerprint: task.fingerprint, result: { skipped: true } });
// In file-ingestion.js filter out skipped items before processing tiers:
if (msg.success && msg.result) {
cache[msg.filePath] = { fingerprint: msg.fingerprint, data: msg.result };
if (!msg.result.skipped) files.push(msg.result);
}Directory Watcher Over-Ignoring Valid Paths
ignored: [/(^|[\/\\])\../, /node_modules/, /data_exports([\/\\]|$)/]🚀 3. Performance & Scaling OptimizationsMain-Thread Blocking on
|
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 13 out of 14 changed files in this pull request and generated 9 comments.
Comments suppressed due to low confidence (1)
test/pipeline.test.js:72
- This test duplicates the earlier "generateAnalyticsReport builds all analytics tiers from text files" test (same fixture + assertions). Keeping both adds maintenance overhead without increasing coverage; consider deleting this duplicate or merging into the existing test.
test('generateAnalyticsReport builds all analytics tiers from text files (descriptive dates path)', async () => {
const fixtureRoot = await createFixtureDirectory();
try {
const report = await generateAnalyticsReport(fixtureRoot);
| let pipelineQueue = Promise.resolve(); | ||
| const triggerPipeline = () => { | ||
| clearTimeout(timeout); | ||
| timeout = setTimeout(() => { | ||
| process.stdout.write(`\n🔄 File system event detected. Recalculating analytics...\n`); |
| async function generateCsvReport(report, exportsDir) { | ||
| await fs.mkdir(exportsDir, { recursive: true }); | ||
| const csvPath = path.join(exportsDir, `report-${Date.now()}.csv`); | ||
|
|
||
| let csvContent = "Category,Metric,Value\n"; | ||
| csvContent += `Descriptive,FileCount,${report.descriptive.fileCount}\n`; | ||
|
|
||
| const locations = report.descriptive.locations || report.locations || []; | ||
| csvContent += `Descriptive,UniqueLocations,"${locations.join(', ')}"\n`; | ||
|
|
||
| if (report.predictive?.locationClusterForecast) { | ||
| csvContent += `Predictive,LikelyNextHotspot,${report.predictive.locationClusterForecast.likelyNextHotspot}\n`; | ||
| } | ||
| if (report.predictive?.keywordFrequencyForecast) { | ||
| csvContent += `Predictive,ForecastMonth,${report.predictive.keywordFrequencyForecast.forecastMonth}\n`; | ||
| csvContent += `Predictive,ForecastWordCount,${report.predictive.keywordFrequencyForecast.forecastWordCount}\n`; | ||
| } | ||
|
|
||
| await fs.writeFile(csvPath, csvContent, 'utf-8'); | ||
| return csvPath; | ||
| } |
| // State Caching (Memoization) | ||
| const cachePath = path.join(process.cwd(), '.analytics_cache.json'); | ||
| let cache = {}; | ||
| if (!options.clearCache) { | ||
| try { | ||
| const cacheData = await fsp.readFile(cachePath, 'utf-8'); | ||
| cache = JSON.parse(cacheData); | ||
| } catch (err) { | ||
| cache = {}; | ||
| } | ||
| } |
| const fileA = targetFiles[indexA]; | ||
| const fileB = targetFiles[indexB]; | ||
| const score = calculateCosineSimilarity(fileA.vector, fileB.vector); | ||
|
|
||
| if (score > 0.05) { | ||
| const correlationScore = Number(score.toFixed(4)); | ||
| relatedByIndex[indexA].push({ match: fileB.fileName, correlationScore }); | ||
| relatedByIndex[indexB].push({ match: fileA.fileName, correlationScore }); | ||
| } |
| return { | ||
| fileName: file.fileName, | ||
| topKeywords: file.topKeywords, | ||
| relatedDocuments: related | ||
| }; |
| ### v1.2.0 Pipeline Architecture | ||
| * **Ingestion (Multithreaded):** Utilizes Node.js `worker_threads` and file-stat fingerprinting (`.analytics_cache.json`) to bypass redundant processing and drastically speed up execution. | ||
| * **Semantic Analytics:** Employs a TF-IDF weighting engine to filter generic stop-words and a Cosine Similarity math engine to automatically cluster related UAP documents based on vector distance. | ||
|
|
| The v1.2.0 AnalyticsBot engine supports multithreading and memoization caching. You can control these via CLI arguments: | ||
|
|
||
| * `node src/index.js ./my_folder --workers=4` : Manually set the number of WebAssembly worker threads (defaults to max CPU cores). | ||
| * `node src/index.js ./my_folder --clear-cache` : Bypasses the `.analytics_cache.json` file and forces a fresh read of all documents. |
| const content = await fs.readFile(task.filePath, 'utf-8'); | ||
| const stats = await fs.stat(task.filePath); | ||
| const dates = []; | ||
| const locations = []; | ||
|
|
||
| if (TEXT_EXTENSIONS.has(extension)) { | ||
| const stream = fs.createReadStream(filePath, { encoding: "utf8" }); | ||
| const lineReader = readline.createInterface({ input: stream, crlfDelay: Infinity }); | ||
| for await (const line of lineReader) await processTextData(line, words, dates, locations); | ||
| stream.destroy(); | ||
| } else if (extension === ".pdf") { | ||
| const dataBuffer = await fsp.readFile(filePath); | ||
| let extractedText = ""; | ||
| // Filter out punctuation, make lowercase, and cull stop words | ||
| const rawWords = content | ||
| .replace(/[^\w\s]/g, '') | ||
| .toLowerCase() | ||
| .split(/\s+/) | ||
| .filter(word => word.length > 1 && !STOP_WORDS.has(word)); |
| // Save newly parsed data back to .analytics_cache.json | ||
| await fsp.writeFile(cachePath, JSON.stringify(cache, null, 2)); | ||
|
|
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 18 out of 19 changed files in this pull request and generated 4 comments.
Comments suppressed due to low confidence (1)
src/ingestion/file-ingestion.js:85
- When all files are served from cache (
numWorkers === 0), the function returns before persisting the updated cache to disk. That means stale-entry eviction (and any other cache updates) are lost and.analytics_cache.jsoncan keep growing with removed files.
const maxCores = options.workers || Math.max(1, os.cpus().length - 1);
const numWorkers = Math.min(pathsToProcess.length, maxCores);
if (numWorkers === 0) {
return { sourceDirectory, files };
}
| // 🚨 FIX: Extract historical dates first, fallback to OS modification if none exist | ||
| const documentDate = (file.dates && file.dates.length > 0) ? file.dates[0] : file.modifiedAt; | ||
| if (!documentDate) continue; | ||
|
|
||
| const key = monthKey(documentDate); | ||
| if (!timeline[key]) timeline[key] = { totalWords: 0, locations: {} }; |
| const stringValue = String(value ?? ''); | ||
| const sanitizedValue = /^[=+\-@]/.test(stringValue) ? `'${stringValue}` : stringValue; | ||
| return `"${sanitizedValue.replace(/"/g, '""')}"`; |
| #!/usr/bin/env sh | ||
|
|
||
| npm run docs:generate || exit 1 | ||
| git add docs/ || exit 1 | ||
| npm test || exit 1 | ||
| npm run docs:check || exit 1 |
Summary
fileNameis missing.contents: read) to satisfy CodeQL.preparescript + devDependency) so the committed pre-commit hook runs in standard contributor setups.Checklist
npm run docs:generate.npm run docs:check.docs/legacy-prototype.md.