Skip to content

v1.2.0 - Core Engine Optimization & Semantic Analytics | (Sprints 1-3) + Review Feedback/Regression Fixes#11

Merged
aj1126 merged 16 commits into
mainfrom
Sprint1through3
Jun 14, 2026
Merged

v1.2.0 - Core Engine Optimization & Semantic Analytics | (Sprints 1-3) + Review Feedback/Regression Fixes#11
aj1126 merged 16 commits into
mainfrom
Sprint1through3

Conversation

@aj1126

@aj1126 aj1126 commented Jun 13, 2026

Copy link
Copy Markdown
Owner

Summary

  • Applied the remaining requested fixes from the linked review thread.
  • Correctly serialized watch-mode pipeline reruns by chaining executions through the pipeline queue.
  • Hardened CSV export with proper quoting and spreadsheet-formula sanitization.
  • Expanded CSV formula sanitization to also block spreadsheet formulas hidden behind leading whitespace.
  • Implemented cache schema version wrapping and legacy-cache rejection.
  • Made cache writes atomic by writing a temporary file and renaming it into place.
  • Ensured cache updates are still persisted when all files are served from cache (zero-worker path).
  • Added stable semantic-analysis filename fallbacks when fileName is missing.
  • Switched worker text ingestion to streamed line-based reading to avoid loading whole files into memory.
  • Restored numeric-token filtering in worker word extraction to prevent skewed frequency analytics.
  • Hardened predictive timeline building to ignore non-ISO NLP date phrases and safely fall back to file modification timestamps.
  • Replaced the duplicated pipeline test with targeted coverage for ingestion option plumbing, CSV escaping, and semantic filename fallback behavior.
  • Updated README and architecture docs to match the current Node.js worker-thread, watch-mode, cache, and export behavior.
  • Added explicit GitHub Actions token permissions (contents: read) to satisfy CodeQL.
  • Added Husky installation wiring (prepare script + devDependency) so the committed pre-commit hook runs in standard contributor setups.

Checklist

  • I reviewed whether this change affects README, architecture docs, or legacy docs.
  • If commands, supported file types, or layout metadata changed, I ran npm run docs:generate.
  • I ran npm run docs:check.
  • If historical Python prototype guidance changed, I updated docs/legacy-prototype.md.

aj1126 added 2 commits June 13, 2026 14:42
…tion

- Added TF-IDF analysis to diagnostic analytics for keyword extraction.
- Implemented CSV report generation in the delivery module.
- Improved file ingestion with caching and fingerprinting for efficiency.
- Enhanced predictive analytics with weighted moving average forecasting.
- Updated prescriptive analytics to handle missing metadata more gracefully.
- Introduced GitHub Actions CI pipeline for automated testing across multiple Node.js versions.
@aj1126 aj1126 changed the title Sprint1through3 v1.2.0 - Core Engine Optimization & Semantic Analytics | (Sprints 1-3) Jun 13, 2026
aj1126 and others added 6 commits June 13, 2026 15:07
- Added advanced CLI flags (--workers, --clear-cache, --format=csv) to the root README.md usage scope.
- Updated docs/architecture.md to detail v1.2.0 pipeline enhancements, including multithreaded worker pool mechanics and semantic vector cross-linking via TF-IDF / Cosine Similarity.
- Verified all documentation structures and ran local test runner pipelines cleanly.
Co-authored-by: Copilot <copilot@github.com>
@aj1126 aj1126 requested a review from Copilot June 13, 2026 23:25
@aj1126 aj1126 added bug Something isn't working documentation Improvements or additions to documentation enhancement New feature or request labels Jun 13, 2026
@aj1126 aj1126 self-assigned this Jun 13, 2026

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR expands the Node CLI pipeline with multithreaded ingestion + memoization caching, adds semantic/forecasting enhancements to the analytics tiers, and introduces a CSV delivery format alongside CI automation.

Changes:

  • Added ingestion options plumbing (worker count, cache clearing) from CLI → pipeline → ingestion, plus .analytics_cache.json memoization.
  • Implemented semantic diagnostics (TF‑IDF + cosine similarity) and updated predictive/prescriptive logic to align with new ingestion output shape.
  • Added CSV report generation, updated docs, and introduced a GitHub Actions workflow to run tests + docs checks across Node 18/20/22.

Reviewed changes

Copilot reviewed 12 out of 13 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
test/pipeline.test.js Adjusts/extends pipeline tests; currently includes a duplicated analytics-tier test block.
src/pipeline.js Adds options passthrough to ingestion.
src/ingestion/worker.js Reworks worker processing to text parsing + stop-word culling; now returns additional fields per file.
src/ingestion/file-ingestion.js Adds symlink skipping, fingerprint-based cache reads/writes, and passes tasks (with fingerprints) to workers.
src/index.js Adds CLI flags (--workers, --clear-cache, --format=csv) and routes to CSV delivery.
src/delivery/csv-generator.js Introduces CSV export surface.
src/analytics/prescriptive.js Adapts missing-metadata detection/mapping for updated ingestion output.
src/analytics/predictive.js Switches forecasting to weighted moving average + fills missing month intervals.
src/analytics/diagnostic.js Adds TF‑IDF + cosine similarity semantic analysis output.
README.md Adds installation/usage guidance and documents new CLI flags (with some WASM-specific wording).
docs/architecture.md Notes v1.2.0 architecture additions (worker pool, caching, semantic analytics).
.gitignore Ignores .analytics_cache.json.
.github/workflows/test.yml Adds CI job for tests + docs checks across Node versions.
Comments suppressed due to low confidence (1)

test/pipeline.test.js:83

  • This test duplicates the earlier “builds all analytics tiers” test but doesn’t assert any additional behavior. Since generateAnalyticsReport now accepts ingestion options, it’d be more valuable to repurpose this test to cover the new workers / clearCache option plumbing instead of repeating the same assertions.
test('generateAnalyticsReport builds all analytics tiers from text files (descriptive dates path)', async () => {
    const fixtureRoot = await createFixtureDirectory();

    try {
        const report = await generateAnalyticsReport(fixtureRoot);

        assert.equal(report.descriptive.fileCount, 2);
        assert.deepEqual(report.descriptive.locations, ['Phoenix', 'Roswell']);
        
        assert.deepEqual(report.descriptive.dates, ['2024-01-01', '2024-02-14']);
        
        assert.ok(report.descriptive.wordFrequency.location >= 2);
        assert.ok(report.diagnostic.wordUsageByLocation.Roswell.length > 0);
        assert.equal(report.predictive.locationClusterForecast.likelyNextHotspot, 'Phoenix');
        assert.equal(report.prescriptive.recommendations[0].type, 'folder-restructure');
    } finally {

Comment thread src/ingestion/worker.js
Comment on lines +12 to +16
parentPort.on('message', async (task) => {
try {
const content = await fs.readFile(task.filePath, 'utf-8');
const stats = await fs.stat(task.filePath);

Comment thread src/ingestion/worker.js
Comment on lines +26 to +36
// State Caching (Memoization)
const cachePath = path.join(process.cwd(), '.analytics_cache.json');
let cache = {};
if (!options.clearCache) {
try {
const cacheData = await fsp.readFile(cachePath, 'utf-8');
cache = JSON.parse(cacheData);
} catch (err) {
cache = {};
}
}
Comment on lines +103 to +105
// Save newly parsed data back to .analytics_cache.json
await fsp.writeFile(cachePath, JSON.stringify(cache, null, 2));

Comment thread src/index.js
Comment on lines 58 to +66
let timeout;
const triggerPipeline = () => {
clearTimeout(timeout);
timeout = setTimeout(() => {
process.stdout.write(`\n🔄 File system event detected. Recalculating analytics...\n`);
runPipeline(sourceDirectory, format);
}, 500); // 500ms buffer
runPipeline(sourceDirectory, format, options);
}, 500);
};

// Bind events
watcher
.on('add', triggerPipeline)
.on('change', triggerPipeline)
.on('unlink', triggerPipeline);
watcher.on('add', triggerPipeline).on('change', triggerPipeline).on('unlink', triggerPipeline);
Comment thread src/analytics/diagnostic.js Outdated
Comment on lines +80 to +89
return vectorizedFiles.map(fileA => {
const related = [];
vectorizedFiles.forEach(fileB => {
if (fileA.fileName !== fileB.fileName) {
const score = calculateCosineSimilarity(fileA.vector, fileB.vector);
if (score > 0.05) { // Threshold for correlation relevancy
related.push({ match: fileB.fileName, correlationScore: Number(score.toFixed(4)) });
}
}
});
Comment on lines +4 to +24
async function generateCsvReport(report, exportsDir) {
await fs.mkdir(exportsDir, { recursive: true });
const csvPath = path.join(exportsDir, `report-${Date.now()}.csv`);

let csvContent = "Category,Metric,Value\n";
csvContent += `Descriptive,FileCount,${report.descriptive.fileCount}\n`;

const locations = report.descriptive.locations || report.locations || [];
csvContent += `Descriptive,UniqueLocations,"${locations.join(', ')}"\n`;

if (report.predictive?.locationClusterForecast) {
csvContent += `Predictive,LikelyNextHotspot,${report.predictive.locationClusterForecast.likelyNextHotspot}\n`;
}
if (report.predictive?.keywordFrequencyForecast) {
csvContent += `Predictive,ForecastMonth,${report.predictive.keywordFrequencyForecast.forecastMonth}\n`;
csvContent += `Predictive,ForecastWordCount,${report.predictive.keywordFrequencyForecast.forecastWordCount}\n`;
}

await fs.writeFile(csvPath, csvContent, 'utf-8');
return csvPath;
}
Comment thread README.md
Comment on lines +145 to +147
2. **Install dependencies:**
Because this engine utilizes pre-compiled WebAssembly, there are no complex C++ build tools or `node-gyp` configurations required on Windows. Simply run:
```bash
Comment thread README.md Outdated
Comment on lines +211 to +212
* `node src/index.js ./my_folder --workers=4` : Manually set the number of WebAssembly worker threads (defaults to max CPU cores).
* `node src/index.js ./my_folder --clear-cache` : Bypasses the `.analytics_cache.json` file and forces a fresh read of all documents.
Copilot AI changed the title v1.2.0 - Core Engine Optimization & Semantic Analytics | (Sprints 1-3) v1.2.0 - Core Engine Optimization & Semantic Analytics | (Sprints 1-3) + Review Feedback Fixes Jun 13, 2026
aj1126 added 2 commits June 13, 2026 19:48
…mory

- Fixes domain logic in predictive analytics by switching the timeline basis from the OS file modification time to actual parsed document dates. This prevents modern download timestamps from invalidating historical UAP forecasting.
- Resolves worker IPC performance bottlenecks by calculating word frequencies directly inside the worker thread pool rather than passing massive raw string arrays across the boundary.
- Mitigates main-thread blocking in diagnostic analytics by capping the O(N²) TF-IDF cosine similarity matrix calculations to a maximum of 500 files.
- Adds backwards-compatibility layers in descriptive and diagnostic modules to gracefully handle legacy cache formats without crashing.
- Refines watch mode path exclusions in the index file to use a strictly scoped regex for the data exports directory.
@aj1126

aj1126 commented Jun 13, 2026

Copy link
Copy Markdown
Owner Author

Based on a comprehensive review of the latest commits on this branch, the codebase shows great progress—particularly with the multi-threaded worker pool management, atomic file-caching setup using temporary files, and structured analytical pipelines.

However, before merging this pull request into main, there is one critical blocker bug regarding the timeline logic, along with three stability edge cases and two major performance bottlenecks that should be resolved to prevent runtime failures and application freezing at scale.


🚨 1. Critical Blocker Bug

Chronological Timeline Corrupted by Filesystem Metadata

  • File: src/analytics/predictive.js
  • Issue: In buildKeywordSeries(files), the timeline key for the Weighted Moving Average (WMA) forecast is determined via file.modifiedAt:
if(!file.modifiedAt) continue;
const key = monthKey(file.modifiedAt);
  • Impact: UAP records are inherently historical datasets (often spanning decades). If a user bulk-downloads or copies a folder of historical reports onto their local machine today, the operating system updates their modification timestamps (mtime) to today. This forces 100% of your data to cluster into the current calendar month, completely neutralizing the predictive forecasting models.
  • The Fix: Leverage the historical document dates already extracted by your worker pool, falling back to modifiedAt only if no embedded date entities are found:
const documentDate = (file.dates && file.dates.length > 0) ? file.dates[0] : file.modifiedAt;
if (!documentDate) continue;
const key = monthKey(documentDate);

🛡️ 2. Edge-Cases & Stability Bugs

Prototype Pollution via Arbitrary Document Content

  • Files: src/analytics/descriptive.js, src/analytics/diagnostic.js

  • Issue: You are using standard JavaScript plain objects ({}) as map accumulators for arbitrary textual inputs.

  • Impact: If an ingested text document contains common JavaScript property names like "toString", "constructor", or "valueOf", the pipeline will encounter severe type errors or value contamination:

  • In descriptive.js, counts["toString"] will fetch the native function prototype, turning numeric addition into a malformed string concatenation.

  • In diagnostic.js, documentFrequencies["toString"] will contaminate the counts, causing the division inside the TF-IDF log formula to evaluate to NaN.

  • In incrementNestedCount, looking up a group matching a built-in property will throw a TypeError when attempting to set properties on a native function prototype.

  • The Fix: Enforce null-prototype dictionary structures using Object.create(null) for all text-parsing accumulators:

// In descriptive.js:
return items.reduce((counts, item) => {
    counts[item] = (counts[item] ?? 0) + 1;
    return counts;
}, Object.create(null));

Cache-Bypass Leaks for Unsupported File Extensions

  • File: src/ingestion/file-ingestion.js
  • Issue: When a worker encounters an unsupported file extension, it gracefully posts success: true back to the pool, but leaves the result property undefined. Your coordinator checks if (msg.success && msg.result) before caching.
  • Impact: Because unsupported items never get added to .analytics_cache.json, they are added to pathsToProcess again on every successive execution. This repeatedly spins up workers for unsupported files during active --watch sessions, wasting processing cycles.
  • The Fix: Explicitly cache skipped files with a distinct flag:
// In worker.js:
parentPort.postMessage({ success: true, filePath: task.filePath, fingerprint: task.fingerprint, result: { skipped: true } });

// In file-ingestion.js filter out skipped items before processing tiers:
if (msg.success && msg.result) {
    cache[msg.filePath] = { fingerprint: msg.fingerprint, data: msg.result };
    if (!msg.result.skipped) files.push(msg.result);
}

Directory Watcher Over-Ignoring Valid Paths

  • File: src/index.js
  • Issue: The chokidar watcher utilizes a literal regex segment /data_exports/ to exclude the export directory from infinite recursive loops.
  • Impact: If a user clones this repository into a folder path that contains those characters (e.g., /Users/username/uap_data_exports_project/bot), the entire working directory matches the ignore condition, causing watch-mode to silently fail to initialize.
  • The Fix: Constrain the exclusion boundary to the end of the directory path:
ignored: [/(^|[\/\\])\../, /node_modules/, /data_exports([\/\\]|$)/]

🚀 3. Performance & Scaling Optimizations

Main-Thread Blocking on $O(N^2)$ Cosine Similarities

  • File: src/analytics/diagnostic.js
  • Issue: The semantic cross-linking calculation uses a nested synchronous double loop to match every document vector against every other document vector:
for (let indexA = 0; indexA < vectorizedFiles.length; indexA += 1) {
    for (let indexB = indexA + 1; indexB < vectorizedFiles.length; indexB += 1) {
  • Impact: For 100 documents, this performs ~4,950 computations. For 5,000 documents, this jumps to 12,497,500 operations. Since this runs synchronously on Node's main event loop, large file processing will lock up the execution context completely, preventing terminal output logging and disrupting watch debouncing.
  • The Fix: Introduce a maximum indexing limit for the cross-referencing loop matrix (e.g., const targetFiles = vectorizedFiles.slice(0, 500);) or plan to offload matrix multiplication into a background worker in Sprint 4.

Inter-Process Communication (IPC) Memory Footprint

  • File: src/ingestion/worker.js
  • Issue: The worker extracts, filters, and passes the entire raw array of strings (words) across the thread channel back to the parent port via parentPort.postMessage.
  • Impact: Serializing and deserializing giant string arrays over V8's Structured Clone algorithm causes massive transient RAM usage spikes during heavy document ingestions.
  • The Fix: Compute the frequency distribution directly inside the worker thread, passing back a concise map rather than thousands of repeated array tokens:
// Inside worker.js:
const wordFrequency = Object.create(null);
for (const word of words) {
    wordFrequency[word] = (wordFrequency[word] || 0) + 1;
}
// Pass wordFrequency, totalWords, and uniqueWords instead of the raw words array

Summary Recommendation

Approve with modifications. These changes do not require reworking your macro structure; they are drop-in logic safety guards. Fixing the predictive.js timeline date assignment and wrapping your accumulators in Object.create(null) will ensure a highly reliable merge into main.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 14 changed files in this pull request and generated 9 comments.

Comments suppressed due to low confidence (1)

test/pipeline.test.js:72

  • This test duplicates the earlier "generateAnalyticsReport builds all analytics tiers from text files" test (same fixture + assertions). Keeping both adds maintenance overhead without increasing coverage; consider deleting this duplicate or merging into the existing test.
test('generateAnalyticsReport builds all analytics tiers from text files (descriptive dates path)', async () => {
    const fixtureRoot = await createFixtureDirectory();

    try {
        const report = await generateAnalyticsReport(fixtureRoot);

Comment thread src/index.js
Comment on lines +60 to 64
let pipelineQueue = Promise.resolve();
const triggerPipeline = () => {
clearTimeout(timeout);
timeout = setTimeout(() => {
process.stdout.write(`\n🔄 File system event detected. Recalculating analytics...\n`);
Comment on lines +4 to +24
async function generateCsvReport(report, exportsDir) {
await fs.mkdir(exportsDir, { recursive: true });
const csvPath = path.join(exportsDir, `report-${Date.now()}.csv`);

let csvContent = "Category,Metric,Value\n";
csvContent += `Descriptive,FileCount,${report.descriptive.fileCount}\n`;

const locations = report.descriptive.locations || report.locations || [];
csvContent += `Descriptive,UniqueLocations,"${locations.join(', ')}"\n`;

if (report.predictive?.locationClusterForecast) {
csvContent += `Predictive,LikelyNextHotspot,${report.predictive.locationClusterForecast.likelyNextHotspot}\n`;
}
if (report.predictive?.keywordFrequencyForecast) {
csvContent += `Predictive,ForecastMonth,${report.predictive.keywordFrequencyForecast.forecastMonth}\n`;
csvContent += `Predictive,ForecastWordCount,${report.predictive.keywordFrequencyForecast.forecastWordCount}\n`;
}

await fs.writeFile(csvPath, csvContent, 'utf-8');
return csvPath;
}
Comment on lines +26 to +36
// State Caching (Memoization)
const cachePath = path.join(process.cwd(), '.analytics_cache.json');
let cache = {};
if (!options.clearCache) {
try {
const cacheData = await fsp.readFile(cachePath, 'utf-8');
cache = JSON.parse(cacheData);
} catch (err) {
cache = {};
}
}
Comment on lines +88 to +96
const fileA = targetFiles[indexA];
const fileB = targetFiles[indexB];
const score = calculateCosineSimilarity(fileA.vector, fileB.vector);

if (score > 0.05) {
const correlationScore = Number(score.toFixed(4));
relatedByIndex[indexA].push({ match: fileB.fileName, correlationScore });
relatedByIndex[indexB].push({ match: fileA.fileName, correlationScore });
}
Comment on lines +105 to +109
return {
fileName: file.fileName,
topKeywords: file.topKeywords,
relatedDocuments: related
};
Comment thread docs/architecture.md
Comment on lines +12 to +15
### v1.2.0 Pipeline Architecture
* **Ingestion (Multithreaded):** Utilizes Node.js `worker_threads` and file-stat fingerprinting (`.analytics_cache.json`) to bypass redundant processing and drastically speed up execution.
* **Semantic Analytics:** Employs a TF-IDF weighting engine to filter generic stop-words and a Cosine Similarity math engine to automatically cluster related UAP documents based on vector distance.

Comment thread README.md
Comment on lines +209 to +212
The v1.2.0 AnalyticsBot engine supports multithreading and memoization caching. You can control these via CLI arguments:

* `node src/index.js ./my_folder --workers=4` : Manually set the number of WebAssembly worker threads (defaults to max CPU cores).
* `node src/index.js ./my_folder --clear-cache` : Bypasses the `.analytics_cache.json` file and forces a fresh read of all documents.
Comment thread src/ingestion/worker.js Outdated
Comment on lines +26 to +37
const content = await fs.readFile(task.filePath, 'utf-8');
const stats = await fs.stat(task.filePath);
const dates = [];
const locations = [];

if (TEXT_EXTENSIONS.has(extension)) {
const stream = fs.createReadStream(filePath, { encoding: "utf8" });
const lineReader = readline.createInterface({ input: stream, crlfDelay: Infinity });
for await (const line of lineReader) await processTextData(line, words, dates, locations);
stream.destroy();
} else if (extension === ".pdf") {
const dataBuffer = await fsp.readFile(filePath);
let extractedText = "";
// Filter out punctuation, make lowercase, and cull stop words
const rawWords = content
.replace(/[^\w\s]/g, '')
.toLowerCase()
.split(/\s+/)
.filter(word => word.length > 1 && !STOP_WORDS.has(word));
Comment on lines +103 to +105
// Save newly parsed data back to .analytics_cache.json
await fsp.writeFile(cachePath, JSON.stringify(cache, null, 2));

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 18 out of 19 changed files in this pull request and generated 4 comments.

Comments suppressed due to low confidence (1)

src/ingestion/file-ingestion.js:85

  • When all files are served from cache (numWorkers === 0), the function returns before persisting the updated cache to disk. That means stale-entry eviction (and any other cache updates) are lost and .analytics_cache.json can keep growing with removed files.
    const maxCores = options.workers || Math.max(1, os.cpus().length - 1);
    const numWorkers = Math.min(pathsToProcess.length, maxCores);
    
    if (numWorkers === 0) {
        return { sourceDirectory, files };
    }

Comment thread src/analytics/predictive.js Outdated
Comment on lines +51 to +56
// 🚨 FIX: Extract historical dates first, fallback to OS modification if none exist
const documentDate = (file.dates && file.dates.length > 0) ? file.dates[0] : file.modifiedAt;
if (!documentDate) continue;

const key = monthKey(documentDate);
if (!timeline[key]) timeline[key] = { totalWords: 0, locations: {} };
Comment on lines +5 to +7
const stringValue = String(value ?? '');
const sanitizedValue = /^[=+\-@]/.test(stringValue) ? `'${stringValue}` : stringValue;
return `"${sanitizedValue.replace(/"/g, '""')}"`;
Comment thread src/ingestion/worker.js Outdated
Comment thread .husky/pre-commit
Comment on lines +1 to +6
#!/usr/bin/env sh

npm run docs:generate || exit 1
git add docs/ || exit 1
npm test || exit 1
npm run docs:check || exit 1
Copilot AI changed the title v1.2.0 - Core Engine Optimization & Semantic Analytics | (Sprints 1-3) + Review Feedback Fixes v1.2.0 - Core Engine Optimization &amp; Semantic Analytics | (Sprints 1-3) + Review Feedback/Regression Fixes Jun 14, 2026
@aj1126 aj1126 merged commit b7ffeba into main Jun 14, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working documentation Improvements or additions to documentation enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Top keywords are pulling partial words Dates referenced includes invalid content Top Keywords are partial letter pairings from common words

3 participants