Skip to content

Commit 0b8264e

Browse files
authored
Merge pull request #14 from statsim/refactor/core-separation
Refactor/core separation
2 parents 86787f8 + 3ab9ab4 commit 0b8264e

26 files changed

+1963
-2740
lines changed

.github/workflows/deploy.yml

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,8 +27,14 @@ jobs:
2727
cache: npm
2828
- run: npm ci
2929
- run: mkdir -p dist && npm run build
30+
- run: |
31+
mkdir -p _site
32+
cp index.html _site/
33+
cp -r css fonts dist _site/
34+
if [ -f CNAME ]; then cp CNAME _site/; fi
35+
touch _site/.nojekyll
3036
- uses: actions/upload-pages-artifact@v3
3137
with:
32-
path: .
38+
path: _site
3339
- id: deployment
3440
uses: actions/deploy-pages@v4

AGENTS.md

Lines changed: 24 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -14,27 +14,46 @@ Use **2-space indentation** and **semicolon-free** syntax. Use **single quotes**
1414

1515
## Quick start
1616
- Install: `npm install`
17-
- Tests: `npm test` (runs Playwright E2E against a local static server)
17+
- Tests: `npm test` (runs unit tests + Playwright E2E)
1818
- Lint/format: Not configured; use `npm run build-dev` as a fast sanity build
19+
- CLI:
20+
- `node src/cli/index.js data.csv` (auto: summary on TTY, JSON when piped)
21+
- `node src/cli/index.js data.csv --format json|summary|serve`
1922

2023
## Repo map
21-
- `src/main.js`: core browser app (streaming CSV parse, stats aggregation, output rendering)
24+
- `src/core/`: pure-JS profiling engine (no DOM deps)
25+
- `index.js`: `profileStream(readable, opts)` — main API
26+
- `constants.js`: shared constants (missing markers, thresholds, stats conventions)
27+
- `classify.js`: `classifyValue()`, `getVariableType()` — pure functions
28+
- `columns.js`: `initColumns()`, `updateColumns()` — online-stats wrappers
29+
- `result.js`: `finalizeResult()` — builds versioned ProfileResult (v1)
30+
- `src/render/index.js`: DOM/chart rendering (tui-chart), consumes ProfileResult
31+
- `src/worker/profile-worker.js`: Web Worker — runs core in background thread (file + URL jobs)
32+
- `src/main.js`: browser entry — DnD/file/url handlers, Worker dispatch, render
33+
- `src/cli/index.js`: CLI entry — `fs.createReadStream`/stdin → core → formatter (`json|summary|serve`)
34+
- `src/cli/progress.js`: CLI progress renderer (spinner/progress bar)
35+
- `src/cli/format-summary.js`: ANSI terminal summary formatter
36+
- `src/cli/serve.js`: local report server for `--format serve` (`/api/result` + browser open)
2237
- `index.html`: UI shell and app entrypoint
2338
- `css/`: app styles and vendor chart CSS
2439
- `dist/bundle.js`: built browser bundle (generated)
40+
- `dist/worker-bundle.js`: built worker bundle (generated)
2541
- `fonts/`: local Roboto font assets
42+
- `tests/unit/`: tape unit tests for core modules
2643
- `tests/e2e/`: Playwright browser-level regression tests
2744
- `tests/support/`: local static server used by E2E tests
28-
- `docs/architecture.md`: not present in this repo
2945

3046
## Definition of done
3147
- Run: `npm run build` (or `npm run build-dev` during iteration)
32-
- Add/adjust tests for: browser upload flow, streaming parse behavior, missing-value classification, numeric stats gating, and top-value counting in `src/main.js`
33-
- If you can’t run tests: explain + add a minimal verification note
48+
- Add/adjust unit tests for core logic (classify, columns, result) in `tests/unit/`
49+
- Add/adjust E2E tests for browser upload flow + URL flow in `tests/e2e/`
50+
- If you can't run tests: explain + add a minimal verification note
3451

3552
## Constraints
3653
- Don’t add new production dependencies without asking.
3754
- No DB migrations exist in this repo; ask before introducing any persistence layer or migration tooling.
55+
- Update the `README.md`, `CHANGELOG.md` and `index.html` with any user-facing changes or new features.
56+
- Commit messages should be clear and descriptive, following the format: `feature|fix|test|docs: short description` (e.g., `feature: add new column type classification`).
3857

3958
## Conventions
4059
- Formatting: no formatter is configured; preserve existing style (2-space indentation, semicolon-free, single quotes)

LICENSE

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
MIT License
2+
3+
Copyright (c) 2024 Anton Zemlyansky
4+
5+
Permission is hereby granted, free of charge, to any person obtaining a copy
6+
of this software and associated documentation files (the "Software"), to deal
7+
in the Software without restriction, including without limitation the rights
8+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9+
copies of the Software, and to permit persons to whom the Software is
10+
furnished to do so, subject to the following conditions:
11+
12+
The above copyright notice and this permission notice shall be included in all
13+
copies or substantial portions of the Software.
14+
15+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21+
SOFTWARE.

README.md

Lines changed: 34 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,36 @@
1-
## Profile
1+
## StatSim Profile
22

3-
Generate data profiles in the browser. Data is processed locally as a stream. In theory you can process really big files because online algorithms are awesome!
3+
Generate data profiles in the browser or from the command line. Data is processed locally as a stream using online algorithms, so you can handle very large files without loading them into memory.
44

5-
* CSV
6-
* Count missing values
7-
* Statistics (min, max, mean, std)
8-
* Top N
5+
### Features
6+
7+
* CSV and TSV support
8+
* Streaming processing via Web Workers (UI stays responsive)
9+
* Missing value detection (empty, NA, NULL, NaN, etc.)
10+
* Descriptive statistics (min, max, mean, variance, std)
11+
* Histograms and top-N value counts
12+
* Variable type classification (Number, String, Boolean, Categorical, Mixed)
13+
* Load files from URL via query param: `?file=https://example.com/data.csv`
14+
* CLI output modes: `summary`, `json`, and `serve`
15+
* npm package: `@statsim/profile`
16+
17+
### Usage
18+
19+
**Browser:** Open [statsim.com/profile](https://statsim.com/profile/), drag a CSV file, paste a URL, or open a prefilled link such as `?file=https://example.com/data.csv`.
20+
21+
**CLI:**
22+
```
23+
npx @statsim/profile data.csv
24+
sprofile data.csv # summary (TTY)
25+
sprofile data.csv --format json # raw JSON
26+
sprofile data.csv --format serve # open local browser report
27+
cat data.csv | sprofile --stdin # stdin mode
28+
```
29+
30+
### Development
31+
32+
```
33+
npm install
34+
npm run build-dev # build browser bundles
35+
npm test # run unit + E2E tests
36+
```

css/main.css

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -62,6 +62,17 @@ dd {
6262
height: 3px;
6363
}
6464

65+
#progress.indeterminate {
66+
width: 100% !important;
67+
animation: indeterminate 1.5s infinite ease-in-out;
68+
}
69+
70+
@keyframes indeterminate {
71+
0% { opacity: 0.3; }
72+
50% { opacity: 1; }
73+
100% { opacity: 0.3; }
74+
}
75+
6576
#output h2 {
6677
font-size: 42px;
6778
font-weight: 700;

index.html

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -48,6 +48,12 @@
4848
<div class="drag-text">
4949
<h4>Drag & drop a CSV file</h4>
5050
<p>Or <label for="input" style="color:#039be5; cursor: pointer; font-size: 14px;">choose a file</label></p>
51+
<div class="url-input" style="margin-top: 16px;">
52+
<input id="url-input" type="text" placeholder="Or paste a CSV URL (https://...)" style="width: 400px; max-width: 80%; padding: 4px 8px; font-size: 14px; border: 1px solid #ccc; border-radius: 3px;">
53+
<button id="url-load" style="padding: 4px 12px; font-size: 14px; cursor: pointer; margin-left: 4px;">Load</button>
54+
</div>
55+
<p style="font-size: 13px; color: #666; margin-top: 8px;">Tip: you can also open <code>?file=https://example.com/data.csv</code></p>
56+
<p id="url-error" style="color: #e53935; font-size: 13px; display: none;"></p>
5157
</div>
5258
</div>
5359
</div>
@@ -60,7 +66,7 @@ <h4>Drag & drop a CSV file</h4>
6066
<h1>Data profiling online</h1>
6167
<h2>Use this free and open-source web app to profile data and generate visual summaries of your CSV datasets</h2>
6268
<p>
63-
In many industries, understanding data is a vital skill. However, most people struggle to recognize patterns and extract insights from raw tabular datasets because we are not computers. That's why data visualization and profiling are invaluable tools, frequently utilized to transform raw numbers into comprehensible elements like charts, trends, and statistics. Data profiling enables the creation of overviews of tabular files, offering detailed information and descriptive statistics for each variable contained in a dataset. <b>StatSim Profile</b> is a browser-based data profiling tool that is free and open-source. It processes files locally without uploading them to a web server and can handle large datasets, even those in gigabytes.
69+
In many industries, understanding data is a vital skill. However, most people struggle to recognize patterns and extract insights from raw tabular datasets because we are not computers. That's why data visualization and profiling are invaluable tools, frequently utilized to transform raw numbers into comprehensible elements like charts, trends, and statistics. Data profiling enables the creation of overviews of tabular files, offering detailed information and descriptive statistics for each variable contained in a dataset. <b>StatSim Profile</b> is a browser-based data profiling tool that is free and open-source. It processes files locally without uploading them to a web server and can handle large datasets, even those in gigabytes. Processing runs in a Web Worker to keep the UI responsive. You can also load CSV files directly from a URL or use the <a href="https://github.com/statsim/profile">command-line tool</a> with <code>summary</code>, <code>json</code>, and <code>serve</code> output modes.
6470
</p>
6571
</div>
6672
</div>

0 commit comments

Comments
 (0)