Skip to content

Commit a129cd3

Browse files
author
Ivan Begtin
committed
Rewrote uniq, frequency and select commands. Updated stats command to support CSV and JSON lines. Added conversion of CSV and JSON lines to Parquet
1 parent dcdd9bf commit a129cd3

16 files changed

Lines changed: 12304 additions & 266 deletions

.idea/datum.iml

Lines changed: 1 addition & 1 deletion
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

.idea/misc.xml

Lines changed: 0 additions & 1 deletion
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

HISTORY.rst

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,11 +3,15 @@
33
History
44
=======
55

6+
1.0.10 (2022-01-29)
7+
-------------------
8+
* Added encoding and delimiter detection for commands: uniq, select, frequency and headers. Completely rewrote these functions. If options for encoding and delimiter set, they override detected. If not set, detected delimiter and encoding used.
9+
* Added support of .parquet files to convert to. It's done in a simpliest way using pandas "to_parquet" function.
10+
611
1.0.9 (2022-01-18)
712
------------------
813
* Added support for CSV and BSON files for "stats" command
914

10-
1115
1.0.8 (2021-07-14)
1216
------------------
1317
* Replaced json with orjson for some operations. Keep looking on performance changes and going to replace or json lib calls to orjson

README.rst

Lines changed: 15 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ Main features
2121

2222
* Common data operations against CSV, JSON lines and BSON files
2323
* Built-in data filtering
24-
* Conversion between CSV, JSONl, BSON, XML, XLS, XLSX file types
24+
* Conversion between CSV, JSONl, BSON, XML, XLS, XLSX, Parquet file types
2525
* Low memory footprint
2626
* Support for compressed datasets
2727
* Advanced statistics calculations
@@ -176,7 +176,8 @@ Commands
176176

177177
Frequency command
178178
-----------------
179-
Field value frequency calculator. Returns frequency table for certain field
179+
Field value frequency calculator. Returns frequency table for certain field.
180+
This command autodetects delimiter and encoding of CSV files and encoding of JSON lines files by default. You may override it providng "-d" delimiter and "-e" encoding parameters
180181

181182
Get frequencies of values for field *GovSystem* in the list of Russian federal government domains from `govdomains repository <https://github.com/infoculture/govdomains/tree/master/refined>`_
182183

@@ -192,6 +193,7 @@ Uniq command
192193

193194
Returns all unique files of certain field(s). Accepts parameter *fields* with comma separated fields to gets it unique values.
194195
Provide single field name to get unique values of this field or provide list of fields to get combined unique values.
196+
This command autodetects delimiter and encoding of CSV files and encoding of JSON lines files by default. You may override it providng "-d" delimiter and "-e" encoding parameters
195197

196198

197199
Returns all unique values of field *regions* in selected JSONl file
@@ -210,7 +212,7 @@ Returns all unique combinations of fields *status* and *regions* in selected JSO
210212
Convert command
211213
---------------
212214

213-
Converts data from one format to another.
215+
Converts data from one format to another. Supports most common data files
214216
Supports conversions:
215217

216218
* XML to JSON lines
@@ -221,6 +223,8 @@ Supports conversions:
221223
* CSV to BSON
222224
* XLS to BSON
223225
* JSON lines to CSV
226+
* CSV to Parquet
227+
* JSON lines to Parquet
224228

225229
Conversion between XML and JSON lines require flag *tagname* with name of tag which should be converted into single JSON record.
226230

@@ -236,6 +240,12 @@ Converts JSON lines file roszdravvendors_final.jsonl to CSV file roszdravvendors
236240
237241
$ undatum convert examples/roszdravvendors_final.jsonl examples/roszdravvendors_final.csv
238242
243+
Converts CSV file feddomains.csv to Parquet file feddomains.parquet
244+
245+
.. code-block:: bash
246+
247+
$ undatum convert examples/feddomains.csv examples/feddomains.parquet
248+
239249
240250
Validate command
241251
----------------
@@ -260,6 +270,7 @@ Headers command
260270
---------------
261271
Returns fieldnames of the file. Supports CSV, JSON, BSON file types.
262272
For CSV file it takes first line of the file and for JSON lines and BSON files it processes number of records provided as *limit* parameter with default value 10000.
273+
This command autodetects delimiter and encoding of CSV files and encoding of JSON lines files by default. You may override it providng "-d" delimiter and "-e" encoding parameters
263274

264275
Returns headers of JSON lines file with top 10 000 records (default value)
265276

@@ -403,4 +414,4 @@ Data types
403414
JSONl
404415
-----
405416
406-
JSON lines is a replacement to CSV and JSON files, with JSON flexibility and ability to process data line by line, without loading everithing into memory.
417+
JSON lines is a replacement to CSV and JSON files, with JSON flexibility and ability to process data line by line, without loading everything into memory.

examples/budgetgovru-fbpgu.jsonl

Lines changed: 3424 additions & 0 deletions
Large diffs are not rendered by default.

examples/trudvac_final_s.jsonl

Lines changed: 1000 additions & 0 deletions
Large diffs are not rendered by default.

0 commit comments

Comments
 (0)