Skip to content

UCHI-DB/comp-datasets

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 

Repository files navigation

Dataset Collection

Many of our works are based on analysis and evaluation of real-world datasets. We have created an automated framework to collect datasets, extract columns from them, organize, and persist the records for further analysis. The framework could accept various input formats including csv, txt, JSon and MS Excel files. It also supports recognition of column data type for unattended data collection. For further analysis purpose, the framework provides API enabling customized features to be extracted from the columns.

Using this framework, we have collected around 20000 columnar datasets from approximately 1200 tables with a total size of 500G data. These datasets are all from real-world data sources and cover a rich collection of data types (integer, date, price, address, etc.), with diverse data distributions.

Here's a list of data sources the datasets are collected from. Please contact us if you need to download the datasets.

Synthetic Dataset

PIDS uses four synthesic datasets in the experiments

  • Phone Number Example: (123)456-7890
  • IPv6 Example: 1234:5678:90AB:CDEF:3323:5678:90AB:CDEF
  • Timestamp Example: 2014-06-01 23:14:29 4249.12345
  • Address Example: 123 Maple Street,Suite P,Chicago,Cook County,IL,60012

For Phone Number and IPv6 datasets, the value for each field is randomly sampled from all available values.
The Timestamp dataset is randomly sampled from a 10 year time span, 1970-01-01 to 2069-12-31.
The Address dataset is randomly sampled from an dictionary with 800,000 records.
This repository contains the source code for generating these datasets.

About

Compression Datasets

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages