Dataset Collection

Many of our works are based on analysis and evaluation of real-world datasets. We have created an automated framework to collect datasets, extract columns from them, organize, and persist the records for further analysis. The framework could accept various input formats including csv, txt, JSon and MS Excel files. It also supports recognition of column data type for unattended data collection. For further analysis purpose, the framework provides API enabling customized features to be extracted from the columns.

Using this framework, we have collected around 20000 columnar datasets from approximately 1200 tables with a total size of 500G data. These datasets are all from real-world data sources and cover a rich collection of data types (integer, date, price, address, etc.), with diverse data distributions.

Here's a list of data sources the datasets are collected from. Please contact us if you need to download the datasets.

Government Data Portals
- Open Government Data
- NYC
- Chicago
- Baltimore
- Dallas
- Washington D.C.
- LA
- Maryland
- Lousiville
- Oakland
Server Logs
- Argonne Server Logs [Requesting permission to publish]
GIS data
- ArcGIS Open Data
- Esri Open Data
Social Networks
- Bikeshare
- Yelp
Machine Learning Datasets
- UCI Repo

Synthetic Dataset

PIDS uses four synthesic datasets in the experiments

Phone Number Example: (123)456-7890
IPv6 Example: 1234:5678:90AB:CDEF:3323:5678:90AB:CDEF
Timestamp Example: 2014-06-01 23:14:29 4249.12345
Address Example: 123 Maple Street,Suite P,Chicago,Cook County,IL,60012

For Phone Number and IPv6 datasets, the value for each field is randomly sampled from all available values.
The Timestamp dataset is randomly sampled from a 10 year time span, 1970-01-01 to 2069-12-31.
The Address dataset is randomly sampled from an dictionary with 800,000 records.
This repository contains the source code for generating these datasets.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
src/main		src/main
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dataset Collection

Synthetic Dataset

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Dataset Collection

Synthetic Dataset

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages