This project implements privacy-preserving techniques for time-series data, specifically focusing on the (k, P)-anonymity model proposed by Shou et al. in the paper:
"Supporting Pattern-Preserving Anonymization for Time-Series Data"
IEEE Transactions on Knowledge and Data Engineering, 2011.
The src directory is organized into four modules:
Analyzer: Scripts for performing analysis, including deanonymization attacks, query utility testing, and statistical calculations.Util: Primitive functions for time-series handling, normalization, and verification tests.KAPRA: The implementation of the bottom-up KAPRA algorithm.Naive: The implementation of the top-down Naive algorithm.
The primary tool for anonymizing a dataset is kp-anonymity.py, which provides a command-line interface (CLI). Usage instructions can be accessed via:
python3 kp-anonymity.py {kapra/naive} -hActually kapra and naive have the same instructions at the moment :)
While kp-anonymity.py serves as the main entry point for users, several auxiliary scripts facilitated the analysis:
analysis_scalability_utility.py: Used to generate data for scalability (execution time) and utility metrics, including Value Loss (VL), Pattern Loss (PL), and SAX level.- Several scripts throughout the repository include a
mainblock to perform specific experiments and data evaluations.
The algorithms are evaluated using two distinct data sources: a synthetic generator for establishing performance baselines and a real-world medical database to assess practical utility.
Synthetic datasets are generated using the dataset_generation notebook, primarily employing a random walk model. This function simulates highly diverse time series with significant fluctuations, mirroring scenarios such as financial volatility or health-related physiological measurements. While useful for controlled scalability tests, these datasets often lack the complex correlations found in natural signals.
To evaluate high-dimensional data with natural similarities, the MIT-BIH Arrhythmia Dataset is used.
- Description: Approximately 109,000 ECG heartbeat signals.
- Quasi-Identifiers (QI): Amplitude measurements at consecutive timestamps.
- Sensitive Attribute (SA): Clinical classification (Normal, Supraventricular, Ventricular, Fusion, Unknown).
Detailed theoretical background is provided in the original paper (KP Anonymity.pdf), while the implementation analysis is documented in Analysis_23-01-2026.pdf.
The analysis report covers:
- Definitions of the utility and privacy metrics.
- Scalability and utility assessments using synthetic data.
- Performance comparisons between the Naive and KAPRA approaches.
- A case study on the MIT-BIH database covering query utility and statistical preservation.
- Privacy assessments, including homogeneity risk and deanonymization analysis.
L. Shou, X. Shang, K. Chen, G. Chen and C. Zhang, "Supporting Pattern-Preserving Anonymization for Time-Series Data," in IEEE Transactions on Knowledge and Data Engineering, vol. 25, no. 4, pp. 877-892, April 2013, doi: 10.1109/TKDE.2011.249. keywords: {Couplings;Databases;Publishing;Pattern matching;Data models;Data privacy;Correlation;Privacy;anonymity;pattern;time series},
@ARTICLE{6095556,
author={Shou, Lidan and Shang, Xuan and Chen, Ke and Chen, Gang and Zhang, Chao},
journal={IEEE Transactions on Knowledge and Data Engineering},
title={Supporting Pattern-Preserving Anonymization for Time-Series Data},
year={2013},
volume={25},
number={4},
pages={877-892},
keywords={Couplings;Databases;Publishing;Pattern matching;Data models;Data privacy;Correlation;Privacy;anonymity;pattern;time series},
doi={10.1109/TKDE.2011.249}}