SynMeter Download - SynMeter Source code download

SynMeter

Other source code

1.0.0

Download

^{^{Generated by DALL·E 3}}
Systematic Assessment of Tabular Data Synthesis Algorithms

A principled library for tuning, training, and evaluating tabular data synthesis.

What's New

[Nov 24, 2024] We add a new SOTA HP synthesizer REaLTabFormer to SynMeter! Try it out!

[Sep 18, 2024] We add a new SOTA HP synthesizer TabSyn to SynMeter! Try it out!

Why SynMeter:

? Easy to add new synthesizers, seamlessly tuning, training, and evaluating various synthesizers.
? Principled evaluation metrics for fidelity, privacy, and utility for both Differential Private (DP) and Heuristic Private (HP) synthesizers.
Several SOTA synthesizers, by type:
- Statistical methods: MST, PrivSyn
- GAN-based: CTGAN, PATE-GAN
- VAE-based: TVAE
- Diffusion-based: TabDDPM, TabSyn, TableDiffusion
- LLM-based: GReaT, REaLTabFormer

Installation

Create a new conda environment and setup:

conda create -n synmeter python==3.9
conda activate synmeter
pip install -r requirements.txt # install dependencies
pip install -e . # package the library

Change the base dictionary in ./lib/info/ROOT_DIR:

ROOT_DIR = root_to_synmeter

? Usage

Datasets

SynMeter provides 12 standardized datasets with train/val/test datasets for benchmarking, which can be downloaded from here: Google Drive
You can also easily use an additional dataset by putting it to ./dataset.

Tune evaluators for utility evaluations

Machine learning affinity requires machine learning models with tuned hyperparameters, SynMeter provides 8 commonly-used machine learning models and their configurations in ./exp/evaluators.
You can tune these evaluators on your customized dataset:

python scripts/tune_evaluator.py -d [dataset] -c [cuda]

Tune synthesizer

We provide a unified tuning objective for model tuning, thus, all kinds of synthesizers can be tuned by just a single command:

python scripts/tune_synthesizer.py -d [dataset] -m [synthesizer] -s [seed] -c [cuda]

Train synthesizer

After tuning, a configuration should be recorded to /exp/dataset/synthesizer, SynMeter can use it to train and store the synthesizer:

python scripts/train_synthesizer.py -d [dataset] -m [synthesizer] -s [seed] -c [cuda]

Evaluate synthesizer

Assessing the fidelity of the synthetic data:

python scripts/eval_fidelity.py -d [dataset] -m [synthesizer] -s [seed] -t [target]

Assessing the privacy of the synthetic data:

python scripts/eval_privacy.py -d [dataset] -m [synthesizer] -s [seed]

Assessing the utility of the synthetic data:

python scripts/eval_utility.py -d [dataset] -m [synthesizer] -s [seed]

The results of the evaluations should be saved under the corresponding dictionary /exp/dataset/synthesizer.

Customize your own synthesizer

One advantage of SynMeter is to provide the easiest way to add new synthesis algorithms, three steps are needed:

Write new synthesis code in modularity into ./synthesizer/my_synthesiszer
Create a base configuration in ./exp/base_config.
Create a calling python function in ./synthesizer, which contain three functions: train, sample, and tune.

Then, you are free to tune, run, and test the new synthesizer!

? Methods

Statistical Methods

Method	Type	Description	Reference
MST	DP	The method uses probabilistic graphical models to learn the dependence of low-dimensional marginals for data synthesis.	Paper, Code
PrivSyn	DP	A non-parametric DP synthesizer, which iteratively updates the synthetic dataset to make it match the target noise marginals.	Paper, Code

Generative Adversarial Networks (GANs)

Method	Type	Description	Reference
CTGAN	HP	A conditional generative adversarial network that can handle tabular data.	Paper, Code
PATE-GAN	DP	The method uses the Private Aggregation of Teacher Ensembles (PATE) framework and applies it to GANs.	Paper, Code

Variational Autoencoders (VAE)

Method	Type	Description	Reference
TVAE	HP	A conditional VAE network which can handle tabular data.	Paper, Code

Diffusion Models

Method	Type	Description	Reference
TabDDPM	HP	Use diffusion model for tabular data synthesis	Paper, Code
TabSyn	HP	Use latent diffusion model and VAE for synthesis.	Paper, Code
TableDiffusion	DP	Generating tabular datasets under differential privacy.	Paper, Code

Large Language Model (LLM)-based Models

Method	Type	Description	Reference
GReaT	HP	Use LLM to fine tune a tabular dataset.	Paper, Code
REaLTabFormer	HP	Use GPT-2 to learn the relational dependence of tabular data.	Paper, Code

⚡ Evaluation Metrics

Fidelity metrics: we consider the Wasserstein distance as a principled fidelity metric, which is calculated by all one and two-way marginals.
Privacy metrics: we devise the Membership Disclosure Score (MDS) to measure the membership privacy risks of both HP and DP synthesizers.
Utility metrics: we use machine learning affinity and query error to measure the utility of synthetic data.

Please see our paper for details and usages.