Metadata-Version: 2.1
Name: sdgym
Version: 0.3.1
Summary: A framework to benchmark the performance of synthetic data generators for non-temporal tabular data
Home-page: https://github.com/sdv-dev/SDGym
Author: MIT Data To AI Lab
Author-email: dailabmit@gmail.com
License: MIT license
Description: <p align="left">
          <a href="https://dai.lids.mit.edu">
            <img width=15% src="https://dai.lids.mit.edu/wp-content/uploads/2018/06/Logo_DAI_highres.png" alt="DAI-Lab" />
          </a>
          <i>An Open Source Project from the <a href="https://dai.lids.mit.edu">Data to AI Lab, at MIT</a></i>
        </p>
        
        [![Development Status](https://img.shields.io/badge/Development%20Status-2%20--%20Pre--Alpha-yellow)](https://pypi.org/search/?c=Development+Status+%3A%3A+2+-+Pre-Alpha)
        [![Travis](https://travis-ci.org/sdv-dev/SDGym.svg?branch=master)](https://travis-ci.org/sdv-dev/SDGym)
        [![PyPi Shield](https://img.shields.io/pypi/v/sdgym.svg)](https://pypi.python.org/pypi/sdgym)
        [![Downloads](https://pepy.tech/badge/sdgym)](https://pepy.tech/project/sdgym)
        
        <img align="center" width=30% src="docs/resources/header.png">
        
        Benchmarking framework for Synthetic Data Generators
        
        * Website: https://sdv.dev
        * Documentation: https://sdv.dev/SDV
        * Repository: https://github.com/sdv-dev/SDGym
        * License: [MIT](https://github.com/sdv-dev/SDGym/blob/master/LICENSE)
        * Development Status: [Pre-Alpha](https://pypi.org/search/?c=Development+Status+%3A%3A+2+-+Pre-Alpha)
        
        # Overview
        
        Synthetic Data Gym (SDGym) is a framework to benchmark the performance of synthetic data
        generators based on [SDV](https://github.com/sdv-dev/SDV) and [SDMetrics](
        https://github.com/sdv-dev/SDMetrics).
        
        SDGym is a part of the [The Synthetic Data Vault](https://sdv.dev/) project.
        
        ## What is a Synthetic Data Generator?
        
        A **Synthetic Data Generator** is a Python function (or method) that takes as input some
        data, which we call the *real* data, learns a model from it, and outputs new *synthetic* data that
        has the same structure and similar mathematical properties as the *real* one.
        
        Please refer to the [synthesizers documentation](SYNTHESIZERS.md) for instructions about how to
        implement your own Synthetic Data Generator and integrate with SDGym. You can also read about how
        to use the ones already included in **SDGym** and see how to run them.
        
        ## Benchmark datasets
        
        **SDGym** evaluates the performance of **Synthetic Data Generators** using *single table*,
        *multi table* and *timeseries* datasets stored as CSV files alongside an [SDV Metadata](
        https://sdv.dev/SDV/user_guides/relational/relational_metadata.html) JSON file.
        
        Further details about the list of available datasets and how to add your own datasets to
        the collection can be found in the [datasets documentation](DATASETS.md).
        
        # Install
        
        **SDGym** can be installed using the following commands:
        
        **Using `pip`:**
        
        ```bash
        pip install sdgym
        ```
        
        **Using `conda`:**
        
        ```bash
        conda install -c sdv-dev -c conda-forge sdgym
        ```
        
        For more installation options please visit the [SDGym installation Guide](INSTALL.md)
        
        # Usage
        
        ## Benchmarking your own Synthesizer
        
        SDGym evaluates **Synthetic Data Generators**, which are Python functions (or classes) that take
        as input some data, which we call the *real* data, learn a model from it, and output new
        *synthetic* data that has the same structure and similar mathematical properties as the *real* one.
        
        As an example, let use define a synthesizer function that applies the [GaussianCopula model from SDV
        ](https://sdv.dev/SDV/user_guides/single_table/gaussian_copula.html) with `gaussian` distribution.
        
        ```python3
        import numpy as np
        from sdv.tabular import GaussianCopula
        
        
        def gaussian_copula(real_data, metadata):
            gc = GaussianCopula(default_distribution='gaussian')
            table_name = metadata.get_tables()[0]
            gc.fit(real_data[table_name])
            return {table_name: gc.sample()}
        ```
        
        |:information_source: You can learn how to create your own synthesizer function [here](SYNTHESIZERS.md).|
        |:-|
        
        We can now try to evaluate this function on the `asia` and `alarm` datasets:
        
        ```python3
        import sdgym
        
        scores = sdgym.run(synthesizers=gaussian_copula, datasets=['asia', 'alarm'])
        ```
        
        |:information_source: You can learn about different arguments for `sdgym.run` function [here](BENCHMARK.md).|
        |:-|
        
        The output of the `sdgym.run` function will be a `pd.DataFrame` containing the results obtained
        by your synthesizer on each dataset.
        
        | synthesizer     | dataset | modality     | metric          |      score | metric_time | model_time |
        |-----------------|---------|--------------|-----------------|------------|-------------|------------|
        | gaussian_copula | asia    | single-table | BNLogLikelihood |  -2.842690 |    2.762427 |   0.752364 |
        | gaussian_copula | alarm   | single-table | BNLogLikelihood | -20.223178 |    7.009401 |   3.173832 |
        
        ## Benchmarking the SDGym Synthesizers
        
        If you want to run the SDGym benchmark on the SDGym Synthesizers you can directly pass the
        corresponding class, or a list of classes, to the `sdgym.run` function.
        
        For example, if you want to run the complete benchmark suite to evaluate all the existing
        synthesizers you can run (:warning: this will take a lot of time to run!):
        
        ```python
        from sdgym.synthesizers import (
            CLBN, CopulaGAN, CTGAN, HMA1, Identity, Independent,
            MedGAN, PAR, PrivBN, SDV, TableGAN, TVAE,
            Uniform, VEEGAN)
        
        all_synthesizers = [
            CLBN,
            CTGAN,
            CopulaGAN,
            HMA1,
            Identity,
            Independent,
            MedGAN,
            PAR,
            PrivBN,
            SDV,
            TVAE,
            TableGAN,
            Uniform,
            VEEGAN,
        ]
        scores = sdgym.run(synthesizers=all_synthesizers)
        ```
        
        For further details about all the arguments and possibilities that the `benchmark` function offers
        please refer to the [benchmark documentation](BENCHMARK.md)
        
        # Additional References
        
        * Datasets used in SDGym are detailed [here](DATASETS.md).
        * How to write a synthesizer is detailed [here](SYNTHESIZERS.md).
        * How to use benchmark function is detailed [here](BENCHMARK.md).
        * Detailed leaderboard results for all the releases are available [here](
        https://docs.google.com/spreadsheets/d/1iNJDVG_tIobcsGUG5Gn4iLa565vVhz2U/edit).
        
        # The Synthetic Data Vault
        
        <p>
          <a href="https://sdv.dev">
            <img width=30% src="https://github.com/sdv-dev/SDV/blob/master/docs/images/SDV-Logo-Color-Tagline.png?raw=true">
          </a>
          <p><i>This repository is part of <a href="https://sdv.dev">The Synthetic Data Vault Project</a></i></p>
        </p>
        
        * Website: https://sdv.dev
        * Documentation: https://sdv.dev/SDV
        
        
        # History
        
        ## v0.3.1 - 2021-05-20
        
        This release adds new features to store results and cache contents into an S3 bucket
        as well as a script to collect results from a cache dir and compile a single results
        CSV file.
        
        ### Issues closed
        
        * Collect cached results from s3 bucket - [Issue #85](https://github.com/sdv-dev/SDGym/issues/85) by @katxiao
        * Store cache contents into an S3 bucket - [Issue #81](https://github.com/sdv-dev/SDGym/issues/81) by @katxiao
        * Store SDGym results into an S3 bucket - [Issue #80](https://github.com/sdv-dev/SDGym/issues/80) by @katxiao
        * Add a way to collect cached results - [Issue #79](https://github.com/sdv-dev/SDGym/issues/79) by @katxiao
        * Allow reading datasets from private s3 bucket - [Issue #74](https://github.com/sdv-dev/SDGym/issues/74) by @katxiao
        * Typos in the sdgym.run function docstring documentation - [Issue #69](https://github.com/sdv-dev/SDGym/issues/69) by @sbrugman
        
        ## v0.3.0 - 2021-01-27
        
        Major rework of the SDGym functionality to support a collection of new features:
        
        * Add relational and timeseries model benchmarking.
        * Use SDMetrics for model scoring.
        * Update datasets format to match SDV metadata based storage format.
        * Centralize default datasets collection in the `sdv-datasets` S3 bucket.
        * Add options to download and use datasets from different S3 buckets.
        * Rename synthesizers to baselines and adapt to the new metadata format.
        * Add model execution and metric computation time logging.
        * Add optional synthetic data and error traceback caching.
        
        ## v0.2.2 - 2020-10-17
        
        This version adds a rework of the benchmark function and a few new synthesizers.
        
        ### New Features
        
        * New CLI with `run`, `make-leaderboard` and `make-summary` commands
        * Parallel execution via Dask or Multiprocessing
        * Download datasets without executing the benchmark
        * Support for python from 3.6 to 3.8
        
        ### New Synthesizers
        
        * `sdv.tabular.CTGAN`
        * `sdv.tabular.CopulaGAN`
        * `sdv.tabular.GaussianCopulaOneHot`
        * `sdv.tabular.GaussianCopulaCategorical`
        * `sdv.tabular.GaussianCopulaCategoricalFuzzy`
        
        ## v0.2.1 - 2020-05-12
        
        New updated leaderboard and minor improvements.
        
        ### New Features
        
        * Add parameters for PrivBNSynthesizer - [Issue #37](https://github.com/sdv-dev/SDGym/issues/37) by @csala
        
        ## v0.2.0 - 2020-04-10
        
        New Becnhmark API and lots of improved documentation.
        
        ### New Features
        
        * The benchmark function now returns a complete leaderboard instead of only one score
        * Class Synthesizers can be directly passed to the benchmark function
        
        ### Bug Fixes
        
        * One hot encoding errors in the Independent, VEEGAN and Medgan Synthesizers.
        * Proper usage of the `eval` mode during sampling.
        * Fix improperly configured datasets.
        
        ## v0.1.0 - 2019-08-07
        
        First release to PyPi
        
Keywords: machine learning synthetic data generation benchmark generative models
Platform: UNKNOWN
Classifier: Development Status :: 2 - Pre-Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Natural Language :: English
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.6,<3.9
Description-Content-Type: text/markdown
Provides-Extra: dev
Provides-Extra: test
