Metadata-Version: 2.1
Name: tsfaker
Version: 0.14
Summary: Generate fake data conforming to a Table Schema
Home-page: https://gitlab.com/healthdatahub/tsfaker/
Author: Pierre-Alain Jachiet - DREES
Author-email: ld-lab-github@sante.gouv.fr
License: MPL-2.0
Description: ==================
        Table Schema Faker
        ==================
        
        Generate tabular fake data conforming to a `Table Schema <https://frictionlessdata.io/specs/table-schema/>`_.
        
        `tsfaker <https://pypi.org/project/tsfaker/>`_ library is available on PyPI.
        
        This library was originally developed to generate a synthetic version of `SNDS database <https://documentation-snds.health-data-hub.fr/introduction/>`_, which contains hundreds of tables, hence **tsfaker** efficiently deals with **foreign keys**.
        
        **Notes :**
        
        We aim to generate fake data conforming to a *schema*, not fake data with realistic statistical information (see **Related work** section).
        
        This library is in beta and subject to frequent changes (see **Releases notes** section).
        
        
        Usage
        =====
        
        Installation
        ------------
        
        .. code:: bash
        
            $ pip3 install tsfaker
        
        
        Simple usage
        ------------
        Generate 3 rows of fake data from a single table schema file.
        
        .. code:: bash
        
            $ tsfaker https://gitlab.com/healthdatahub/tsfaker/raw/master/tests/schemas/implemented_types.json  --nrows 3 --pretty
              boolean         string            number      integer        date              datetime  year yearmonth
            0       1  haHoKysholbSI    9780230269.512  -7061309068  1914-10-03  1902-04-11T11:21:11Z  1939    196405
            1       0      rLugGhNek    990894536.8945   2529879443  2026-09-08  2015-11-27T16:21:54Z  1932    192909
            2       1         ipqVXm  -4371053960.8987   -529880373  1994-09-27  1937-01-12T18:40:15Z  2021    193303
        
        
        Advanced usage
        --------------
        
        Show help message.
        
        .. code:: bash
        
            $ tsfaker --help
            Usage: tsfaker [OPTIONS] [SCHEMA_DESCRIPTORS]...
            ...
        
        
        Download examples schemas from project **schema-snds**.
        
        .. code:: bash
        
            $ git clone  https://gitlab.com/healthdatahub/schema-snds && cd schema-snds
        
        
        Generate fake data for all schemas in a **schemas** folder using csv files in **nomenclatures** folder, and write them to **fake_data** folder.
        
        .. code:: bash
        
            $ mkdir fake_data
            $ tsfaker schemas -o fake_data -r nomenclatures
            2019-01-01 00:00:00 :: INFO :: Data generated from descriptor 'schemas/PMSI/PMSI MCO/T_MCOaa_nnE.json' will be written on 'fake_data/PMSI/PMSI MCO/T_MCOaa_nnE.csv'
            2019-01-01 00:00:00 :: INFO :: Data generated from descriptor 'schemas/PMSI/PMSI MCO/T_MCOaa_nnFASTC.json' will be written on 'fake_data/PMSI/PMSI MCO/T_MCOaa_nnFASTC.csv'
            2019-01-01 00:00:00 :: INFO :: Data generated from descriptor 'schemas/PMSI/PMSI SSR/T_SSRaa_nnE.json' will be written on 'fake_data/PMSI/PMSI SSR/T_SSRaa_nnE.csv'
            ...
        
        Related work
        ============
        
        We use directly or get inspiration from the following libraries.
        
        Simple data Generators
        ----------------------
        
        - `numpy <https://github.com/numpy/numpy>`_ comes with many functions to generate random data.
        
        - `rstr <https://pypi.org/project/rstr/>`_ and `exrex <https://github.com/asciimoo/exrex>`_ generate random string following regular expressions.
        
        - `Faker <https://github.com/joke2k/faker>`_ and `Mimesis <https://mimesis.readthedocs.io/index.html>`_ allow to generate fake data. They both focus on generating high level data, such as names, email or addresses, which does not seem necessary for us.
        
        - `DataScienceFaker <https://github.com/EDS-APHP/dsfaker>`_ generate synthetic data conforming to statistical distributions. It is based on numpy and rstr.
        
        Table generator
        ---------------
        
        - `pydbgen <https://github.com/tirthajyoti/pydbgen>`_ is a shallow wrapper around Faker to generate tables as pandas dataframe, sqlite table or Excel files.
        
        - `pySyntheticDatasetGenerator <https://github.com/EDS-APHP/pySyntheticDatasetGenerator>`_ is a wrapper around dsfaker, that generate tables with their relations as described in yaml configuration files.
        
        - `datafiller <https://github.com/memsql/datafiller>`_ generate random data from database schema. API could be interesting.
        
        - `plaitpy <https://github.com/plaitpy/plaitpy>`_ is a fake table generator from a yaml configuration file.
        
        
        Realistic data
        --------------
        
        Generating realistic data - ie data carrying statistical information -  could mean different things in different contexts :
        
        - realistic statistical distribution on single columns,
        - realistic temporal dynamics,
        - realistic correlations between pairs of columns,
        - realistic correlations between pairs of columns from different (joinable) tables,
        - etc.
        
        Hence there is no universal way to generate realistic data. Most approaches follow two steps :
        
        1. learn a statistical model from the real data,
        2. generate data using this model.
        
        The statistical model depends of the context, and is usually not expressed in the form of a generic schema, such as table-schema.
        However, a schema of your data will be often be necessary to *configure* this kind of libraries.
        
        This topic is an active research area, with many articles but few production implementations :
        
        - `DataSynthesizer <https://github.com/DataResponsibly/DataSynthesizer>`_ (`article <https://arxiv.org/abs/1710.08874>`__) learn a diferentially private Bayesian network capturing the correlation structure between attributes
        - `dpgan <https://github.com/alps-lab/dpgan>`_ (`article <https://arxiv.org/pdf/1801.01594.pdf>`__) Differentially Private Releasing via Deep Generative Model.
        - `SDV <https://github.com/HDI-Project/SDV>`_ (`article <https://dai.lids.mit.edu/wp-content/uploads/2018/03/SDV.pdf>`__) Generative modeling for relational databases.
        - `medGAN <https://github.com/mp2893/medgan>`_ (`article <https://arxiv.org/abs/1703.06490>`__) Generative adversarial network for generating electronic health records.
        - `CTGAN <https://github.com/sdv-dev/CTGAN>`_ (`article <https://arxiv.org/abs/1907.00503>`__) Modeling Tabular data using Conditional GAN
        
        The statistical model may convey sensitive information and personnal data. 
        It is important fact to bear in mind, as protecting sensitive information is a common reason to generate fake data in the first place.
        
        Some tools offer ways to mitigate the risk from personal data leakage, with no formal guarantees.
        Other tools offer formal privacy guarantees through `differential privacy <https://en.wikipedia.org/wiki/Differential_privacy>`_.
        
        An active line of work is to use Generative Adversial Network to generate realistic data, for example dpgan (see above) or `Privacy-Preserving Generative Deep Neural Networks Support Clinical Data Sharing <https://www.ahajournals.org/doi/10.1161/CIRCOUTCOMES.118.005122>`__.
        
        When using Neural Network, one can use TensorFlow's `specific library <https://medium.com/tensorflow/introducing-tensorflow-privacy-learning-with-differential-privacy-for-training-data-b143c5e801b6>`_.
        `PySyft project <https://github.com/OpenMined/PySyft>`_ aims to provide a generic implementation for PyTorch.
        
        Release notes
        =============
        
        Version 0.14
        
        - [Fix] Update command line default value to match Click library version >=8.0
        
        Version 0.13
        
        - [Fix] Adapt maximum default integer value to local system
        
        Version 0.12
        
        - It is possible to specify trueValues and falseValues for boolean type (according to TableSchema standard)
        - Only one item is accepted in trueValues and falseValues arrays
        - It is possible to specify a format for types date and datetime
        
        Version 0.11
        
        - yearmonth type does not follow ISO 8601 format 'YYYY-MM' and is now generated without a dash 'YYYYMM'
        
        Version 0.10
        
        - boolean type is implemented, default values for this type are 0 for False and 1 for True
        
Platform: UNKNOWN
Requires-Python: ~=3.5
Provides-Extra: dev
