Metadata-Version: 2.1
Name: rdt
Version: 0.5.3.dev0
Summary: Reversible Data Transforms
Home-page: https://github.com/sdv-dev/RDT
Author: MIT Data To AI Lab
Author-email: dailabmit@gmail.com
License: MIT license
Keywords: rdt
Platform: UNKNOWN
Classifier: Development Status :: 2 - Pre-Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Natural Language :: English
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Requires-Python: >=3.6,<3.9
Description-Content-Type: text/markdown
Provides-Extra: copulas
Provides-Extra: test
Provides-Extra: dev
License-File: LICENSE
License-File: AUTHORS.rst

<p align="left">
  <a href="https://dai.lids.mit.edu">
    <img width=15% src="https://dai.lids.mit.edu/wp-content/uploads/2018/06/Logo_DAI_highres.png" alt="DAI-Lab" />
  </a>
  <i>An Open Source Project from the <a href="https://dai.lids.mit.edu">Data to AI Lab, at MIT</a></i>
</p>

[![Development Status](https://img.shields.io/badge/Development%20Status-2%20--%20Pre--Alpha-yellow)](https://pypi.org/search/?c=Development+Status+%3A%3A+2+-+Pre-Alpha)
[![PyPi Shield](https://img.shields.io/pypi/v/RDT.svg)](https://pypi.python.org/pypi/RDT)
[![Unit Tests](https://github.com/sdv-dev/RDT/actions/workflows/unit.yml/badge.svg)](https://github.com/sdv-dev/RDT/actions/workflows/unit.yml)
[![Downloads](https://pepy.tech/badge/rdt)](https://pepy.tech/project/rdt)
[![Coverage Status](https://codecov.io/gh/sdv-dev/RDT/branch/master/graph/badge.svg)](https://codecov.io/gh/sdv-dev/RDT)

<img align="center" width=40% src="docs/images/rdt-logo.png">

* Website: https://sdv.dev
* Documentation: https://sdv.dev/SDV
* Repository: https://github.com/sdv-dev/RDT
* License: [MIT](https://github.com/sdv-dev/RDT/blob/master/LICENSE)
* Development Status: [Pre-Alpha](https://pypi.org/search/?c=Development+Status+%3A%3A+2+-+Pre-Alpha)

# Overview

**RDT** is a Python library used to transform data for data science libraries and preserve
the transformations in order to revert them as needed.

# Install

**RDT** is part of the **SDV** project and is automatically installed alongside it. For
details about this process please visit the [SDV Installation Guide](
https://sdv.dev/SDV/getting_started/install.html)

Optionally, **RDT** can also be installed as a standalone library using the following commands:

**Using `pip`:**

```bash
pip install rdt
```

**Using `conda`:**

```bash
conda install -c sdv-dev -c conda-forge rdt
```

For more installation options please visit the [RDT installation Guide](INSTALL.md)


# Quickstart

In this short series of tutorials we will guide you through a series of steps that will
help you getting started using **RDT** to transform columns, tables and datasets.

## Transforming a column

In this first guide, you will learn how to use **RDT** in its simplest form, transforming
a single column loaded as a `pandas.DataFrame` object.

### 1. Load the demo data

You can load some demo data using the `rdt.get_demo` function, which will return some random
data for you to play with.

```python3
from rdt import get_demo

data = get_demo()
```

This will return a `pandas.DataFrame` with 10 rows and 4 columns, one of each data type supported:

```
   0_int    1_float 2_str          3_datetime
0   38.0  46.872441     b 2021-02-10 21:50:00
1   77.0  13.150228   NaN 2021-07-19 21:14:00
2   21.0        NaN     b                 NaT
3   10.0  37.128869     c 2019-10-15 21:39:00
4   91.0  41.341214     a 2020-10-31 11:57:00
5   67.0  92.237335     a                 NaT
6    NaN  51.598682   NaN 2020-04-01 01:56:00
7    NaN  42.204396     c 2020-03-12 22:12:00
8   68.0        NaN     c 2021-02-25 16:04:00
9    7.0  31.542918     a 2020-07-12 03:12:00
```

Notice how the data is random, so your output might look a bit different. Also notice how
RDT introduced some null values randomly.

### 2. Load the transformer

In this example we will use the datetime column, so let's load a `DatetimeTransformer`.

```python3
from rdt.transformers import DatetimeTransformer

transformer = DatetimeTransformer()
```

### 3. Fit the Transformer

Before being able to transform the data, we need the transformer to learn from it.

We will do this by calling its `fit` method passing the column that we want to transform.

```python3
transformer.fit(data['3_datetime'])
```

### 4. Transform the data

Once the transformer is fitted, we can pass the data again to its `transform` method in order
to get the transformed version of the data.

```python3
transformed = transformer.transform(data['3_datetime'])
```

The output will be a `numpy.ndarray` with two columns, one with the datetimes transformed
to integer timestamps, and another one indicating with 1s which values were null in the
original data.

```
array([[1.61299380e+18, 0.00000000e+00],
       [1.62672924e+18, 0.00000000e+00],
       [1.59919923e+18, 1.00000000e+00],
       [1.57117554e+18, 0.00000000e+00],
       [1.60414542e+18, 0.00000000e+00],
       [1.59919923e+18, 1.00000000e+00],
       [1.58570616e+18, 0.00000000e+00],
       [1.58405112e+18, 0.00000000e+00],
       [1.61426904e+18, 0.00000000e+00],
       [1.59452352e+18, 0.00000000e+00]])
```

### 5. Revert the column transformation

In order to revert the previous transformation, the transformed data can be passed to
the `reverse_transform` method of the transformer:

```python3
reversed_data = transformer.reverse_transform(transformed)
```

The output will be a `pandas.Series` containing the reverted values, which should be exactly
like the original ones.

```
0   2021-02-10 21:50:00
1   2021-07-19 21:14:00
2                   NaT
3   2019-10-15 21:39:00
4   2020-10-31 11:57:00
5                   NaT
6   2020-04-01 01:56:00
7   2020-03-12 22:12:00
8   2021-02-25 16:04:00
9   2020-07-12 03:12:00
dtype: datetime64[ns]
```

## Transforming a table

Once we know how to transform a single column, we can try to go the next level and transform
a table with multiple columns.

### 1. Load the HyperTransformer

In order to manuipulate a complete table we will need to load a `rdt.HyperTransformer`.

```python3
from rdt import HyperTransformer

ht = HyperTransformer()
```

### 2. Fit the HyperTransformer

Just like the transfomer, the HyperTransformer needs to be fitted before being able to transform
data.

This is done by calling its `fit` method passing the `data` DataFrame.

```python3
ht.fit(data)
```

### 3. Transform the table data

Once the HyperTransformer is fitted, we can pass the data again to its `transform` method in order
to get the transformed version of the data.

```python3
transformed = ht.transform(data)
```

The output, will now be another `pandas.DataFrame` with the numerical representation of our
data.

```
    0_int  0_int#1    1_float  1_float#1  2_str    3_datetime  3_datetime#1
0  38.000      0.0  46.872441        0.0   0.70  1.612994e+18           0.0
1  77.000      0.0  13.150228        0.0   0.90  1.626729e+18           0.0
2  21.000      0.0  44.509511        1.0   0.70  1.599199e+18           1.0
3  10.000      0.0  37.128869        0.0   0.15  1.571176e+18           0.0
4  91.000      0.0  41.341214        0.0   0.45  1.604145e+18           0.0
5  67.000      0.0  92.237335        0.0   0.45  1.599199e+18           1.0
6  47.375      1.0  51.598682        0.0   0.90  1.585706e+18           0.0
7  47.375      1.0  42.204396        0.0   0.15  1.584051e+18           0.0
8  68.000      0.0  44.509511        1.0   0.15  1.614269e+18           0.0
9   7.000      0.0  31.542918        0.0   0.45  1.594524e+18           0.0
```

### 4. Revert the table transformation

In order to revert the transformation and recover the original data from the transformed one,
we need to call `reverse_transform` method of the `HyperTransformer` instance passing it the
transformed data.

```python3
reversed_data = ht.reverse_transform(transformed)
```

Which should output, again, a table that looks exactly like the original one.

```
   0_int    1_float 2_str          3_datetime
0   38.0  46.872441     b 2021-02-10 21:50:00
1   77.0  13.150228   NaN 2021-07-19 21:14:00
2   21.0        NaN     b                 NaT
3   10.0  37.128869     c 2019-10-15 21:39:00
4   91.0  41.341214     a 2020-10-31 11:57:00
5   67.0  92.237335     a                 NaT
6    NaN  51.598682   NaN 2020-04-01 01:56:00
7    NaN  42.204396     c 2020-03-12 22:12:00
8   68.0        NaN     c 2021-02-25 16:04:00
9    7.0  31.542918     a 2020-07-12 03:12:00
```

# The Synthetic Data Vault

<p>
  <a href="https://sdv.dev">
    <img width=30% src="https://github.com/sdv-dev/SDV/blob/master/docs/images/SDV-Logo-Color-Tagline.png?raw=true">
  </a>
  <p><i>This repository is part of <a href="https://sdv.dev">The Synthetic Data Vault Project</a></i></p>
</p>

* Website: https://sdv.dev
* Documentation: https://sdv.dev/SDV


# History

## 0.5.2 - 2021-08-16

This release fixes a couple of bugs introduced by the previous release regarding the
`OneHotEncoder` and the `BooleanTransformer`.

### Issues closed

* BooleanTransformer.reverse_transform sometimes crashes with TypeError - Issue [#210](https://github.com/sdv-dev/RDT/issues/210) by @katxiao
* OneHotEncoder causing shape misalignment in CopulaGAN, CTGAN, and TVAE - Issue [#208](https://github.com/sdv-dev/RDT/issues/208) by @sarahmish
* Boolean.transformer.reverse_transform modifies the input data - Issue [#211](https://github.com/sdv-dev/RDT/issues/211) by @katxiao

## 0.5.1 - 2021-08-11

This release improves the overall performance of the library, both in terms of memory and time consumption.
More specifically, it makes the following modules more efficient: `NullTransformer`, `DatetimeTransformer`,
`LabelEncodingTransformer`, `NumericalTransformer`, `CategoricalTransformer`, `BooleanTransformer` and `OneHotEncodingTransformer`.

It also adds performance-based testing and a script for profiling the performance.

### Issues closed

* Add performance-based testing - Issue [#194](https://github.com/sdv-dev/RDT/issues/194) by @amontanez24
* Audit the NullTransformer - Issue [#192](https://github.com/sdv-dev/RDT/issues/192) by @amontanez24
* Audit DatetimeTransformer - Issue [#189](https://github.com/sdv-dev/RDT/issues/189) by @sarahmish
* Audit the LabelEncodingTransformer - Issue [#184](https://github.com/sdv-dev/RDT/issues/184) by @amontanez24
* Audit the NumericalTransformer - Issue [#181](https://github.com/sdv-dev/RDT/issues/181) by @fealho
* Audit CategoricalTransformer - Issue [#180](https://github.com/sdv-dev/RDT/issues/180) by @katxiao
* Audit BooleanTransformer - Issue [#179](https://github.com/sdv-dev/RDT/issues/179) by @katxiao
* Auditing OneHotEncodingTransformer - Issue [#178](https://github.com/sdv-dev/RDT/issues/178) by @sarahmish
* Create script for profiling - Issue [#176](https://github.com/sdv-dev/RDT/issues/176) by @amontanez24
* Create folder structure for performance testing - Issue [#174](https://github.com/sdv-dev/RDT/issues/174) by @amontanez24

## 0.5.0 - 2021-07-12

This release updates the `NumericalTransformer` by adding a new `rounding` argument.
Users can now obtain numerical values with precision, either pre-specified or automatically computed from the given data.

### Issues closed

* Add `rounding` argument to `NumericalTransformer` - Issue [#166](https://github.com/sdv-dev/RDT/issues/166) by @amontanez24 and @csala
* `NumericalTransformer` rounding error with infinity - Issue [#169](https://github.com/sdv-dev/RDT/issues/169) by @amontanez24
* Add min and max arguments to NumericalTransformer - Issue [#106](https://github.com/sdv-dev/RDT/issues/106) by @amontanez24

## 0.4.2 - 2021-06-08

This release adds a new method to the `CategoricalTransformer` to solve a bug where
the transformer becomes unusable after being pickled and unpickled if it had `NaN`
values in the data which it was fit on.

It also fixes some grammar mistakes in the documentation.

### Issues closed

* CategoricalTransformer with NaN values cannot be pickled bug - Issue [#164](https://github.com/sdv-dev/RDT/issues/164) by @pvk-developer and @csala

### Documentation changes

* docs: fix typo - PR [#163](https://github.com/sdv-dev/RDT/issues/163) by @sbrugman

## 0.4.1 - 2021-03-29

This release improves the `HyperTransformer` memory usage when working with a
high number of columns or a high number of categorical values when using one hot encoding.

### Issues closed

* `Boolean`, `Datetime` and `LabelEncoding` transformers fail with 2D `ndarray` - Issue [#160](https://github.com/sdv-dev/RDT/issues/160) by @pvk-developer
* `HyperTransformer`: Memory usage increase when `reverse_transform` is called - Issue [#156](https://github.com/sdv-dev/RDT/issues/152) by @pvk-developer and @AnupamaGangadhar

## 0.4.0 - 2021-02-24

In this release a change in the HyperTransformer allows using it to transform and
reverse transform a subset of the columns seen during training.

The anonymization functionality which was deprecated and not being used has also
been removed along with the Faker dependency.

### Issues closed

* Allow the HyperTransformer to be used on a subset of the columns - Issue [#152](https://github.com/sdv-dev/RDT/issues/152) by @csala
* Remove faker - Issue [#150](https://github.com/sdv-dev/RDT/issues/150) by @csala

## 0.3.0 - 2021-01-27

This release changes the behavior of the `HyperTransformer` to prevent it from
modifying any column in the given `DataFrame` if the `transformers` dictionary
is passed empty.

### Issues closed

* If transformers is an empty dict, do nothing - Issue [#149](https://github.com/sdv-dev/RDT/issues/149) by @csala

## 0.2.10 - 2020-12-18

This release adds a new argument to the `HyperTransformer` which gives control over
which transformers to use by default for each `dtype` if no specific transformer
has been specified for the field.

This is also the first version to be officially released on conda.

### Issues closed

* Add `dtype_transformers` argument to HyperTransformer - Issue [#148](https://github.com/sdv-dev/RDT/issues/148) by @csala
* Makes Copulas an optional dependency - Issue [#144](https://github.com/sdv-dev/RDT/issues/144) by @fealho

## 0.2.9 - 2020-11-27

This release fixes a bug that prevented the `CategoricalTransformer` from working properly
when being passed data that contained numerical data only, without any strings, but also
contained `None` or `NaN` values.

### Issues closed

* KeyError: nan - CategoricalTransformer fails on numerical + nan data only - Issue [#142](https://github.com/sdv-dev/RDT/issues/142) by @csala

## 0.2.8 - 2020-11-20

This release fixes a few minor bugs, including some which prevented RDT from fully working
on Windows systems.

Thanks to this fixes, as well as a new testing infrastructure that has been set up, from now
on RDT is officially supported on Windows systems, as well as on the Linux and macOS systems
which were previously supported.

### Issues closed

* TypeError: unsupported operand type(s) for: 'NoneType' and 'int' - Issue [#132](https://github.com/sdv-dev/RDT/issues/132) by @csala
* Example does not work on Windows - Issue [#114](https://github.com/sdv-dev/RDT/issues/114) by @csala
* OneHotEncodingTransformer producing all zeros - Issue [#135](https://github.com/sdv-dev/RDT/issues/135) by @fealho
* OneHotEncodingTransformer support for lists and lists of lists - Issue [#137](https://github.com/sdv-dev/RDT/issues/137) by @fealho

## 0.2.7 - 2020-10-16

In this release we drop the support for the now officially dead Python 3.5
and introduce a new feature in the DatetimeTransformer which reduces the dimensionality
of the generated numerical values while also ensuring that the reverted datetimes
maintain the same level as time unit precision as the original ones.

* Drop Py35 support - Issue [#129](https://github.com/sdv-dev/RDT/issues/129) by @csala
* Add option to drop constant parts of the datetimes - Issue [#130](https://github.com/sdv-dev/RDT/issues/130) by @csala

## 0.2.6 - 2020-10-05

* Add GaussianCopulaTransformer - Issue [#125](https://github.com/sdv-dev/RDT/issues/125) by @csala
* dtype category error - Issue [#124](https://github.com/sdv-dev/RDT/issues/124) by @csala

## 0.2.5 - 2020-09-18

Miunor bugfixing release.

# Bugs Fixed

* Handle NaNs in OneHotEncodingTransformer - Issue [#118](https://github.com/sdv-dev/RDT/issues/118) by @csala
* OneHotEncodingTransformer fails if there is only one category - Issue [#119](https://github.com/sdv-dev/RDT/issues/119) by @csala
* All NaN column produces NaN values enhancement - Issue [#121](https://github.com/sdv-dev/RDT/issues/121) by @csala
* Make the CategoricalTransformer learn the column dtype and restore it back - Issue [#122](https://github.com/sdv-dev/RDT/issues/122) by @csala

## 0.2.4 - 2020-08-08

### General Improvements

* Support Python 3.8 - Issue [#117](https://github.com/sdv-dev/RDT/issues/117) by @csala
* Support pandas >1 - Issue [#116](https://github.com/sdv-dev/RDT/issues/116) by @csala

## 0.2.3 - 2020-07-09

* Implement OneHot and Label encoding as transformers - Issue [#112](https://github.com/sdv-dev/RDT/issues/112) by @csala

## 0.2.2 - 2020-06-26

### Bugs Fixed

* Escape `column_name` in hypertransformer - Issue [#110](https://github.com/sdv-dev/RDT/issues/110) by @csala

## 0.2.1 - 2020-01-17

### Bugs Fixed

* Boolean Transformer fails to revert when there are NO nulls - Issue [#103](https://github.com/sdv-dev/RDT/issues/103) by @JDTheRipperPC

## 0.2.0 - 2019-10-15

This version comes with a brand new API and internal implementation, removing the old
metadata JSON from the user provided arguments, and making each transformer work only
with `pandas.Series` of their corresponding data type.

As part of this change, several transformer names have been changed and a new BooleanTransformer
and a feature to automatically decide which transformers to use based on dtypes have been added.

Unit test coverage has also been increased to 100%.

Special thanks to @JDTheRipperPC and @csala for the big efforts put in making this
release possible.

### Issues

* Drop the usage of meta - Issue [#72](https://github.com/sdv-dev/RDT/issues/72) by @JDTheRipperPC
* Make CatTransformer.probability_map deterministic - Issue [#25](https://github.com/sdv-dev/RDT/issues/25) by @csala

## 0.1.3 - 2019-09-24

### New Features

* Add attributes NullTransformer and col_meta - Issue [#30](https://github.com/sdv-dev/RDT/issues/30) by @ManuelAlvarezC

### General Improvements

* Integrate with CodeCov - Issue [#89](https://github.com/sdv-dev/RDT/issues/89) by @csala
* Remake Sphinx Documentation - Issue [#96](https://github.com/sdv-dev/RDT/issues/96) by @JDTheRipperPC
* Improve README - Issue [#92](https://github.com/sdv-dev/RDT/issues/92) by @JDTheRipperPC
* Document RELEASE workflow - Issue [#93](https://github.com/sdv-dev/RDT/issues/93) by @JDTheRipperPC
* Add support to Python 3.7 - Issue [#38](https://github.com/sdv-dev/RDT/issues/38) by @ManuelAlvarezC
* Create way to pass HyperTransformer table dict - Issue [#45](https://github.com/sdv-dev/RDT/issues/45) by @ManuelAlvarezC

## 0.1.2

* Add a numerical transformer for positive numbers.
* Add option to anonymize data on categorical transformer.
* Move the `col_meta` argument from method-level to class-level.
* Move the logic for missing values from the transformers into the `HyperTransformer`.
* Removed unreacheble lines in `NullTransformer`.
* `Numbertransfomer` to set default value to 0 when the column is null.
* Add a CLA for collaborators.
* Refactor performance-wise the transformers.

## 0.1.1

* Improve handling of NaN in NumberTransformer and CatTransformer.
* Add unittests for HyperTransformer.
* Remove unused methods `get_types` and `impute_table` from HyperTransformer.
* Make NumberTransformer enforce dtype int on integer data.
* Make DTTransformer check data format before transforming.
* Add minimal API Reference.
* Merge `rdt.utils` into `HyperTransformer` class. 

## 0.1.0

* First release on PyPI.


