Metadata-Version: 2.1
Name: carefree-data
Version: 0.2.9
Summary: Data processing module implemented with numpy
Home-page: https://github.com/carefree0910/carefree-data
Author: carefree0910
Author-email: syameimaru.saki@gmail.com
License: UNKNOWN
Download-URL: https://github.com/carefree0910/carefree-data/archive/v0.2.9.tar.gz
Keywords: python numpy data-science
Platform: UNKNOWN
Description-Content-Type: text/markdown
License-File: LICENSE

# carefree-data

`carefree-data` implemented a data processing module with numpy.

#### Update 2021.02.04

`carefree-data` now uses [`datatable`](https://github.com/h2oai/datatable) as backend, which significantly improves the performances on file inputs!

#### Why carefree-data?

`carefree-data` is a data processing module which is capable of handling 'dirty' and 'messy' datasets.

##### For tabular datasets, `carefree-data` is able to:

+ Elegantly deal with data pre-processing.
    + A `Recognizer` to recognize whether a column is `STRING`, `NUMERICAL` or `CATEGORICAL`.
    + A `Converter` to convert a column into friendly format (["one", "two"] -> [0, 1]).
    + A `Processor` to further process columns (`OneHot`, `Normalize`, `MinMax`, ...).
    + And all the transforms could be inverse! (See `tests\unittests\test_tabular.py` -> `test_recover_labels` & `test_recover_features`).
    + And these procedures are all completed **AUTOMATICALLY**!
+ Handle datasets saved in files (`.txt`, `.csv`).
    + For `.txt`, `" "` will be the default `delimiter`.
    + For `.csv`, `","` will be the default `delimiter`, and the first row will be skipped as default.
    + `delimiter`, `label index`, `skip first` could be set manually.

#### Pandas-free

There is one more thing we'd like to mention: `carefree-data` is 'Pandas-free'. Pandas is an open source library providing easy-to-use data structures on structured datasets. Although it is a widely used library in almost every famous Machine Learning and Deep Learning module, we finally decided to escape from it, and the reasons are listed below:

+ `carefree-data` wants to have full control on the data, and Pandas is not flexible enough.
+ `carefree-data` needs higher performances. Pandas is fast, but not as fast as pure numpy (and sometimes cython) codes on some critical code paths.
+ Pandas provides many powerful functions, but `carefree-data` doesn't need that much, which means Pandas is a little 'heavy' for `carefree-data`.

In short, Pandas is a more general library, and that's why we've written some codes to cover our needs instead of directly utilizing it.


> Currently `carefree-data` only supports tabular datasets.


## Installation

`carefree-data` requires Python 3.8 or higher.

```bash
pip install carefree-data
```

or

```bash
git clone https://github.com/carefree0910/carefree-data.git
cd carefree-data
pip install -e .
```


## Basic Usages

### Get scikit-learn datasets

```python
from cfdata.tabular import TabularDataset

iris = TabularDataset.iris()
```

### Read from array / dataset

```python
from cfdata.tabular import *

iris = TabularDataset.iris()
x, y = iris.xy
assert TabularData().read(x, y) == TabularData.from_dataset(iris)
```

### Read from file

```python
from cfdata.tabular import TabularData

file = "/path/to/your/file"
data = TabularData().read(file)
assert data.processed == data.transform(file)
```


## License

`carefree-data` is MIT licensed, as found in the [`LICENSE`](https://github.com/carefree0910/carefree-data/blob/master/LICENSE) file.

---


