Metadata-Version: 2.1
Name: proxiflow
Version: 0.1.6
Summary: Data Preprocessing flow tool in python
License: MIT
Author: Martin Tomes
Author-email: tomesm@gmail.com
Requires-Python: >=3.10,<3.11
Classifier: Development Status :: 1 - Planning
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: POSIX :: Linux
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: Implementation :: CPython
Classifier: Topic :: Scientific/Engineering
Requires-Dist: click (>=8.1.3,<9.0.0)
Requires-Dist: numpy (>=1.24.2,<2.0.0)
Requires-Dist: polars (>=0.16.7,<0.17.0)
Requires-Dist: pyaml (>=21.10.1,<22.0.0)
Requires-Dist: scipy (>=1.10.1,<2.0.0)
Project-URL: Bug Tracker, https://github.com/tomesm/proxiflow/issues
Project-URL: Documentation, http://proxiflow.readthedocs.io
Project-URL: Repository, https://github.com/tomesm/proxiflow
Description-Content-Type: text/markdown

[![PyPi version](https://badgen.net/pypi/v/proxiflow/)](https://pypi.org/project/proxiflow)
[![Documentation Status](https://readthedocs.org/projects/proxiflow/badge/?version=latest)](https://proxiflow.readthedocs.io/en/latest/?badge=latest)
[![PyPI download month](https://img.shields.io/pypi/dm/proxiflow.svg)](https://pypi.python.org/pypi/proxiflow/)
[![Maintenance](https://img.shields.io/badge/Maintained%3F-yes-green.svg)](https://github.com/tomesm/proxiflow/graphs/commit-activity)
[![PyPI license](https://img.shields.io/pypi/l/proxiflow.svg)](https://pypi.python.org/pypi/proxiflow/)
[![tests](https://github.com/tomesm/proxiflow/actions/workflows/tests.yml/badge.svg)](https://github.com/tomesm/proxiflow/actions/workflows/tests.yml)


# ProxiFlow

ProxiFlow is a data preprocessig tool for machine learning that performs
data cleaning, normalization, and feature engineering.

The biggest advantage if this library (which is basically a wrapper over [polars](https://github.com/pola-rs/polars) data frame) is that it is configurable via YAML configuration file which makes it suitable for MLOps pipelines or for building API requests over it.

## Documentation
Read the full documentation [here](http://proxiflow.readthedocs.io/).

## Usage

To use ProxiFlow, install it via pip:

``` bash
pip install proxiflow
```

You can then call it from the command line:

``` bash
proxiflow --config-file myconfig.yaml --input-file mydata.csv --output-file cleaned_data.csv
```

Here\'s an example of a YAML configuration file:

``` yaml
input_format: csv
output_format: csv

data_cleaning: #mandatory
  # NOTE: Not handling missing values can cause errors during data normalization
  handle_missing_values:
    drop: false
    mean: true # Only Int and Float columns are handled 
    # mode: true # Turned off for now. 

  handle_outliers: true # Only Float columns are handled
  remove_duplicates: true

data_normalization: # mandatory
  min_max: #mandatory but values are not mandatory. It can be left empty
    # Specify columns:
    - Age # not mandatory
  z_score: 
    - Price 
  log:
    - Floors

feature_engineering:
  one_hot_encoding: # mandatory
    - Bedrooms      # not mandatory

  feature_scaling:  # mandatory
    degree: 2       # not mandatory. It specifies the polynominal degree
    columns:        # not mandatory
      - Floors      # not mandatory
```

The above configuration specifies that duplicate rows should be removed
and missing values should be dropped.

## API

ProxiFlow can also be used as a Python library. Here\'s an example:

``` python
import polars as pl
from proxiflow.config import Config
from proxiflow.core import Cleaner

# Load the data
df = pl.read_csv("mydata.csv")

# Load the configuration
config = Config("myconfig.yaml")

# Clean the data
cleaner = Cleaner(config)
cleaned_data = cleaner.clean_data(data)

# Perform data normalization
normalizer = Normalizer(config)
normalized_data = normalizer.normalize(cleaned_data)

# Perform feature engineering
engineer = Engineer(config)
engineered_data = engineer.execute(normalized_data)

# Write the output data
engineered_data.write_csv("cleaned_data.csv")
```

## Log

-   \[x\] Data cleaning
-   \[x\] Data normalization
-   \[x\] Feature engineering
