Metadata-Version: 2.1
Name: proxiflow
Version: 0.1.0
Summary: Data Preprocessing flow tool in python
License: LICENSE
Author: Martin Tomes
Author-email: tomesm@gmail.com
Requires-Python: >=3.10,<4.0
Classifier: Development Status :: 1 - Planning
Classifier: Intended Audience :: Developers
Classifier: License :: Other/Proprietary License
Classifier: Operating System :: POSIX :: Linux
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: Implementation :: CPython
Classifier: Topic :: Scientific/Engineering
Requires-Dist: black (>=23.1.0,<24.0.0)
Requires-Dist: click (>=8.1.3,<9.0.0)
Requires-Dist: numpy (>=1.24.2,<2.0.0)
Requires-Dist: polars (>=0.16.7,<0.17.0)
Requires-Dist: pyaml (>=21.10.1,<22.0.0)
Description-Content-Type: text/x-rst

Preflow
=======

Preflow is a data preparation tool for machine learning that performs data cleaning, normalization, and feature engineering.

Usage
-----

To use Preflow, install it via `pip` (from test PyPi):

.. code-block:: bash

    pip install --find-links ./wheels --index-url https://test.pypi.org/simple preflow

You can then call it from the command line:

.. code-block:: bash

    preflow --config-file myconfig.yaml --input-file mydata.csv --output-file cleaned_data.csv

Here's an example of a YAML configuration file:

.. code-block:: yaml

    data_cleaning:
      remove_duplicates: True
      handle_missing_values:
        drop: True

    data_normalization:
      ...

    feature_engineering:
      ...

The above configuration specifies that duplicate rows should be removed and missing values should be dropped.

API
---

Preflow can also be used as a Python library. Here's an example:

.. code-block:: python

    import polars as pl
    from preppy.config import Config
    from preppy.preprocessor import Preprocessor

    # Load the data
    df = pl.read_csv("mydata.csv")

    # Load the configuration
    config = Config("myconfig.yaml")

    # Preprocess the data
    preprocessor = Preprocessor(config)
    cleaned_df = preprocessor.clean_data(df)

    # Write the output data
    cleaned_df.write_csv("cleaned_data.csv")

TODO
----

- [x] Data cleaning
- [ ] Data normalization
- [ ] Feature engineering

Note: only data cleaning is currently implemented; data normalization and feature engineering are TODO features.

