Metadata-Version: 2.1
Name: pyrecdp
Version: 1.0.1b20230328
Summary: A data processing bundle for spark based recommender system operations
Home-page: https://github.com/intel/e2eAIOK/
Author: INTEL AIA
License: UNKNOWN
Project-URL: Bug Tracker, https://github.com/intel/e2eAIOK/
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.6
Description-Content-Type: text/markdown
License-File: LICENSE

# RecDP v2.0

# INTRODUCTION

## Problem Statement

Data Preparation is an essential step to build AI pipelines 
* key data preparation capabilities: data connector, cleaning, sampling, joining, profiling, feature engineering, low-code/no-code UI, lineage etc. 
* exploration of optimal Data preparation consumes majority of Data Science time

## Solution with RecDP v2.0

* Auto pipeline
    * only 3 lines of codes required
* Pipeline Generator
    * Data Profiling:
        * Auto anomalies detection
        * Auto missing value impute
        * Profiling Visualizzation        
    * Feature Wrangling:
        * feature transformation(datetime, geo_info, text_nlp, url, etc.)
        * multiple data auto joining
        * feature cross(aggregation transformation - sum, avg, count, etc.)
    * export pipeline as JSON file, can be import to other data platform
* Pipeline Runner:
    * spark engine: convert pipeline to spark codes to run
    * pandas engine: convert pipeline to pandas codes to run
    * sql engine: convert pipeline to sql
* DataLoader:
    * parquet, csv, json, database
* FeatureWriter - ML/DL connector:
    * Data Lineage
    * Feature Store
    * numpy, csv, parquet, dgl / pyG graph
![RecDP v2.0 Overview](resources/recdp_intro.png)

## This solution is intended for
citizen data scientists, enterprise users, independent software vendor and partial of cloud service provider.

# Getting Start
## setup with pip
```
git clone --single-branch --branch RecDP_v2.0 https://github.com/intel-innersource/frameworks.bigdata.AIDK.git
cd frameworks.bigdata.AIDK/RecDP
# install dependencies
apt-get update -y &&  DEBIAN_FRONTEND=noninteractive apt-get install -y python3 python3-pip python-is-python3 graphviz
DEBIAN_FRONTEND=noninteractive apt-get install -y openjdk-8-jre
# install recdp
python setup.py sdist
pip install dist/pyrecdp-1.0.1.tar.gz

sh start-jupyter.sh
# open browser with http://hostname:8888
```

## run
![nyc taxi demo](https://github.com/intel-innersource/frameworks.bigdata.AIDK/assets/4355494/03d8c2fe-de47-41f9-9fef-8513bc4aaf42)

## modify pipeline (add user defined function or remove operation)
``` python
def filter_with_cond(df):
    df = df[df['Year'] <= 2018]
    return df
operation = {
    "children": [6], # will to append this new op
    "next": [7], # who will be connected to this new op
    "inline_function": filter_with_cond, # function
}
pipeline.add_operation(operation)
```
``` python
operation = {
    "idx": 6, # OP id to be deleted
    "next": [10] # who is connected to the to_be_deleted op
}
pipeline.delete_operation(operation)
```

## export pipeline
``` python
pipeline.export(file_path = "exported_pipeline.json")
```
``` json
{
    "0": {
        "children": null,
        "op": "DataFrame",
        "config": "main_table"
    },
    "1": {
        "children": [0],
        "op": "type_infer",
        "config": [
            ["pickup_datetime",["is_string","is_datetime"]],
            ["pickup_longitude",["is_numeric","is_float"]],
            ["pickup_latitude",["is_numeric","is_float"]],
            ["dropoff_longitude",["is_numeric","is_float"]],
            ["dropoff_latitude",["is_numeric","is_float"]],
            ["passenger_count",["is_numeric","is_int64","is_integer","is_categorical"]]
        ]
    },
    "2": {
        "children": [1],
        "op": "tuple",
        "config": {
            "src": ["pickup_latitude","pickup_longitude"],"dst": "pickup_coordinates"
        }
    },
    "3": {
        "children": [2],
        "op": "tuple",
        "config": {
            "src": ["dropoff_latitude","dropoff_longitude"],"dst": "dropoff_coordinates"
        }
    },
    "4": {
        "children": [3],
        "op": "fillna",
        "config": {
            "pickup_longitude": -1,
            "pickup_latitude": -1,
            "dropoff_longitude": -1,
            "dropoff_latitude": -1,
            "passenger_count": -1
        }
    },
    "5": {
        "children": [4],
        "op": "datetime_feature",
        "config": {
            "pickup_datetime": [
                ["pickup_datetime__day",["featuretools.primitives.standard.transform.datetime.day", "Day"]],
                ["pickup_datetime__month",["featuretools.primitives.standard.transform.datetime.month","Month"]],
                ["pickup_datetime__weekday",["featuretools.primitives.standard.transform.datetime.weekday","Weekday"]],
                ["pickup_datetime__year",["featuretools.primitives.standard.transform.datetime.year","Year"]],
                ["pickup_datetime__hour",["featuretools.primitives.standard.transform.datetime.hour","Hour"]]
            ]
        }
    },
    "6": {
        "children": [5],
        "op": "haversine",
        "config": {
            "['pickup_coordinates', 'dropoff_coordinates']": ["haversine_pickup_coordinates_dropoff_coordinates",["featuretools.primitives.standard.transform.latlong.haversine","Haversine"]]
        }
    },
    "7": {
        "children": [6],
        "op": "drop",
        "config": [
            "pickup_datetime",
            "pickup_coordinates",
            "dropoff_coordinates"
        ]
    },
    "8": {
        "children": [7],
        "op": "lightgbm",
        "config": {
            "label": "fare_amount",
            "metrics": "rmse",
            "objective": "regression",
            "model_file": "lightgbm_regression_label.mdl",
            "method": "predict"
        }
    }
}
```


## Quick Example
* [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/gist/xuechendi/de044d606e6cdc44d3005db2a0ae968d/recdp_fare_prediction.ipynb) - [nyc taxi fare](examples/notebooks/autofe/demo/nyc_taxi_workflow_train.ipynb) - geographic, datetime, lgbm regression

* [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/gist/xuechendi/2b8acb3eda73028635072352560139ba/recdp_fraud_detect.ipynb) - [IBM Card Transaction Fraud Detect](examples/notebooks/autofe/demo/fraud_detect_train.ipynb) - onehot encode, lgbm, binary classification 

* [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/gist/xuechendi/b039b8c0a40fec951b1b09c3fbb93a7b/recdp_social_media.ipynb) - [twitter recsys](examples/notebooks/autofe/demo/twitter_workflow_test.ipynb) - text nlp, datetime feature engineering  

* [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/gist/xuechendi/1e08668ab0e15e1e98c592f284d79dad/recdp_click_through_rate_multiple_tables.ipynb) - [outbrain](examples/notebooks/autofe/demo/outbrain_ctr_workflow_test.ipynb) - multiple table joining

* [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/gist/xuechendi/2f942c30dbf9b63a64cc819a61966e34/recdp_rating_prediction_amazon_review.ipynb) - [amazon product review](examples/notebooks/autofe/demo/amazon_product_review_test.ipynb) - text nlp, datetime, feature-cross

# More Examples - completed example including training

## Auto Feature Engineering vs. featuretools
* [NYC Taxi fare auto data prepration](examples/notebooks/autofe/FeatureWrangler.ipynb): An example to show how RecDP_v2.0 automatically generating datetime and geo features upon 55M records. Tested with both Spark and Pandas(featuretools) as compute engine, show 21x speedup by spark.

## load PIPELINE and execute
* [twitter pipeline re-load and execute](examples/notebooks/autofe/demo/custom_pipeline_twitter.ipynb): An example to show how RecDP_v2.0 reload pipeline from json and do execution - use RecDP as compute engine.

## Realtime Inference pipeline
* [nyc taxi fare](examples/notebooks/autofe/demo/nyc_taxi_workflow_predict.ipynb) - geographic, datetime feature engineering, lgbm

## Data Profiler Examples
* [NYC Taxi fare Profiler](resources/FeatureProfiler_NYC.png): An example to show RecDP_v2.0 to profile data, including infer the potential data type, generate data distribution charts.

* [twitter Profiler](resources/FeatureProfiler_recsys.png): An example to show RecDP_v2.0 to profile data, including infer the potential data type, generate data distribution charts.


## LICENSE
* Apache 2.0

## Dependency
* Spark 3.x
* python 3.*


