Metadata-Version: 2.1
Name: airflow-provider-datarobot
Version: 0.0.4
Summary: DataRobot Airflow provider.
Home-page: http://www.datarobot.com/
Author: Andrius Senulis
Author-email: andrius.senulis@datarobot.com
License: Apache License 2.0
Requires-Python: ~=3.7
Description-Content-Type: text/markdown
License-File: LICENSE

# DataRobot Provider for Apache Airflow

This package provides operators, sensors, and a hook that integrates [DataRobot](https://www.datarobot.com) into Apache Airflow.
Using these components you should be able to build the essential DataRobot pipeline - create a project, train models, deploy a model,
score predictions against the model deployment.


## Installation

Prerequisites:
- [Apache Airflow](https://pypi.org/project/apache-airflow/)
- [DataRobot Python API client](https://pypi.org/project/datarobot/)

Install the DataRobot provider:
```
pip install airflow-provider-datarobot
```


## Connection

In the Airflow user interface, create a new DataRobot connection in `Admin > Connections`:

* Connection Type: `DataRobot`
* Connection Id: `datarobot_default` (default)
* API Key: `your-datarobot-api-key`
* DataRobot endpoint URL: `https://app.datarobot.com/api/v2` (default)

Create the API Key in the [DataRobot Developer Tools](https://app.datarobot.com/account/developer-tools) page, `API Keys` section (see [DataRobot Docs](https://app.datarobot.com/docs/api/api-quickstart/api-qs.html#create-a-datarobot-api-token) for more details).

By default, all components use `datarobot_default` connection ID.


## Config JSON for dag run

Operators and sensors use parameters from the [config](https://airflow.apache.org/docs/apache-airflow/stable/cli-and-env-variables-ref.html?highlight=config#Named%20Arguments_repeat21)
which must be submitted when triggering the dag. Example config JSON with required parameters:

    {
        "training_data": "s3-presigned-url-or-local-path-to-training-data",
        "project_name": "Project created from Airflow",
        "autopilot_settings": {
            "target": "readmitted"
        },
        "deployment_label": "Deployment created from Airflow",
        "score_settings": {
            "intake_settings": {
                "type": "s3",
                "url": "s3://path/to/scoring-data/Diabetes10k.csv",
                "credential_id": "62160b511fb29da8dd5f2c81"
            },
            "output_settings": {
                "type": "s3",
                "url": "s3://path/to/results-dir/Diabetes10k_predictions.csv",
                "credential_id": "62160b511fb29da8dd5f2c81"
            }
        }
    }
    

These config values can be accessed in the `execute()` method of any operator the dag 
in the `context["params"]` variable, e.g. getting a training data you would use this in the operator:

    def execute(self, context: Dict[str, Any]) -> str:
        ...
        training_data = context["params"]["training_data"]
        ...


## Modules

### [Operators](https://github.com/datarobot/airflow-provider-datarobot/blob/main/datarobot_provider/operators/datarobot.py)

- `CreateProjectOperator`

    Creates a DataRobot project and returns its ID.
 
    Required config params:

        training_data: str - pre-signed S3 URL or local path to training dataset
        project_name: str - project name

    In case of an S3 input, the `training_data` value must be a [pre-signed AWS S3 URL](https://docs.aws.amazon.com/AmazonS3/latest/userguide/using-presigned-url.html).

    For more [project settings](https://datarobot-public-api-client.readthedocs-hosted.com/en/latest-release/autodoc/api_reference.html#project) see the DataRobot docs.

    Returns a project ID.

- `TrainModelsOperator`

    Triggers DataRobot Autopilot to train models.

    Parameters:

        project_id: str - DataRobot project ID

    Required config params:

        "autopilot_settings": {
            "target": "readmitted"
        } 
         
    `target` is a required parameter with the column name which defines the modeling target.
    
    For more [autopilot settings](https://datarobot-public-api-client.readthedocs-hosted.com/en/latest-release/autodoc/api_reference.html#datarobot.models.Project.set_target) see the DataRobot docs.

    Returns `None`.

- `DeployModelOperator`

    Deploys a specified model.

    Parameters:

        model_id: str - DataRobot model ID

    Required config params:

        deployment_label - deployment label/name

    For more [deployment settings](https://datarobot-public-api-client.readthedocs-hosted.com/en/latest-release/autodoc/api_reference.html#datarobot.models.Deployment) see the DataRobot docs.

    Returns a deployment ID.

- `DeployRecommendedModelOperator`

    Deploys a recommended model.

    Parameters:

        project_id: str - DataRobot project ID

    Required config params:

        deployment_label: str - deployment label

    For more [deployment settings](https://datarobot-public-api-client.readthedocs-hosted.com/en/latest-release/autodoc/api_reference.html#datarobot.models.Deployment) see the DataRobot docs.

    Returns a deployment ID.

- `ScorePredictionsOperator`

    Scores batch predictions against the deployment.

    Prerequisites:
    - [S3 credentials added to DataRobot via Python API client](https://datarobot-public-api-client.readthedocs-hosted.com/en/latest-release/reference/admin/credentials.html#s3-credentials).
      You need the `creds.credential_id` for the `credential_id` parameter in the config.
    - OR a Dataset ID in the AI Catalog

    Parameters:
  
        deployment_id: str - DataRobot project ID

    Required config params:

        "score_settings": {
            "intake_settings": {
                "type": "s3",
                "url": "s3://my-bucket/Diabetes10k.csv",
                "credential_id": "62160b511fb29da8dd5f2c81"
            },
            "output_settings": {
                "type": "s3",
                "url": "s3://my-bucket/Diabetes10k_predictions.csv",
                "credential_id": "62160b511fb29da8dd5f2c81"
            }
        }

    Config params for scoring a Dataset in the AI Catalog:

        "score_settings": {
            "intake_settings": {
                "type": "dataset",
                "dataset_id": "<datasetId>",
            },
            "output_settings": {
                ...
            }
        }
    
    For more [batch prediction settings](https://datarobot-public-api-client.readthedocs-hosted.com/en/latest-release/autodoc/api_reference.html#datarobot.models.BatchPredictionJob.score) see the DataRobot docs.

    Returns a batch prediction job ID.

- `GetTargetDriftOperator`

    Gets the target drift from a deployment.

    Parameters:

        deployment_id: str - DataRobot deployment ID

    No config params are required. [Optional params](https://datarobot-public-api-client.readthedocs-hosted.com/en/latest-release/autodoc/api_reference.html#datarobot.models.Deployment.get_target_drift) may be passed in the config as follows:

        "target_drift": {
            ...
        }

    Returns a dict with the target drift data.

- `GetFeatureDriftOperator`

    Gets the feature drift from a deployment.

    Parameters:

        deployment_id: str - DataRobot deployment ID

    No config params are required. [Optional params](https://datarobot-public-api-client.readthedocs-hosted.com/en/latest-release/autodoc/api_reference.html#datarobot.models.Deployment.get_feature_drift) may be passed in the config as follows:

        "feature_drift": {
            ...
        }

    Returns a dict with the feature drift data.

### [Sensors](https://github.com/datarobot/airflow-provider-datarobot/blob/main/datarobot_provider/sensors/datarobot.py)

- `AutopilotCompleteSensor`

    Checks whether the Autopilot has completed.

    Parameters:

        project_id: str - DataRobot project ID

- `ScoringCompleteSensor`

    Checks whether batch scoring has completed.

    Parameters:

        job_id: str - Batch prediction job ID

### [Hooks](https://github.com/datarobot/airflow-provider-datarobot/blob/main/datarobot_provider/hooks/datarobot.py)

- `DataRobotHook`

    A hook for initializing DataRobot Public API client.


## Pipeline

The modules described above allows to construct a standard DataRobot pipeline in an Airflow dag:

    create_project_op >> train_models_op >> autopilot_complete_sensor >> deploy_model_op >> score_predictions_op >> scoring_complete_sensor


## Examples

See the [**examples**](https://github.com/datarobot/airflow-provider-datarobot/blob/main/datarobot_provider/example_dags) directory for the example DAGs.


## Issues

Please submit [issues](https://github.com/datarobot/airflow-provider-datarobot/issues) and [pull requests](https://github.com/datarobot/airflow-provider-datarobot/pulls) in our official repo:
[https://github.com/datarobot/airflow-provider-datarobot](https://github.com/datarobot/airflow-provider-datarobot)

We are happy to hear from you. Please email any feedback to the authors at [support@datarobot.com](mailto:support@datarobot.com).


# Copyright Notice

Copyright 2022 DataRobot, Inc. and its affiliates.

All rights reserved.

This is proprietary source code of DataRobot, Inc. and its affiliates.

Released under the terms of DataRobot Tool and Utility Agreement.
