Metadata-Version: 2.1
Name: ocean-spark-airflow-provider
Version: 0.1.0
Summary: TODO(crezvoy): add description
Home-page: https://spot.io/products/ocean-apache-spark/
License: Apache-2.0
Keywords: airflow,provider,spark,ocean
Author: Ocean for Spark authors
Author-email: some.mail@netapp.com
Requires-Python: >=3.6.15,<4.0.0
Classifier: Development Status :: 4 - Beta
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Requires-Dist: apache-airflow (>=1)
Requires-Dist: requests (>=2.0.0,<3.0.0)
Project-URL: Repository, https://github.com/spotinst/ocean-spark-airflow-provider
Description-Content-Type: text/markdown

# Airflow connector for Ocean Apache Spark

An Airflow plugin and provider to launch and monitor Spark
applications on the [Ocean for
Spark](https://spot.io/products/ocean-apache-spark/).

## Compatibility

`ocean-spark-airflow-provider` is compatible with both Airflow 1 and
Airflow 2. it is detected as an Airflow plugin by Airflow 1 and up,
and as a provider by Airflow 2.


## Installation

```
pip install ocean-spark-airflow-provider
```

## Usage

For general usage of Ocean for Spark, refer to the [official
documentation](https://docs.spot.io/ocean-spark/getting-started/?id=get-started-with-ocean-for-apache-spark).

### Setting up the connection

In the connection menu, register a new connection of type **Ocean For
Spark**. The default connection name is `ocean_spark_default`. You will
need to have:

 - The Ocean Spark cluster ID of the cluster you just created (of the
   format `osc-e4089a00`). You can find this in the console in the
   [list of
   clusters](https://docs.spot.io/ocean-spark/product-tour/manage-clusters),
   or by using the [Get Cluster
   List](https://docs.spot.io/api/#operation/OceanSparkClusterList) in
   the API.
 - [A Spot
   token](https://docs.spot.io/administration/api/create-api-token?id=create-an-api-token)
   to interact with Spot API.
 
![connection setup dialog](./images/connection_setup.png) 

The **Ocean For Spark** connection type is not available for Airflow
1, instead create an **HTTP** connection and fill your cluster id as
**host** your API token as **password**.

You will need to create a separate connection for every Ocean Spark
cluster that you plan to use with Airflow.  In the
`OceanSparkOperator`, you can select which Ocean Spark connection to
use with the `connection_name` argument (defaults to
`ocean_spark_default`).

### Using the operator

```python
from airflow import __version__ as airflow_version
if airflow_version.starts_with("1."):
    # Airflow 1, import as plugin
    from airflow.operators.ocean_spark import OceanSparkOperator
else:
    # Airflow 2
    from ocean_spark.operators import OceanSparkOperator
    
# DAG creation
    
spark_pi_task = OceanSparkOperator(
    job_id="spark-pi",
    task_id="compute-pi",
    dag=dag,
    config_overrides={
        "type": "Scala",
        "sparkVersion": "3.2.0",
        "image": "gcr.io/datamechanics/spark:platform-3.2-latest",
        "imagePullPolicy": "IfNotPresent",
        "mainClass": "org.apache.spark.examples.SparkPi",
        "mainApplicationFile": "local:///opt/spark/examples/jars/examples.jar",
        "arguments": ["10000"],
        "driver": {
            "cores": 1,
        },
        "executor": {
            "cores": 2,
            "instances": 1,
        },
    },
)
```

more examples are available for [Airflow 1](./deploy/airflow1/example_dags) and [Airflow 2](./deploy/airflow2/dags).

## Test locally

You can test the plugin locally using the docker compose setup in this
repository. Run `make serve_airflow2` at the root of the repository to
launch an instance of Airflow 2 with the provider already installed.

