Metadata-Version: 2.1
Name: airflow-mcd
Version: 0.0.5
Summary: Monte Carlo's Apache Airflow Provider
Home-page: https://www.montecarlodata.com/
Author: Monte Carlo Data, Inc
Author-email: info@montecarlodata.com
License: Apache Software License (Apache 2.0)
Platform: UNKNOWN
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Natural Language :: English
Classifier: Topic :: Software Development :: Build Tools
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Requires-Python: >=3.7
Description-Content-Type: text/markdown
License-File: LICENSE

# airflow-mcd

Monte Carlo's alpha Airflow provider.

## Installation

Requires Python 3.7 or greater and is compatible with Airflow 1.10.14 or greater.

You can install and update using pip. For instance:
```
pip install -U airflow-mcd
```

This package can be added like any other python dependency to Airflow (e.g. via `requirements.txt`).

## Basic usage

### Hooks:

- **SessionHook**

    Creates a [pycarlo](https://pypi.org/project/pycarlo/) compatible session. This is useful 
    for creating your own operator built on top of our Python SDK.

    This hook expects an Airflow HTTP connection with the Monte Carlo API id as the "login" and the API token as the
    "password".

    Alternatively, you could define both the Monte Carlo API id and token in "extra" with the following format:
    ```
    {
        "mcd_id": "<ID>",
        "mcd_token": "<TOKEN>"
    }
    ```
    See [here](https://docs.getmontecarlo.com/docs/creating-an-api-token) for details on how to generate a token.
  
### Operators:

- **BaseMcdOperator**

  This operator can be extended to build your own operator using our [SDK](https://pypi.org/project/pycarlo/) or any other 
  dependencies. This is useful if you want implement your own custom logic (e.g. creating custom lineage after a task completes).

- **SimpleCircuitBreakerOperator**
   
  This operator can be used to execute a circuit breaker compatible rule (custom SQL monitor) to run integrity tests 
  before allowing any downstream tasks to execute. Raises an `AirflowFailException` if the rule condition is in
  breach when using an Airflow version newer than 1.10.11, as that is preferred for tasks that can be failed without 
  retrying. Older Airflow versions raise an `AirflowException`. For instance:
  ```
  from datetime import datetime, timedelta
  
  from airflow import DAG
  
  try:
    from airflow.operators.bash import BashOperator
  except ImportError:
    # For airflow versions <= 2.0.0. This module was deprecated in 2.0.0.
    from airflow.operators.bash_operator import BashOperator
  
  from airflow_mcd.operators import SimpleCircuitBreakerOperator
  
  mcd_connection_id = 'mcd_default_session'
  
  with DAG('sample-dag', start_date=datetime(2022, 2, 8), catchup=False, schedule_interval=timedelta(1)) as dag:
      task1 = BashOperator(
          task_id='example_elt_job_1',
          bash_command='echo I am transforming a very important table!',
      )
      breaker = SimpleCircuitBreakerOperator(
          task_id='example_circuit_breaker',
          mcd_session_conn_id=mcd_connection_id,
          rule_uuid='<RULE_UUID>'
      )
      task2 = BashOperator(
          task_id='example_elt_job_2',
          bash_command='echo I am building a very important dashboard from the table created in task1!',
          trigger_rule='none_failed'
      )
  
      task1 >> breaker >> task2
  ```
  This operator expects the following parameters:
    - `mcd_session_conn_id`: A SessionHook compatible connection.
    - `rule_uuid`: UUID of the rule (custom SQL monitor) to execute.

  The following parameters can also be passed:
   - `timeout_in_minutes` [default=5]: Polling timeout in minutes. Note that The Data Collector Lambda has a max timeout of
        15 minutes when executing a query. Queries that take longer to execute are not supported, so we recommend
        filtering down the query output to improve performance (e.g limit WHERE clause). If you expect a query to
        take the full 15 minutes we recommend padding the timeout to 20 minutes.
   - `fail_open` [default=True]: Prevent any errors or timeouts when executing a rule from stopping your pipeline.
        Raises `AirflowSkipException` if set to True and any issues are encountered. Recommended to set the 
       [trigger_rule](https://airflow.apache.org/docs/apache-airflow/stable/concepts/dags.html#trigger-rules)
        param for any downstream tasks to `none_failed` in this case.

## Tests and releases

Locally make test will run all tests. See [README-dev.md](README-dev.md) for additional details on development. When 
ready for a review, create a PR against main.

When ready to release, create a new Github release with a tag using semantic versioning (e.g. v0.42.0) and CircleCI will 
test and publish to PyPI. Note that an existing version will not be deployed.

## License

Apache 2.0 - See the [LICENSE](http://www.apache.org/licenses/LICENSE-2.0) for more information.

