# hydro 💧

[![main](https://github.com/christophergrant/delta-hydro/actions/workflows/main.yml/badge.svg)](https://github.com/christophergrant/delta-hydro/actions/workflows/main.yml)
[![codecov](https://codecov.io/gh/christophergrant/delta-hydro/branch/main/graph/badge.svg?token=Z64814CV1E)](https://codecov.io/gh/christophergrant/delta-hydro)

hydro is a collection of Python-based [Apache Spark](https://spark.apache.org/) and [Delta Lake](https://delta.io/) extensions.

See [Key Functionality](#key-functionality-) for concrete use cases.

## Warning ⚠️

hydro is well tested but not battle hardened, yet. Use it at your own risk.

## Installation

```commandline
pip install delta-hydro
```

## Docs 📖

https://christophergrant.github.io/delta-hydro

## Key Functionality 🔑

- De-duplicate a Delta Lake table, in-place, without a full overwrite - [hydro.delta.deduplicate](https://christophergrant.github.io/delta-hydro/api/delta.html#hydro.delta.deduplicate)
- Correctly perform [Slowly Changing Dimensions (SCD)](https://en.wikipedia.org/wiki/Slowly_changing_dimension) (types 1 or 2) on Delta Lake tables - [hydro.delta.scd](https://christophergrant.github.io/delta-hydro/api/delta.html#hydro.delta.scd) and [hydro.delta.bootstrap_scd2](https://christophergrant.github.io/delta-hydro/delta.html#hydro.delta.bootstrap_scd2)
- Issue queries against Delta Log metadata, quickly and efficently getting things like partition sizes on huge tables - [hydro.delta.partition_stats](https://christophergrant.github.io/delta-hydro/api/delta.html#hydro.delta.partition_stats)
- Other quality of life improvements like [hydro.delta.detail_enhanced](https://christophergrant.github.io/delta-hydro/api/delta.html#hydro.delta.detail_enhanced) and [hydro.spark.fields](https://christophergrant.github.io/delta-hydro/api/spark.html#hydro.spark.fields)


## Contributions ✨

Contributions are welcome.

Please [create an issue](https://github.com/christophergrant/delta-hydro/issues/new/choose) and discuss before starting work on a feature to make sure that it aligns with the future of the project.

## Naming 🤓

`hydro` is short for hydrologist, where a hydrologist is a person who studies water and its movement. Delta Lake, Data Lake, Lakehouse => water.

## ChatGPT and LLMs 🤖

Some of this project's code and documentation was generated by a Large [Language Model](https://en.wikipedia.org/wiki/Language_model)(LLM), namely [ChatGPT](https://chat.openai.com/chat).

We are proud prompt engineers, so we display the prompt that gave us the code in hydro's source ([example](https://github.com/christophergrant/delta-hydro/commit/8d2d84da4930f14caac62c46ea9a1c07a8bdeac4#diff-4665a0f13cae8eb34e13e308ee3935edf0a63f563ac6301038b0d15f95666446R11)).

## APIs

The topic of  SQL vs  DataFrames is a hot one in the data space.

SQL certainly has its place in analytic and other ad-hoc use cases, but it is missing the expressive power of an imperative language.

This project is a testament to the power of the mix of imperative and declarative expression that DataFrames give. A lot of this code would be very verbose or impossible to express with SQL.
