Metadata-Version: 2.1
Name: data-diff
Version: 0.9.8
Summary: Command-line tool and Python library to efficiently diff rows across two different databases.
Home-page: https://github.com/datafold/data-diff
License: MIT
Author: Datafold
Author-email: data-diff@datafold.com
Requires-Python: >=3.8.0,<4.0.0
Classifier: Development Status :: 2 - Pre-Alpha
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Information Technology
Classifier: Intended Audience :: System Administrators
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Topic :: Database :: Database Engines/Servers
Classifier: Typing :: Typed
Provides-Extra: clickhouse
Provides-Extra: duckdb
Provides-Extra: mssql
Provides-Extra: mysql
Provides-Extra: oracle
Provides-Extra: postgresql
Provides-Extra: preql
Provides-Extra: presto
Provides-Extra: redshift
Provides-Extra: snowflake
Provides-Extra: trino
Provides-Extra: vertica
Requires-Dist: attrs (>=23.1.0,<24.0.0)
Requires-Dist: click (>=8.1,<9.0)
Requires-Dist: clickhouse-driver ; extra == "clickhouse"
Requires-Dist: cryptography ; extra == "snowflake"
Requires-Dist: dbt-core (>=1.0.0,<2.0.0)
Requires-Dist: dsnparse (<0.2.0)
Requires-Dist: duckdb ; extra == "duckdb"
Requires-Dist: keyring
Requires-Dist: mashumaro[msgpack] (>=3.8.1,<3.9.0)
Requires-Dist: mysql-connector-python (==8.0.29) ; extra == "mysql"
Requires-Dist: oracledb ; extra == "oracle"
Requires-Dist: preql (>=0.2.19,<0.3.0) ; extra == "preql"
Requires-Dist: presto-python-client ; extra == "presto"
Requires-Dist: psycopg2 ; extra == "postgresql" or extra == "redshift"
Requires-Dist: pydantic (==1.10.12)
Requires-Dist: pyodbc (>=4.0.39,<5.0.0) ; extra == "mssql"
Requires-Dist: rich
Requires-Dist: snowflake-connector-python (>=3.0.2,<4.0.0) ; extra == "snowflake"
Requires-Dist: tabulate (>=0.9.0,<0.10.0)
Requires-Dist: toml (>=0.10.2,<0.11.0)
Requires-Dist: trino (>=0.314.0,<0.315.0) ; extra == "trino"
Requires-Dist: typing-extensions (>=4.0.1)
Requires-Dist: urllib3 (<2)
Requires-Dist: vertica-python ; extra == "vertica"
Project-URL: Repository, https://github.com/datafold/data-diff
Description-Content-Type: text/markdown

<p align="center">
    <a href="https://datafold.com/"><img alt="Datafold" src="https://user-images.githubusercontent.com/1799931/196497110-d3de1113-a97f-4322-b531-026d859b867a.png" width="30%" /></a>
</p>

<h2 align="center">
data-diff: Compare datasets fast, within or across SQL databases

![data-diff-logo](docs/data-diff-logo.png)
</h2>
<br>

# Use Cases

## Data Migration & Replication Testing
Compare source to target and check for discrepancies when moving data between systems:
- Migrating to a new data warehouse (e.g., Oracle > Snowflake)
- Converting SQL to a new transformation framework (e.g., stored procedures > dbt)
- Continuously replicating data from an OLTP DB to OLAP DWH (e.g., MySQL > Redshift)


## Data Development Testing
Test SQL code and preview changes by comparing development/staging environment data to production:
1. Make a change to some SQL code
2. Run the SQL code to create a new dataset
3. Compare the dataset with its production version or another iteration

  <p align="left">
  <img alt="dbt" src="https://seeklogo.com/images/D/dbt-logo-E4B0ED72A2-seeklogo.com.png" width="10%" />
  </p>

<details>
<summary> data-diff integrates with dbt Core to seamlessly compare local development to production datasets

 </summary>

![data-development-testing](docs/development_testing.png)

</details>

> [dbt Cloud users should check out Datafold's out-of-the-box deployment testing integration](https://www.datafold.com/data-deployment-testing)

:eyes: **Watch [4-min demo video](https://www.loom.com/share/ad3df969ba6b4298939efb2fbcc14cde)**

**[Get started with data-diff & dbt](https://docs.datafold.com/development_testing/open_source)**

Also available in a [VS Code Extension](https://marketplace.visualstudio.com/items?itemName=Datafold.datafold-vscode)

Reach out on the dbt Slack in [#tools-datafold](https://getdbt.slack.com/archives/C03D25A92UU) for advice and support


# How it works

When comparing the data, `data-diff` utilizes the resources of the underlying databases as much as possible. It has two primary modes of comparison:

## `joindiff`
- Recommended for comparing data within the same database
- Uses the outer join operation to diff the rows as efficiently as possible within the same database
- Fully relies on the underlying database engine for computation
- Requires both datasets to be queryable with a single SQL query
- Time complexity approximates JOIN operation and is largely independent of the number of differences in the dataset

## `hashdiff`
- Recommended for comparing datasets across different databases
- Can also be helpful in diffing very large tables with few expected differences within the same database
- Employs a divide-and-conquer algorithm based on hashing and binary search
- Can diff data across distinct database engines, e.g., PostgreSQL <> Snowflake
- Time complexity approximates COUNT(*) operation when there are few differences
- Performance degrades when datasets have a large number of differences

More information about the algorithm and performance considerations can be found [here](https://github.com/datafold/data-diff/blob/master/docs/technical-explanation.md)

# Get started

## Validating dbt model changes between dev and prod
⚡ Looking to use `data-diff` in dbt development? Head over to [our `data-diff` + `dbt` documentation](https://docs.datafold.com/development_testing/open_source/) to get started!

## Compare data tables between databases
🔀 To compare data between databases, install `data-diff` with specific database adapters, e.g.:

```
pip install data-diff 'data-diff[postgresql,snowflake]' -U
```

Run `data-diff` with connection URIs. In the following example, we compare tables between PostgreSQL and Snowflake using the hashdiff algorithm:

```bash
data-diff \
  postgresql://<username>:'<password>'@localhost:5432/<database> \
  <table> \
  "snowflake://<username>:<password>@<account>/<DATABASE>/<SCHEMA>?warehouse=<WAREHOUSE>&role=<ROLE>" \
  <TABLE> \
  -k <primary key column> \
  -c <columns to compare> \
  -w <filter condition>
```

Run `data-diff` with a `toml` configuration file. In the following example, we compare tables between MotherDuck(hosted DuckDB) and Snowflake using the hashdiff algorithm:

```toml
## DATABASE CONNECTION ##
[database.duckdb_connection] 
  driver = "duckdb"
  # filepath = "datafold_demo.duckdb" # local duckdb file example
  # filepath = "md:" # default motherduck connection example
  filepath = "md:datafold_demo?motherduck_token=${motherduck_token}" # API token recommended for motherduck connection
  database = "datafold_demo"

[database.snowflake_connection]
  driver = "snowflake"
  database = "DEV"
  user = "sung"
  password = "${SNOWFLAKE_PASSWORD}" # or "<PASSWORD_STRING>"
  # the info below is only required for snowflake
  account = "${ACCOUNT}" # by33919
  schema = "DEVELOPMENT"
  warehouse = "DEMO"
  role = "DEMO_ROLE"

## RUN PARAMETERS ##
[run.default]
  verbose = true

## EXAMPLE DATA DIFF JOB ##
[run.demo_xdb_diff]
  # Source 1 ("left")
  1.database = "duckdb_connection"
  1.table = "development.raw_orders"

  # Source 2 ("right")
  2.database = "snowflake_connection"
  2.table = "RAW_ORDERS" # note that snowflake table names are case-sensitive

  verbose = false
```

```bash
# export relevant environment variables, example below
export motherduck_token=<MOTHERDUCK_TOKEN>

# run the configured data-diff job
data-diff --conf datadiff.toml \
  --run demo_xdb_diff \
  -k "id" \
  -c status

# output example
- 1, completed
+ 1, returned
```

Check out [documentation](https://docs.datafold.com/reference/open_source/cli) for the full command reference.


# Supported databases


| Database      | Status | Connection string                                                                                                                   |
|---------------|-------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------|
| PostgreSQL >=10 |  🟢   | `postgresql://<user>:<password>@<host>:5432/<database>`                                                                             |
| MySQL         |  🟢     | `mysql://<user>:<password>@<hostname>:5432/<database>`                                                                              |
| Snowflake     |  🟢     | `"snowflake://<user>[:<password>]@<account>/<database>/<SCHEMA>?warehouse=<WAREHOUSE>&role=<role>[&authenticator=externalbrowser]"` |
| BigQuery      |  🟢     | `bigquery://<project>/<dataset>`                                                                                                    |
| Redshift      |  🟢     | `redshift://<username>:<password>@<hostname>:5439/<database>`                                                                       |
| DuckDB        |  🟢   | `duckdb://<dbname>@<filepath>`                                                                                          |
| MotherDuck        |  🟢   | `duckdb://<dbname>@<filepath>`                                                                                                   |
| Oracle        |  🟡   | `oracle://<username>:<password>@<hostname>/servive_or_sid`                                                                          |
| Presto        |  🟡   | `presto://<username>:<password>@<hostname>:8080/<database>`                                                                         |
| Databricks    |  🟡   | `databricks://<http_path>:<access_token>@<server_hostname>/<catalog>/<schema>`                                                      |
| Trino         |  🟡   | `trino://<username>:<password>@<hostname>:8080/<database>`                                                                          |
| Clickhouse    |  🟡   | `clickhouse://<username>:<password>@<hostname>:9000/<database>`                                                                     |
| Vertica       |  🟡   | `vertica://<username>:<password>@<hostname>:5433/<database>`                                                                        |
| ElasticSearch |  📝    |                                                                                                                                     |
| Planetscale   |  📝    |                                                                                                                                     |
| Pinot         |  📝    |                                                                                                                                     |
| Druid         |  📝    |                                                                                                                                     |
| Kafka         |  📝    |                                                                                                                                     |
| SQLite        |  📝    |                                                                                                                                     |

* 🟢: Implemented and thoroughly tested.
* 🟡: Implemented, but not thoroughly tested yet.
* ⏳: Implementation in progress.
* 📝: Implementation planned. Contributions welcome.

Your database not listed here?

- Contribute a [new database adapter](https://github.com/datafold/data-diff/blob/master/docs/new-database-driver-guide.rst) – we accept pull requests!
- [Get in touch](https://www.datafold.com/demo) about enterprise support and adding new adapters and features


<br>

## Contributors

We thank everyone who contributed so far!

<a href="https://github.com/datafold/data-diff/graphs/contributors">
  <img src="https://contributors-img.web.app/image?repo=datafold/data-diff" />
</a>

<br>

## Analytics

* [Usage Analytics & Data Privacy](https://github.com/datafold/data-diff/blob/master/docs/usage_analytics.md)

<br>

## License

This project is licensed under the terms of the [MIT License](https://github.com/datafold/data-diff/blob/master/LICENSE).

