Metadata-Version: 2.1
Name: data-check
Version: 0.15.0
Summary: simple data validation
Home-page: https://andrjas.github.io/data_check/
License: MIT
Keywords: data,validation,testing,quality
Author: Andreas Rjasanow
Author-email: andrjas@gmail.com
Requires-Python: >=3.8,<3.12
Classifier: Development Status :: 3 - Alpha
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Other Audience
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Topic :: Database
Classifier: Topic :: Software Development
Classifier: Topic :: Software Development :: Testing
Provides-Extra: mssql
Provides-Extra: mysql
Provides-Extra: oracle
Provides-Extra: postgres
Requires-Dist: Faker (==17.0.0)
Requires-Dist: Jinja2 (==3.1.2)
Requires-Dist: SQLAlchemy (==1.4.46)
Requires-Dist: click (==8.1.3)
Requires-Dist: click-default-group (==1.2.2)
Requires-Dist: colorama (==0.4.6)
Requires-Dist: cx_Oracle (==8.3.0); extra == "oracle"
Requires-Dist: numpy (==1.24.2)
Requires-Dist: openpyxl (==3.1.1)
Requires-Dist: pandas (==1.5.3)
Requires-Dist: psycopg2-binary (==2.9.5); extra == "postgres"
Requires-Dist: pymysql[rsa] (==1.0.2); extra == "mysql"
Requires-Dist: pyodbc (==4.0.35); extra == "mssql"
Requires-Dist: python-dateutil (==2.8.2)
Requires-Dist: pyyaml (==6.0)
Project-URL: Repository, https://github.com/andrjas/data_check
Description-Content-Type: text/markdown

# data_check

data_check is a simple data validation tool. In its most basic form it will execute SQL queries and compare the results against CSV or Excel files. But there are more advanced features:

## Features

* [CSV checks](https://andrjas.github.io/data_check/csv_checks/): compare SQL queries against CSV files
* Excel support: Use Excel (xlsx) instead of CSV
* multiple environments (databases) in the configuration file
* [populate tables](https://andrjas.github.io/data_check/loading_data/) from CSV or Excel files
* [execute any SQL files on a database](https://andrjas.github.io/data_check/sql/)
* more complex [pipelines](https://andrjas.github.io/data_check/pipelines/)
* run any script/command (via pipelines)
* simplified checks for [empty datasets](https://andrjas.github.io/data_check/csv_checks/#empty-dataset-checks) and [full table comparison](https://andrjas.github.io/data_check/csv_checks/#full-table-checks)
* [lookups](https://andrjas.github.io/data_check/csv_checks/#lookups) to reuse the same data in multiple queries
* [test data generation](https://andrjas.github.io/data_check/test_data/)

## Database support

data_check should work with any database that works with [SQLAlchemy](https://docs.sqlalchemy.org/en/14/dialects/). Currently data_check is tested against PostgreSQL, MySQL, SQLite, Oracle and Microsoft SQL Server.

## Quickstart

You need Python 3.8 or above to run data_check. The easiest way to install data_check is via [pipx](https://github.com/pipxproject/pipx):

`pipx install data-check`

The data_check Git repository is also a sample data_check project. Clone the repository, switch to the folder and run data_check:

```
git clone git@github.com:andrjas/data_check.git
cd data_check/example
data_check
```

This will run the tests in the _checks_ folder using the default connection as set in data_check.yml.

See the [documentation](https://andrjas.github.io/data_check) how to install data_check in different environments with additional database drivers and other usages of data_check.

## Project layout

data_check has a simple layout for projects: a single configuration file and a folder with the test files. You can also organize the test files in subfolders.

    data_check.yml    # The configuration file
    checks/           # Default folder for data tests
        some_test.sql # SQL file with the query to run against the database
        some_test.csv # CSV file with the expected result
        subfolder/    # Tests can be nested in subfolders

## CSV checks

This is the default mode when running data_check. data_check expects a SQL file and a CSV file. The SQL file will be executed against the database and the result is compared with the CSV file. If they match, the test is passed, otherwise it fails.

## Pipelines

If data_check finds a file named _data\_check\_pipeline.yml_ in a folder, it will treat this folder as a pipeline check. Instead of running [CSV checks](#csv-checks) it will execute the steps in the YAML file.

Example project with a pipeline:

    data_check.yml
    checks/
        some_test.sql                # this test will run in parallel to the pipeline test
        some_test.csv
        sample_pipeline/
            data_check_pipeline.yml  # configuration for the pipeline
            data/
                my_schema.some_table.csv       # data for a table
            data2/
                some_data.csv        # other data
            some_checks/             # folder with CSV checks
                check1.sql
                check1.csl
                ...
            run_this.sql             # a SQL file that will be executed
            cleanup.sql
        other_pipeline/              # you can have multiple pipelines that will run in parallel
            data_check_pipeline.yml
            ...

The file _sample\_pipeline/data\_check\_pipeline.yml_ can look like this:

```yaml
steps:
    # this will truncate the table my_schema.some_table and load it with the data from data/my_schema.some_table.csv
    - load: data
    # this will execute the SQL statement in run_this.sql
    - sql: run_this.sql
    # this will append the data from data2/some_data.csv to my_schema.other_table
    - load:
        file: data2/some_data.csv
        table: my_schema.other_table
        mode: append
    # this will run a python script and pass the connection name
    - cmd: "python3 /path/to/my_pipeline.py --connection {{CONNECTION}}"
    # this will run the CSV checks in the some_checks folder
    - check: some_checks
```

Pipeline checks and simple CSV checks can coexist in a project.

## Documentation

See the [documentation](https://andrjas.github.io/data_check) how to setup data_check, how to create a new project and more options.

## License

[MIT](LICENSE)

