Metadata-Version: 2.1
Name: data-raven
Version: 1.0.0
Summary: A Python framework for building data quality tests
Home-page: https://github.com/tekkabroley/data-raven
Author: Alex Broley
Author-email: alex.broley@gmail.com
License: UNKNOWN
Description: # data-raven
        
        ## Description
        A toolbox of flexible database connectors and test methods used to measure data integrity of datasets and database 
        tables.
        * Build data quality tests which can be inserted into an existing Python script or run as a stand-alone script. 
        * Send outcome notifications to messaging and logging applications. 
        * Halt pipelines and raise exceptions when needed.
        
        ## Prerequisites
        Python 3.6+
        
        sqlalchemy>=1.3.19
        
        psycopg2
        
        pymysql
        
        ## Installing
        `pip install data-raven`
        
        ## A simple data quality test script
        
        In this example we build a script to test the `name`, `price` and `product_id` columns from the Postgres table `Orders`.
        This table has the following DDL:
        ```buildoutcfg
        create table Orders (
        id int,
        name varchar(50),
        order_ts varchar(26),
        product_id int,
        price float
        );
        ```
        
        Here's the test script.
        ```buildoutcfg
        import os
        
        from dataraven.connections import PostgresConnector
        from dataraven.data_quality_operators import SQLNullCheckOperator
        
        
        def main():
            # initialize logging
            lazy_logger = lambda msg: print(msg + '\n')
        
            # database connection credentials
            user = os.environ["user"]
            password = os.environ["password"]
            host = os.environ["host"]
            dbname = os.environ["dbname"]
            port = os.environ["port"]
        
            # postgres database connector
            conn = PostgresConnector(user, password, host, dbname, port, logger=lazy_logger)
            dialect = "postgres"
        
            # test thresholds
            threshold0 = 0
            threshold1 = 0.01
            threshold5 = 0.05
        
            ##### TEST ORDERS TABLE #####
            # Table to be tested
            from_clause = "test_schema.Orders"
        
            # Conditional logic to be applied to input data
            date = "2020-09-08"
            where_clause = [f"date(order_ts) = '{date}'"]
        
            # Columns to be tested in target table
            columns = ("name", "product_id", "price")
        
            # Threshold value to be applied to each column
            threhold = {"name": threshold1, "product_id": threshold0, "price": threshold5}
        
            # Hard fail condition set on specific columns
            hard_fail = {"product_id": True}
        
            # Execute the null check test on each column in columns, on the above table
            SQLNullCheckOperator(conn, dialect, from_clause, threhold, *columns, where=where_clause, logger=lazy_logger,
                                 hard_fail=hard_fail)
        
        
        if __name__ == "__main__":
            main()
        ```
        
        # Documentation
        ## Database Support
        * Postgres
        * MySQL
        
        ## Data Quality Tests
        Data quality tests are used to measure the integrity of specified columns within a table or document. Every data 
        quality test will return `'test_pass'` or `'test_fail'` depending on the given measure and threshold.
        
        ### Data Quality Operators
        Each operator will log the test results using the function passed in the `logger` parameter. If no logger is found then
        these log messages will be swallowed. 
        
        Each operator has a `test_results` attribute which exposes the results from the underlying test. `test_results` is a 
        `dict` object with the following structure:
        ```buildoutcfg
        {
            COLUMN NAME: {
                "result": 'test_pass' or 'test_fail',
                "measure": THE MEASURED VALUE OF COLUMN NAME,
                "threshold": THE THRESHOLD VALUE SPECIFIED FOR TEST,
                "result_msg": TEST RESULT MESSAGE
            }
        }   
        ```
        
        #### SQL Operators
        All SQL operators have the following required parameters:
        * `conn` - The database connection object.
        * `dialect` - The SQL dialect for the given database. Accepted values are `postgres` or `mysql`.
        * `from_` - The schema and table name of table to be tested.
        * `threshold` - The threshold specified for a given test or collection of tests. This parameter can be numeric or a 
        `dict` object. If `threshold` is numeric then this value will be applied to all columns being tested by the operator. 
        If `threshold` is a `dict` then each `threshold` value will be referenced by column name. All columns being passed to the
        operator must have a specified threshold value. If `threshold` is a `dict` it must have the following structure:
        ```buildoutcfg
        {
            COLUMN NAME: NUMERIC VALUE
        }
        ```
        * `columns` - The column names entered as comma separated positional arguments.
        
        All SQL operators have the following optional parameters:
        * `logger` - The logging function. If None is passed then logged messages will be swallowed.
        * `where` - Conditional logic to be applied to table specified in `from_`.
        * `hard_fail` - Specifies if an operator which has a test which results in `'test_fail'` should terminate the current 
        process. This parameter
        can be passed as a literal or a `dict` object. If `hard_fail` is set to `True` then every test being performed by the 
        given operator which results in `'test_fail'` will terminate the current process. If `hard_fail` is a `dict` object then
        each `hard_fail` value will be referenced by column name. Only those columns with a `hard_fail` value of `True` will 
        terminate the process upon test failure. If `hard_fail` is a `dict` it must have the following structure:
        ```buildoutcfg
        {
            COLUMN NAME: BOOLEAN VALUE
        }
        ```
        * `use_ansi` - If true then compile measure query to ANSI standards.
         
        `SQLNullCheckOperator` - Test the proportion of null values for each column contained in `columns`. 
        
        `SQLDuplicateCheckOperator` - Test the proportion of duplicate values for each column contained in `columns`.
        
        `SQLSetDuplicateCheckOperator` - Test the number of duplicate values across all columns passed to the `columns`
        parameter simultaniously. This measure is equivalent to counting the number of rows returned from a `SELECT DISTINCT` on 
        all columns and dividing by the total number of rows.
        
        #### CSV Operators
        All CSV operators have the following required parameters:
        * `from_` - The path to CSV file to be tested.
        * `threshold` - Same as defined above for SQL operators.
        * `columns` - the column names entered as comma separated positional arguments.
        
        All CSV operators have the following optional parameters:
        * `delimiter` -  The delimiter used to separate values specified in the file refeneced by the `from_` parameter. 
        * `hard_fail` - Same as defined above for SQL operators.
        * `fieldnames` - A sequence of all column names for CSV file specified in `from_` parameter. To be used if the specified
        file does not have column headers.
        * `reducer_kwargs` - Key word arguments passed to the measure reducer function.
        
        `CSVNullCheckOperator` - Test the proportion of `NULL` values for each column contained in `columns`.
        
        `CSVDuplicateCheckOperator` - Test the proportion of duplicate values for each column contained in `columns`.
        
        `CSVSetDuplicateCheckOperator` - Test the number of duplicate values across all columns passed to the `columns`
        parameter simultaniously.
        
        #### Custom Operators
        `CustomSQLDQOperator` - Executes the test passed by the `custom_test` parameter on each column contained in `columns`. 
        The `CustomSQLDQOperator` class has the following required parameters:
        * `conn` - The database connection object.
        * `custom_test` - The SQL query to be executed. The `custom_test` query is required to return a column labeled `result` 
        which takes value `'test_pass'` or `'test_fail'`. The `custom_test` query should also return columns `measure`, which
        provides the measured column value, and `threshold`, which gives the threshold used in the test. If these columns are
        present then these values will be logged and returned in the `test_results` attribute. If `measure` and `threshold` are
        not returned by the `custom_test` query then these values will be logged as `None`, and will be given in the 
        `test_results` attribute as `None`. `custom_test` can also be a query template with placeholders `{column}` and 
        `{threshold}` for variable column names and threshold values.  
        * `description` - The description of the data quality test being performed. The description is may contain 
        placeholders `{column}` and `{threshold}` for the optional parameters `columns` and `threshold`, if they are passed
        to the `CustomSQLDQOperator`. In this case then a test description will be generated for each `column` in `columns` and
        for each value of `threshold`.
        
        The `CustomSQLDQOperator` class has the following optional parameters:
        * `columns` - a comma separated list of column arguments.
        * `threhsold` - Same as defined above for SQL operators.
        * `hard_fail` - Same as defined above for SQL operators.
        * `test_desc_kwargs` - Key word arguments for formatting the test description.
        
        
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.6
Description-Content-Type: text/markdown
