Metadata-Version: 2.1
Name: dslibrary
Version: 0.0.4
Summary: Data Science Framework & Abstractions
Home-page: UNKNOWN
License: Apache
Description: # DSLIBRARY
        
        ## Installation
        
            # normal install
            pip install dslibrary
        
            # to include a robust set of data connectors:
            pip install dslibrary[all]
        
        ## Data Science Framework and Abstraction of Data Details
        
        Data science code is supposed to focus on the data, but it frequently gets bogged down in repetitive tasks like
        juggling parameters, working out file formats, and connecting to cloud data sources.  This library proposes some ways
        to make those parts of life a little easier, and to make the resulting code a little shorter and more readable.
        
        Some of this project's goals:
         * make it possible to create 'situation agnostic' code which runs unchanged across many platforms, against many data
           sources and in many data formats
         * remove the need to code some of the most often repeated mundane chores, such as parameter parsing, read/write in
           different file formats with different formatting options, cloud data access
         * enhance the ability to run and test code locally
         * support higher security and cross-cloud data access
         * compatibility with mlflow.tracking, with the option to delegate to mlflow or not
        
        If you use dslibrary with no configuration it will revert to very straightforward behaviors that a person would expect
        while doing local development.  But it can be configured to operate in a wide range of environments.
        
        
        ## Data Cleaning Example
        
        Here's a simple data cleaning example.  You can run it from the command line, or call it's clean() method and it
        will truncate the values in a column of the supplied data.  But so far it only works on local files, it only supports
        one file format (CSV), and it uses all the default formatting arguments, which will not always work.
        
            # clean_it.py
            import pandas
        
            def clean(upper=100, in="in.csv", out="out.csv"):
                df = pandas.read_csv(in)
                df.loc[df.x > upper, 'x'] = upper
                df.to_csv(out)
        
            if __name__ == "__main__":
                # INSERT ARGUMENT PARSING CODE HERE
                clean(...)
        
        Here it is converted to use dslibrary:
        
            import dslibrary as dsl
        
            def clean(upper: float=100, in="in.csv", out="out.csv"):
                df = dsl.load_dataframe(in)
                df.loc[df.x > upper, 'x'] = upper
                dsl.write_resource(out, df)
        
            if __name__ == "__main__":
                clean(**dsl.get_parameters())
        
        Now if we execute that code through dslibrary's ModelRunner class, we can point it to data in the cloud, and set
        different file formatting options:
        
            from dslibrary import ModelRunner
            import clean_it
        
            runner = ModelRunner()
            runner.set_parameter("upper", 50)
            runner.set_input("in.csv", "s3://bucket/raw.csv", format_options={"delim_whitespace": True})
            runner.set_output("out.csv", "s3://bucket/clipped.csv", format_options={"sep": "\t"})
        
            # run the method directly
            runner.run_method(clean_it.clean)
        
        Or I can invoke it as a subprocess:
        
            runner.run_local("path/to/clean_it.py")
        
        This will also work with notebooks:
        
            runner.run_local("path/to/clean_it.ipynb")
        
        
        ## More examples
        
        ### Report a metric about some data
        
        Report the average temperature of some data (default result is line-by-line JSON in "__metrics__" in the cwd):
        
            import dslibrary as dsl
            data = dsl.load_dataframe("input")
            with dsl.start_run():
                dsl.log_metric("avg_temp", data.temperature.mean())
        
        Call it with some SQL data:
        
            from dslibrary import ModelRunner
            runner = ModelRunner()
            runner.set_input(
                "input",
                uri="mysql://username:password@mysql-server/climate",
                sql="select temperature from readings order by timestamp desc limit 1000"
            ))
            runner.run_local("avg_temp.py")
        
        Change format & filename for metrics output (format is implied by filename):
        
            runner.set_output(dslibrary.METRICS_ALIAS, "metrics.csv", format_optons={"sep": "\t"})
        
        We could send the metrics to mlflow instead:
        
            runner = ModelRunner(mlflow=True)
        
        
        ## Data Security and Cross-Cloud Data
        
        The normal way of accessing data in the cloud is to store CSP credentials in, say, "~/.aws/credentials", and then the
        intervening library is able to read and write to s3 buckets.  You have to make sure this setup is done, that the right
        packages are in your environment, and write your code accordingly.  Here are the main problems:
        
        ### Do you trust the code?
        
        The code often has access to those credentials.  Maybe you trust the code not to "lift" those credentials and use
        them elsewhere, maybe you don't.  Maybe you can ensure those credentials are locked down to no more than s3 bucket
        read access, or maybe you can't.  Even secret management systems will still expose the credentials to the code.
        
        The solution dslibrary facilitates is to have a different, trusted system perform the data access.  In dslibrary
        there is an extensible/customizable way to "transport" data access to another system.  By setting an environment 
        variable or two (one for the remote URL, another for an access token), the data read and write operations can be 
        managed by that other system.  Before executing the code, one sends the URIs, credentials and file format information 
        to the data access system.
        
        The `transport.to_rest` class will send dslibrary calls to a REST service.
        
        The `transport.to_volume` class will send dslibrary calls through a shared volume to a Kubernetes sidecar.
        
        ### The setup is annoying
        
        It can be time consuming to ensure that every system running the code has this setup done, and the same hardware may need
        to access different accounts for the same CSP.  Especially if you are one CSP trying to access data in another CSP,
        then there is no automated setup you can count on.
        
        The usual solution is to require that all the data science code add support for some particular cloud provider,
        and accept credentials as parameters.  It's a lot of overhead.
        
        The way dslibrary aims to help is by separating out
        all the information about a particular data source or target and providing ways to bundle and un-bundle it so that it
        can be sent where it is needed.  The data science code itself should not have to worry about these settings or need
        to change just because the data moved or changed format.
        
        
        # COPYRIGHT
        
        (c) Accenture 2021
        
Platform: UNKNOWN
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Description-Content-Type: text/markdown
Provides-Extra: all
