Metadata-Version: 2.1
Name: dslibrary
Version: 0.0.18
Summary: Data Science Framework & Abstractions
Home-page: UNKNOWN
License: Apache
Platform: UNKNOWN
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Description-Content-Type: text/markdown
Provides-Extra: all

# DSLIBRARY

## Installation

    # normal install
    pip install dslibrary

    # to include a robust set of data connectors:
    pip install dslibrary[all]

## Data Science Framework and Abstraction of Data Details

Data science code is supposed to focus on the data, but it frequently gets bogged down in repetitive tasks like
juggling parameters, working out file formats, and connecting to cloud data sources.  This library proposes some ways
to make those parts of life a little easier, and to make the resulting code a little shorter and more readable.

Some of this project's goals:
 * make it possible to create 'situation agnostic' code which runs unchanged across many platforms, against many data
   sources and in many data formats
 * remove the need to code some of the most often repeated mundane chores, such as parameter parsing, read/write in
   different file formats with different formatting options, cloud data access
 * enhance the ability to run and test code locally
 * support higher security and cross-cloud data access
 * compatibility with mlflow.tracking, with the option to delegate to mlflow or not

If you use dslibrary with no configuration it will revert to very straightforward behaviors that a person would expect
while doing local development.  But it can be configured to operate in a wide range of environments.


## Data Cleaning Example

Here's a simple data cleaning example.  You can run it from the command line, or call it's clean() method and it
will truncate the values in a column of the supplied data.  But so far it only works on local files, it only supports
one file format (CSV), and it uses all the default formatting arguments, which will not always work.

    # clean_it.py
    import pandas

    def clean(upper=100, in="in.csv", out="out.csv"):
        df = pandas.read_csv(in)
        df.loc[df.x > upper, 'x'] = upper
        df.to_csv(out)

    if __name__ == "__main__":
        # INSERT ARGUMENT PARSING CODE HERE
        clean(...)

Here it is converted to use dslibrary:

    import dslibrary as dsl

    def clean(upper: float=100, in="in.csv", out="out.csv"):
        df = dsl.load_dataframe(in)
        df.loc[df.x > upper, 'x'] = upper
        dsl.write_resource(out, df)

    if __name__ == "__main__":
        clean(**dsl.get_parameters())

Now if we execute that code through dslibrary's ModelRunner class, we can point it to data in the cloud, and set
different file formatting options:

    from dslibrary import ModelRunner
    import clean_it

    runner = ModelRunner()
    runner.set_parameter("upper", 50)
    runner.set_input("in.csv", "s3://bucket/raw.csv", format_options={"delim_whitespace": True})
    runner.set_output("out.csv", "s3://bucket/clipped.csv", format_options={"sep": "\t"})

    # run the method directly
    runner.run_method(clean_it.clean)

Or I can invoke it as a subprocess:

    runner.run_local("path/to/clean_it.py")

This will also work with notebooks:

    runner.run_local("path/to/clean_it.ipynb")


## More examples

### Report a metric about some data

Report the average temperature of some data (default result is line-by-line JSON in "__metrics__" in the cwd):

    import dslibrary as dsl
    data = dsl.load_dataframe("input")
    with dsl.start_run():
        dsl.log_metric("avg_temp", data.temperature.mean())

Call it with some SQL data:

    from dslibrary import ModelRunner
    runner = ModelRunner()
    runner.set_input(
        "input",
        uri="mysql://username:password@mysql-server/climate",
        sql="select temperature from readings order by timestamp desc limit 1000"
    ))
    runner.run_local("avg_temp.py")

Change format & filename for metrics output (format is implied by filename):

    runner.set_output(dslibrary.METRICS_ALIAS, "metrics.csv", format_optons={"sep": "\t"})

We could send the metrics to mlflow instead:

    runner = ModelRunner(mlflow=True)


## Data Security and Cross-Cloud Data

The normal way of accessing data in the cloud is to store CSP credentials in, say, "~/.aws/credentials", and then the
intervening library is able to read and write to s3 buckets.  You have to make sure this setup is done, that the right
packages are in your environment, and write your code accordingly.  Here are the main problems:

### Do you trust the code?

The code often has access to those credentials.  Maybe you trust the code not to "lift" those credentials and use
them elsewhere, maybe you don't.  Maybe you can ensure those credentials are locked down to no more than s3 bucket
read access, or maybe you can't.  Even secret management systems will still expose the credentials to the code.

The solution dslibrary facilitates is to have a different, trusted system perform the data access.  In dslibrary
there is an extensible/customizable way to "transport" data access to another system.  By setting an environment 
variable or two (one for the remote URL, another for an access token), the data read and write operations can be 
managed by that other system.  Before executing the code, one sends the URIs, credentials and file format information 
to the data access system.

The `transport.to_rest` class will send dslibrary calls to a REST service.

The `transport.to_volume` class will send dslibrary calls through a shared volume to a Kubernetes sidecar.

### The setup is annoying

It can be time consuming to ensure that every system running the code has this credential configuration in place,
and one system may need to access multiple accounts for the same CSP.  And especially if you are on one CSP trying to
access data in another CSP there is no automated setup you can count on.

The usual solution is to require that all the data science code add support for some particular cloud provider,
and accept credentials as parameters.  It's a lot of overhead.

The way dslibrary aims to help is by separating out
all the information about a particular data source or target and providing ways to bundle and un-bundle it so that it
can be sent where it is needed.  The data science code itself should not have to worry about these settings or need
to change just because the data moved or changed format.


# COPYRIGHT

(c) Accenture 2021


