Metadata-Version: 2.1
Name: red-panda
Version: 1.0.0
Summary: Data science on the cloud
Home-page: https://github.com/jucyai/red-panda
Author: Jiachen Yao
Maintainer: Jiachen Yao
License: MIT
Project-URL: Code, https://github.com/jucyai/red-panda
Project-URL: Issue tracker, https://github.com/jucyai/red-panda/issues
Description: # Red Panda
        
        ![image](https://raw.githubusercontent.com/jucyai/red-panda/dev/red-panda.jpg)
        
        [![docs](https://readthedocs.org/projects/red-panda/badge/?version=dev)](https://red-panda.readthedocs.io/en/dev/?badge=dev)
        [![license](https://img.shields.io/github/license/mashape/apistatus.svg?maxAge=2592000)](https://github.com/jucyao/red-panda/blob/master/LICENSE)
        
        Easily interact with cloud (AWS) in your Data Science workflow.
        
        ## Features
        
        - DataFrame/files to and from S3 and Redshift.
        - Run queries on Redshift in Python.
        - Use built-in Redshift admin queries, such as checking running queries and errors.
        - Use Redshift utility functions to easily accomplish common tasks such as creating a table.
        - Manage files on S3.
        - Query data on S3 directly with Athena.
        - Pandas DataFrame utility functions.
        
        ## Installation
        
        ```sh
        pip install red-panda
        ```
        
        ## Using red-panda
        
        Import `red-panda` and create an instance of `RedPanda`. If you create the instance with `dryrun=True` (i.e. `rp = RedPanda(redshift_conf, s3_conf, dryrun=True)`), `red-panda` will print the planned queries instead of executing them.
        
        ```python
        from red_panda import RedPanda
        
        redshift_conf = {
            "user": "awesome-developer",
            "password": "strong-password",
            "host": "awesome-domain.us-east-1.redshift.amazonaws.com",
            "port": 5432,
            "dbname": "awesome-db",
        }
        
        aws_conf = {
            "aws_access_key_id": "your-aws-access-key-id",
            "aws_secret_access_key": "your-aws-secret-access-key",
            # "aws_session_token": "temporary-token-if-you-have-one",
        }
        
        rp = RedPanda(redshift_conf, aws_conf)
        ```
        
        Load your Pandas DataFrame into Redshift as a new table.
        
        ```python
        import pandas as pd
        
        df = pd.DataFrame(data={"col1": [1, 2], "col2": [3, 4]})
        
        s3_bucket = "s3-bucket-name"
        s3_path = "parent-folder/child-folder" # optional, if you don't have any sub folders
        s3_file_name = "test.csv" # optional, randomly generated if not provided
        rp.df_to_redshift(df, "test_table", bucket=s3_bucket, path=s3_path, append=False)
        ```
        
        It is also possible to:
        
        - Upload a DataFrame or flat file to S3.
        - Delete files from S3.
        - Load S3 data into Redshift.
        - Unload a Redshift query result to S3.
        - Obtain a Redshift query result as a DataFrame.
        - Run any query on Redshift.
        - Download S3 file to local.
        - Read S3 file in memory as DataFrame.
        - Run built-in Redshift admin queries, such as getting running query information.
        - Use utility functions such as `create_table` to quickly create tables in Redshift.
        - Run queries against S3 data directly with Athena using `AthenaUtils`.
        - Use features separately with `RedshiftUtils`, `S3Utils`, `AthenaUtils`.
        
        ```python
        s3_key = s3_path + "/" + s3_file_name
        
        # DataFrame uploaded to S3
        rp.df_to_s3(df, s3_bucket, s3_key)
        
        # Delete a file on S3
        rp.delete_from_s3(s3_bucket, s3_key)
        
        # Upload a local file to S3
        pd.to_csv(df, "test_data.csv", index=False)
        rp.file_to_s3("test_data.csv", s3_bucket, s3_key)
        
        # Populate a Redshift table from S3 files
        # Use a dictionary for column definition, here we minimally define only data_type
        redshift_column_definition = {
            "col1": {data_type: "int"},
            "col2": {data_type: "int"},
        }
        rp.s3_to_redshift(
            s3_bucket, s3_key, "test_table", column_definition=redshift_column_definition
        )
        
        # Unload Redshift query result to S3
        sql = "select * from test_table"
        rp.redshift_to_s3(sql, s3_bucket, s3_path+"/unload", prefix="unloadtest_")
        
        # Obtain Redshift query result as a DataFrame
        df = rp.redshift_to_df("select * from test_table")
        
        # Run queries on Redshift
        rp.run_query("create table test_table_copy as select * from test_table")
        
        # Download S3 file to local
        rp.s3_to_file(s3_bucket, s3_key, "local_file_name.csv")
        
        # Read S3 file in memory as DataFrame
        df = rp.s3_to_df(s3_bucket, s3_key, delimiter=",") # csv file in this example
        
        # Since we are only going to use Redshift functionalities, we can just use RedshiftUtils
        from red_panda.red_panda import RedshiftUtils
        ru = RedshiftUtils(redshift_conf)
        
        # Run built-in Redshift admin queries, such as getting running query information
        load_errors = ru.get_load_error(as_df=True)
        
        # Use utility functions such as create_table to quickly create tables in Redshift
        ru.create_table("test_table", redshift_column_definition, sortkey=["col2"], drop_first=True)
        ```
        
        For full API documentation, visit <https://red-panda.readthedocs.io/en/latest/>.
        
        ## TODO
        
        In no particular order:
        
        - Support more data formats for copy. Currently only support delimited files.
        - Support more data formats for s3 to df. Currently only support delimited files.
        - Improve tests and docs.
        - Better ways of inferring data types from dataframe to Redshift.
        - Explore using `S3 Transfer Manager`'s `upload_fileobj` for `df_to_s3` to take advantage of automatic multipart upload.
        - Add COPY from S3 manifest file, in addition to COPY from S3 source path.
        - Support multi-cloud.
        - Take advantage of Redshift slices for parallel processing. Split files for COPY.
        
Platform: UNKNOWN
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Requires-Python: >=3.6
Description-Content-Type: text/markdown
