Metadata-Version: 2.1
Name: stdflow
Version: 0.0.64
Summary: [alpha] A package that transform your notebooks and python files into pipeline steps by standardizing the data input / output.
Home-page: https://github.com/CR/stdflow
Author: Cyprien Ricque
Author-email: Cyprien Ricque <ricque.cyprien@gmail.com>
License: Apache Software License 2.0
Keywords: data science,data,flow,data flow
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.9
Requires-Python: >=3.7
Description-Content-Type: text/markdown
Provides-Extra: dev
License-File: LICENSE

# stdflow

<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->

Data flow tool that transform your notebooks and python files into
pipeline steps by standardizing the data input / output. (for data
science projects)

Create clean data flow pipelines just by replacing you `pd.read_csv()`
and `df.to_csv()` by `sf.load()` and `sf.save()`.

## Install

``` sh
pip install stdflow
```

## How to use

### Pipelines

``` python
from stdflow import StepRunner
from stdflow.pipeline import Pipeline

# Pipeline with 2 steps

dm = "../demo_project/notebooks/"

ingestion_ppl = Pipeline([
    StepRunner(dm + "01_ingestion/countries.ipynb"), 
    StepRunner(dm + "01_ingestion/world_happiness.ipynb")
])

# === OR ===
ingestion_ppl = Pipeline(
    StepRunner(dm + "01_ingestion/countries.ipynb"), 
    StepRunner(dm + "01_ingestion/world_happiness.ipynb")
)

# === OR ===
ingestion_ppl = Pipeline()
ingestion_ppl.add_step(StepRunner(dm + "01_ingestion/countries.ipynb"))
# OR
ingestion_ppl.add_step(dm + "01_ingestion/world_happiness.ipynb")


ingestion_ppl
```


    ================================
                PIPELINE            
    ================================

    STEP 1
        path: ../demo_project/notebooks/01_ingestion/countries.ipynb
        vars: {}

    STEP 2
        path: ../demo_project/notebooks/01_ingestion/world_happiness.ipynb
        vars: {}

    ================================

Run the pipeline

``` python
ingestion_ppl.run()
```

    DEBUG:stdflow.environ_manager:setting variables {}
    DEBUG:stdflow.environ_manager:setting variables {}

    ===============================
        61.../demo_project/notebooks/01_ingestion/countries.ipynb
    ===============================
    Variables: {}
        Path: countries.ipynb
        Duration: 0 days 00:00:00.771219
        Env: {}
    Notebook executed successfully.


    ===============================
        61.../demo_project/notebooks/01_ingestion/world_happiness.ipynb
    ===============================
    Variables: {}
        Path: world_happiness.ipynb
        Duration: 0 days 00:00:00.644832
        Env: {}
    Notebook executed successfully.

## Load and save data

**Specify everything**

``` python
import stdflow as sf
import pandas as pd

# load data from ./data/raw/twitter/france/step_raw/v_1/countries of the world.csv
df = sf.load(
   root="./data", 
   attrs=['twitter', 'france'], # or attrs='twitter/france'
   step='raw', 
   version='1', 
   file_name='countries of the world.csv',
   method=pd.read_csv  # or method='csv'
)

# export data to ./data/raw/twitter/france/step_processed/v_1/countries.csv
sf.save(
   df, 
   root="./data", 
   attrs=['twitter', 'france'], 
   step='processed', 
   version='1', 
   file_name='countries.csv', 
   method=pd.to_csv  # or method='csv'  or any function that takes the object to export as first input 
)
```

Each time you perform a save, a metadata.json file is created in the
folder. This keeps track of how your data was created and other
information.

**More Convenient Method**

``` python
import stdflow as sf

# use package level default values
sf.root = "./data"
sf.attrs = ['twitter', 'france']  # if needed use attrs_in and attrs_out
sf.step_in = 'raw'
sf.step_out = 'processed'

df = sf.load()  
# ! root / attrs / step : used from default values set above
# ! version : the last version was automatically used. default: ":last"
# ! file_name : the file, alone in the folder, was automatically found
# ! method : was automatically used from the file extension

sf.save(df)
# ! root / attrs / step : used from default values set above
# ! version: used default %Y%m%d%H%M format
# ! file_name: used from the input (because only one file)
# ! method : inferred from file name
```

Note that everything we did at package level can be done with the Step
class

``` python
from stdflow import Step

step = Step(root="./data", attrs=['twitter', 'france'], step_in='raw', step_out='processed')
# or set after
step.root = "./data"
# ...

df = step.load(version=':last', file_name=":auto", verbose=True)

step.save(df, verbose=True)
`
```

## Do not

- Save in the same directory from different steps. Because this will
  erase metadata from the previous step.

## Data visualization

``` python
import stdflow as sf
sf.save({'what?': "very cool data"},..., export_viz_tool=True) # exports viz folder
`
```

This command exports a folder `metadata_viz` in the same folder as the
data you exported. The metadata to display is saved in the metadata.json
file.

In order to display it you need to get both the file and the folder on
your local pc (download if you are working on a server)

Then go to the html file in your file explorer and open it. it should
open in your browser and lets you upload the metadata.json file.

Data flow tool that transform your notebooks and python files into
pipeline steps by standardizing the data input / output. (for data
science projects)

Create clean data flow pipelines just by replacing you `pd.read_csv()`
and `df.to_csv()` by `sf.load()` and `sf.save()`.

## Data Organization

### Format

Data folder organization is systematic and used by the function to load
and save. If follows this format:
root_data_folder/attrs_1/attrs_2/…/attrs_n/step_name/version/file_name

where:

- root_data_folder: is the path to the root of your data folder, and is
  not exported in the metadata
- attrs: information to classify your dataset (e.g. country, client, …)
- step_name: name of the step. always starts with `step_`
- version: version of the step. always starts with `v_`
- file_name: name of the file. can be anything

Each folder is the output of a step. It contains a metadata.json file
with information about all files in the folder and how it was generated.
It can also contain a html page (if you set `html_export=True` in
`save()`) that lets you visualize the pipeline and your metadata

### Pipeline

A pipeline is composed of steps each step should export the data by
using export_tabular_data function which does the export in a standard
way a step can be

- a file: jupyter notebook
- python file (in coming)
- a python function (in coming)

### Recommended steps

You can set up any step you want. However, just like any tools there are
good/bad and common ways to use it.

The recommended way to use it is:

1.  Load
    - Use a custom load function to load you raw datasets if needed
    - Fix column names
    - Fix values
      - Except those for which you would like to test multiple methods
        that impacts ml models.
    - Fix column types
2.  Merge
    - Merge data from multiple sources
3.  Transform
    - Pre-processing step along with most plots and analysis
4.  Feature engineering (step that is likely to see many iterations) \>
    *The output of this step goes into the model*
    - Create features
    - Fill missing values
5.  Model
    - This step likely contains gridsearch and therefore output multiple
      resulting datasets
    - Train model
    - Evaluate model (or moved to a separate step)
    - Save model

**Best Practices**: - Do not use `sf.reset` as part of your final code -
In one step, export only to one path (except the version). meaning for
one step only one combination of attrs and step_name - Do not set
sub-dirs within the export (i.e. version folder is the last depth). if
you need similar operation for different datasets, create pipelines

## Tests

tests are run with pytest

/! run from project root

pytest

Data flow tool that transform your notebooks and python files into
pipeline steps by standardizing the data input / output. (for data
science projects)

Create clean data flow pipelines just by replacing you `pd.read_csv()`
and `df.to_csv()` by `sf.load()` and `sf.save()`.
