Metadata-Version: 2.1
Name: pyspark-ds-toolbox
Version: 0.1.4
Summary: A Pyspark companion for data science tasks.
Home-page: https://github.com/viniciusmsousa/pyspark-ds-toolbox
License: GPL-3.0-only
Author: vinicius.sousa
Author-email: vinisousa04@gmail.com
Requires-Python: >=3.7.1,<3.10
Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3)
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Requires-Dist: h2o (>=3.34.0,<4.0.0)
Requires-Dist: matplotlib (>=3.5.1,<4.0.0)
Requires-Dist: numpy (==1.21.0)
Requires-Dist: pandas (>=1.3.4,<2.0.0)
Requires-Dist: pyarrow (>=6.0.1,<7.0.0)
Requires-Dist: pyspark (>=3.2)
Requires-Dist: seaborn (>=0.11.2,<0.12.0)
Requires-Dist: typeguard (>=2.13.2,<3.0.0)
Project-URL: Documentation, https://viniciusmsousa.github.io/pyspark-ds-toolbox/index.html
Project-URL: Repository, https://github.com/viniciusmsousa/pyspark-ds-toolbox
Description-Content-Type: text/markdown

# Pyspark DS Toolbox

<!-- badges: start -->
[![Lifecycle:
experimental](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://lifecycle.r-lib.org/articles/stages.html#experimental)
[![PyPI Latest Release](https://img.shields.io/pypi/v/pyspark-ds-toolbox.svg)](https://pypi.org/project/pyspark-ds-toolbox/)
[![CodeFactor](https://www.codefactor.io/repository/github/viniciusmsousa/pyspark-ds-toolbox/badge)](https://www.codefactor.io/repository/github/viniciusmsousa/pyspark-ds-toolbox)
[![Codecov test coverage](https://codecov.io/gh/viniciusmsousa/pyspark-ds-toolbox/branch/main/graph/badge.svg)](https://codecov.io/gh/viniciusmsousa/pyspark-ds-toolbox?branch=main)
[![Package Tests](https://github.com/viniciusmsousa/pyspark-ds-toolbox/actions/workflows/package-tests.yml/badge.svg)](https://github.com/viniciusmsousa/pyspark-ds-toolbox/actions)
<!-- badges: end -->


The objective of the package is to provide a set of tools that helps the daily work of data science with spark. The documentation can be found [here](https://viniciusmsousa.github.io/pyspark-ds-toolbox/index.html).


## Installation

Directly from PyPi:
```
pip install pyspark-ds-toolbox
```

or from github:
```
pip install git+https://github.com/viniciusmsousa/pyspark-ds-toolbox.git
```

## Organization

The package is currently organized in a structure based on the nature of the task, such as data wrangling, model/prediction evaluation, and so on.

```
pyspark_ds_toolbox     # Main Package
├─ causal_inference    # Sub-package dedicated to Causal Inferece
│  ├─ diff_in_diff.py   # Module Diff in Diff
│  └─ ps_matching.py    # Module Propensity Score Matching
├─ ml                  # Sub-package dedicated to ML
│  ├─ data_prep.py      # Module for Data Preparation
│  ├─ eval.py           # Module for model/prediction evaluation
│  └─ shap_values.py    # Module for estimate shap values
├─ wrangling.py        # Module for general Data Wrangling
└─ stats               # Sub-package dedicated to basic statistic functionalities
   └─ association.py    # Association metrics module
```


