Metadata-Version: 2.1
Name: pyjanitor
Version: 0.29.2
Summary: Tools for cleaning pandas DataFrames
Home-page: https://github.com/pyjanitor-devs/pyjanitor
Author: pyjanitor devs
Author-email: ericmajinglong@gmail.com
License: MIT
Requires-Python: >=3.6
Description-Content-Type: text/markdown
License-File: LICENSE
License-File: AUTHORS.md
Requires-Dist: natsort
Requires-Dist: pandas_flavor
Requires-Dist: multipledispatch
Requires-Dist: scipy
Provides-Extra: dev
Requires-Dist: pip-tools; extra == "dev"
Requires-Dist: pre-commit; extra == "dev"
Requires-Dist: isort>=4.3.18; extra == "dev"
Requires-Dist: black>=19.3b0; extra == "dev"
Requires-Dist: darglint; extra == "dev"
Requires-Dist: flake8; extra == "dev"
Provides-Extra: docs
Requires-Dist: mkdocs; extra == "docs"
Requires-Dist: polars; extra == "docs"
Requires-Dist: mkdocs-material; extra == "docs"
Requires-Dist: mkdocstrings>=0.19.0; extra == "docs"
Requires-Dist: mkdocstrings-python; extra == "docs"
Requires-Dist: ipython>7.31.1; extra == "docs"
Requires-Dist: biopython; extra == "docs"
Requires-Dist: tqdm; extra == "docs"
Requires-Dist: unyt; extra == "docs"
Requires-Dist: pyspark; extra == "docs"
Provides-Extra: test
Requires-Dist: pytest-cov; extra == "test"
Requires-Dist: pytest-xdist; extra == "test"
Requires-Dist: pytest>=3.4.2; extra == "test"
Requires-Dist: hypothesis>=4.4.0; extra == "test"
Requires-Dist: interrogate; extra == "test"
Requires-Dist: pandas-vet; extra == "test"
Requires-Dist: polars; extra == "test"
Requires-Dist: py>=1.10.0; extra == "test"
Provides-Extra: biology
Requires-Dist: biopython; extra == "biology"
Provides-Extra: chemistry
Requires-Dist: tqdm; extra == "chemistry"
Provides-Extra: engineering
Requires-Dist: unyt; extra == "engineering"
Provides-Extra: spark
Requires-Dist: pyspark; extra == "spark"
Provides-Extra: all
Requires-Dist: mkdocs; extra == "all"
Requires-Dist: pytest>=3.4.2; extra == "all"
Requires-Dist: tqdm; extra == "all"
Requires-Dist: interrogate; extra == "all"
Requires-Dist: polars; extra == "all"
Requires-Dist: black>=19.3b0; extra == "all"
Requires-Dist: pytest-xdist; extra == "all"
Requires-Dist: ipython>7.31.1; extra == "all"
Requires-Dist: pytest-cov; extra == "all"
Requires-Dist: pyspark; extra == "all"
Requires-Dist: pandas-vet; extra == "all"
Requires-Dist: isort>=4.3.18; extra == "all"
Requires-Dist: py>=1.10.0; extra == "all"
Requires-Dist: mkdocstrings-python; extra == "all"
Requires-Dist: hypothesis>=4.4.0; extra == "all"
Requires-Dist: biopython; extra == "all"
Requires-Dist: pip-tools; extra == "all"
Requires-Dist: flake8; extra == "all"
Requires-Dist: pre-commit; extra == "all"
Requires-Dist: mkdocs-material; extra == "all"
Requires-Dist: unyt; extra == "all"
Requires-Dist: mkdocstrings>=0.19.0; extra == "all"
Requires-Dist: darglint; extra == "all"


`pyjanitor` is a Python implementation of the R package [`janitor`][janitor], and
provides a clean API for cleaning data.

[janitor]: https://github.com/sfirke/janitor

## Quick start

- Installation: `conda install -c conda-forge pyjanitor`. Read more installation instructions [here](https://pyjanitor-devs.github.io/pyjanitor/#installation).
- Check out the collection of [general functions](https://pyjanitor-devs.github.io/pyjanitor/api/functions/).

## Why janitor?

Originally a port of the R package,
`pyjanitor` has evolved from a set of convenient data cleaning routines
into an experiment with the [`method chaining`][mc] paradigm.

[mc]: https://towardsdatascience.com/the-unreasonable-effectiveness-of-method-chaining-in-pandas-15c2109e3c69

Data preprocessing usually consists of a series of steps
that involve transforming raw data into an understandable/usable format.
These series of steps need to be run in a certain sequence to achieve success.
We take a base data file as the starting point,
and perform actions on it,
such as removing null/empty rows,
replacing them with other values,
adding/renaming/removing columns of data,
filtering rows and others.
More formally, these steps along with their relationships
and dependencies are commonly referred to as a Directed Acyclic Graph (DAG).

The `pandas` API has been invaluable for the Python data science ecosystem,
and implements method chaining of a subset of methods as part of the API.
For example, resetting indexes (`.reset_index()`),
dropping null values (`.dropna()`), and more,
are accomplished via the appropriate `pd.DataFrame` method calls.

Inspired by the ease-of-use
and expressiveness of the `dplyr` package
of the R statistical language ecosystem,
we have evolved `pyjanitor` into a language
for expressing the data processing DAG for `pandas` users.

## Installation

`pyjanitor` is currently installable from PyPI:

```bash
pip install pyjanitor
```

`pyjanitor` also can be installed by the conda package manager:

```bash
conda install pyjanitor -c conda-forge
```

`pyjanitor` can be installed by the pipenv environment manager too. This requires enabling prerelease dependencies:

```bash
pipenv install --pre pyjanitor
```

`pyjanitor` requires Python 3.6+.

## Functionality

Current functionality includes:

- Cleaning columns name (multi-indexes are possible!)
- Removing empty rows and columns
- Identifying duplicate entries
- Encoding columns as categorical
- Splitting your data into features and targets (for machine learning)
- Adding, removing, and renaming columns
- Coalesce multiple columns into a single column
- Date conversions (from matlab, excel, unix) to Python datetime format
- Expand a single column that has delimited, categorical values
  into dummy-encoded variables
- Concatenating and deconcatenating columns, based on a delimiter
- Syntactic sugar for filtering the dataframe based on queries on a column
- Experimental submodules for finance, biology, chemistry, engineering, and pyspark
