Metadata-Version: 2.1
Name: split-folders
Version: 0.4.0
Summary: Split folders with files (e.g. images) into training, validation and test (dataset) folders.
Home-page: https://github.com/jfilter/split-folders
License: MIT
Keywords: machine-learning,training-validation-test,datasets,folders
Author: Johannes Filter
Author-email: hi@jfilter.de
Requires-Python: >=3.6
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Topic :: Utilities
Project-URL: Repository, https://github.com/jfilter/split-folders
Description-Content-Type: text/markdown

# `split-folders` [![Build Status](https://travis-ci.com/jfilter/split-folders.svg?branch=master)](https://travis-ci.com/jfilter/split-folders) [![PyPI](https://img.shields.io/pypi/v/split-folders.svg)](https://pypi.org/project/split-folders/) [![PyPI - Python Version](https://img.shields.io/pypi/pyversions/split-folders.svg)](https://pypi.org/project/split-folders/)  [![PyPI - Downloads](https://img.shields.io/pypi/dm/split-folders)](https://pypistats.org/packages/split-folders)

Split folders with files (e.g. images) into **train**, **validation** and **test** (dataset) folders.

The input folder should have the following format:

```
input/
    class1/
        img1.jpg
        img2.jpg
        ...
    class2/
        imgWhatever.jpg
        ...
    ...
```

In order to give you this:

```
output/
    train/
        class1/
            img1.jpg
            ...
        class2/
            imga.jpg
            ...
    val/
        class1/
            img2.jpg
            ...
        class2/
            imgb.jpg
            ...
    test/
        class1/
            img3.jpg
            ...
        class2/
            imgc.jpg
            ...
```

This should get you started to do some serious deep learning on your data. [Read here](https://stats.stackexchange.com/questions/19048/what-is-the-difference-between-test-set-and-validation-set) why it's a good idea to split your data intro three different sets.

-   Split files into a training set and a validation set (and optionally a test set).
-   Works on any file types.
-   The files get shuffled.
-   A [seed](https://docs.python.org/3/library/random.html#random.seed) makes splits reproducible.
-   Allows randomized [oversampling](https://en.wikipedia.org/wiki/Oversampling_and_undersampling_in_data_analysis) for imbalanced datasets.
-   Optionally group files by prefix.
-   (Should) work on all operating systems.

## Install

```bash
pip install split-folders
```

If you are working with a large amount of files, you may want to get a progress bar. Install [tqdm](https://github.com/tqdm/tqdm) in order to get visual updates for copying files.

```bash
pip install split-folders tqdm
```

## Usage

You can use `split-folders` as Python module or as a Command Line Interface (CLI).

If your datasets is balanced (each class has the same number of samples), choose `ratio` otherwise `fixed`. NB: oversampling is turned off by default.

### Module

```python
import splitfolders  # or import split_folders

# Split with a ratio.
# To only split into training and validation set, set a tuple to `ratio`, i.e, `(.8, .2)`.
splitfolders.ratio("input_folder", output="output", seed=1337, ratio=(.8, .1, .1), group_prefix=None) # default values

# Split val/test with a fixed number of items e.g. 100 for each set.
# To only split into training and validation set, use a single number to `fixed`, i.e., `10`.
splitfolders.fixed("input_folder", output="output", seed=1337, fixed=(100, 100), oversample=False, group_prefix=None) # default values
```

Occasionally you may have things that comprise more than a single file (e.g. picture (.png) + annotation (.txt)).
`splitfolders` lets you split files into equally-sized groups based on their prefix.
Set `group_prefix` to the length of the group (e.g. `2`).
But now *all* files should be part of groups.

### CLI

```
Usage:
    splitfolders [--output] [--ratio] [--fixed] [--seed] [--oversample] [--group_prefix] folder_with_images
Options:
    --output        path to the output folder. defaults to `output`. Get created if non-existent.
    --ratio         the ratio to split. e.g. for train/val/test `.8 .1 .1` or for train/val `.8 .2`.
    --fixed         set the absolute number of items per validation/test set. The remaining items constitute
                    the training set. e.g. for train/val/test `100 100` or for train/val `100`.
    --seed          set seed value for shuffling the items. defaults to 1337.
    --oversample    enable oversampling of imbalanced datasets, works only with --fixed.
    --group_prefix  split files into equally-sized groups based on their prefix
Example:
    splitfolders --ratio .8 .1 .1 folder_with_images
```

Instead of the command `splitfolders` you can also use `split_folders` or `split-folders`.

## Development

Install and use [poetry](https://python-poetry.org/).

## Contributing

If you have a **question**, found a **bug** or want to propose a new **feature**, have a look at the [issues page](https://github.com/jfilter/split-folders/issues).

**Pull requests** are especially welcomed when they fix bugs or improve the code quality.


## License

MIT

