Metadata-Version: 2.1
Name: public-datasets
Version: 0.0.2
Summary: Public datasets loaders
Home-page: UNKNOWN
Author: DEEL
Author-email: collaborateurs.du.projet.deel@irt-saintexupery.com
License: UNKNOWN
Platform: UNKNOWN
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Description-Content-Type: text/markdown
License-File: LICENSE

# Public datasets plugin

This project implements the public datasets (CIFAR10 / SVHN / MNIST) plugin for the DEEL dataset manager.

A deel dataset plugin is an extension of the Dataset class defined in the [DEEL dataset manager project](https://github.com/deel-ai/deel_dataset_manager).
It allows to access to specific dataset files using the `load` method and other defined modes.

Public datasets (CIFAR10 / SVHN / MNIST) dataset plugin use the default mode `path`to load.

- MNIST:
    - `train-images-idx3-ubyte.gz`,
    - `train-labels-idx1-ubyte.gz`,
    - `t10k-images-idx3-ubyte.gz`,
    - `t10k-labels-idx1-ubyte.gz`,

- CIFAR10:
    - `cifar-10-python.tar.gz`,

- SVHN:
    - `housenumbers/train.tar.gz`,
    - `housenumbers/test.tar.gz`,
    - `housenumbers/extra.tar.gz`,

using the http protocol.

## Installation

The latest release can be installed from pypi. All needed python packages will also be installed as a dependency.

```bash
pip install public-datasets
```

Otherwize the ssh or HTTPS version should work but you will have to enter your credentials manually:

```bash

# SSH version (with proper SSH key setup):
pip install git+ssh://git@github.com:deel-ai/public_datasets.git

# HTTPS version:
pip install git+https://github.com/deel-ai/public_datasets.git
```

**Note:**

- CIFAR10 dataset loading name is `cifra10`,
- SVHN dataset loading name is `svhn`,
- MNIST dataset loading name is `mnist`.

## Examples of usage

### Basic usage

To load one of public datasets (CIFAR10 / SVHN / MNIST), you can simply do:

```python
import deel.datasets

# Load the default mode of mnist dataset:
mnist_data_path = deel.datasets.load("mnist")

# Load the default mode of svhn dataset:
svhn_data_path = deel.datasets.load("svhn")

# Load the default mode of cifra10 dataset:
cifra10_data_path = deel.datasets.load("cifra10")
```

The `deel.datasets.load` function is the basic entry to access the datasets.
By passing `with_info=True`, extra information can be retrieved as a python
dictionary. Information are not standardized, so each dataset may provide
different ones:
The `mode` argument can be used to load different "version" of the dataset. By default,
only the `path` mode is available and will return the path to the local folder
containing the dataset.

### Command line utilities

The `deel-datasets` package comes with some command line utilities that can be accessed using:

```
python -m deel.datasets ARGS...
```

The `--help` option can be used to view the full capabilities of the command line program.
By default, the program uses the configuration at `$HOME/.deel/config.yml`, but the `-c`
argument can be used to specified a custom configuration file.

The following commands are available (not exhaustive):

- `list` &mdash; List the available datasets. If the configuration specify a remote provider
  (e.g., WebDAV), this will list the datasets available remotely. To list the dataset already
  downloaded, you can use the `--local` option.

```bash
$ python -m deel.datasets list
Listing datasets at https://datasets.deel.ai:
  dataset-a: 3.0.1 [latest], 3.0.0
  dataset-b: 1.0 [latest]
  dataset-c: 1.0 [latest]
$ python -m deel.datasets list --local
Listing datasets at /opt/datasets:
  dataset-a: 3.0.1 [latest], 3.0.0
  dataset-c: 1.0 [latest]
```

- `download NAME[:VERSION]` &mdash; Download the specified dataset. If the configuration
  does not specify a remote provider, this does nothing except outputing some information.
  The `:VERSION` can be omitted, in which case `:latest` is implied. To force the re-download
  of a dataset, the `--force` option can be used.

#### cas de MNIST

```bash
$ python -m deel.datasets download mnist
Fetching mnist...
Matplotlib created a temporary config/cache directory at /tmp/matplotlib-j83d2nc3 because the default path (/home/<user>/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
train-images-idx3-ubyte.gz: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.45M/9.45M [00:00<00:00, 11.2Mbytes/s]
Extracting train-images-idx3-ubyte.gz: 44.9Mbytes [00:00, 248Mbytes/s]
train-labels-idx1-ubyte.gz: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 28.2k/28.2k [00:00<00:00, 13.4Mbytes/s]
Extracting train-labels-idx1-ubyte.gz: 58.6kbytes [00:00, 132Mbytes/s]
t10k-images-idx3-ubyte.gz: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.57M/1.57M [00:00<00:00, 9.04Mbytes/s]
Extracting t10k-images-idx3-ubyte.gz: 7.48Mbytes [00:00, 246Mbytes/s]
t10k-labels-idx1-ubyte.gz: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.44k/4.44k [00:00<00:00, 55.2Mbytes/s]
Extracting t10k-labels-idx1-ubyte.gz: 9.77kbytes [00:00, 59.0Mbytes/s]
convert train images: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 60000/60000 [00:10<00:00, 5554.88it/s]
convert test images: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10000/10000 [00:01<00:00, 5526.01it/s]
Dataset mnist loaded and stored at '/home/<user>/.deel/datasets/mnist/1.0.0'.
```

#### cas de SVHN

```bash
$ python -m deel.datasets download svhn
python -m deel.datasets download svhn
Fetching svhn...
Matplotlib created a temporary config/cache directory at /tmp/matplotlib-gl2vzgmi because the default path (/home/justin.plakoo/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
train_32x32.mat: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 174M/174M [00:47<00:00, 3.84Mbytes/s]
test_32x32.mat: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 61.3M/61.3M [00:25<00:00, 2.50Mbytes/s]
extra_32x32.mat: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.24G/1.24G [05:32<00:00, 4.00Mbytes/s]
.....
Dataset svhn loaded and stored at '/home/<user>/.deel/datasets/svhn/1.0.0'.
```


