Metadata-Version: 2.1
Name: aitoolbox
Version: 1.8.0
Summary: PyTorch Model Training and Experiment Tracking Framework
Home-page: https://github.com/mv1388/AIToolbox
Author: Marko Vidoni
Author-email: 
License: MIT
Keywords: PyTorch,deep learning,research,train loop
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Requires-Python: >=3.9.0
Description-Content-Type: text/markdown
License-File: LICENSE

# AI Toolbox

[![PyPI version](https://badge.fury.io/py/aitoolbox.svg)](https://badge.fury.io/py/aitoolbox)
[![Package testing](https://github.com/mv1388/aitoolbox/actions/workflows/package_testing.yml/badge.svg)](https://github.com/mv1388/aitoolbox/actions/workflows/package_testing.yml)
[![Documentation Status](https://readthedocs.org/projects/aitoolbox/badge/?version=latest)](https://aitoolbox.readthedocs.io/en/latest/?badge=latest)
&nbsp; &nbsp;
[![codebeat badge](https://codebeat.co/badges/04217a3f-a838-418f-8f14-66cf6ae1b03d)](https://codebeat.co/projects/github-com-mv1388-aitoolbox-master)
[![CodeFactor](https://www.codefactor.io/repository/github/mv1388/aitoolbox/badge)](https://www.codefactor.io/repository/github/mv1388/aitoolbox)

[**Documentation**](https://aitoolbox.readthedocs.io/en/latest/)


AIToolbox is a framework which helps you train deep learning models in PyTorch and quickly iterate experiments. 
It hides the repetitive technicalities of training the neural nets and 
frees you to focus on interesting part of devising new models. 
In essence, it offers a keras-style train loop abstraction which can be used for higher 
level training process while still allowing the manual control on the lower 
level when desired.

In addition to orchestrating the model training loop the framework also helps you keep track of different 
experiments by automatically saving models in a structured traceable way and creating performance reports. 
These can be stored both locally or on AWS S3 (Google Cloud Storage in beta) which makes the library 
very useful when training on the GPU instance on AWS. Instance can be 
automatically shut down when training is finished and all the results 
are safely stored on S3.


## Installation
To install the AIToolbox package execute:
```bash
pip install aitoolbox
```

If you want to install the most recent version from github repository, first clone the package repository and 
then install via the `pip` command:
```bash
git clone https://github.com/mv1388/aitoolbox.git

pip install ./aitoolbox
```

AIToolbox package can be also provided as a dependency in the `requirements.txt` file. This can be done by 
just specifying the `aitoolbox` dependency. On the other hand, to automatically
download the current master branch from github include the following dependency specification in the requirements.txt:
```bash
git+https://github.com/mv1388/aitoolbox#egg=aitoolbox
```  


## TrainLoop

[`TrainLoop`](https://aitoolbox.readthedocs.io/en/latest/api/aitoolbox.torchtrain.train_loop.train_loop.html#aitoolbox.torchtrain.train_loop.train_loop.TrainLoop)
is the main abstraction for PyTorch neural net training. At its core
it handles the batch feeding of data into the model, calculating loss and updating parameters for a specified number of epochs.
To learn how to define the TrainLoop supported PyTorch model please look at the [Model](#model) section bellow.

After the model is created, the simplest way to train it via the TrainLoop abstraction is by doing the following:
```python
from aitoolbox.torchtrain.train_loop import *

tl = TrainLoop(model,
               train_loader, val_loader, test_loader,
               optimizer, criterion)

model = tl.fit(num_epochs=10)
```

AIToolbox includes a few more advanced derivations of the basic TrainLoop
which automatically handle the experiment tracking by creating model
checkpoints, performance reports, example predictions, etc. All of this can be saved just on the local drive
or can also be automatically stored on AWS S3. Currently implemented advanced 
[`TrainLoops`](https://aitoolbox.readthedocs.io/en/latest/api/aitoolbox.torchtrain.train_loop.train_loop_tracking.html) 
are 
[`TrainLoopCheckpoint`](https://aitoolbox.readthedocs.io/en/latest/api/aitoolbox.torchtrain.train_loop.train_loop_tracking.html#aitoolbox.torchtrain.train_loop.train_loop_tracking.TrainLoopCheckpoint), 
[`TrainLoopEndSave`](https://aitoolbox.readthedocs.io/en/latest/api/aitoolbox.torchtrain.train_loop.train_loop_tracking.html#aitoolbox.torchtrain.train_loop.train_loop_tracking.TrainLoopEndSave) and 
[`TrainLoopCheckpointEndSave`](https://aitoolbox.readthedocs.io/en/latest/api/aitoolbox.torchtrain.train_loop.train_loop_tracking.html#aitoolbox.torchtrain.train_loop.train_loop_tracking.TrainLoopCheckpointEndSave).
Here, 'Checkpoint' stands for checkpointing after each epoch, while 'EndSave' will only persist and evaluate at the very end of the training. 

For the most complete experiment tracking it is recommended to use the 
[`TrainLoopCheckpointEndSave`](https://aitoolbox.readthedocs.io/en/latest/api/aitoolbox.torchtrain.train_loop.train_loop_tracking.html#aitoolbox.torchtrain.train_loop.train_loop_tracking.TrainLoopCheckpointEndSave) 
option. 
The optional use of the *result packages* needed for the neural net performance evaluation is explained in 
the [experiment section](#experiment) bellow.
```python
from aitoolbox.torchtrain.train_loop import *

TrainLoopCheckpointEndSave(
    model,
    train_loader, validation_loader, test_loader,
    optimizer, criterion,
    project_name, experiment_name, local_model_result_folder_path,
    hyperparams, val_result_package=None, test_result_package=None,
    cloud_save_mode='s3', bucket_name='models', cloud_dir_prefix=''
)
```

Check out a full [TrainLoop training & experiment tracking example](https://github.com/mv1388/aitoolbox/blob/master/examples/TrainLoop_use/trainloop_fully_tracked_experiment.py).


## Multi-GPU training

All TrainLoop versions in addition to single GPU also support multi-GPU training to achieve even faster training.
Following the core PyTorch setup, two multi-GPU training approaches are available: 
`DataParallel` and `DistributedDataParallel`.

### DataParallel (DP)

To use DataParallel-like multiGPU training with TrainLoop just set the TrainLoop's `gpu_mode` parameter to `'dp'`:
```python
from aitoolbox.torchtrain.train_loop import *

model = ... # TTModel

TrainLoop(
    model,
    train_loader, val_loader, test_loader,
    optimizer, criterion,
    gpu_mode='dp'
).fit(num_epochs=10)
```

Check out a full [DataParallel training example](https://github.com/mv1388/aitoolbox/blob/master/examples/dp_ddp_training/dp_training.py).

### DistributedDataParallel (DDP)

Distributed training on multiple GPUs via DistributedDataParallel is enabled by the TrainLoop itself under
the hood by wrapping the model (`TTModel`, [more in Model section](#model)) into `DistributedDataParallel`.
TrainLoop also automatically spawns multiple processes and initializes them. Inside each spawned process
the model and all other necessary training components are moved to the correct GPU belonging to a specific
process. Lastly, TrainLoop also automatically adds the PyTorch `DistributedSampler` to each of the provided
data loaders in order to ensure different data batches go to different GPUs and there is no overlap.

To enable distributed training via DistributedDataParallel, the user has to set the TrainLoop's `gpu_mode` 
parameter to `'ddp'`.
```python
from aitoolbox.torchtrain.train_loop import *

model = ... # TTModel

TrainLoop(
    model,
    train_loader, val_loader, test_loader,
    optimizer, criterion,
    gpu_mode='ddp'
).fit(num_epochs=10, callbacks=None,
      ddp_model_args=None,
      num_nodes=1, node_rank=0, num_gpus=torch.cuda.device_count())
```

Check out a full [DistributedDataParallel training example](https://github.com/mv1388/aitoolbox/blob/master/examples/dp_ddp_training/ddp_training.py).

## Automatic Mixed Precision training (AMP)

All the TrainLoop versions also support training with Automatic Mixed Precision (*AMP*). In the past this required
using the [Nvidia apex](https://github.com/NVIDIA/apex) extension but from _PyTorch 1.6_ onwards AMP functionality 
is built into core PyTorch and no separate instalation is needed. 
Current version of AIToolbox already supports the use of built-in PyTorch AMP. 

The user only has to set the TrainLoop parameter `use_amp` to `use_amp=True` in order to use the default 
AMP initialization and start training the model in the mixed precision mode. If the user wants to specify 
custom AMP `GradScaler` initialization parameters, these should be provided as a dict parameter 
`use_amp={'init_scale': 2.**16, 'growth_factor': 2.0, ...}` to the TrainLoop. 
All AMP initializations and training related steps are then handled automatically by the TrainLoop. 

You can read more about different AMP details in the 
[PyTorch AMP documentation](https://pytorch.org/docs/stable/notes/amp_examples.html).

### Single-GPU mixed precision training
Example of single-GPU AMP setup:
```python
from aitoolbox.torchtrain.train_loop import *

model = ... # TTModel

TrainLoop(
    model, ...,
    optimizer, criterion, 
    use_amp=True
).fit(num_epochs=10)
``` 

Check out a full [AMP single-GPU training example](https://github.com/mv1388/aitoolbox/blob/master/examples/amp_training/single_GPU_training.py).

### Multi-GPU DDP mixed precision training
When training in the multi-GPU setting, the setup is mostly the same as in the single-GPU. 
All the user has to do is set accordingly the `use_amp` parameter of the TrainLoop and to switch its `gpu_mode`
parameter to `'ddp'`. 
Under the hood, TrainLoop will initialize the model and the optimizer for AMP and start training using 
DistributedDataParallel approach.

Example of multi-GPU AMP setup:
```python
from aitoolbox.torchtrain.train_loop import *

model = ... # TTModel

TrainLoop(
    model, ...,
    optimizer, criterion,
    gpu_mode='ddp',
    use_amp=True
).fit(num_epochs=10)
``` 

Check out a full [AMP multi-GPU DistributedDataParallel training example](https://github.com/mv1388/aitoolbox/blob/master/examples/amp_training/mutli_GPU_training.py).


## Model

To take advantage of the TrainLoop abstraction the user has to define their model as a class which is a standard way
in core PyTorch as well. The only difference is that for TrainLoop supported training the model class has 
to be inherited from the AIToolbox specific 
[`TTModel`](https://aitoolbox.readthedocs.io/en/latest/api/aitoolbox.torchtrain.model.html#aitoolbox.torchtrain.model.TTModel) 
base class instead of PyTorch `nn.Module`.

`TTModel` itself inherits from the normally used `nn.Module` class thus our models still
retain all the expected PyTorch enabled functionality. The reason for using the TTModel super class is that
TrainLoop requires users to implement two additional methods which describe how each batch of data
is fed into the model when calculating the loss in the training mode and when making the predictions in the 
evaluation mode.

The code below shows the general skeleton all the TTModels have to follow to enable them to be trained 
with the TrainLoop:
```python
from aitoolbox.torchtrain.model import TTModel

class MyNeuralModel(TTModel):
    def __init__(self):
        # model layers, etc.

    def forward(self, x_data_batch):
        # The same method as required in the base PyTorch nn.Module
        ...
        # return prediction
        
    def get_loss(self, batch_data, criterion, device):
        # Get loss during training stage, called from fit() in TrainLoop
        ...
        # return batch loss

    def get_predictions(self, batch_data, device):
        # Get predictions during evaluation stage 
        # + return any metadata potentially needed for evaluation
        ...
        # return predictions, true_targets, metadata
```

## Callbacks

For advanced applications the basic logic offered in different default TrainLoops might not be enough.
Additional needed logic can be injected into the training procedure by using 
[`callbacks`](https://aitoolbox.readthedocs.io/en/latest/api/aitoolbox.torchtrain.callbacks.html)
and providing them as a parameter list to TrainLoop's `fit(callbacks=[callback_1, callback_2, ...])` function. 

AIToolbox by default already offers a wide selection of different useful callbacks. However when
some completely new functionality is desired the user can also implement their own callbacks by 
inheriting from the base callback object 
[`AbstractCallback`](https://aitoolbox.readthedocs.io/en/latest/api/aitoolbox.torchtrain.callbacks.abstract.html#aitoolbox.torchtrain.callbacks.abstract.AbstractCallback). 
All that the user has to do is to implement corresponding methods to execute the new callback 
at the desired point in the train loop, such as: start/end of batch, epoch, training.


## experiment

### Result Package

This is the definition of the model evaluation procedure on the task we are experimenting with.
Result packages available out of the box can be found in the 
[`result_package` module](https://aitoolbox.readthedocs.io/en/latest/api/aitoolbox.experiment.result_package.html)
where we have implemented several 
[basic, general result packages](https://aitoolbox.readthedocs.io/en/latest/api/aitoolbox.experiment.result_package.basic_packages.html). 
Furthermore, for those dealing with NLP, result packages for
several widely researched NLP tasks such as translation, QA can be found as part of the 
[`NLP` module](https://aitoolbox.readthedocs.io/en/latest/api/aitoolbox.nlp.experiment_evaluation.NLP_result_package.html)
module. Last but not least, as the framework was built with extensibility in mind and thus 
if needed the users can easily define their own result packages with custom evaluations by extending the base
[`AbstractResultPackage`](https://aitoolbox.readthedocs.io/en/latest/api/aitoolbox.experiment.result_package.abstract_result_packages.html#aitoolbox.experiment.result_package.abstract_result_packages.AbstractResultPackage). 
 
Under the hood the result package executes one or more 
[`metrics`](https://aitoolbox.readthedocs.io/en/latest/api/aitoolbox.experiment.core_metrics.html) 
objects which actually 
calculate the performance metric calculation. Result package object is thus used as a wrapper 
around potentially multiple performance calculations which are needed for our task. The metrics
which are part of the specified result package are calculated by calling the `prepare_result_package()` method 
of the result package which we are using to evaluate model's performance.

### Experiment Saver 

The experiment saver saves the model architecture as well as model performance evaluation results and training history. 
This can be done at the end of each epoch as a model checkpointing or at the end of training.

Normally not really a point of great interest when using the TrainLoop interface as it is hidden under the hood.
However as AIToolbox was designed to be modular one can decide to write their own training loop logic but
just use the provided experiment saver module to help with the experiment tracking and model saving.
For PyTorch users we recommend using the 
[`FullPyTorchExperimentS3Saver`](https://aitoolbox.readthedocs.io/en/latest/api/aitoolbox.experiment.experiment_saver.html#aitoolbox.experiment.experiment_saver.FullPyTorchExperimentS3Saver) 
which has also been most thoroughly tested. 
The experiment is saved by calling the `save_experiment()` function from the selected experiment saver and 
providing the trained model and the evaluated result package containing the calculated performance results.


## cloud

All of these modules are mainly hidden under the hood when using different experiment tracking
abstractions. However, if desired and only the cloud saving functionality is needed it is easy to use them
as standalone modules in some desired downstream application.

### AWS 

Functionality for saving model architecture and training results to S3 either during 
training or at the training end. On the other hand, the module also offers the dataset downloading
from the S3 based dataset store. This is useful when we are experimenting with datasets and have only a slow
local connection, thus scp/FTP is out of the picture.

### Google Cloud

Same functionality as for AWS S3 but for Google Cloud Storage. 
Implemented, however, not yet tested in practice. 


## nlp

Currently, mainly used for the performance evaluation 
[`result packages`](https://aitoolbox.readthedocs.io/en/latest/api/aitoolbox.nlp.experiment_evaluation.NLP_result_package.html) 
needed for different NLP tasks, such as Q&A, summarization, machine translation. 

For the case of e.g. NMT the module also provides 
[attention heatmap plotting](https://aitoolbox.readthedocs.io/en/latest/api/aitoolbox.nlp.experiment_evaluation.attention_heatmap.html)
which is often helpful for gaining addition insights into the seq2seq model. The heatmap plotter
creates attention heatmap plots for every validation example and saves them as pictures to disk 
(potentially also to cloud).

Lastly, the nlp module also provides several rudimentary NLP data processing functions.


## AWS GPU instance prep and management bash scripts

As some of the tasks when training models on the AWS cloud GPU are quite repetitive, the package
also includes several useful bash scripts to automatize tasks such as instance initialization and
bootstrapping, experiment file updating, remote AIToolbox installation updating, etc.

For further information look into the [`/bin/AWS`](/bin/AWS/) folder and read 
the provided [README](/bin/AWS/README.md).


# Examples of package usage

Look into the [`/examples`](/examples) folder for starters. 
Will be adding more examples of different training scenarios.
