Metadata-Version: 2.1
Name: end2endML
Version: 0.8.0
Summary: Automate data analysis pipelines for data analyst
Home-page: https://gitlab.com/YipengUva/end2endml_pkg
Author: Yipeng Song
Author-email: yipeng.song@hotmail.com
License: UNKNOWN
Project-URL: Bug Tracker, https://gitlab.com/YipengUva/end2endml_pkg/issues
Keywords: data analysis,machine learning,automation
Platform: UNKNOWN
Classifier: Development Status :: 4 - Beta
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.6, <4
Description-Content-Type: text/markdown
License-File: LICENSE.txt

# end2endML package

The end2endML Python package implemented all the components, data preprocessing, data splitting, model selection, model fitting and model evaluation, required for defining pipelines to do do automate data analysis using some most commonly used machine learning algorithms. 

### Installation

Install end2endML package by running:

```
pip install end2endML
```

on the command line of either Linux system or the Anaconda Prompt on Windows system. If you don't have root privileges, some times you need to add --user after the above commands, then pip will install the packages in your home directory. which doesn't require root privileges.

### User guide

User guide is available at https://end2endml.readthedocs.io/en/latest/.

### TODO

- ~~Implement feature extraction feature to the models.~~
  - The feature extraction methods only implemented for linear models, svm and neural network. For Tree based methods, they are not implemented.
  - The number of components are taken as a hyperparameter for model selection. 
- ~~Implement the unite test suite to do automate testing for every update.~~
- ~~Currently, if we specify a gradient boosting model for imbalanced classification both RUSBOOST and EASYENSYMBLE, which differs in how the undersampling is implemented, are selected and trained. Need to find a way to let the user to set it.~~
- ~~If the trained model has already used 10 cores, specify the CV procedure to use another 10 cores, in general is Ok. However, it can be a problem for easyensemble models when the data set is large. Fix it by set the CV procedure n_jobs to be None in easyensembler model~~ 
- ~~Add the fun to check if the preprocessed data is avaliable. If the data is avaliable, there is no need to preprocess the data anymore. Myabe this is not a good idea, as sometime we may use different parameters to control the behavior to do data preprocessing. And the time to re-preprocess time is not much.~~
- ~~Bug. The data analysis pipline should has the ability to remove the inifnte values existed in X and y.~~
- ~~When cat_threshold set to 2, which means we are not going to classify the subjects with numerical data type but with limited unique values, then the y will not be transformed to object data type, then the automate data analysis procedure will take it as a regression task.~~
- ~~We should re-save the preprocessed data sets every time. Currently, if the function detect the preprocessed data has already saved, it will not save the preprocessed data anymore. This  can lead to serious issue when the data preprocessing parameters change. In addition, it doesn't take much time, we should save the preprocessed data.~~
- ~~For binary classificatoin and regression problems, the saved feature importances should be one dimentional rather than two dimensional.~~
- ~~--user, why~~ 
- ~~Keep track of all the preprocessing steps, so we can apply the exat same preprocessing steps to the new data.~~
- Add Dan and Mengzhe to author list. Haven't got the agreement from Mengzhe and Dan. Thus, only include them into the credits.
- ~~Print out time~~ 



