Metadata-Version: 2.1
Name: shapkit
Version: 0.0.1
Summary: Interpret machine learning predictions using agnostic local feature importance based on Shapley Values.
Home-page: https://github.com/sgrah-oss/shapkit/tree/master/
Author: Simon Grah
Author-email: simon.grah.pro@gmail.com
License: MIT License
Description: # Shapkit
        > Interpret machine learning predictions using agnostic local feature importance based on Shapley Values. 
        
        
        ## Overview
        
        ### Objective
        
        Machine Learning is enjoying an increasing success in many applications: medical, marketing, defense, cyber security, transport. It is becoming a key tool in critical systems. However, models are often very complex and highly non-linear. This is problematic, especially for critical systems, because end-users need to fully understand decisions of an algorithm (e.g. why an alert has been triggered, why a person has a high probability of cancer recurrence,. . . ). One solution is to offer an interpretation for each individual prediction based on attribute relevance. Shapley Values allow to distribute fairly contributions for each attribute in order to understand the difference between a predicted value for an observation and a base value (e.g. the average prediction of a reference population).
        
        The method used is:
        * **agnostic**: no particular information on the model is needed, it works with black box algorithms. We only define a reward funtion (e.g. the model output).
        * **local**: the explanation is computed at instance level. Thus, each interpretation is associated to a given prediction.
        * More suitable for **tabular data** with meaningful features.
        
        ### A concrete use case: COMPAS
        
        > *COMPAS (Correctional Offender Management Profiling for Alternative Sanctions) is a popular commercial algorithm used by judges and parole officers for scoring criminal defendant’s likelihood of reoffending (recidivism)*
        
        Assume that we have trained a machine learning model to predict the probability of recividism of a given individual. The algorithm is quite effective but it only returns a probability score without any details on how it has made its choice.
        We would like to know how each attribute (characteristic) influences the model output. Furthermore, contributions explain the difference between the individual prediction and the mean prediction for all references. These references are defined by the user (e.g. for classification, interesting references are selected into other predicted classes).
        
        <img alt="Exporting from nbdev" width="700" caption="On this example, the fact that this person has commited 6 priors crime, is African-American and 27 years old, his legal status is Post Sentence, are mainly explained why the model has predicted such probability score. The contributions could also be negative, e.g. his probation custody status influences the model towards a low probability of recividism." src="nbs/images/shap_readme_illustration.png">
        
        This picture displays the kind of interpretation associated to a given prediction for individual x. The estimated probability of recidivism is about 0,75 (deep blue arrow). The individual attributes (or characteristics) are showed in the y axis. Based on a set of chosen references (here the references are predicted as non recidivist by the model), we compute contributions (Shapley Values) of each attribute related to their influence on the model output. 
        Those contributions have some interesting properties. Indeed, the sum of all contributions equals the difference between the output of the individual x (0,75) and the mean output of references (0,13).
        
        On this example, the fact that this person has commited 6 priors crimes, is African-American and 27 years old, his legal status is Post Sentence, mainly explain why the model has predicted such probability score. The contributions could also be negatives, e.g. his probation custody status influences the model towards a low probability of recividism.
        
        ## Install
        
        ```
        pip install shapkit
        ```
        
        ## Dependencies
        
        * [python3](https://www.python.org/downloads/) (>= 3.6)
        * [numpy](https://numpy.org/) (>= 1.17.2)
        * [pandas](https://pandas.pydata.org/) (>= 0.25.3)
        * [matplotlib](https://matplotlib.org/) (>= 2.2.3)
        * [seaborn](https://seaborn.pydata.org/) (>= 0.9.0)
        * [tqdm](https://github.com/tqdm/tqdm) [optional] (>= 4.26.0)
        
        ## How to use
        
        The method is a post-hoc explanation, so you do not have to change your routine. Firstly, train your model:
        ```python
        model.fit(X_train, y_train)
        ```
        
        Then, define your reward function `fc` (e.g. simply set by your model output):
        ```python
        fc = lambda x: model.predict_proba(x)
        ```
        
        Select an instance `x` for which you need more interpretation. Pick also one or several `reference(s)` (instance or dataset of individuals). 
        If the number of features is not too high (said lower than 10), you can compute the exact Shapley Values.
        ```python
        true_shap = ShapleyValues(x=x, fc=fc, ref=reference)
        ```
        
        If the dimension exceeds about 15, then you may need approximation algorithms to estimate the Shapley Values. 
        
        * Monte Carlo algorithm:
        
        ```python
        mc_shap = MonteCarloShapley(x=x, fc=fc, ref=reference, n_iter=1000)
        ```
        
        
        * Projected Stochastic Gradient Descent algorithm:
        
        ```python
        sgd_est = SGDshapley(d, C=y.max())
        sgd_shap = sgd_est.sgd(x=x, fc=fc, r=reference, n_iter=5000, step=.1, step_type="sqrt")
        ```
        
        ## Code and description
        
        This library is based on [nbdev](http://nbdev.fast.ai/).
        > nbdev is a library that allows you to fully develop a library in Jupyter Notebooks, putting all your code, tests and documentation in one place. That is:you now have a true literate programming environment, as envisioned by Donald Knuth back in 1983!
        
        Codes, descriptions, small examples and tests are all put together in jupyter notebooks in the folder `nbs`.
        
        Usefull commands from `nbdev`:
        
        * Build your lib by converting all notebooks in folder `nbs` to .py files
        ```
         nbdev_build_lib
        ```
        
        
        * Run all tests in parallel
        ```
        nbdev_test_nbs
        ```
        
        
        * Build docs
        ```
        nbdev_build_docs
        ```
        
        ## Tutorial
        
        Notebook demos are availables in `tutorials` folder.
        
        ## License
        
        Shapkit is licensed under the terms of the MIT License (see the file LICENSE).
        
Keywords: feature Importance Shapley Values
Platform: UNKNOWN
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Natural Language :: English
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Requires-Python: >=3.6
Description-Content-Type: text/markdown
