Metadata-Version: 2.1
Name: AutoCarver
Version: 3.0.5
Summary: Automatic Carving of Features with Optimal Association
Home-page: https://github.com/mdefrance/AutoCarver
Author: Mario DEFRANCE
Author-email: defrancemario@gmail.com
License: MIT
Project-URL: Bug Tracker, https://github.com/mdefrance/AutoCarver/issues
Description: # AutoCarver
        
        **AutoCarver** is an approach for maximising a qualitative feature's association with a binary target feature while reducing it's number of distinct modalities.
        Can also be used to discretize quantitative features, that are prealably cut in quantiles.
        
         Modalities/values of features are carved/regrouped according to a computed specific order defined based on their types:
          - *Qualitative features* grouped based on target rate per modality.
          - *Qualitative ordinal features* grouped based on specified modality order.
          - *Quantitative features* grouped based on the order of their values.
         
        Uses Tschurpow's T or Cramer's V to find the optimal carving (regrouping) of modalities/values of features.
        
        `AutoCarver` is an `sklearn` transformer.
        
        Only implementend for binary classification problems.
        
        ## Install
        
        AutoCarver can be installed from [PyPI](https://pypi.org/project/AutoCarver):
        
        <pre>
        pip install --upgrade autocarver
        </pre>
        
        ## Complete Example
        
        #### Setting up Samples
        
        `AutoCarver` tests the robustness of carvings on a specific sample. For this purpose, the use of an out of time sample is recommended. 
        
        ```python
        # defining training and testing sets
        X_train, y_train = ...
        X_test, y_test = ...
        X_val, y_val = ...
        ```
        
        #### Formatting features to be carved
        
        All features need to be discretized via a `Discretizer` so `AutoCarver` can group their modalities. Following parameters must be set for `Discretizer`:
        
        - `min_freq`, should be set from 0.01 (preciser) to 0.05 (faster, increased stability).
          - *For qualitative features:*  Minimal frequency of a modality, less frequent modalities are grouped in the `default_value='__OTHER__'` modality.
          - *For qualitative ordinal features:* Less frequent modalities are grouped to the closest modality  (smallest frequency or closest target rate), between the superior and inferior values (specified in the `values_orders` dictionnary).
        
        - `q`, should be set from 10 (faster) to 20 (preciser).
          - *For quantitative features:* Number of quantiles to initialy cut the feature. Values more frequent than `1/q` will be set as their own group and remaining frequency will be cut into proportionaly less quantiles (`q:=max(round(non_frequent * q), 1)`). 
        
        - `values_orders`
          - *For qualitative ordinal features:* `dict` of features values and `GroupedList` of their values. Modalities less frequent than `min_freq` are automaticaly grouped to the closest modality (smallest frequency or closest target rate), between the superior and inferior values.
        
        At this step, all `numpy.nan` are kept as their own modality.
        
        For qualitative features, unknown modalities passed to `Discretizer.transform` (that where not passed to `Discretizer.fit`) are automaticaly grouped to the `default_value='__OTHER__'` modality.
        
        By default, samples are modified and not copied (recommanded for large datasets). Use `copy=True` if you want a new `DataFrame` to be returned.
        
        ```python
        from AutoCarver.Discretizers import Discretizer
        
        # specifying features to be carved
        quantitatives = ['amount', 'distance', 'length', 'height']
        qualitatives = ['age', 'type', 'grade', 'city']
        
        # specifying orders of categorical ordinal features
        values_orders = {
            'age': ['0-18', '18-30', '30-50', '50+'],
            'grade': ['A', 'B', 'C', 'D', 'J', 'K', 'NN']
        }
        
        # pre-processing of features into categorical ordinal features
        discretizer = Discretizer(quantitatives, qualitatives, min_freq=0.02, q=20, values_orders=values_orders)
        discretizer.fit_transform(X_train, y_train)
        discretizer.transform(X_test)
        
        # updating features' values orders (at this step every features are qualitative ordinal)
        values_orders = discretizer.values_orders
        ```
        
        #### Automatic Carving of features
        
        All specified features can now automatically be carved in an association maximising grouping of their modalities while reducing their number. Following parameters must be set for `AutoCarver`:
        
        - `sort_by`, association measure used to find the optimal group modality combination.
          - Use `sort_by='cramerv'` for more modalities, less robust.
          - Use `sort_by='tschuprowt'` for more robust modalities.
          - **Tip:** a combination of features carved with `sort_by='cramerv'` and `sort_by='tschuprowt'` can sometime prove to be better than only one of those.
        
        - `max_n_mod`, maximum number of modalities for the carved features (excluding `numpy.nan`). All possible combinations of less than `max_n_mod` groups of modalities will be tested. Should be set from 4 (faster) to 6 (preciser).
        
        At this step, all `numpy.nan` are grouped to the best non-NaN value (after they were grouped). Use `keep_nans=True` if you want `numpy.nan` to remain as a specific modality.
        
        
        ```python
        from AutoCarver.AutoCarver import AutoCarver
        
        # intiating AutoCarver
        auto_carver = AutoCarver(values_orders, sort_by='cramerv', max_n_mod=5, verbose=True)
        
        # fitting on training sample 
        # a test sample can be specified to evaluate carving robustness
        auto_carver.fit_transform(X_train, y_train, X_test, y_test)
        
        # applying transformation on test sample
        auto_carver.transform(X_test)
        ```
        <p align="left">
          <img width="500" src="/docs/auto_carver_fit.PNG" />
        </p>
        
        #### Storing, reusing an AutoCarver
        
        The `Discretizer` and `AutoCarver` steps can be stored in a `Pipeline` and can than be stored as a `pickle` file.
        
        ```python
        from pickle import dump
        from sklearn.pipeline import Pipeline
        
        # storing Discretizer
        pipe = [('Discretizer', discretizer)]
        
        # storing fitted AutoCarver in a Pipeline
        pipe += [('AutoCarver', auto_carver)]
        pipe = Pipeline(pipe)
        
        # storing as pickle file
        dump(pipe, open('my_pipe.pkl', 'wb'))
        ```
        
        The stored `Pipeline`, can then be used to transform new datasets.
        
        ```python
        from pickle import load
        
        # restoring the pipeline
        pipe = load(open('my_pipe.pkl', 'rb'))
        
        # applying pipe to a validation set or in production
        X_val = pipe.transform(X_val)
        ```
        
Platform: UNKNOWN
Description-Content-Type: text/markdown
