Metadata-Version: 2.1
Name: chatintents
Version: 0.0.1
Summary: ChatIntents automatically clusters and labels short text intent messages.
Home-page: https://github.com/dborrelli/chat-intents
Author: David Borrelli
License: UNKNOWN
Description: # chat-intents
        
        ChatIntents provides a method for automatically clustering and applying descriptive group labels to short text documents containing dialogue intents.  It uses [UMAP](https://github.com/lmcinnes/umap) for performing dimensionality reduction on user-supplied document embeddings and [HDSBCAN](https://github.com/scikit-learn-contrib/hdbscan) for performing the clustering. Hyperparameters are automatically tuned by performing a Bayesian search (using [hyperopt](https://github.com/hyperopt/hyperopt)) on a constrained optimization of an objective function using user-supplied bounds.
        
        **See associated Medium post for additional description and motivation.**
        
        ## Installation
        
        Installation can be done using PyPI:
        
        ```pip install chatintents```
        
        **Note:** Depending on your system setup and environment, you may encounter an error associated with the pip install of HDSBCAN (failure to build the hdbscan wheel). This is a [known issue](https://github.com/scikit-learn-contrib/hdbscan/issues/293) with HDSCAN and has several possible solutions. If you are already using a conda virtual environment, an easy solution is to conda install HDBSCAN *before* installing the chatintents package using:
        
        ```conda install -c conda-forge hdbscan```
        
        ### Sentence embeddings
        The `chatintents` package doesn't include or specify how to create the sentence embeddings of the documents. Two popular pre-trained embedding models, as shown in the [tutorial notebook](https://github.com/dborrelli/chat-intents/blob/main/notebooks/chatintents_tutorial.ipynb), are the [Unversal Sentence Encoder (USE)](https://tfhub.dev/google/universal-sentence-encoder/4) and [Sentence Transformers](https://www.sbert.net/).
        
        Sentence Transformers can be installed by:
        
        ```
        pip install -U sentence-transformers
        ```
        
        Universal Sentence Encoder requires installing
        
        ```
        pip install tensorflow
        pip install --upgrade tensorflow-hub
        ```
        
        ## Quick Start
        The below example uses a Sentence Transformer model to embed the messages and create a model instance:
        
        ```
        import chatintents
        from chatintents import ChatIntents
        
        from sentence_transformers import SentenceTransformer
        
        all_intents = list(docs['text'])
        model = SentenceTransformer('all-mpnet-base-v2')
        embeddings = model.encode(all_intents)
        
        model = ChatIntents(embeddings, 'st1')
        ```
        
        Creating a ChatIntents instance requires inputs of an embedding representation of all documents and a short-text string description of the model (no spaces).
        
        ### Generating clusters
        Methods are provided for generating clusters using user-supplied hyperparameters, from random search, and from a Bayesian search.
        
        #### User-supplied hyperparameters and manually scoring
        ```
        clusters = model.generate_clusters(n_neighbors = 15, 
                                           n_components = 5, 
                                           min_cluster_size = 5, 
                                           min_samples = None,
                                           random_state=42)
        
        labels, cost = model.score_clusters(clusters)
        ```
        
        #### Random search
        To run 100 evaluations of randomly-selected hyperparameter values within user-supplied ranges:
        ```
        space = {
                "n_neighbors": range(12,16),
                "n_components": range(3,7),
                "min_cluster_size": range(2,15),
                "min_samples": range(2,15)
            }
        
        df_random = model.random_search(space, 100)
        ```
        
        #### Bayesian search
        Perform a Bayesian search of the hyperparameter space using hyperopt and user-supplied upper and lower bounds for the number of expected clusters:
        ```
        hspace = {
            "n_neighbors": hp.choice('n_neighbors', range(3,16)),
            "n_components": hp.choice('n_components', range(3,16)),
            "min_cluster_size": hp.choice('min_cluster_size', range(2,16)),
            "min_samples": None,
            "random_state": 42
        }
        
        label_lower = 30
        label_upper = 100
        max_evals = 100
        
        model.bayesian_search(space=hspace,
                              label_lower=label_lower, 
                              label_upper=label_upper, 
                              max_evals=max_evals)
        ```
        Running the `bayesian_search` method on a model instance saves the best parameters and best clusters to that instance as variables. For example:
        
        ```
        >>> model.best_params
        
        {'min_cluster_size': 5,
         'min_samples': None,
         'n_components': 11,
         'n_neighbors': 3,
         'random_state': 42}
        ```
        
        ### Applying labels to best clusters from Bayesian search
        After running the `bayesian_search` method to identify the best clusters for a given embedding model, descriptive labels can then be applied with:
        ```
        df_summary, labeled_docs = model.apply_and_summarize_labels(docs[['text']])
        ```
        This yields two results. The `df_summary` dataframe summarizing the count and descriptive label of each group:
        
        ![alt text](https://github.com/dborrelli/chat-intents/blob/main/images/table_extracted_labels.png)
        
        and the `labeled_docs` dataframe with each document in the dataset and it's associated cluster number and descriptiive label:
        
        ![alt text](https://github.com/dborrelli/chat-intents/blob/main/images/labeled_docs.png)
        
        ### Evaluating performance if ground truth is known
        Two methods are also supplied for evaluating and comparing the performance of different models if the ground truth labels happen to be known:
        
        ```
        models = [model_use, model_st1, model_st2, model_st3]
        
        df_comparison, labeled_docs_all_models = chatintents.evaluate_models(docs[['text', 
                                                                                   'category']],
                                                                                   models)
        ```
        An example `df_comparison` dataframe comparing model performance is shown below:
        
        ![alt text](https://github.com/dborrelli/chat-intents/blob/main/images/results_comparison.png)
        
        ## Tutorial
        See this [tutorial notebook](https://github.com/dborrelli/chat-intents/blob/main/notebooks/chatintents_tutorial.ipynb) for an example of using the `chatintents` package for comparing four different models on a dataset.
Keywords: nlp clustering document embeddings unsupervised intents
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.9
Description-Content-Type: text/markdown
