Metadata-Version: 2.1
Name: gaussianlda
Version: 0.2.5
Summary: Implementation of Gaussian LDA topic model, with efficiency tricks
Home-page: https://github.com/markgw/gaussianlda
Author: Mark Granroth-Wilding
License: UNKNOWN
Project-URL: Based on, https://github.com/rajarshd/Gaussian_LDA
Project-URL: Funding, https://www.newseye.eu/
Description: # Gaussian LDA
        Another implementation of the paper 
        [Gaussian LDA for Topic Models with Word Embeddings](http://rajarshd.github.io/papers/acl2015.pdf).
        
        This is a Python implementation based as closely as possible on 
        the [Java implementation](https://github.com/rajarshd/Gaussian_LDA) 
        released by the paper's authors.
        
        ## Installation
        
        You'll first need to install the ``choldate`` package, [following its installation 
        instructions](https://github.com/modusdatascience/choldate). (It's not 
        possible to include this as a dependency for the PyPi package.)
        
        Then install gaussianlda using Pip:
        ```
        pip install gaussianlda
        ```
        
        ## Usage
        
        The package provides two classes for training Gaussian LDA:
         * Cholesky only, `gaussianlda.GaussianLDATrainer`: Simple Gibbs sampler 
           with optional Cholesky decomposition trick.
         * Cholesky+aliasing, `gaussianlda.GaussianLDAAliasTrainer`: 
           Cholesky decomposition (not optional) and the Vose aliasing trick.
        
        The trainer is prepared by instantiating the training class:
         * *corpus*: List of documents, where each document is a list of int IDs 
           of words. These are IDs into the vocabulary and the embeddings matrix.
         * *vocab_embeddings*: (V, D) Numpy array, where V is the number of words 
           in the vocabulary and D is the dimensionality of the embeddings.
         * *vocab*: Vocabulary, given as a list of words, whose position corresponds 
           to the indices using in the data. This is not strictly needed for training, 
           but is used to output topics.
         * *num_tables*: Number of topics to learn.
         * *alpha*, *kappa*: Hyperparameters to the doc-topic Dirichlet and 
           the inverse Wishart prior
         * *save_path*: Path to write the model out to after each iteration.
         * *mh_steps* (aliasing only): Number of Montecarlo-Hastings steps for 
           each topic sample.
        
        Then you set the sampler running for a specified number of iterations 
        over the training data by calling `trainer.sample(num_iters)`.
        
        ## Example
        
        ```python
        import numpy as np
        from gaussianlda import GaussianLDAAliasTrainer
        
        # A small vocabulary as a list of words
        vocab = "money business bank finance sheep cow goat pig".split()
        # A random embedding for each word
        # Really, you'd want to load something more useful!
        embeddings = np.random.sample((8, 100), dtype=np.float32)
        corpus = [
            [0, 2, 1, 1, 3, 0, 6, 1],
            [3, 1, 1, 3, 7, 0, 1, 2],
            [7, 5, 4, 7, 7, 4, 6],
            [5, 6, 1, 7, 7, 5, 6, 4],
        ]
        # Prepare a trainer
        trainer = GaussianLDAAliasTrainer(
            corpus, embeddings, vocab, 2, 0.1, 0.1
        )
        # Set training running
        trainer.sample(10)
        ```
Keywords: gaussian lda topic-model machine-learning
Platform: UNKNOWN
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Information Technology
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Text Processing
Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3)
Classifier: Programming Language :: Python :: 3 :: Only
Requires-Python: >=3.2
Description-Content-Type: text/markdown
