Metadata-Version: 2.1
Name: tokenizer_xm
Version: 1.0
Summary: Tokenizing with options to include contractions, lemmatize and stem.
Home-page: https://github.com/ALaughingHorse/tokenizer_xm
Author: Xiao Ma
Author-email: Marshalma0923@gmail.com
License: MIT
Download-URL: https://github.com/ALaughingHorse/tokenizer_xm/archive/v_10.tar.gz
Description: 
        # Introduction 
        
        This package is an aggregation of several packages I found useful for text pre-processing including gensim and ntlk. I put them together to create a more comprehensive and convenient pipeline. 
        
        # Installation
        
        ```
        pip install tokenizer_xm
        ```
        
        # Usage
        
        ## Processing a single text string
        
        
        ```python
        from tokenizer_xm import TextPreProcessor
        import string
        
        # An example text
        example_text = "This is an amazing product! I've been using it for almost a year now and it's clearly better than any other products I've used."
        ```
        
        
        ```python
        print("Original text:")
        print(example_text)
        print("---")
        
        print("Simple Preprocessed:")
        print("---")
        tk = TextPreProcessor(text=example_text, lemma_flag=False, stem_flag=False, stopwords=[])
        print(tk.process())
        print("---")
        
        print("Pre-processing with regular contractions (e.g. I've -> I have):")
        # In this package, I included a dictionary of regular contractions for your convenience
        tk = TextPreProcessor(text=example_text, lemma_flag=False, stem_flag=False, \
                              contractions=[], stopwords=[])
        print(tk.process())
        print("---")
        
        print("Pre-processing with lemmatization:")
        tk = TextPreProcessor(text=example_text, lemma_flag=True, stem_flag=False, \
                              stopwords=[])
        print(tk.process())
        print("---")
        
        print("Pre-processing with lemmatization and stemming:")
        # This package uses the SnowballStemmer from ntlk.stem. I will try to make it customizable later
        tk = TextPreProcessor(text=example_text, lemma_flag=True, stem_flag=True, \
                               stopwords=[])
        print(tk.process())
        print("---")
        
        print("Adding stop words")
        # This package uses the SnowballStemmer from ntlk.stem. I will try to make it customizable later
        tk = TextPreProcessor(text=example_text, lemma_flag=True, stem_flag=True, \
                                stopwords=["this",'be',"an",'it'])
        print(tk.process())
        print("---")
        ```
        
            Original text:
            This is an amazing product! I've been using it for almost a year now and it's clearly better than any other products I've used.
            ---
            Simple Preprocessed:
            ---
            ['this', 'is', 'an', 'amazing', 'product', 'i', 'have', 'been', 'using', 'it', 'for', 'almost', 'a', 'year', 'now', 'and', 'it', 'has', 'it', 'is', 'clearly', 'better', 'than', 'any', 'other', 'products', 'i', 'have', 'used']
            ---
            Pre-processing with regular contractions (e.g. I've -> I have):
            ['this', 'is', 'an', 'amazing', 'product', 'i', 'have', 'been', 'using', 'it', 'for', 'almost', 'a', 'year', 'now', 'and', 'it', 'has', 'it', 'is', 'clearly', 'better', 'than', 'any', 'other', 'products', 'i', 'have', 'used']
            ---
            Pre-processing with lemmatization:
            ['this', 'be', 'an', 'amaze', 'product', 'i', 'have', 'be', 'use', 'it', 'for', 'almost', 'a', 'year', 'now', 'and', 'it', 'have', 'it', 'be', 'clearly', 'better', 'than', 'any', 'other', 'product', 'i', 'have', 'use']
            ---
            Pre-processing with lemmatization and stemming:
            ['this', 'be', 'an', 'amaz', 'product', 'i', 'have', 'be', 'use', 'it', 'for', 'almost', 'a', 'year', 'now', 'and', 'it', 'have', 'it', 'be', 'clear', 'better', 'than', 'ani', 'other', 'product', 'i', 'have', 'use']
            ---
            Adding stop words
            ['amaz', 'product', 'i', 'have', 'use', 'for', 'almost', 'a', 'year', 'now', 'and', 'have', 'clear', 'better', 'than', 'ani', 'other', 'product', 'i', 'have', 'use']
            ---
        
        
        ## The order of stop words removal and lemmatization/stemming
        
        The current algorithm **performs lemmatization and stem before stop-words removal**. Thus,
        
        1. You need to be carefull when defining a list of stop words. For example, including the term "product" will also remove the term "production" if you set the stem_flag to True or the term "products" if you set lemma_flag to True.
        
        2. When the lemma_flag is set to True, terms like "is" and "are" will be lemmatized to "be". And if "be" is not in the list of stopwords, it will remain. It is recommended that you process the list of stop-words as well if you decide to perform lemmatization
        
        
        ```python
        """
        Example
        """
        
        text = "products, production, is"
        stop_words = ['product','is']
        tk = TextPreProcessor(text = text, lemma_flag= False, stem_flag = False, \
                               stopwords=stop_words)
        # Use the .txt_pre_pros_all method instead when the input is a corpus
        print(tk.process())
        ```
        
            ['products', 'production']
        
        
        
        ```python
        tk = TextPreProcessor(text = text, lemma_flag= True, stem_flag = False, \
                               stopwords=stop_words)
        # Use the .txt_pre_pros_all method instead when the input is a corpus
        print(tk.process())
        ```
        
            ['production', 'be']
        
        
        
        ```python
        tk = TextPreProcessor(text = text, lemma_flag= True, stem_flag = True, \
                               stopwords=stop_words)
        # Use the .txt_pre_pros_all method instead when the input is a corpus
        print(tk.process())
        ```
        
            ['be']
        
        
Keywords: text preprocessing,tokenize,NLP
Platform: UNKNOWN
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Topic :: Software Development :: Build Tools
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.4
Classifier: Programming Language :: Python :: 3.5
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Description-Content-Type: text/markdown
