Metadata-Version: 2.1
Name: meguru-tokenizer
Version: 0.2.0
Summary: simple tokenizer for tensorflow 2.x and PyTorch
Home-page: https://github.com/MokkeMeguru/meguru_tokenizer
Author: MokkeMeguru
Author-email: meguru.mokke@gmail.com
License: MIT license
Description: # meguru tokenizer
        
        # installation and initialization
        
        ```shell
        pip install meguru_tokenizer
        sudachipy link -t full
        ```
        
        # Abstruction of Usage
        
        1.  Preprocess Using Each Tokenizer
            e.g. sentencepiece preprocess / sudachi preprocess
        2.  Tokenize in your code using its Tokenizer
            - basis    
              see. [official docs](https://mokkemeguru.github.io/meguru_tokenizer/index.html)
            - Tensorflow    
        	  see. [tutorial](./tutorials/01_tokenize_tf.ipynb)
            - TODO: PyTorch
        
        # RealWorld Example
        
        ```python
        import meguru_tokenizer.whitespace_tokenizer import WhitespaceTokenizer
        import pprint
        
        sentences = [
            "Hello, I don't know how to use it?",
            "Tensorflow is awesome!",
            "it is good framework.",
        ]
        
        # define tokenizer and vocaburary
        tokenizer = WhitespaceTokenizer(lower=True)
        vocab = Vocab()
        
        # build vocaburary
        for sentence in sentences:
            vocab.add_vocabs(tokenizer.tokenize(sentence))
        vocab.build_vocab()
        
        # set vocaburary into tokenizer to enable encoding
        tokenizer.vocab = vocab
        
        # save vocaburary information
        vocab.dump_vocab(Path("vocab.txt"))
        print("vocabs:")
        pprint.pprint(vocab.i2w)
        
        # tokenize
        print("tokenized sentence")
        pprint.pprint(tokenizer.tokenize_list(sentences))
        
        # [['hello', ',', 'i', 'do', "n't", 'know', 'how', 'to', 'use', 'it', '?'],
        #  ['tensorflow', 'is', 'awesome', '!'],
        #  ['it', 'is', 'good', 'framework', '.']]
        
        # encode
        print("encoded sentence")
        pprint.pprint([tokenizer.encode(sentence) for sentence in sentences])
        
        # [[7, 8, 9, 10, 11, 12, 13, 14, 15, 5, 16], [17, 6, 18, 19], [5, 6, 20, 21, 22]]
        
        print("decoded sentence")
        pprint.pprint([tokenizer.decode(tokens) for tokens in encodes])
        # ["hello , i do n't know how to use it ?",
        #  'tensorflow is awesome !',
        #  'it is good framework .']
        
        vocab_size = len(vocab)
        
        # restore the vocaburary from dumped file
        print("reload from dump file")
        vocab = Vocab()
        vocab.load_vocab(Path("vocab.txt"))
        assert vocab_size == len(vocab)
        
        tokenizer = WhitespaceTokenizer(vocab=vocab)
        pprint.pprint([tokenizer.encode(sentence) for sentence in sentences])
        
        # [[7, 8, 9, 10, 11, 12, 13, 14, 15, 5, 16], [17, 6, 18, 19], [5, 6, 20, 21, 22]]
        
        # vocaburary with minimum frequency limitation
        vocab = Vocab()
        for sentence in sentences:
            vocab.add_vocabs(tokenizer.tokenize(sentence))
        vocab.build_vocab(min_freq=2)
        assert vocab_size != len(vocab)
        
        # vocaburary with maximum voaburary size
        vocab = Vocab()
        for sentence in sentences:
            vocab.add_vocabs(tokenizer.tokenize(sentence))
        vocab.build_vocab(vocab_size=10)
        assert 10 == len(vocab)
        ```
        
Keywords: tensorflow,pytorch,tokenizer,nlp
Platform: UNKNOWN
Classifier: Development Status :: 2 - Pre-Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Natural Language :: English
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.5
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Requires-Python: >=3.5
Description-Content-Type: text/markdown
