Metadata-Version: 2.1
Name: Japanese-BPE-Tokenizer
Version: 0.1.2
Summary: Mecab-based BPE toknizer for Japanese text
Home-page: http://github.com/stsuchi/Japanese-BPE-Tokenizer
Author: Shiro T.
License: MIT
Description: # Byte Pair Encoding for Japanese Language
        
        ## Summary
        The package applies Mecab and Byte Pair Encoding algorithms to tokenize Japanese text.
        
        ## Usage
        To train a new tokenizer, import the module
        ```
        from jpn_bpe_tokenizer import MecabBPETrainTokenizer
        ```
        
        Instantiate the object
        ```
        tokenizer = MecabBPETrainTokenizer()
        ```
        
        Set the arguments with the text, vocab_size and special_tokens to train the tokenizer
        ```
        tokenizer.train(files,
                        vocab_size=52000,
                        special_tokens=["<pad>", "<unk>", "<s>", "</s>", "<mask>"])
        ```
        
        Save the trained config files
        ```
        tokenizer.save_model('config')
        ```
        
        When you have config files, use them to tokenize new text with a tokenizer object
        ```
        from jpn_bpe_tokenizer import MecabBPETokenizer
        
        tokenizer = MecabBPETokenizer('config/vocab.json','config/merges.txt')
        ```
        
Platform: UNKNOWN
Description-Content-Type: text/markdown
