Metadata-Version: 2.1
Name: genz-tokenize
Version: 1.0.9
Summary: Tokenize for subword
Home-page: https://github.com/nghiemIUH/genz-tokenize
Author: Van Nghiem
Author-email: vannghiem848@gmail.com
License: MIT
Platform: UNKNOWN
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Operating System :: OS Independent
Description-Content-Type: text/markdown
License-File: LICENSE

# Genz Tokenize



[Github](https://github.com/nghiemIUH/genz-tokenize)



## Cài đặt:



    pip install genz-tokenize



## Sử dụng cho tokenize thông thường



```python

    >>> from genz_tokenize import Tokenize

    # sử dụng vocab sẵn có của thư viện

    >>> tokenize = Tokenize()

    >>>  print(tokenize(['sinh_viên công_nghệ', 'hello'], maxlen = 5))

    # [[1, 288, 433, 2, 0], [1, 20226, 2, 0, 0]]

    >>> print(tokenize.decode([1, 288, 2]))

    # <s> sinh_viên </s>

    # Sử dụng vocab tự tạo

    >>> tokenize = Tokenize.fromFile('vocab.txt','bpe.codes')

```



## Sử dụng tokenize cho model bert của thư viện transformers



```python

    >>> from genz_tokenize import TokenizeForBert

    # sử dụng vocab sẵn có của thư viện

    >>> tokenize = TokenizeForBert()

    >>> print(tokenize(['sinh_viên công_nghệ', 'hello'], max_length=5, padding='max_length',truncation=True))

    # {'input_ids': [[1, 287, 432, 2, 0], [1, 20225, 2, 0, 0]], 'token_type_ids': [[0, 0, 0, 0, 0], [0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 0], [1, 1, 1, 0, 0]]}

    # Sử dụng vocab tự tạo

    >>> tokenize = TokenizeForBert.fromFile('vocab.txt','bpe.codes')

```



### Có thể tạo vocab cho riêng mình bằng thư viện [subword-nmt (learn-joint-bpe-and-vocab)](https://github.com/rsennrich/subword-nmt)



