Metadata-Version: 2.1
Name: basicthainlp
Version: 0.3.2
Summary: Basic nlp for thai
Home-page: UNKNOWN
Author: bablueza
Author-email: bablueza@gmail.com
License: MIT
Platform: UNKNOWN
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE.txt

# Colab
https://drive.google.com/file/d/1G7OUNsCC-B5XHNd8V5Et1ZKpJp4R66hg/view?usp=share_link
================================================================================
# Update
## 0.3.1
* Add POS Tagging
## 0.2.7
* Add wrap function get_ps
## 0.2.1
* Add Token Identification
================================================================================
# Token Identification
## Example code TokenIden
```
from basicthainlp import TokenIden
TID = TokenIden()
textTest = "the 1..  .25 \"(12,378.36 / -78.9%) = 76,909\tcontain iphone 13 45. +-*/ -5 12.10.226.38.25 กค. สิงหา%/<=>  6 's\n"
tokenIdenList = TID.tagTokenIden(textTest)
textTokenList,tagList = TID.toTokenList(textTest,tokenIdenList)
for x, y in zip(textTokenList, tagList):
    if y != 'otherSymb' and y != 'space':
        print(x,y)
```
## Example code TokenIden: Add dict
input เป็น folder ที่ข้างในเป็นไฟล์ซึ่งเป็น list ของคำ 1 colum<br>
Tag ที่ได้จะตรงกันชื่อไฟล์
### ตัวอย่าง dict
input<br>
--abbreviation.txt #Tag ที่ได้ออกมีคือ abbreviation<br>
----กค.<br>
----สค.<br>
----กพ.<br>
```
from basicthainlp import TokenIden, DictToken
TID = TokenIden()
textTest = "the 1..  .25 \"(12,378.36 / -78.9%) = 76,909\tcontain iphone 13 45. +-*/ -5 12.10.226.38.25 กค. สิงหา%/<=>  6 's\n"
tokenIdenList = TID.tagTokenIden(textTest)
# >>> Add dict 
DTK = DictToken()
DTK.readFloder('input')
tokenIdenList = DTK.rep_dictToken(textTest,tokenIdenList)
# <<< Add dict
textTokenList,tagList = TID.toTokenList(textTest,tokenIdenList)
for x, y in zip(textTokenList, tagList):
    if y != 'otherSymb' and y != 'space':
        print(x,y)
```
================================================================================
# Word token to Pseudo Morpheme Segmentation
-ไม่ควรใช้งานกับประโยคภาษาไทยยาวๆ ควรตัดคำ หรือ ใช้งานรวมกับ TokenIdentification
## Example code PmSeg
```
from basicthainlp import PmSeg
ps = PmSeg()

textTest = 'รัฐราชการ'
data_list = ps.word2DataList(textTest)
print(data_list)
pred = ps.dataList2pmSeg(data_list)
print(list(textTest))
print(pred[0])
print(ps.pmSeg2List(list(textTest),pred[0]))
```
```
[['ร', 'Ccc'], ['ั', 'Vu'], ['ฐ', 'C'], ['ร', 'Ccc'], ['า', 'Vm'], ['ช', 'C'], ['ก', 'C'], ['า', 'Vm'], ['ร', 'Ccc']]
['ร', 'ั', 'ฐ', 'ร', 'า', 'ช', 'ก', 'า', 'ร']
['B', 'I', 'C', 'B', 'I', 'C', 'B', 'I', 'I']
['รัฐ', 'ราช', 'การ']
```
## Example code PmSeg: ใช้งานกับ Token Identification
```
from basicthainlp import PmSeg
from basicthainlp import TokenIden, DictToken
TID = TokenIden()
DTK = DictToken()
DTK.readFloder('input')
ps = PmSeg()
def get_ps(textInput):
  tokenIdenList = TID.tagTokenIden(textInput)
  tokenIdenList = DTK.rep_dictToken(textInput,tokenIdenList)
  textTokenList,tagList = TID.toTokenList(textInput,tokenIdenList)
  # ['otherSymb','mathSymb','punc','th_char','th_mym','en_char','digit','order','url','whitespace','space','newline','abbreviation','ne']
  # newTokenList = TID.replaceTag(['digit=<digit>'],textTokenList,tagList)
  newTokenList = []
  for textToken, tag in zip(textTokenList, tagList):
      if tag == 'th_char':
          data_list = ps.word2DataList(textToken)
          pred = ps.dataList2pmSeg(data_list)
          psList = ps.pmSeg2List(list(textToken),pred[0])
          newTokenList.extend(psList)
      else:
          newTokenList.append(textToken)
  return newTokenList
textTest = 'ติดตามข่าวล่าสุด 12 iphone'
print(get_ps(textTest))
```
### หรือใช้ wrap function ของ basicthainlp ซึ่งการทำงานจะเป็นดังเช่น โคดด้านบน
```
from basicthainlp import PmSeg
from basicthainlp import TokenIden, DictToken
from basicthainlp import get_ps
TID = TokenIden()
DTK = DictToken()
DTK.readFloder('input')
ps = PmSeg()
textTest = 'ติดตามข่าวล่าสุด 12 iphone'
print(get_ps(tid_cls=TID,dtk_cls=DTK,ps_cls=ps,textInput=textTest))
print(get_ps(textInput=textTest))
```
================================================================================
# POS Tagging
POS Tagging จาก pm token นำมา tag pos เป็น word
## Example code PosTag
```
from basicthainlp import TokenIden,DictToken
from basicthainlp import PmSeg
from basicthainlp import PosTag
TID = TokenIden()
DTK = DictToken()
DTK.readFloder('input')
PS = PmSeg()
textTest = 'จากนั้นคนร้ายก็ได้ขับมุ่งไปทางถนนเจริญกรุง' 
pos_cls = PosTag(tid_cls=TID,dtk_cls=DTK,ps_cls=PS)
ps_list,tag_list = pos_cls.tagPOS(textTest)
print(ps_list)
print(tag_list)
word_list,pos_list = pos_cls.psSeg2WS(ps_list,tag_list)
print(word_list)
print(pos_list)
```

