Metadata-Version: 2.1
Name: nlp-id
Version: 0.1.10.3
Summary: Kumparan's NLP Services
Home-page: https://github.com/kumparan/nlp-id
Author: Frandy Eddy
Author-email: eddy.frandy@gmail.com
License: MIT
Keywords: Indonesian,Bahasa,NLP
Platform: UNKNOWN
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: Text Processing :: Linguistic
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.4
Classifier: Programming Language :: Python :: 3.5
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Description-Content-Type: text/markdown
License-File: LICENSE

# Kumparan's NLP Services

`nlp-id` is a collection of modules which provides various functions for Natural Language Processing for Bahasa Indonesia. This repository contains all source code related to NLP services.

## Installation

To install `nlp-id`, use the following command:

    $ pip install nlp-id     


## Usage

Description on how to use the lemmatizer, tokenizer, POS-tagger, etc. will be explained in more detail in this section.

### Lemmatizer

Lemmatizer is used to get the root words from every word in a sentence.

    from nlp_id.lemmatizer import Lemmatizer 
    lemmatizer = Lemmatizer() 
    lemmatizer.lemmatize('Saya sedang mencoba') 
    # saya sedang coba 
    
### Tokenizer

Tokenizer is used to convert text into tokens of word, punctuation, number, date, email, URL, etc. 
There are two kinds of tokenizer in this repository, **standard tokenizer** and **phrase tokenizer**. 
The **standard tokenizer** tokenizes the text into separate tokens where the word tokens are single-word tokens.
Tokens that started with *ku-* or ended with *-ku*, *-mu*, *-nya*, *-lah*, *-kah* will be split if it is personal pronoun or particle.

    from nlp_id.tokenizer import Tokenizer 
    tokenizer = Tokenizer() 
    tokenizer.tokenize('Lionel Messi pergi ke pasar di daerah Jakarta Pusat.') 
    # ['Lionel', 'Messi', 'pergi', 'ke', 'pasar', 'di', 'daerah', 'Jakarta', 'Pusat', '.']

    tokenizer.tokenize('Lionel Messi pergi ke rumahmu di daerah Jakarta Pusat.') 
    # ['Lionel', 'Messi', 'pergi', 'ke', 'rumah', 'mu', 'di', 'daerah', 'Jakarta', 'Pusat', '.']
    
The **phrase tokenizer** tokenizes the text into separate tokens where the word tokens are phrases (single or multi-word tokens). 

    from nlp_id.tokenizer import PhraseTokenizer 
    tokenizer = PhraseTokenizer() 
    tokenizer.tokenize('Lionel Messi pergi ke pasar di daerah Jakarta Pusat.') 
    # ['Lionel Messi', 'pergi', 'ke', 'pasar', 'di', 'daerah', 'Jakarta Pusat', '.']
    
### POS Tagger

POS tagger is used to obtain the Part-Of-Speech tag from a text.
There are two kinds of POS tagger in this repository, **standard POS tagger** and **phrase POS tagger**. 
The tokens in **standard POS Tagger** are single-word tokens, while the tokens in **phrase POS Tagger** are phrases (single or multi-word tokens).

    from nlp_id.postag import PosTag
    postagger = PosTag() 
    postagger.get_pos_tag('Lionel Messi pergi ke pasar di daerah Jakarta Pusat.') 
    # [('Lionel', 'NNP'), ('Messi', 'NNP'), ('pergi', 'VB'), ('ke', 'IN'), ('pasar', 'NN'), ('di', 'IN'), ('daerah', 'NN'),  
      ('Jakarta', 'NNP'), ('Pusat', 'NNP'), ('.', 'SYM')]
    
    postagger.get_phrase_tag('Lionel Messi pergi ke pasar di daerah Jakarta Pusat.') 
    # [('Lionel Messi', 'NP'), ('pergi', 'VP'), ('ke', 'IN'), ('pasar', 'NN'), ('di', 'IN'), ('daerah', 'NN'), 
      ('Jakarta Pusat', 'NP'), ('.', 'SYM')]

    
Description of tagset used for POS Tagger:

| No. | Tag | Description | Example |
|:-----:|:-----:|:--------|:------------|
| 1 | ADV | Adverbs. Includes adverb, modal, and auxiliary verb | sangat, hanya, justru, boleh, harus, mesti|
| 2 | CC  | Coordinating conjunction. Coordinating conjunction links two or more syntactically equivalent parts of a sentence. Coordinating conjunction can link independent clauses, phrases, or words. | dan, tetapi, atau |
| 3 | DT  | Determiner/article. A grammatical unit which limits the potential referent of a noun phrase, whose basic role is to mark noun phrases as either definite or indefinite.| para, sang, si |
| 4 | FW | Foreign word. Foreign word is a word which comes from foreign language and is not yet included in Indonesian dictionary| online, e-commerce |
| 5 | IN  | Preposition. A preposition links word or phrase and constituent in front of that preposition and results prepositional phrase. | dalam, dengan, di, ke|
| 6 | JJ | Adjective. Adjectives are words which describe, modify, or specify some properties of the head noun of the phrase | bersih, panjang, jauh, marah |
| 7 | NEG | Negation | tidak, belum, jangan |
| 8 | NN | Noun. Nouns are words which refer to human, animal, thing, concept, or understanding | meja, kursi, monyet, perkumpulan |
| 9 | NNP | Proper Noun. Proper noun is a specific name of a person, thing, place, event, etc. | Indonesia, Jakarta, Piala Dunia, Idul Fitri, Jokowi |
| 10 | NUM  | Number. Includes cardinal and ordinal number | 9876, 2019, 0,5, empat |
| 11 | PR  | Pronoun. Includes personal pronoun and demonstrative pronoun | saya, kami, kita, kalian, ini, itu |
| 12 | RP  | Particle. Particle which confirms interrogative, imperative, or declarative sentences | pun, lah, kah|
| 13 | SC  | Subordinating Conjunction. Subordinating conjunction links two or more clauses and one of the clauses is a subordinate clause. | sejak, jika, seandainya, dengan, bahwa, yang|
| 14 | SYM | Symbols and Punctuations  | +,%,@ |
| 15 | UH | Interjection. Interjection expresses feeling or state of mind and has no relation with other words syntactically. | ayo, mari, aduh|
| 16 | VB | Verb. Includes transitive verbs, intransitive verbs, active verbs, passive verbs, and copulas. | tertidur, bekerja, membaca |
| 17 | WH | Question words | siapa, apa, kapan, bagaimana |
| 18 | ADJP | Adjective Phrase. A group of words headed by an adjective that describes a noun or a pronoun | sangat tinggi |
| 19 | DP | Date Phrase. Date written with whitespaces | 1 Januari 2020 |
| 20 | NP | Noun Phrase. A phrase that has a noun (or indefinite pronoun) as its head | Jakarta Pusat, Lionel Messi |
| 21 | NUMP | Number Phrase.  | 10 juta |
| 22 | VP | Verb Phrase. A syntactic unit composed of at least one verb and its dependents | tidak makan |

### Stopword

`nlp-id` also provide list of Indonesian stopword.

    from nlp_id.stopword import StopWord 
    stopword = StopWord() 
    stopword.get_stopword() 
    # [{list_of_nlp_id_stopword}]    

Stopword Removal is used to remove every Indonesian stopword from the given text.

    from nlp_id.stopword import StopWord 
    text = "Lionel Messi pergi Ke pasar di area Jakarta Pusat" # single sentence
    stopword = StopWord() 
    stopword.remove_stopword(text)
    # Lionel Messi pergi pasar area Jakarta Pusat  
    
    paragraph = "Lionel Messi pergi Ke pasar di area Jakarta Pusat itu. Sedangkan Cristiano Ronaldo ke pasar Di area Jakarta Selatan. Dan mereka tidak bertemu begini-begitu."
    stopword.remove_stopword(text)
    # Lionel Messi pergi pasar area Jakarta Pusat. Cristiano Ronaldo pasar area Jakarta Selatan. bertemu.
    
## Training and Evaluation

Our model is trained using stories from kumparan as the dataset. We managed to get ~95% accuracy on our test set.
    
## Citation
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.4556870.svg)](https://doi.org/10.5281/zenodo.4556870)


