Metadata-Version: 2.1
Name: tetun_lid
Version: 0.0.8
Summary: Tetun Language Identification Model
Author-email: Gabriel de Jesus <gabriel.dejesus@timornews.tl>
License-File: LICENSE
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Requires-Python: >=3.7
Description-Content-Type: text/markdown

### Tetun LID
Tetun Language Identification (Tetun LID) Model is a machine learning model that automatically identifies the language of a given text. It was specifically designed to recognize four languages commonly spoken in Timor-Leste: Tetun, Portuguese, Indonesian, and English.


Using a combination of cutting-edge algorithms and sophisticated linguistic features, Tetun LID was trained on a large corpus of text data to accurately recognize the characteristic of each language and the linguistic patterns. Its ability to accurately identify multiple languages makes it a valuable tool for anyone working with multilingual text data in Timor-Leste in the natural language processing (NLP) and information retrieval (IR) areas, such as language-specific search engines, sentiment analysis, and machine translation.

### Installation

To install Tetun LID, run the following commands in your console:

```
pip install tetun-lid
```

### Dependecies

Tetun LID package depends on the following packages:

* joblib
* scikit-learn
* Unicode

To install the dependencies packages, use the commands as follows:

```
pip install joblib
pip install scikit-learn
pip install Unidecode
```

### Usage

To use the Tetun LID, from `tetunlid` package, import `lid` as follows:

1. In case you want to predict a sentence as the input text.

```python

from tetunlid import lid

input_text = "Sé mak hamriik iha ne'ebá?"
output = lid.predict_language(input_text)

print(output)
```

The output will be:

```
Tetun
```

2. If you want to see the details of why it was being predicted to Tetun, you can use the `predict_detail()` function.

```python

from tetunlid import lid

input_list_of_str = ["Sé mak hamriik iha ne'ebá?"]
output_detail = lid.predict_detail(input_list_of_str)
print('\n'.join(output_detail))
```

The output will be:

```
Input text: "Sé mak hamriik iha ne'ebá?"
Probability:
        English: 0.0007
        Indonesian: 0.0007
        Portuguese: 0.0006
        Tetun: 0.9980
Thus, the input text is "Tetun" with a confidence level of 99.80%.
```

`Note`: the input parameter and the output of `predict_detail()` is a `List[str]` or a list of strings, and therefore to view the output result in the console, we need to use `for` loop or `join()` as in the example above to print the result.

3. You can use multiple languages as an input. Observe the following example:

```python
from tetunlid import lid

multiple_langs = ["Ha'u ema baibain", "I am quite busy",
                  "Kamu malas sekali", "Vou sair daqui"]

output = [(ml, lid.predict_language(ml)) for ml in multiple_langs]
print(output)
```

The output will be:

```
[("Ha'u ema baibain", 'Tetun'), ('I am quite busy', 'English'), ('Kamu malas sekali', 'Indonesian'), ('Vou sair daqui', 'Portuguese')]
```

You can use `for` or any similar way to print the output in lines in the console as follows:

```python
from tetunlid import lid

input_texts = ["Ha'u ema baibain", "I am quite busy",
               "Kamu malas sekali", "Vou sair daqui"]

for input_text in input_texts:
    lang = lid.predict_language(input_text)
    print(f"{input_text} ({lang})")
```

The output will be:

```
Ha'u ema baibain (Tetun)
I am quite busy (English)
Kamu malas sekali (Indonesian)
Vou sair daqui (Portuguese)
```

If you want to see the details of each input, you can use the same function as illustrated above. Here you go:

```python

from tetunlid import lid

input_texts = ["Ha'u ema baibain", "I am quite busy",
               "Kamu malas sekali", "Vou sair daqui"]

output_multiple_detail = lid.predict_detail(input_texts)
print('\n'.join(output_multiple_detail))
```

The output will be:

```
Input text: "Ha'u ema baibain"
Probability:
        English: 0.0027
        Indonesian: 0.0028
        Portuguese: 0.0024
        Tetun: 0.9920
Thus, the input text is "Tetun" with a confidence level of 99.20%.


Input text: "I am quite busy"
Probability:
        English: 0.9974
        Indonesian: 0.0007
        Portuguese: 0.0015
        Tetun: 0.0004
Thus, the input text is "English" with a confidence level of 99.74%.


Input text: "Kamu malas sekali"
Probability:
        English: 0.0001
        Indonesian: 0.9997
        Portuguese: 0.0001
        Tetun: 0.0001
Thus, the input text is "Indonesian" with a confidence level of 99.97%.



Input text: "Vou sair daqui"
Probability:
        English: 0.0034
        Indonesian: 0.0030
        Portuguese: 0.9912
        Tetun: 0.0023
Thus, the input text is "Portuguese" with a confidence level of 99.12%.
```

4. You can also use Tetun LID to predict a text from a file containing various languages. Here is an example:

```python
from pathlib import Path
from tetunlid import lid


file_path = Path("myfile/example.txt")

try:
    with file_path.open('r', encoding='utf-8') as f:
        contents = [line.strip() for line in f]
except FileNotFoundError:
    print(f"File not found at: {file_path}")

output = [(content, lid.predict_language(content)) for content in contents]
print(output)
```

There are a few more ways to read file contents that you can use to achieve the same output.

### Additional notes

1. Please follow the instruction as it is and try to understand how it works. All the dependencies need to be installed accordingly.
2. If you encountered an `AttributeError: 'list' object has no attribute 'predict_proba'`, you might have some issues while installing the package. Please send me an email, and I will guide you on how to handle the error.
3. Please make sure that you use the latest version of Tetun LID. To get the latest version, run this command in your console: `pip install --upgrade tetun-lid`.