Metadata-Version: 2.1
Name: pylade
Version: 0.2.1
Summary: PyLaDe - Language Detection tool written in Python.
Home-page: https://github.com/fievelk/pylade
License: MIT
Author: Pierpaolo Pantone
Author-email: 24alsecondo@gmail.com
Requires-Python: >=3.7,<=3.12
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Requires-Dist: nltk (>=3.8.1,<4.0.0)
Project-URL: Repository, https://github.com/fievelk/pylade
Description-Content-Type: text/markdown

# PyLaDe

[![Build Status](https://travis-ci.org/fievelk/pylade.svg?branch=master)](https://travis-ci.org/fievelk/pylade)

`pylade` is a lightweight language detection tool written in Python. The tool provides a ready-to-use command-line interface, along with more complex scaffolding for customized tasks.

The current version of `pylade` implements the *Cavnar-Trenkle N-Gram-based approach*. However, the tool can be further expanded with customized language identification implementations.

- [Installation](#installation)
- [Usage](#usage)
  - [Train a model on a training set](#train-a-model-on-a-training-set)
  - [Evaluate a model on a test set](#evaluate-a-model-on-a-test-set)
  - [Detect language of a text using a trained model](#detect-language-of-a-text-using-a-trained-model)
  - [Custom implementations and corpora](#custom-implementations-and-corpora)
- [Development and testing](#development-and-testing)
- [Notes](#notes)
- [References](#references)


## Installation

You can install using pip:

```bash
$ pip install pylade
```


## Usage

For a quick use, simply give the following command from terminal:

```console
$ pylade "Put text here"
en
```
Done!

If you want to get deeper and use some more advanced features, please keep reading. **Note:** you can obtain more information about each of the following commands using the `--help` flag.

### Train a model on a training set

```console
$ pylade_train \
    training_set.csv \
    --implementation CavnarTrenkleImpl \
    --corpus-reader TwitterCorpusReader \
    --output model.json \
    --train-args '{"limit": 5000, "verbose": "True"}'
```

`--train-args` is a dictionary of arguments to be passed to the `train()` method of the chosen implementation (`CavnarTrenkleImpl` in the example above). For an accurate description of the arguments please refer to the `train()` method docstring.

**NOTE**: to define a new training set, you can check the format of the file `tests/test_files/training_set_example.csv`.

### Evaluate a model on a test set

```console
$ pylade_eval \
    test_set.csv \
    --model model.json \
    --implementation CavnarTrenkleImpl \
    --corpus-reader TwitterCorpusReader \
    --output results.json \
    --eval-args '{"languages": ["it", "de"], "error_values": 8000}'
```

`--eval-args` is a dictionary of arguments to be passed to the `evaluate()` method of the chosen implementation (`CavnarTrenkleImpl` in the example above). For an accurate description of the arguments please refer to the `evaluate()` method docstring.

### Detect language of a text using a trained model

```console
$ pylade \
    "Put text here" \
    --model model.json \
    --implementation CavnarTrenkleImpl \
    --output detected_language.txt \
    --predict-args '{"error_value": 8000}'
```

`--predict-args` is a dictionary of arguments to be passed to the `predict_language()` method of the chosen implementation (`CavnarTrenkleImpl` in the example above). For an accurate description of the arguments please refer to the `predict_language()` method docstring.

### Custom implementations and corpora

Different language detection approaches can be implemented creating new classes that inherit from the `Implementation` class. This class should be considered as an interface whose methods are meant to be implemented by the inheriting class.

Customized corpus readers can be created the same way, inheriting from the `CorpusReader` interface instead.


## Development and testing

You can install development requirements using Poetry (`poetry install`). This will also install requirements needed for testing.

To run tests, just run `tox` from the package root folder.


## Notes

The default model (`data/model.json`) has been trained using `limit = 5000`. This value provides a good balance between computational performance and accuracy. Please note that this might change if you use your own data to train a new model.


## References

- Cavnar, William B., and John M. Trenkle. "N-gram-based text categorization." *Ann Arbor MI* 48113.2 (1994): 161-175.

