Metadata-Version: 2.1
Name: pylade
Version: 0.2.0
Summary: PyLaDe - Language Detection tool written in Python.
Home-page: https://github.com/fievelk/pylade
License: MIT
Author: Pierpaolo Pantone
Author-email: 24alsecondo@gmail.com
Requires-Python: >=3.7,<=3.12
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Requires-Dist: nltk (>=3.8.1,<4.0.0)
Project-URL: Repository, https://github.com/fievelk/pylade
Description-Content-Type: text/markdown

# PyLaDe

[![Build Status](https://travis-ci.org/fievelk/pylade.svg?branch=master)](https://travis-ci.org/fievelk/pylade)

`pylade` is a lightweight language detection tool written in Python. The tool provides a ready-to-use command-line interface, along with a more complex scaffolding for customized tasks.

The current version of `pylade` implements the *Cavnar-Trenkle N-Gram-based approach*. However, the tool can be further expanded with customized language identification implementations.

## Requirements

- python 3.7 up to 3.11 (included)
- `nltk`

## Installation

Download the repository and install using pip (locally):

```bash
$ git clone git@github.com:fievelk/PyLaDe.git
$ cd pylade
$ pip install .
```

## Usage

For a quick use, simply give the following command from terminal:

```bash
pylade "Put text here"
# en
```
Done!

If you want to get deeper and use some more advanced features, please keep reading. **Note:** you can obtain more information about each of the following commands using the `--help` flag.

### Train a model on a training set

```bash
pylade_train \
    training_set.csv \
    --implementation CavnarTrenkleImpl \
    --corpus-reader TwitterCorpusReader \
    --output model.json \
    --train-args '{"limit": 5000, "verbose": "True"}'
```

`--train-args` is a dictionary of arguments to be passed to the `train()` method of the chosen implementation (`CavnarTrenkleImpl` in the example above). For an accurate description of the arguments please refer to the `train()` method docstring.

**NOTE**: to define a new training set, you can check the format of the file `tests/test_files/training_set_example.csv`.


### Evaluate a model on a test set

```bash
pylade_eval \
    test_set.csv \
    --model model.json \
    --implementation CavnarTrenkleImpl \
    --corpus-reader TwitterCorpusReader \
    --output results.json \
    --eval-args '{"languages": ["it", "de"], "error_values": 8000}'
```

`--eval-args` is a dictionary of arguments to be passed to the `evaluate()` method of the chosen implementation (`CavnarTrenkleImpl` in the example above). For an accurate description of the arguments please refer to the `evaluate()` method docstring.

### Detect language of a text using a trained model

```bash
pylade \
    "Put text here" \
    --model model.json \
    --implementation CavnarTrenkleImpl \
    --output detected_language.txt \
    --predict-args '{"error_value": 8000}'
```

`--predict-args` is a dictionary of arguments to be passed to the `predict_language()` method of the chosen implementation (`CavnarTrenkleImpl` in the example above). For an accurate description of the arguments please refer to the `predict_language()` method docstring.

## Info

The default model (`data/model.json`) has been trained using `limit = 5000`. This value provides a good balance between computational performance and accuracy. Please note that this might change if you use your own data to train a new model.

## Tests

Give the command `tox` from the package root in order to perform tests.

Tests with `tox` require the following dependencies:

- `tox`
- `pytest`

## Customization

Different language detection approaches can be implemented creating new classes that inherit from the `Implementation` class. This class should be considered as an interface whose methods are meant to be implemented by the inheriting class.

Customized corpus readers can be created the same way, inheriting from the `CorpusReader` interface instead.

## References

- Cavnar, William B., and John M. Trenkle. "N-gram-based text categorization." *Ann Arbor MI* 48113.2 (1994): 161-175.

