Metadata-Version: 2.1
Name: iamtokenizing
Version: 0.5.0
Summary: Simple tokenizers: n-grams and chargrams splitting, white space splitting, or splitting using configurable REGEX expression. Based on Span and Token objects from the tokenspan package.
Home-page: https://framagit.org/nlp/iamtokenizing/
Author: François Konschelle - IAM CHU Bordeaux France
Author-email: via.issue@only.please
License: GNU GENERAL PUBLIC LICENSE v.3
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3)
Classifier: Operating System :: OS Independent
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Topic :: Text Processing
Classifier: Intended Audience :: Information Technology
Classifier: Intended Audience :: Science/Research
Requires-Python: >=3.7
Description-Content-Type: text/markdown
License-File: LICENSE

# Tokenization for language processing

This package contains some generic configurable tools allowing to cut a string in sub-parts (cf. [Wikipedia](https://en.wikipedia.org/wiki/Lexical_analysis#Tokenization)), called `Token`, and to group them into sequences called `Tokens`. A `Token` is a sub-string from a parent string (say the initial complete text), with associated ranges of non-overlaping characters. The number of associated ranges is arbitrary. A `Tokens` is a collection of `Token`. These two classes allow to associate to any `Token` a collection of attributes in a versatile way, and to pass these attributes from one object to the next one while cutting `Token` into sub-parts (collected as `Tokens`) and eventually re-merging them into larger `Token`.

`Token` and `Tokens` classes allow basic tokenization of text, such as word splitting, n-gram splitting, char-gram splitting of arbitrary size. In addition, it allows to associate several non-overlapping sub-strings into a given `Token`, and to associate arbitrary attributes to these parts. One can compare two different `Token` objects in terms of their attributes and/or ranges. One can also apply basic mathematical operations and logic to them (+,-,*,/) corresponding to the union, difference, intersection and symmetric difference implemented by Python set ; here the sets are the ranges of position from the parent string.

## Installation

 - The documentation is available on [https://nlp.frama.io/iamtokenizing/](https://nlp.frama.io/iamtokenizing/)
 - The PyPi package is available on [https://pypi.org/project/iamtokenizing/](https://pypi.org/project/iamtokenizing/)
 - The official repository is on [https://framagit.org/nlp/iamtokenizing](https://framagit.org/nlp/iamtokenizing)

### From Python Package Index (PIP)

Simply run 

```bash
pip install iamtokenizing
```

is sufficient.

### From the repository

The official repository is on https://framagit.org/nlp/iamtokenizing

Once the repository has been downloaded (or cloned), one can install this package using `pip` : 

```bash
git clone https://framagit.org/nlp/iamtokenizing.git
cd iamtokenizing/
pip install .
```

Once installed, one can run some tests using

```bash
cd tests/
python3 -m unittest -v
```

(verbosity `-v` is an option).

## Basic examples

Basic examples can be found in the [documentation](https://nlp.frama.io/iamtokenizing/).

## Versions

 - Versions before 0.4 only present the `Token` and `Tokens` classes. They have been splitted after in three classes, named `Span`, `Token` and `Tokens`. Importantly, the methods `Token.append` and `Token.remove` no longer exist in the next version. They have been replaced by `Token.append_range`, `Token.append_ranges`, `Token.remove_range` and `Token.remove_ranges`.
 - Version 0.4 add the class `Span` to `Token` and `Tokens`. `Span` handles the sub-parts splitting of a given string, whereas `Token` and `Tokens` now consumes `Span` objects and handle the attributes of the `Token`. 
 - From version 0.5, one has split the basic tools `Span`, `Token` and `Tokens` from the `iamtokenizing` package (see https://pypi.org/project/iamtokenizing/). Only the advanced tokenizer are now present in the package `iamtokenizing`, which depends on the package `tokenspan`. The objects `Span`, `Token` and `Tokens` can be called as before from the newly deployed package `tokenspan`, available on https://pypi.org/project/tokenspan/.

## About us

Package developped for Natural Language Processing at IAM : Unité d'Informatique et d'Archivistique Médicale, Service d'Informatique Médicale, Pôle de Santé Publique, Centre Hospitalo-Universitaire (CHU) de Bordeaux, France.

You are kindly encouraged to signal any trouble, and to propose ameliorations and/or suggestions to the authors, via issue or merge requests.

Last version : June 03, 2021

