Metadata-Version: 2.1
Name: tokenspan
Version: 0.5.0
Summary: Basic tools to tokenize (i.e. to construct atomic-entities/sub-strings of) a string, for Natural Language Processing (NLP). Usefull also for annotation, tree parsing, entity linking, ... (in fact, anything that links a string or its sub-parts to an other object). Key concepts are versatility to other librairies, and freedom to define many concepts on top of a string.
Home-page: https://framagit.org/nlp/tokenspan
Author: François Konschelle - IAM CHU Bordeaux France
Author-email: via.issue@only.please
License: GNU GENERAL PUBLIC LICENSE v.3
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3)
Classifier: Operating System :: OS Independent
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Topic :: Text Processing
Classifier: Intended Audience :: Information Technology
Classifier: Intended Audience :: Science/Research
Requires-Python: >=3.7
Description-Content-Type: text/markdown
License-File: LICENSE

# Tokenization for language processing

This package contains some generic configurable tools allowing to cut a string in sub-parts (cf. [Wikipedia](https://en.wikipedia.org/wiki/Lexical_analysis#Tokenization)), called `Token`, and to group them into sequences called `Tokens`. A `Token` is a sub-string from a parent string (say the initial complete text), with associated ranges of non-overlaping characters. The number of associated ranges is arbitrary. A `Tokens` is a collection of `Token`. These two classes allow to associate to any `Token` a collection of attributes in a versatile way, and to pass these attributes from one object to the next one while cutting `Token` into sub-parts (collected as `Tokens`) and eventually re-merging them into larger `Token`.

`Token` and `Tokens` classes allow basic tokenization of text, such as word splitting, n-gram splitting, char-gram splitting of arbitrary size. In addition, it allows to associate several non-overlapping sub-strings into a given `Token`, and to associate arbitrary attributes to these parts. One can compare two different `Token` objects in terms of their attributes and/or ranges. One can also apply basic mathematical operations and logic to them (+,-,*,/) corresponding to the union, difference, intersection and symmetric difference implemented by Python set ; here the sets are the ranges of position from the parent string.

## Installation

### From Python Package Index (PIP)

Simply run 

```bash
pip install tokenspan
```

is sufficient.

### From the repository

The official repository is on https://framagit.org/nlp/tokenspan. To install the package from the repository, run the following command lines 

```bash
git clone https://framagit.org/nlp/tokenspan.git
cd tokenspan/
pip install .
```

Once installed, one can run some tests using

```bash
cd tests/
python3 -m unittest -v
```

(verbosity `-v` is an option).

## Philosophy of this library

In `tokenspan`, one thinks of a string as a collection of integers: the position of each character in the string. For instance

```python
'Simple string for demonstration and for illustration.' # the parent string
'01234567891123456789212345678931234567894123456789512' # the positions

'       string                       for illustration ' # the Span span1
'       789112                       678 412345678951 ' # the ranges

'Simple                                               ' # the Span span2
'012345                                               ' # the ranges
```

To define the `Span` `'string for illustration'` consists in selecting the positions `[range(7,13),range(38,39),range(40,52)]` from the parent string, and the `Span` `'simple'` is defined by the positions `[range(0,6),]`. Underneath, one keeps the principal methods associated to string to each of these `Span`, like e.g. `lowercase()`, `'uppercase()`, `islower()`, ... So a `Span` is primarilly a list of ranges on top of a string, which still behaves like a string.

In addition, one can see the above ranges as sets of positins. Then it is quite easy to perform some basic operations on the `Span`, for instance the addition of two `Span`

```python
str(span1 + span2) = 'Simple string for illustration'
```

is interpreted as the union of their relative sets of positions.

In addition to these logical operations, there are a few utilities, like the possibility to split or slice a `Span` into `Span` objects, as long as their are all related to the same parent string.

## Basic example

Below we give a simple example of usage of the `Token` and `Tokens` classes.

```python
import re
from tokenspan import Token

string = 'Simple string for demonstration and for illustration.'
initial_token = Token(string)

# char-gram generation
chargrams = initial_token.slice(0,len(initial_token),3)
str(chargrams[2])
# return 'mpl'

# each char-gram conserves a memory of the initial string
chargrams[2].string
# return 'Simple string for demonstration and for illustration.'

cuts = [range(r.start(),r.end()) for r in re.finditer(r'\w+',string)]
tokens = initial_token.split(cuts)
# --> this is a Tokens instance, not a Token one ! (see documentation for explanation)

# tokens conserve the cutted parts, but behaves like a list
interesting_tokens = tokens[1::2]
# so one has to take only odd elements

# n-gram construction
ngram = interesting_tokens.slice(0,len(interesting_tokens),2)
ngram[2]
# return Token('for demonstration', 2 ranges)
str(ngram[2])
# return 'for demonstration'
ngram[2].ranges
# return [range(14, 17), range(18, 31)]
ngram[2].subTokens
# return the Tokens instance composed of token 'for' and token 'demonstration'

# add attributes to a Token
tok0 = interesting_tokens[0]
tok0.setattr('name_of_attribute',{'some_key':'some_value'})
# and take the attribute back
tok0.name_of_attribute
# return {'some_key':'some_value'}

# are the two 'for' Token the same ?
interesting_tokens[2] == interesting_tokens[-2]
# return False, because they are not at the same position

# basic operations among Token
for_for = interesting_tokens[2] + interesting_tokens[-2]
str(for_for)
# return 'for for'
for_for.ranges
# return [range(14, 17), range(36, 39)]
for_for.string
# return 'Simple string for demonstration and for illustration.'
# to check the positions of the two 'for' Token : 
#        '01234567890...456...01234567890.....678.0123456789012'

# also available : 
# tok1 + tok2 : union of the sets of tok1.ranges and tok2.ranges
# tok1 - tok2 : difference of tok1.ranges and tok2.ranges
# tok1 * tok2 : intersection of tok1.ranges and tok2.ranges
# tok1 / tok2 : symmetric difference of tok1.ranges and tok2.ranges

# reconstruction of a Token
simple_demonstration = interesting_tokens[0:5:3].join()
# one could have done interesting_tokens.join(0,5,3) as well

# it contains two non-overlapping sub-parts
str(simple_demonstration)
# return 'Simple demonstration'

# basic string methods from Python are still there
simple_demonstration.lower()
# return 'simple demonstration'
```

Other examples can be found in the [documentation](https://nlp.frama.io/tokenspan/).

## Comparison with other Python libraries

A comparison with some other NLP librairies (nltk, gensim, spaCy, gateNLP, ...) can be found in the [documentation](https://nlp.frama.io/tokenspan/comparison_other_libraries.html)

## Versions

 - Versions before 0.4 only present the `Token` and `Tokens` classes. They have been splitted after in three classes, named `Span`, `Token` and `Tokens`. Importantly, the methods `Token.append` and `Token.remove` no longer exist in the next version. They have been replaced by `Token.append_range`, `Token.append_ranges`, `Token.remove_range` and `Token.remove_ranges`.
 - Version 0.4 add the class `Span` to `Token` and `Tokens`. `Span` handles the sub-parts splitting of a given string, whereas `Token` and `Tokens` now consumes `Span` objects and handle the attributes of the `Token`. 
 - From version 0.5, one has split the basic tools `Span`, `Token` and `Tokens` from the `iamtokenizing` package (see https://pypi.org/project/iamtokenizing/). Only the advanced tokenizer are now present in the package `iamtokenizing`, which depends on the package `tokenspan`. The objects `Span`, `Token` and `Tokens` can be called as before from the newly deployed package `tokenspan`, available on https://pypi.org/project/tokenspan/.

## About us

Package developped for Natural Language Processing at IAM : Unité d'Informatique et d'Archivistique Médicale, Service d'Informatique Médicale, Pôle de Santé Publique, Centre Hospitalo-Universitaire (CHU) de Bordeaux, France.

You are kindly encouraged to contact the authors by issue on the [official repository](https://framagit.org/nlp/tokenspan), and to propose ameliorations and/or suggestions to the authors, via issue or merge requests.

Last version : August 5, 2021

