Metadata-Version: 2.1
Name: nlpiper
Version: 0.3.0
Summary: NLPiper, a lightweight package integrated with a universe of frameworks to pre-process documents.
Home-page: https://github.com/dlite-tools/NLPiper
License: MIT
Keywords: NLP,CL,natural language processing,computational linguistics,parsing,tokenizing,linguistics,language,natural language,text analytics,deep-learning 
Author: Tomás Osório
Maintainer: Carlos Alves, Daniel Ferrari, Tomás Osório
Requires-Python: >=3.8,<4.0
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: Freely Distributable
Classifier: License :: OSI Approved :: MIT License
Classifier: Natural Language :: English
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Utilities
Provides-Extra: all
Provides-Extra: bs4
Provides-Extra: gensim
Provides-Extra: hunspell
Provides-Extra: nltk
Provides-Extra: numpy
Provides-Extra: sacremoses
Provides-Extra: spacy
Provides-Extra: stanza
Provides-Extra: torchtext
Requires-Dist: bs4 (>=0.0.1,<0.0.2); extra == "bs4" or extra == "all"
Requires-Dist: cyhunspell (>=2.0.2,<3.0.0); extra == "hunspell" or extra == "all"
Requires-Dist: gensim (>=4.1.2,<5.0.0); extra == "gensim" or extra == "all"
Requires-Dist: nltk (>=3.5,<4.0); extra == "nltk" or extra == "all"
Requires-Dist: numpy (>=1.22.2,<2.0.0); extra == "numpy" or extra == "all"
Requires-Dist: pydantic (>=1.8.0,<2.0.0)
Requires-Dist: sacremoses (>=0.0.49,<0.0.50); extra == "sacremoses" or extra == "all"
Requires-Dist: spacy (>=3.2.4,<4.0.0); extra == "spacy" or extra == "all"
Requires-Dist: stanza (>=1.3.0,<2.0.0); extra == "stanza" or extra == "all"
Requires-Dist: torchtext (>=0.12.0,<0.13.0); extra == "torchtext" or extra == "all"
Project-URL: Documentation, https://github.com/dlite-tools/NLPiper/README.md
Project-URL: Repository, https://github.com/dlite-tools/NLPiper
Description-Content-Type: text/markdown

<p align="center">
  <img src="https://raw.githubusercontent.com/dlite-tools/NLPiper/main/docs/imgs/nlpiper.png" />
</p>

[![Test](https://github.com/dlite-tools/NLPiper/actions/workflows/test.yml/badge.svg)](https://github.com/dlite-tools/NLPiper/actions/workflows/test.yml)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![codecov](https://codecov.io/gh/dlite-tools/NLPiper/branch/main/graph/badge.svg?token=PK513BHBVC)](https://codecov.io/gh/dlite-tools/NLPiper)
![Package Version](https://img.shields.io/pypi/v/NLPiper)
![Python Version](https://img.shields.io/pypi/pyversions/NLPiper)

NLPiper is a package that agglomerates different NLP tools and applies their transformations in the target document.

## Goal
Lightweight package integrated with a universe of frameworks to pre-process documents.

---
## Installation

You can install NLPiper from PyPi with `pip` or your favorite package manager:

    pip install nlpiper

---

## Optional Dependencies

Some **transformations** require the installation of additional packages.
The following table explains the optional dependencies that can be installed:

| Package                                                                                                   | Description
|---                                                                                                        |---
| <a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/" target="_blank"><code>bs4</code></a>     | Used in **CleanMarkup** to remove HTML and XML from the document.
| <a href="https://www.nltk.org/install.html" target="_blank"><code>nltk</code></a>                         | Used in **RemoveStopWords** to remove stop words from the document.
| <a href="https://github.com/alvations/sacremoses" target="_blank"><code>sacremoses</code></a>             | Used in **MosesTokenizer** to tokenize the document using Sacremoses.

To install the optional dependency needed for your purpose you can run:


    pip install nlpiper[<package>]


You can install all of these dependencies at once with:


    pip install nlpiper[all]

The package can be installed using `pip`:

`pip install nlpiper`

For all transforms be available:
`pip install 'nlpiper[all]'`, otherwise, just install the packages needed.

## Usage

### Define a Pipeline:

```python
>>> from nlpiper.core import Compose
>>> from nlpiper.transformers import cleaners, normalizers, tokenizers
>>> pipeline = Compose([
...                    cleaners.CleanNumber(),
...                    tokenizers.BasicTokenizer(),
...                    normalizers.CaseTokens()
... ])
>>> pipeline
Compose([CleanNumber(), BasicTokenizer(), CaseTokens(mode='lower')])
```

### Generate a Document and Document structure:
```python
>>> from nlpiper.core import Document
>>> doc = Document("The following character is a number: 1 and the next one is not a.")
>>> doc
Document(
    original='The following character is a number: 1 and the next one is not a.',
    cleaned='The following character is a number: 1 and the next one is not a.',
    tokens=None,
    embedded=None,
    steps=[]
)
```

### Apply Pipeline to a Document:
```python
>>> doc = pipeline(doc)
>>> doc
Document(
    original='The following character is a number: 1 and the next one is not a.',
    cleaned='The following character is a number:  and the next one is not a.',
    tokens=[
        Token(original='The', cleaned='the', lemma=None, stem=None, embedded=None),
        Token(original='following', cleaned='following', lemma=None, stem=None, embedded=None),
        Token(original='character', cleaned='character', lemma=None, stem=None, embedded=None),
        Token(original='is', cleaned='is', lemma=None, stem=None, embedded=None),
        Token(original='a', cleaned='a', lemma=None, stem=None, embedded=None),
        Token(original='number:', cleaned='number:', lemma=None, stem=None, embedded=None),
        Token(original='and', cleaned='and', lemma=None, stem=None, embedded=None),
        Token(original='the', cleaned='the', lemma=None, stem=None, embedded=None),
        Token(original='next', cleaned='next', lemma=None, stem=None, embedded=None),
        Token(original='one', cleaned='one', lemma=None, stem=None, embedded=None),
        Token(original='is', cleaned='is', lemma=None, stem=None, embedded=None),
        Token(original='not', cleaned='not', lemma=None, stem=None, embedded=None),
        Token(original='a.', cleaned='a.', lemma=None, stem=None, embedded=None)
    ],
    embedded=None,
    steps=['CleanNumber()', 'BasicTokenizer()', "CaseTokens(mode='lower')"]
)
```

### Available Transformers
#### Cleaners
Clean document as a whole, e.g. remove HTML, remove accents, remove emails, etc.

- `CleanURL`: remove URL from the text.
- `CleanEmail`: remove email from the text.
- `CleanNumber`: remove numbers from text.
- `CleanPunctuation`: remove punctuation from text.
- `CleanEOF`: remove end of file from text.
- `CleanMarkup`: remove HTML or XML from text.
- `CleanAccents`: remove accents from the text.

#### Tokenizers
Tokenize a document after cleaning is done (Split document into tokens)

- `BasicTokenizer`: Split tokens by spaces in the text.
- `MosesTokenizer`: Split tokens using Moses tokenizer (https://github.com/alvations/sacremoses)
- `StanzaTokenizer`: Split tokens using Stanza tokenizer (https://github.com/stanfordnlp/stanza)

#### Normalizer
Applies on the token level, e.g. remove stop-words, spell-check, etc.

- `CaseTokens`: lower or upper case all tokens.
- `RemovePunctuation`: Remove punctuation from resulting tokens.
- `RemoveStopWords`: Remove stop-words as tokens.
- `VocabularyFilter`: Only allow tokens from a pre-defined vocabulary.
- `Stemmer`: Get the stem from the tokens.
- `SpellCheck`: Spell check the token, if given max distance will calculate the Levenshtein distance from the token with
the suggested word and if lower the token is replaced by the suggestion else will keep the token. If no maximum distance is given if the
word is not correctly spelt then will be replaced by an empty string.

#### Embeddings
Applies on the token level, converting words by embeddings

- `GensimEmbeddings`: Use Gensim word embeddings.
- `TorchTextEmbeddings`: Applies word embeddings using torchtext models `Glove`, `CharNGram` and `FastText`.

#### Document
`Document` is a dataclass that contains all the information used during text preprocessing.

Document attributes:
- `original`: original text to be processed.
- `cleaned`: original text to be processed when document is initiated and then attribute which `Cleaners` and `Tokenizers` work.
- `tokens`: list of tokens that where obtained using a `Tokenizer`.
- `steps`: list of transforms applied on the document.
- `embedded`: document embedding.

`token`:
- `original`: original token.
- `cleaned`: original token at initiation, then modified according with `Normalizers`.
- `lemma`: token lemma (need to use a normalizer or tokenizer to obtain).
- `stem`: token stem (need to use a normalizer to obtain).
- `ner`: token entity (need to use a normalizer or tokenizer to obtain).
- `embedded`: token embedding.

#### Compose
Compose applies the chosen transformers into a given document.
It restricts the order that the transformers can be applied, first are the Cleaners, then the Tokenizers and lastly
the Normalizers and Embeddings.

It is possible to create a compose using the steps from a processed document:
```python
>>> doc.steps
['CleanNumber()', 'BasicTokenizer()', "CaseTokens(mode='lower')"]
>>> new_pipeline = Compose.create_from_steps(doc.steps)
>>> new_pipeline
Compose([CleanNumber(), BasicTokenizer(), CaseTokens(mode='lower')])
```
It is also possible to rollback the steps applied to a document:
```python
>>> new_doc = Compose.rollback_document(doc, 2)
>>> new_doc
Document(
    original='The following character is a number: 1 and the next one is not a.',
    cleaned='The following character is a number:  and the next one is not a.',
    tokens=None,
    embedded=None,
    steps=['CleanNumber()']
)
>>> doc
Document(
    original='The following character is a number: 1 and the next one is not a.',
    cleaned='The following character is a number:  and the next one is not a.',
    tokens=[
        Token(original='The', cleaned='the', lemma=None, stem=None, embedded=None),
        Token(original='following', cleaned='following', lemma=None, stem=None, embedded=None),
        Token(original='character', cleaned='character', lemma=None, stem=None, embedded=None),
        Token(original='is', cleaned='is', lemma=None, stem=None, embedded=None),
        Token(original='a', cleaned='a', lemma=None, stem=None, embedded=None),
        Token(original='number:', cleaned='number:', lemma=None, stem=None, embedded=None),
        Token(original='and', cleaned='and', lemma=None, stem=None, embedded=None),
        Token(original='the', cleaned='the', lemma=None, stem=None, embedded=None),
        Token(original='next', cleaned='next', lemma=None, stem=None, embedded=None),
        Token(original='one', cleaned='one', lemma=None, stem=None, embedded=None),
        Token(original='is', cleaned='is', lemma=None, stem=None, embedded=None),
        Token(original='not', cleaned='not', lemma=None, stem=None, embedded=None),
        Token(original='a.', cleaned='a.', lemma=None, stem=None, embedded=None)
    ],
    embedded=None,
    steps=['CleanNumber()', 'BasicTokenizer()', "CaseTokens(mode='lower')"]
)
```

---

## Development Installation

```
git clone https://github.com/dlite-tools/NLPiper.git
cd NLPiper
poetry install
```

To install an [optional dependency](##Optional-Dependencies) you can run:


    poetry install --extras <package>


To install all the optional dependencies run:


    poetry install --extras all


---

## Contributions

All contributions, bug reports, bug fixes, documentation improvements,
enhancements and ideas are welcome.

A detailed overview on how to contribute can be found in the
[contributing guide](CONTRIBUTING.md)
on GitHub.

---

## Issues

Go [here](https://github.com/dlite-tools/NLPiper/issues) to submit feature
requests or bugfixes.

---

## License and Credits

`NLPiper` is licensed under the [MIT license](LICENSE) and is written and
maintained by Tomás Osório ([@tomassosorio](https://github.com/tomassosorio)), Daniel Ferrari ([@FerrariDG](https://github.com/FerrariDG)) and Carlos Alves ([@cmalves](https://github.com/cmalves))

