Metadata-Version: 2.1
Name: take-text-preprocess
Version: 0.0.6b1
Summary: Text Preprocesser
Author: Data & Analytics Research
Author-email: analytics.dar@take.net
Maintainer: daresearch
Maintainer-email: anaytics.dar@take.net
License: MIT License
Keywords: text,preprocessing
Classifier: Programming Language :: Python :: 3
Classifier: Operating System :: OS Independent
Description-Content-Type: text/markdown

# Take Text Pre-Process #

This package is a tool for pre-processing a sentence.

The basic functionality available in this packages are:
* Converting to lower case
* Remove non ascii characters
* Add space between punctuation and word

The customize functionality available are:
* Replace URL by a token
* Replace Email by a token
* Replace Numbers by a token
* Replace Code (Number and letters) by a token
* Remove symbols
* Replace abbreviations
* Keep emojis

## Installation
The TakeTextPreProcess can be installed from PyPi:

```bash
pip install take-text-preprocess
```

## Usage

### Basic pre-process
To use the basic pre-process:
```python
from take_text_preprocess.presentation import pre_process
sentence = 'Bom dia, meu áºž caro'
pre_process(sentence)
```

### Customize pre-process
To use the customize pre-process is needed a input with a list of all pre-process you wanted to use.

The following examples show all the customized pre-processes available.
* URL
```python
from take_text_preprocess.presentation import pre_process
optional_tokenization = ['URL']
sentence = 'Bom dia, meu https://www.take.net  caro'
pre_process(sentence, optional_tokenization)
```

* EMAIL
```python
from take_text_preprocess.presentation import pre_process
optional_tokenization = ['EMAIL']
sentence = 'Bom dia, meu teste@gmail.com  caro'
pre_process(sentence, optional_tokenization)
```

* NUMBER
```python
from take_text_preprocess.presentation import pre_process
optional_tokenization = ['NUMBER']
sentence = 'Este Ã© um nÃºmero 99999-9999'
pre_process(sentence, optional_tokenization)
```

* CODE
```python
from take_text_preprocess.presentation import pre_process
optional_tokenization = ['CODE']
sentence = 'Este Ã© um cÃ³digo 91234abc'
pre_process(sentence, optional_tokenization)
```

* SYMBOLS
```python
from take_text_preprocess.presentation import pre_process
optional_tokenization = ['SYMBOL']
sentence = 'Este Ã© um sÃ­mbolo %'
pre_process(sentence, optional_tokenization)
```

* ABBREVIATIONS
```python
from take_text_preprocess.presentation import pre_process
optional_tokenization = ['ABBR']
sentence = 'Este Ã© uma abreviaÃ§Ã£o vc'
pre_process(sentence, optional_tokenization)
```
* EMOJI
```python
from take_text_preprocess.presentation import pre_process
optional_tokenization = ['EMOJI']
sentence = 'Este Ã© um emoji ðŸ˜€'
pre_process(sentence, optional_tokenization)
```

## Contribute
If this is the first time you are contributing to this project, first create the virtual environment using the following command:
    
    conda env create -f env/environment.yml
   
Then activate the environment:

    conda activate taketextpreprocess
    
To test your modifications build the package:

    pip install dist\take-text-preprocess-VERSION-py3-none-any.whl --force-reinstall
    
Then run the tests:

    pytest
