Metadata-Version: 2.1
Name: data-preprocessors
Version: 0.25.0
Summary: An easy to use tool for Data Preprocessing specially for Text Preprocessing
Home-page: https://github.com/MusfiqDehan/data-preprocessors
License: MIT
Keywords: nlp,data-preprocessors,data-preprocessing,text-preprocessing,data-science,textfile,musfiqdehan
Author: Md. Musfiqur Rahaman
Author-email: musfiqur.rahaman@northsouth.edu
Requires-Python: >=3.7.1,<4.0
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Education
Classifier: Intended Audience :: Information Technology
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Topic :: Communications
Classifier: Topic :: Education
Classifier: Topic :: Software Development
Classifier: Topic :: Text Processing
Classifier: Topic :: Text Processing :: Linguistic
Requires-Dist: bnlp-toolkit (>=3.1.2,<4.0.0)
Requires-Dist: nltk (>=3.7,<4.0)
Requires-Dist: pandas (==1.3.0)
Project-URL: Repository, https://github.com/MusfiqDehan/data-preprocessors
Description-Content-Type: text/markdown

<div align="center">
    
<img src="https://github.com/MusfiqDehan/data-preprocessors/raw/master/branding/logo.png">

<p>Data Preprocessors</p>

<sub>An easy to use tool for Data Preprocessing specially for Text Preprocessing</sub>

<!-- Badges -->

<!-- [<img src="https://deepnote.com/buttons/launch-in-deepnote-small.svg">](PROJECT_URL) -->
[![](https://img.shields.io/pypi/v/data-preprocessors.svg)](https://pypi.org/project/data-preprocessors/)
[![Downloads](https://img.shields.io/pypi/dm/data-preprocessors)](https://pepy.tech/project/data-preprocessors)
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1mJuRfIz__uS3xoFaBsFn5mkLE418RU19?usp=sharing)
[![Kaggle](https://kaggle.com/static/images/open-in-kaggle.svg)](https://kaggle.com/kernels/welcome?src=https://github.com/keras-team/keras-io/blob/master/examples/vision/ipynb/mnist_convnet.ipynb)

</div>

## **Table of Contents**

- [Installation](#installation)
- [Quick Start](#quick-start)
- [Features](#features)
    - [Split Textfile](#split-textfile)
    - [Build Parallel Corpus](#build-parallel-corpus)
    - [Separate Parallel Corpus](#)
    - [Remove Punctuation](#remove-punctuation)
    - [Space Punctuation](#space-punctuation)
    - [Text File to List](#text-file-to-list)
    - [List to Text File](#list-to-text-file)
    - [Count Characters of a Sentence](#)
    - [Count Words of Sentence](#)
    - [Count No of Lines in a Text File](#)
    - **[Apply Any Function in a Full Text File](#)**

    

## **Installation**
Install the latest stable release<br>
**For windows**<br>
```
pip install -U data-preprocessors
```

**For Linux/WSL2**<br>
```
pip3 install -U data-preprocessors
```

## **Quick Start**

```python
from data_preprocessors import text_preprocessor as tp
sentence = "bla! bla- ?bla ?bla."
sentence = tp.remove_punc(sentence)
print(sentence)

>> bla bla bla bla
```

## **Features**

### Split Textfile

This function will split your textfile into train, test and validate. Three separate text files. By changing `shuffle` and `seed` value, you can randomly shuffle the lines of your text files.

```python
from data_preprocessors import text_preprocessor as tp
tp.split_textfile(
    main_file_path="example.txt",
    train_file_path="splitted/train.txt",
    val_file_path="splitted/val.txt",
    test_file_path="splitted/test.txt",
    train_size=0.6,
    val_size=0.2,
    test_size=0.2,
    shuffle=True,
    seed=42
)

# Total lines:  500
# Train set size:  300
# Validation set size:  100
# Test set size:  100
```

### Separate Parallel Corpus

By using this function, you will be able to easily separate `src_tgt_file` into separated `src_file` and `tgt_file`.

```python
from data_preprocessors import text_preprocessor as tp
tp.separate_parallel_corpus(src_tgt_file="", separator="|||", src_file="", tgt_file="")
```

### Remove Punctuation

By using this function, you will be able to remove the punction of a single line of a text file.

```python
from data_preprocessors import text_preprocessor as tp
sentence = "bla! bla- ?bla ?bla."
sentence = tp.remove_punc(sentence)
print(sentence)

# bla bla bla bla
```

### Space Punctuation

By using this function, you will be able to add one space to the both side of the punction so that it will easier to tokenize the sentence. This will apply on a single line of a text file. But if we want, we can use it in a full twxt file.

```python
from data_preprocessors import text_preprocessor as tp
sentence = "bla! bla- ?bla ?bla."
sentence = tp.space_punc(sentence)
print(sentence)

# bla bla bla bla
```

### Text File to List

Convert any text file into list.

```python
 mylist= tp.text2list(myfile_path="myfile.txt")
```

### List to Text File

Convert any list into a text file (filename.txt)

```python
tp.list2text(mylist=mylist, myfile_path="myfile.txt")
```

### Apply a function in whole text file

In the place of `function_name` you can use any function and that function will be applied in the full/whole text file.

```python
from data_preprocessors import text_preprocessor as tp
tp.apply_whole(
    function_name, 
    myfile_path="myfile.txt", 
    modified_file_path="modified_file.txt"
)
```


