Metadata-Version: 2.1
Name: ternaus_cleantext
Version: 0.0.1
Summary: Clean text from extra spaces and special symbols as in the CLIP model.
Home-page: https://github.com/ternaus/ternaus-cleantext
Author: Vladimir Iglovikov
License: MIT
Classifier: License :: OSI Approved :: MIT License
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Software Development :: Libraries
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Description-Content-Type: text/markdown
License-File: LICENSE

[![Sourcery](https://img.shields.io/badge/Sourcery-enabled-brightgreen)](https://sourcery.ai)

# Cleantextclip
Library to prepare text for machine learning and NLP tasks. Originated from CLIP model preparation, but a few more
rules were added.

## Installation
```bash
pip install -U ternaus_cleantext
```


Cleans text similar, but stricter than in the CLIP model:

1. Escapes HTML characters
2. Removes html tags
3. Removes URLs
4. Removes extra white spaces
5. Text to lower case 

```python
from ternaus_cleantext.ternaus_cleantext import clean_text
print(clean_text("This is a test https://ternaus.com <b>bold</b>"))
```
returns
`this is a test bold`
