Metadata-Version: 2.1
Name: toxine
Version: 1.0.13
Summary: Tiny preprocessor for Russian text
Home-page: https://github.com/fostroll/toxine
Author: Sergei Ternovykh
Author-email: fostroll@gmail.com
License: BSD
Description: <div align="right"><strong>RuMor: Russian Morphology project</strong></div>
        <h2 align="center">Toxine: a tiny python NLP library for Russian text preprocessing</h2>
        
        [![PyPI Version](https://img.shields.io/pypi/v/toxine?color=blue)](https://pypi.org/project/toxine/)
        [![Python Version](https://img.shields.io/pypi/pyversions/toxine?color=blue)](https://www.python.org/)
        [![License: BSD-3](https://img.shields.io/badge/License-BSD-brightgreen.svg)](https://opensource.org/licenses/BSD-3-Clause)
        
        A part of ***RuMor*** project. It contains pipeline for preprocessing and
        tokenization texts in *Russian*. Also, it includes preliminary entity tagging.
        Highlights are:
        
        * Extracting emojis, emails, dates, phones, urls, html/xml fragments etc.
        * Tagging/removing tokens with unallowed symbols
        * Normalizing punctuation
        * Tokenization (via *NLTK*)
        * Russan *Wikipedia* tokenizer
        * [*brat*](https://brat.nlplab.org/) annotations support
        
        ## Installation
        
        ### pip
        
        ***Toxine*** supports *Python 3.5* or later. To install it via *pip*, run:
        ```sh
        $ pip install toxine
        ```
        
        If you currently have a previous version of ***Toxine*** installed, use:
        ```sh
        $ pip install toxine -U
        ```
        
        ### From Source
        
        Alternatively, you can also install ***Toxine*** from source of this *git
        repository*:
        ```sh
        $ git clone https://github.com/fostroll/toxine.git
        $ cd toxine
        $ pip install -e .
        ```
        This gives you access to examples that are not included to the *PyPI* package.
        
        ## Setup
        
        ***Toxine*** uses *NLTK* with *punkt* data downloaded. If you didn't do it yet,
        start *Python* interpreter and execute:
        ```python
        >>> import nltk
        >>> nltk.download('punkt')
        ```
        
        ## Usage
        
        [Text Preprocessor](https://github.com/fostroll/toxine/blob/master/doc/README_TEXT_PREPROCESSOR.md)
        
        [Wrapper for tokenized *Wikipedia*](https://github.com/fostroll/toxine/blob/master/doc/README_WIKIPEDIA.md)
        
        [*brat* annotations support](https://github.com/fostroll/toxine/blob/master/doc/README_BRAT.md)
        
        ## Examples
        
        You can find them in the directory `examples` of our ***Toxine*** github
        repository.
        
        ## License
        
        ***Toxine*** is released under the BSD License. See the
        [LICENSE](https://github.com/fostroll/toxine/blob/master/LICENSE) file for
        more details.
        
Keywords: natural-language-processing nlp preprocessing
Platform: UNKNOWN
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Education
Classifier: Intended Audience :: Science/Research
Classifier: Intended Audience :: Information Technology
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Topic :: Text Processing
Classifier: Topic :: Text Processing :: Linguistic
Classifier: Topic :: Software Development
Classifier: Topic :: Software Development :: Libraries
Classifier: Programming Language :: Python :: 3.5
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Requires-Python: >=3.5
Description-Content-Type: text/markdown
