Metadata-Version: 2.1
Name: texta-parsers
Version: 2.7.19
Summary: texta-parsers
Home-page: UNKNOWN
Author: TEXTA
Author-email: info@texta.ee
License: GPLv3
Description: # texta-parsers
        
        A Python package for file parsing.
        
        The main class in the package is **DocParser**. The package also supports sophisticated parsing of emails which is implemented in class **EmailParser**. If you only need to parse emails then you can specify it with parameter `parse_only_extensions`. It is possible to use **EmailParser** independently as well but then attachments will not be parsed. 
        
        
        ## Requirements
        
        Most of the file types are parsed with **[tika](http://tika.apache.org/)**. Other tools that are required:
        
        | Tool | File Type |
        |---|---|
        | pst-utils | .pst  |
        | digidoc-tool | .ddoc .bdoc .asics .asice |
        | rar-nonfree  | .rar |
        | lxml | XML HTML |
        
        Installation of required packages on Ubuntu/Debian:
        
        `sudo apt-get install pst-utils rar python3-lxml cmake build-essential -y`
        
        `sudo sh install-digidoc.sh`
        
        Requires our custom version of Apache TIKA with relevant Tesseract language packs installed:
        
        `sudo docker run -p 9998:9998 docker.texta.ee/texta/texta-parsers-python/tikaserver:latest`
        
        ## Installation
        
        Base install (without MLP & Face Analyzer):
        
        `pip install texta-parsers`
        
        Install with MLP:
        
        `pip install texta-parsers[mlp]`
        
        
        Install with whole bundle:
        
        `pip install texta-parsers[mlp]`
        
        
        ## Testing
        
        `python -m  pytest -rx -v tests`
        
        
        ## Description
        
        #### DocParser
        
        A file parser. Input can either be in bytes or a path to the file as a string. See [user guide](https://git.texta.ee/texta/email-parser/-/wikis/DocParser/User-Guide/Getting-started) more information. DocParser also includes EmailParser.
        
        #### EmailParser
        
        For parsing email messages and mailboxes. Supported file formats are Outlook Data File (**.pst**), mbox (**.mbox**) and EML (**.eml**). Can be used separately from DocParser but then attachments are not parsed.
        User guide can be found [here](https://git.texta.ee/texta/email-parser/-/wikis/EmailParser/User-Guide/Getting-started) and documentation [here](https://git.texta.ee/texta/email-parser/-/wikis/EmailParser/Documentation/1.2.1).
        
Platform: UNKNOWN
Description-Content-Type: text/markdown
Provides-Extra: mlp
