Metadata-Version: 1.1
Name: docSilhouette
Version: 0.1.0
Summary: Document aesthetics and text extractor
Home-page: https://github.com/fabraz/docSilhouette
Author: Fabricio Braz
Author-email: fabricio.braz@gmail.com
License: MIT License
Description: # docSilhouette
        :tada: docSilhouette
        
        ## What is it? 
        
        This library wraps pytesseract and adds some useful features for text processing. Objevtively it takes information from the bouding boxes issued by tesseract and it exctracts some coherent information from the text aesthetic, like page and document position for each text block.
        
        We also applied a greedy algorithm to organize the words in blocks, firstly processing lines, after that processing the groups of words as exposed by tesseract dataframe.
        
        ## How to use
        
        You'd rather install the library using pip:
        
        ```shell
        pip install docSilhouette
        ```
        
        Then you can use it:
        
        ```python
        from docSilhouette.docSilhouette import docSilhouette
        doc = docSilhouette('./tests/assets/single_page.pdf')
        doc.setup()
        print(doc.get_text(True))
        ```
        
        You might find output like the following
        
        ```shell
        xxP001
        xxQ00_00 xxbob Universal Language Model Fine-tuning for Text Classification
        xxeob xxQ00_03
        ```
        
        ## Special Tokens
        
        * `xxP001`: Page number
        * `xxbob`: Begin of block
        * `xxeob`: End of block
        * `xxQ01_00`: Block number, where 01 refers to the first line of the page matrix and the 00 refers to the first column of the page. Check out the image bellow with a page with the matrix plotted on it. When set to issue quadrants, every block will have a `xxQ` for the beginning of the block and another for the end of the block. The following example highlights the quadrant of the block ``1 Introduction``, which starts at line 3 and column 0 and ends at line 3 and column 1. Refer to the image bellow for a more detailed example.
        
        ```shell
        xxQ03_00 xxbob 1 Introduction
        xxeob xxQ03_01
        ```
        
        * `xxbcet`: centralized text line
        * `xxecet`: end of centralized text line
        
        ![](imgs/2022-04-23-15-08-27.png)
        
        
        ## License
        MIT
Keywords: document OCR visual features
Platform: UNKNOWN
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Topic :: Text Processing
Classifier: License :: OSI Approved :: MIT License
