Metadata-Version: 2.1
Name: pysin
Version: 1.0.6
Summary: PySin is a toolbox for text retrieval in unstructured documents datasets. It contains both a multi-type text extractor and a search engine. To test them, you can use the medical prescriptions generator that is also provided.
Home-page: https://github.com/arkhn/PySin
Author: Jean-Baptiste Laval
Author-email: contact@arkhn.com
License: Apache License 2.0
Download-URL: https://github.com/arkhn/PySin/archive/1.0.6.tar.gz
Description: # PySin
        
        PySin is a toolbox for text retrieval in unstructured documents datasets. It contains both a multi-type text extractor and a search engine. To test them, you can use the medical prescriptions generator that is also provided.
        
        
        ## OS Dependencies
        
        ### Debian, Ubuntu, and friends
        
        ```sh
        sudo apt-get install build-essential libpoppler-cpp-dev pkg-config python-dev
        ```
        
        ### Fedora, Red Hat, and friends
        
        ```sh
        sudo yum install gcc-c++ pkgconfig poppler-cpp-devel python-devel redhat-rpm-config
        ```
        
        ### macOS
        
        ```sh
        brew install pkg-config poppler
        ```
        
        Conda users may also need libgcc:
        
        ```sh
        conda install -c anaconda libgcc
        ```
        
        ### Windows
        
        Currently tested only when using conda:
        
        - Install the Microsoft Visual C++ Build Tools
        - Install poppler through conda:
        
        ```sh
        conda install -c conda-forge poppler
        ```
        
        
        ## Install
        
        ```sh
        pip install pysin
        ```
        
        
        ## Search engine
        
        ### Arguments
        
        The function search takes 5 arguments.
        
        Positionnal arguments :
        - `query` : your query
        - `input_path` : the path to the directory to search in
        - `output_path` : the path to the directory to put the results in
        
        Keyword arguments:
        - `scale` : can take the values *row* or *doc* depending on if the query should be satisfied by a single row or by a whole document. The *row* scale is more precise whereas the *doc* scale is faster. The scale defaults to *row*.
        - `update_cache` : `True` to update the cached files (for example if some files have been added to the folder since the last search), else `False`. Defaults to `True`. If you're working with a huge amount of data that doesn't change, you should set `update_cache` to `False`.
        
        To search the word 'word' within the files of the folder 'path/to/data/' by writing the results in the folder '/path/to/results/', just run the following command :
        
        ```python
        from pysin import search
        search('word', 'path/to/data/', 'path/to/results/')
        ```
        
        
        ### Queries
        
        To search one word beyond multiple ones, just write them side to side in the query.
        
        ```python
        search('word1 word2 word3', 'path/to/data/', 'path/to/results/')
        ```
        
        To search the files where 'mandatory' is and where 'foo' or 'bar' is also (but not necessarily both at the same time), just type the following command :
        
        ```python
        search('+mandatory foo bar', 'path/to/data/', 'path/to/results/', scale='doc')
        ```
        
        The same query holds for the *row* scale. The previous command might return a document that contains 'mandatory' at the first row and 'foo' at the last one whereas in the *row* scale, only the occurrences where 'mandatory' **AND** 'foo' (and/or 'bar') are in the same row are returned.
        
        To search the rows where 'mandatory' is but 'forbidden' isn't, type the following command :
        
        ```python
        search('mandatory -forbidden', 'path/to/data/', 'path/to/results/')
        ```
        
        To search an expression with several words, use quotes :
        
        ```python
        search('"complex expression"', 'path/to/data/', 'path/to/results/')
        ```
        
        
        You can obviously combine everything into a single query :
        
        ```python
        search('+mandatory choice1 choic2 "choice3" -"not this one" +"another mandatory"', 'path/to/data/', 'path/to/results/')
        ```
        
        ### Results
        
        When a research is launched, a folder is made at `output_path` in which are two files :
        - `results.csv` : in *row* scale, one row correspond to one occurrence and contains the path to the file, the occurrence row number and the context of the occurrence. In *doc* scale, there are only the paths to the corresponding files.
        - `folders.json` : returns the number of occurrences in each folder using a tree structure
        
        
        
        ## Extractor
        
        The extractor preprocesses all the files to enable the research by converting the handled files into txt cached files. The handled types are csv, doc, docx, html, md, pdf, rtf, txt, xml.
        
        To extract all the files within a folder at path 'path', just run :
        
        ```python
        extract('path/to/data')
        ```
        
        To erase all the cached files, just run :
        
        ```python
        reset_cache('path/to/data')
        ```
        
        
        
        ## Medical prescriptions generator
        
        The generator is based on the data of the [faker](https://pypi.org/project/Faker/) module. To generate 19 fake medical prescriptions in the folder 'data', just run the following command :
        
        ```python
        generate(19, 'path/to/data')
        ```
        
        
        
        ## Soft mode
        
        The search engine and the extractor can also by used as softs. For the search engine, just run the following command :
        
        ```sh
        $ python src/search.py +mandatory choice1 choic2 "choice3" -"not this one" +"another mandatory" --input_path path/to/data/ --output_path path/to/results/
        ```
        
        To search at the *doc* scale, just add the argument `--d`.
        
        The extractor can be used like this :
        
        ```sh
        $ python src/extractor.py path/to/data/
        ```
        
        To clear the cached files, just add the argument `--reset` :
        
        ```sh
        $ python src/extractor.py --reset path/to/data/
        ```
        
        ### Trick
        
        If you have to do lots of researchs in one folder, let's say `absolute/path/to/data/`, by putting the results always in the same folder, let's say `absolute/path/to/results/`, and always at the same scale, let's say the *row* one, then you can create a shortcut to search more easily by running the following commands :
        
        ```sh
        $ echo alias search=\'python /absolute/path/to/search.py --input_path /absolute/path/to/data/ --output_path /absolute/path/to/results\' >> ~/.bashrc
        $ source ~/.bashrc
        ```
        
        Then, you're able to do a research from any location by typing :
        
        ```sh
        $ search +mandatory choice1 choic2 "choice3" -"not this one" +"another mandatory"
        ```
        
        **WARNING :** before doing this, make sure that the `search` alias doesn't exist yet, for example by running the command ```search``` and checking that shell returns the following message :
        
        ```sh
        ModuleNotFoundError: No module named 'apt_pkg'
        ```
        
        
        
        ## Example
        
        You can test this module using the `example.py` script.
        
        
        
        ## TODO
        - multithreaded research
        - improve medication notation
        - new document types
        - adapt .doc extraction to windows environment
Keywords: arkhn,text retrieval,search engine,text extraction,dataset generator,medical
Platform: UNKNOWN
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Topic :: Software Development :: Build Tools
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.4
Classifier: Programming Language :: Python :: 3.5
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Description-Content-Type: text/markdown
