^[![Build status](https://github.com/sanskrit-coders/doc_curation/workflows/Python%20package/badge.svg)](https://github.com/sanskrit-coders/doc_curation/actions)
[![Documentation Status](https://readthedocs.org/projects/doc_curation/badge/?version=latest)](http://doc_curation.readthedocs.io/en/latest/?badge=latest)
[![PyPI version](https://badge.fury.io/py/doc_curation.svg)](https://badge.fury.io/py/doc_curation)

## doc curation

A package for curating doc file collections. Prominent features:

- Scrape texts off various sites, such as Wikisource. See example [here](https://github.com/sanskrit-coders/doc_curation/blob/master/curation_projects/misc/wikisource.py). (PS: Consider contributing to [raw_etexts repo](https://github.com/sanskrit/raw_etexts). )
- OCR some pdf with google drive. Automatically splits into 25 page bits and ocrs them individually. See usage example [here](https://github.com/sanskrit-coders/doc_curation/blob/master/curation_projects/pdf_tasks.py), function [here](https://github.com/sanskrit-coders/doc_curation/blob/master/doc_curation/pdf.py#L13).

## For users
* [Autogenerated Docs on readthedocs (might be broken)](http://doc_curation.readthedocs.io/en/latest/).
* Manually and periodically generated docs [here](https://sanskrit-coders.github.io/doc_curation/build/html/)
* For detailed examples and help, please see individual module files in this package.


## Installation or upgrade:
* For stable version `pip install doc_curation -U`
* For latest code `pip install git+https://github.com/sanskrit-coders/doc_curation/@master -U`
* [Web](https://pypi.python.org/pypi/doc_curation).

## Usage:
* Enable Google Driver API and download service account key file having Google Driver API access.
```python
from doc_curation import pdf
pdf_file = '/home/file.pdf'
key_file = '/home/key.json'
doc_curation.pdf.drive_ocr.split_and_ocr_on_drive(pdf_file, key_file)
```

### Usage for the `google_vision_pdf.py` to OCR pdf to txt files.
* Follow the instructions here: https://cloud.google.com/vision/docs/before-you-begin. 
* Make sure to set the environment variable for `GOOGLE_APPLICATION_CREDENTIALS` to the path of json containing your service account key.
* Example:
```
export GOOGLE_APPLICATION_CREDENTIALS="/home/user/Downloads/service-account-file.json"
 ```

* Invoke the script passing in the input file. Eg:

```
python3 google_vision_pdf.py --input-file <input.pdf>
```

# For contributors

## Contact

Have a problem or question? Please head to [github](https://github.com/sanskrit-coders/doc_curation).

## Packaging

* ~/.pypirc should have your pypi login credentials.
```
python setup.py bdist_wheel
twine upload dist/* --skip-existing
```

## Build documentation
- sphinx html docs can be generated with `cd docs; make html`

## Testing
Run `pytest` in the root directory.

## Auxiliary tools
- ![Build status](https://github.com/sanskrit-coders/doc_curation/workflows/Python%20package/badge.svg)
- [![Documentation Status](https://readthedocs.org/projects/doc_curation/badge/?version=latest)](http://doc_curation.readthedocs.io/en/latest/?badge=latest)
- [pyup](https://pyup.io/account/repos/github/sanskrit-coders/doc_curation/)
