Metadata-Version: 2.1
Name: pycurator
Version: 0.1b3
Summary: Collect data from popular data repositories with ease.
Home-page: https://github.com/michaelbaluja/PyCurator
Author: Michael Baluja
Author-email: michaelbaluja@protonmail.com
Maintainer-email: Michael Baluja <michaelbaluja@protonmail.com>
License: This software is Copyright © 2021-2022 The Regents of the University of California.
        All Rights Reserved.
        
        Permission to copy, modify, and distribute this software and its documentation
        for educational, research and non-profit purposes, without fee, and without a
        written agreement is hereby granted, provided that the above copyright notice,
        this paragraph and the following three paragraphs appear in all copies.
        
        Permission to make commercial use of this software may be obtained by contacting:
        
            Office of Innovation & Commercialization
            9500 Gilman Drive, Mail Code 0910
            University of California
            La Jolla, CA 92093-0910
            (858) 534-5815 | invent@ucsd.edu
        
        This software program and documentation are copyrighted by The Regents of the
        University of California. The software program and documentation are supplied
        “as is”, without any accompanying services from The Regents. The Regents does
        not warrant that the operation of the program will be uninterrupted or
        error-free. The end-user understands that the program was developed for
        research purposes and is advised not to rely exclusively on the program for
        any reason.
        
        IN NO EVENT SHALL THE UNIVERSITY OF CALIFORNIA BE LIABLE TO ANY PARTY FOR
        DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
        LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS DOCUMENTATION,
        EVEN IF THE UNIVERSITY OF CALIFORNIA HAS BEEN ADVISED OF THE POSSIBILITY OF
        SUCH DAMAGE. THE UNIVERSITY OF CALIFORNIA SPECIFICALLY DISCLAIMS ANY
        WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
        MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE SOFTWARE PROVIDED
        HEREUNDER IS ON AN “AS IS” BASIS, AND THE UNIVERSITY OF CALIFORNIA HAS NO
        OBLIGATIONS TO PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR
        MODIFICATIONS.
        
Project-URL: repository, https://github.com/michaelbaluja/PyCurator
Keywords: curation,API,data collection
Platform: UNKNOWN
Classifier: Development Status :: 4 - Beta
Classifier: Environment :: Console
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: BSD License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3.10
Classifier: Topic :: Scientific/Engineering
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Provides-Extra: excel
Provides-Extra: repositories
Provides-Extra: format
Provides-Extra: test
License-File: LICENSE

# PyCurator
Making data extraction and curation as easy as py.

PyCurator allows users to easily query research repositories without the trouble of reading through API
documentation. Data curation is now as easy as ```$ pycurator```. Whether you want the ease of clicking
some buttons and getting the data or the flexibility of modifying query format, PyCurator provides a simple
UI for quickly retrieving data that is built on top of an extensible collection of Web and API scraper classes.

## Supported Repositories
PyCurator currently supports the following repositories in the capacities listed. Authentication is only required for Kaggle,
though may provide runtime benefits for Dryad as rate-limiting is relaxed.
 

| Repository           | Authentication                                                                               |
|----------------------|----------------------------------------------------------------------------------------------|
| Dataverse            |                                                                                              |
| Dryad                | [Dryad](https://github.com/CDL-Dryad/dryad-app/blob/main/documentation/apis/api_accounts.md) |
| Figshare             |                                                                                              |
| Kaggle               | [Kaggle](https://www.kaggle.com/docs/api#authentication)                                     |
| OpenML               |                                                                                              |
| Papers With Code     |                                                                                              |                
| Zenodo               |                                                                                              |

If there's a repository that you would like to see added to the list, check out the [Contributions](#contributions) section.

## Installation and use
### Installation
Dependencies are provided in the ```requirements.txt``` file.
It is recommended to create a virtual environment to ensure there is no conflict with the packages
in your current work space.

PyCurator requires a Python version >= 3.10.

To run, simply paste the following commands into your terminal
```bash
pip install pycurator
pycurator
```

### Use
#### Repository Selection
After following the commands above, you will be met with the landing page, containing licensing, funding, and 
copyright information. Clicking ```Continue``` will bring you to the following page 

![Repository Selection Page](https://raw.githubusercontent.com/michaelbaluja/PyCurator/main/images/repo_selection.png "Repository Selection Page")

#### Parameter Selection
Clicking on one of the repositories will bring up the respective parameters used for querying the API and 
saving your results. Parameters will vary depending on repository selected.

![Parameter Selection](https://raw.githubusercontent.com/michaelbaluja/PyCurator/main/images/param_selection.png "Figshare Parameter Selection")

These parameters are outlined as

| Parameter      | Description                                                                                                                      |
|----------------|----------------------------------------------------------------------------------------------------------------------------------|
| Save Directory | Location to save results. Defaults to "/data/{repo_name}/{search_term}_{search_type}.json" within PyCurator /data sub-directory. |
| Search Terms   | Search term(s) to query. Terms should be separated with a comma, and multi-word terms should be wrapped in quotes.               |
| Search Types   | Type of objects to query.                                                                                                        |

After all required parameters are provided, the ```Run``` button is activated.

#### Run Page
![Run Page](https://raw.githubusercontent.com/michaelbaluja/PyCurator/main/images/run_page.png "Run Page")

The run page provides high level status updates in the main window. These include the beginning and end
of processes, rate-limiting issues, runtime completion, and saving confirmation. Below are real-time status updates for the 
specific query being completed as well as a progress bar for the high level task. During tasks that have
a fixed duration, such as metadata querying or some web scraping, a fixed-length progress bar will show
the progression of output. During tasks that have an indeterminate duration, a cycling task bar will be 
present to represent continued progress.

At the bottom are the navigation buttons. To avoid unnecessary queries, the ```Back``` button is unresponsive 
during runtime, but is activated after completion. The ```Stop``` button is used to interrupt runtime and stop querying.
After runtime completion or interruption, the ```Stop``` button is replaced by the ```Exit``` button, allowing you to 
safely terminate the program.

## Contributions
### Bugs
Please note that as of Spring 2022, PyCurator is still undergoing active development. For any bugs or problems that you come across, open an issue that details the problem that 
you're experiencing.

### Extension
Know of an API that you think should be included in PyCurator? Create a Pull Request outlining
the API and why you think it would be beneficial, and make sure to follow the format set out
through the existing Scraper classes.

## Funding and acknowledgements
The initial development of this program was funded by the Librarians Association of The University of California (LAUC) and UC San Diego Library Research Data Curation Program (RDCP).

Thank you to Matt Peters, Dan LaSusa, John Chen, Joshua Weimer, and Amy Ly for their feedback during testing of early iterations of PyCurator.


