Metadata-Version: 2.1
Name: opendata-pipeline
Version: 0.2.1
Summary: A pipeline for processing open medical examier's data using GitHub Actions CI/CD.
License: MIT
Author: Nick Anthony
Author-email: nanthony007@gmail.com
Requires-Python: >=3.10,<4.0
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Requires-Dist: aiohttp (>=3.8.1,<4.0.0)
Requires-Dist: orjson (>=3.8.0,<4.0.0)
Requires-Dist: pandas (>=1.4.4,<2.0.0)
Requires-Dist: pydantic[dotenv] (>=1.10.2,<2.0.0)
Requires-Dist: requests (>=2.28.1,<3.0.0)
Requires-Dist: rich (>=12.5.1,<13.0.0)
Requires-Dist: typer (>=0.6.1,<0.7.0)
Description-Content-Type: text/markdown

TODO:

- [ ] Add A LOT more PRINT statements
- [x] Add comments
- [ ] Add documentation (README and docs site)
  - The latter will be necesarry once we move to dockerfiles and actions
- [ ] Add tests 😅
  - [ ] including CLI tests
- [x] Use arcgis package for geocoding
  - [ ] Use batch geocoding (had problem with Token... can register as anonymous user?)
~~- [x] Use Socrata package (register API key) for data fetching from datasets published on Socrata~~
  - [x] Use `requests` package for data fetching from datasets published on odata
- [x] Use github python package to keep config.yaml updated after successful runs
  - [ ] Can also use to update JS datafiles at end of analysis (see below)
  - [x] Just used requests and api directly
    - These should be very small and generated by pandas analysis of the data
- [ ] results should be in a github release (data files) (can zip them)
  - [ ] Use GH CLI in bash script because pre-installed in Actions
  - [ ] We can then just use the OctoKit JS package to point to the LINKS of the files and when you click on them it will download them
  - [ ] then web page to enable file downloads and show some graphs (basic --> records over time for each dataset)
    -  what charting frameowkr to use?
    -  Need an action to update the frontend codebase with the new data
       -  Store in JSON format
~~-  [ ] add website to socrata key~~
-  [x] Make a container to run the whole pipeline (so no downloads for users)
   -  [ ] Host on GHCR
-  [x] MAKE OUR OWN UNIQUE IDENTIFIERS FOR ALL DATASETS COMBINED
   -  [x] SAME COLUMN NAME IN ALL DATASETS, THEN DON'T HAVE TO PROVIDE IDENTIFIER COLUMN IN config.yaml
   -  [x] Also allows for better merging of datasets (i.e. records + drugs + geo)
~~- [ ] DO we want to publish a Web API as well?~~
  ~~- [ ] Then weould need DB~~
- [x] No Windows support due to drug extraction tool usage

I think, if my math is right, we can do ~20 minutes / day of actions... (2,000 minutes per month limit for free)


*** make a note it is very important to often PULL to stay updated with the CONFIG
