Metadata-Version: 2.1
Name: scraparser
Version: 0.0.2
Summary: A simplified PDF table scraping and parsing tool
Home-page: https://gitlab.com/yookoala/scraparser
Author: Koala Yeung
Author-email: koalay@gmail.com
License: UNKNOWN
Description: # COVID-19 Hong Kong Data Scraper
        
        A data scraper to scrap CSV data from Hong Kong government
        and fill in various destination for data analysis.
        
        ## Prerequisites
        
        Install all the packages specified in [requirements.txt](requirements.txt). 
        
        Recommended to use [venv](https://docs.python.org/3/library/venv.html) for
        package management.
        
        ```
        pip -m venv .venv
        . ./bin/activate.sh
        ```
        
        Install all packages with pip:
        
        ```
        pip install -r requirements.txt
        ```
        
        
        ## Example Use
        
        ### Basic Scraping
        
        To scrap the latest location situation report:
        
        ```
        python3 scraparser scrap "https://www.chp.gov.hk/files/pdf/local_situation_covid19_tc.pdf" \
        | python3 scraparser parse-pdf-to-csv --headers="個案編號,報告日期,發病日期,性別,年齡,入住醫院名稱,住院/出院/死亡,香港/非香港居民,個案分類,確診/疑似個案" \
        ```
        
        The downloaded PDF file and the parsed CSV file will be stored in:
        
        ```
        ./data/local_situation_covid19_tc.<time-string>.pdf
        ./data/local_situation_covid19_tc.<time-string>.csv
        ```
        
        The `time-string` will be formated as `YYYY-MM-DD-HHmmss`.
        
        ### Parse Previously Downloaded PDF Report
        
        To parse pre-exist PDF file from your local computer:
        
        ```
        python3 scraparser scrap-location-situation-pdf --file=path/to/somename.pdf
        ```
        
        The parsed CSV file will be stored to "`path/to/somename.csv`"
        
        
        ### Utility to Fix or Modify Parsed CSV
        
        Its highly difficult to correctly read tables from PDF files. Common errors include:
        
        * **Column underflow / overflow**
        
          The content of a cell spilled over to the last or next cell
        
        * **Row overflow**
        
          The content of a cell (usually with line wraped into multiple lines), spilled over
          to create a phantom row with only 1 content-filled cell.
        
        To fix these issue, please use the following subcommands:
        
        #### `sort`
        
        The command takes CSV filenames either from arguments or from `STDIN` (one filename)
        per line:
        
        ```
        python scraparser sort --column=0 --sort-as-number --in-place ./data/local_situation_covid19_tc.<time-string>.csv
        ```
        
        This command will:
        
        1. Read the file and parse 1st column (parameter `--column`
           accepts column definition start with 0, like in Python list index)
        2. Sort all rows by the 1st column.
        3. Save the fix result back to the input file.
        
        
        #### `fix-column-underflow`
        
        The command takes CSV filenames either from arguments or from `STDIN` (one filename)
        per line:
        
        ```
        python scraparser fix-column-underflow --column=5 --in-place ./data/local_situation_covid19_tc.<time-string>.csv
        ```
        
        This command will:
        
        1. Automatically read all the valid contents in the 6th column (parameter `--column`
           accepts column definition start with 0, like in Python list index).
        2. Read every row and check if a cell in that column is empty (`math.isnan()`).
        3. If so, check the column before it (6th column for our case) and see if it is 
           suffixed by any valid content found in step (1).
        4. Split the content correctly for the 6th and 7th column.
        5. Save the fix result back to the input file.
        
        #### `fix-date-column-underflow`
        
        The command takes CSV filenames either from arguments or from `STDIN` (one filename)
        per line:
        
        ```
        python scraparser fix-date-column-underflow --column=1 --format=DD/MM/YYYY --in-place ./data/local_situation_covid19_tc.<time-string>.csv
        ```
        
        This command will:
        
        1. Read every row and check if a cell in the 2nd column is empty (`math.isnan()`).
        2. If so, check the column before it (1st column for our case) and see if it is 
           suffixed by string that matches our specified date format.
        3. Split the content correctly for the 1st and 2nd column.
        4. Save the fix result back to the input file.
        
        #### `fix-empty-rows`
        
        The command takes CSV filenames either from arguments or from `STDIN` (one filename)
        per line:
        
        ```
        python scraparser fix-empty-rows --in-place ./data/local_situation_covid19_tc.<time-string>.csv
        ```
        
        This command will:
        
        1. Read every row and find all rows with all but 1 cell empty (`math.isnan()`).
        2. If so, append the content of that 1 cell to the cell directly above it.
        3. Drop all "phantom rows" found in step (1).
        4. Save the fix result back to the input file.
        
        ## Advanced Piping usage
        
        ### Parse and Show Result Data
        
        To correctly fix all the issue created from the parsed CSV file in local situation report:
        
        **Linux**
        
        ```
        python3 scraparser scrap "https://www.chp.gov.hk/files/pdf/local_situation_covid19_tc.pdf" \
        | python3 scraparser parse-pdf-to-csv --headers="個案編號,報告日期,發病日期,性別,年齡,入住醫院名稱,住院/出院/死亡,香港/非香港居民,個案分類,確診/疑似個案" \
        | python3 scraparser fix-date-column-underflow --column=1 --in-place \
        | python3 scraparser fix-column-underflow --column=6 --in-place \
        | python3 scraparser fix-column-underflow --column=5 --in-place \
        | python3 scraparser fix-empty-rows --in-place \
        | python3 scraparser sort --in-place \
        | xargs -i xdg-open "{}"
        ```
        
        **macos**
        
        ```
        python3 ./scraparser scrap "https://www.chp.gov.hk/files/pdf/local_situation_covid19_tc.pdf" \
        | python3 scraparser parse-pdf-to-csv --headers="個案編號,報告日期,發病日期,性別,年齡,入住醫院名稱,住院/出院/死亡,香港/非香港居民,個案分類,確診/疑似個案" \
        | python3 scraparser fix-date-column-underflow --column=1 --in-place \
        | python3 scraparser fix-column-underflow --column=6 --in-place \
        | python3 scraparser fix-column-underflow --column=5 --in-place \
        | python3 scraparser fix-empty-rows --in-place \
        | python3 scraparser sort --in-place
        | xargs -I{} open "{}"
        ```
        
        ### Parse Data then Update Google Sheet
        
        This will overwrite the current data specified in the range. If there are not enough rows in
        the Google Sheet, the file will be expanded automatically.
        
        Presume you have defined the string `$GOOGLE_SHEET_ID` and the target sheet
        'CHP/DH Local Situation Input' exists:
        
        ```
        python3 scraparser scrap "https://www.chp.gov.hk/files/pdf/local_situation_covid19_tc.pdf" \
        | python3 scraparser parse-pdf-to-csv --headers="個案編號,報告日期,發病日期,性別,年齡,入住醫院名稱,住院/出院/死亡,香港/非香港居民,個案分類,確診/疑似個案" \
        | python3 scraparser fix-date-column-underflow --column=1 --in-place \
        | python3 scraparser fix-column-underflow --column=6 --in-place \
        | python3 scraparser fix-column-underflow --column=5 --in-place \
        | python3 scraparser fix-empty-rows --in-place \
        | python3 scraparser sort --in-place \
        | python3 scraparser googlesheet "$GOOGLE_SHEET_ID" update --range="'CHP/DH Local Situation Input'!A2:Z" 
        ```
        
        
        ## Development
        
        ### Generating distribution archives
        
        Run these commands to generate distribution folder `dist`:
        
        ```
        python3 -m pip install --user --upgrade setuptools wheel
        python3 setup.py sdist bdist_wheel
        ```
        
        If you have [make](https://www.gnu.org/software/make/) in your system, you may simply
        run:
        
        ```
        make dist
        ```
        
        ## License
        
        License under [the MIT License](LICENSE). You may obtain the license in this repository.
        
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.6
Description-Content-Type: text/markdown
