Metadata-Version: 2.1
Name: kraken-extract-from-html
Version: 0.0.15
Summary: Kraken Extract From HTML
Home-page: https://github.com/tactik8/kraken_extract_from_html2
Author: Tactik8
Author-email: info@tactik8.com
License: MIT
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Description-Content-Type: text/markdown
Requires-Dist: beautifulsoup4
Requires-Dist: extruct
Requires-Dist: html2text
Requires-Dist: w3lib
Requires-Dist: boilerpy3

# Extract from html


## What it does
Extracts the following from html:
- urls
- emails
- images
- tables
- structured data (schema.org)
- text
- title
- feeds


## How to use

### Using the api

#### Send a url (get)
Send the url as a query parameter 'url'.
Will retrieve the content and return extracted data.
If 'contentUrl' provided, will use the content from 'contentUrl' but use 'url' as attributes


#### Send a WebContent object (post)
The content will be extracted from either the 'text' field or it will retrieve the content from the url in 'archivedAt'.

```
{
    "@type": "webContent",
    "url": [
        "https://storage.googleapis.com/kraken-cdn/641fcdaa9664421b3ac4db2b6b494397bf0dc8d65a559e9c2238de77d09e740e.html"
    ],
    "archivedAt": [
        "https://storage.googleapis.com/kraken-cdn/641fcdaa9664421b3ac4db2b6b494397bf0dc8d65a559e9c2238de77d09e740e.html"
    ],
    "about": {
        "@type": "webPage",
        "url": "https://www.petro-canada.ca/en/business/rack-prices"
    }
}

```

### Using the library
Provided url of the page and html content, returns list of records with extractions.

`from kraken_extract_from_html import kraken_extract_from_html as k
`

`records = k.get(url, html)`
