# -*- coding: utf-8 -*-
from setuptools import setup

package_dir = \
{'': 'src'}

packages = \
['spacy_html_tokenizer']

package_data = \
{'': ['*']}

install_requires = \
['selectolax>=0.3.6,<0.4.0', 'spacy>=3.2.2,<4.0.0']

entry_points = \
{'spacy_tokenizers': ['html_tokenizer = '
                      'spacy_html_tokenizer.html_tokenizer:create_html_tokenizer']}

setup_kwargs = {
    'name': 'spacy-html-tokenizer',
    'version': '0.1.3',
    'description': 'An HTML-friendly spaCy tokenizer',
    'long_description': '# HTML-friendly spaCy Tokenizer\n\nIt\'s not an [HTML tokenizer](https://www.w3.org/TR/2011/WD-html5-20110113/tokenization.html#tokenization), but a tokenizer that works with text that happens to be embedded in HTML. \n\n**Install**\n\n```\npip install spacy-html-tokenizer\n```\n\n## How it works\n\nUnder the hood we use [`selectolax`](https://github.com/rushter/selectolax) to parse HTML. From there, common elements used for styling within traditional text elements (e.g. `<b>` or `<span>` inside of a `<p>`) are [unwrapped](https://selectolax.readthedocs.io/en/latest/parser.html#selectolax.parser.HTMLParser.unwrap_tags), meaning the text contained within those elements becomes nested inside their parent elements. You can change this with the `unwrapped_tags` argument to the constructor. Tags used for non-text content, such as `<script>` and `<style>` are removed. Then the text is extracted from each remaining terminal node that contains text. These texts are then tokenized with the standard tokenizer defaults and then combined into a single `Doc`. The end result is a `Doc`, but each element\'s text from the original document is also a [sentence](https://spacy.io/api/doc#sents), so you can iterate through each element\'s text with `doc.sents`.\n\n## Example\n\n```python\nimport spacy\nfrom spacy_html_tokenizer import create_html_tokenizer\n\nnlp = spacy.blank("en")\nnlp.tokenizer = create_html_tokenizer()(nlp)\n\nhtml = """<h2>An Ordered HTML List</h2>\n<ol>\n    <li><b>Good</b> coffee. There\'s another sentence here</li>\n    <li>Tea and honey</li>\n    <li>Milk</li>\n</ol>"""\n\ndoc = nlp(html)\nfor sent in doc.sents:\n    print(sent.text, "-- N Tokens:", len(sent))\n\n# An Ordered HTML List -- N Tokens: 4\n# Good coffee. There\'s another sentence here -- N Tokens: 8\n# Tea and honey -- N Tokens: 3\n# Milk -- N Tokens: 1\n```\n\nIn the prior example, we didn\'t have any other sentence boundary detection components. However, this will also work with downstream sentence boundary detection components -- e.g.\n\n```python\nnlp = spacy.load("en_core_web_sm")  # has parser for sentence boundary detection\nnlp.tokenizer = create_html_tokenizer()(nlp)\n\ndoc = nlp(html)\nfor sent in doc.sents:\n    print(sent.text, "-- N Tokens:", len(sent))\n\n# An Ordered HTML List -- N Tokens: 4\n# Good coffee. -- N Tokens: 3\n# There\'s another sentence here -- N Tokens: 5\n# Tea and honey -- N Tokens: 3\n# Milk -- N Tokens: 1\n```\n\n### Comparison\n\nWe\'ll compare parsing [Explosion\'s About page](https://explosion.ai/about) with and without the HTML tokenizer.\n\n```python\nimport requests\nimport spacy\nfrom spacy_html_tokenizer import create_html_tokenizer\nfrom selectolax.parser import HTMLParser\n\nabout_page_html = requests.get("https://explosion.ai/about").text\n\nnlp_default = spacy.load("en_core_web_lg")\nnlp_html = spacy.load("en_core_web_lg")\nnlp_html.tokenizer = create_html_tokenizer()(nlp_html)\n\n# text from HTML - used for non-HTML default tokenizer\nabout_page_text = HTMLParser(about_page_html).text()\n\ndoc_default = nlp_default(about_page_text)\ndoc_html = nlp_html(about_page_html)\n```\n\n#### View first sentences of each\n\nWith standard tokenizer on text extracted from HTML\n\n```python\nlist(sent.text for sent in doc_default.sents)[:5]\n```\n\n```python\n[\'AboutSoftware & DemosCustom SolutionsBlog & NewsAbout usExplosion is a software company specializing in developer tools for Artificial\\nIntelligence and Natural Language Processing.\',\n\'We’re the makers of\\nspaCy, one of the leading open-source libraries for advanced\\nNLP and Prodigy, an annotation tool for radically efficient\\nmachine teaching.\',\n\'\\n\\n\',\n\'Ines Montani CEO, FounderInes is a co-founder of Explosion and a core developer of the spaCy NLP library and the Prodigy annotation tool.\',\n\'She has helped set a new standard for user experience in developer tools for AI engineers and researchers.\']\n```\n\nWith HTML Tokenizer on HTML\n\n```python\nlist(sent.text for sent in doc_html.sents)[:10]\n```\n\n```python\n[\'About us · Explosion\',\n \'About\',\n \'Software\',\n \'&\',\n \'Demos\',\n \'Custom Solutions\',\n \'Blog & News\',\n \'About us\',\n \'Explosion is a software company specializing in developer tools for Artificial Intelligence and Natural Language Processing.\',\n \'We’re the makers of spaCy, one of the leading open-source libraries for advanced NLP and Prodigy, an annotation tool for radically efficient machine teaching.\']\n```\n\nWhat about the last sentence?\n\n```python\nlist(sent.text for sent in doc_default.sents)[-1]\n\n# We’re the makers of spaCy, one of the leading open-source libraries for advanced NLP.NavigationHomeAbout usSoftware & DemosCustom SolutionsBlog & NewsOur SoftwarespaCy · Industrial-strength NLPProdigy · Radically efficient annotationThinc · Functional deep learning© 2016-2022 Explosion · Legal & Imprint/*<![CDATA[*/window.pagePath="/about";/*]]>*//*<![CDATA[*/window.___chunkMapping={"app":["/app-ac229f07fa81f29e0f2d.js"],"component---node-modules-gatsby-plugin-offline-app-shell-js":["/component---node-modules-gatsby-plugin-offline-app-shell-js-461e7bc49c6ae8260783.js"],"component---src-components-post-js":["/component---src-components-post-js-cf4a6bf898db64083052.js"],"component---src-pages-404-js":["/component---src-pages-404-js-b7a6fa1d9d8ca6c40071.js"],"component---src-pages-blog-js":["/component---src-pages-blog-js-1e313ce0b28a893d3966.js"],"component---src-pages-index-js":["/component---src-pages-index-js-175434c68a53f68a253a.js"],"component---src-pages-spacy-tailored-pipelines-js":["/component---src-pages-spacy-tailored-pipelines-js-028d0c6c19584ef0935f.js"]};/*]]>*/\n```\n\nYikes. How about HTML Tokenizer?\n\n```python\nlist(sent.text for sent in doc_html.sents)[-1]\n\n# \'© 2016-2022 Explosion · Legal & Imprint\'\n```\n',
    'author': 'Peter Baumgartner',
    'author_email': '5107405+pmbaumgartner@users.noreply.github.com',
    'maintainer': None,
    'maintainer_email': None,
    'url': 'https://github.com/pmbaumgartner/spacy-html-tokenizer',
    'package_dir': package_dir,
    'packages': packages,
    'package_data': package_data,
    'install_requires': install_requires,
    'entry_points': entry_points,
    'python_requires': '>=3.7,<4.0',
}


setup(**setup_kwargs)
