# -*- coding: utf-8 -*-
from setuptools import setup

package_dir = \
{'': 'src'}

modules = \
['bpe_summarizer']
install_requires = \
['nltk>=3.5,<4.0', 'scipy>=1.4.1,<2.0.0', 'transformers>=2.11.0,<3.0.0']

setup_kwargs = {
    'name': 'bpe-summarizer',
    'version': '0.2.0',
    'description': 'This summarizer attempts to leverage Byte Pair Encoding (BPE) tokenization and the Bart vocabulary to filter text by semantic meaningfulness.',
    'long_description': '## BPE Summarizer\n\n![CI](https://github.com/crodriguez1a/bpe-summarizer/workflows/CI/badge.svg?branch=master)\n\nThis summarizer attempts to leverage Byte Pair Encoding (BPE) tokenization and the Bart vocabulary to filter text by semantic meaningfulness.\n\nBPE text representation is a subword level approach to tokenization which aims to efficiently reuse parts of words while retaining semantic value.\n\nThe algorithm is based on the frequency of n-gram pairs. More frequent pairs are represented by larger tokens.\n\nThis project explored the assumption that token size correlates strongly to semantic meaningfulness. This summarization approach intends to surface the most meaningful sentences with comparing token values and retaining sentences from the original text that included meaningful tokens within a specified percentile.\n\n## Install\n\n```\npip install bpe-summarizer\n```\n\n## Usage\n\n```python\nfrom bpe_summarizer import bpe_summarize\n\nbpe_summarize(article, percentile=99)\n```\n\n## Parameters\n\nParameter|Definition|Default|Type\n--|--|--|--\n`document` | A text blob with sentences delineated by punctuation | `None` | `String`\n`percentile` | Sentences that include tokens in the top kth percentile  will remain after summarization | `99` | `Float`\n`tokenizer` | A [huggingface](https://github.com/huggingface/tokenizers) `PreTrainedTokenizer` instance that relies on byte-pair-encoding | `BartTokenizer.from_pretrained("facebook/bart-large")` | `transformers.PreTrainedTokenizer`\n`apply_intra_sentence` | If `True`, summarization will be applied at both the document level and the sentence level | `False` | `False`\n`intra_sentence_percentile`| When `apply_intra_sentence` is `True`, this percentile will be applied to individual sentences | `50`* | `Float`\n\n* *Note: `intra_sentence_percentile` is ignored if its value represents less than the percentile score of the mean of tokens, otherwise the percentile score of the mean is used.*\n\n## Examples\n\n**Human Summary**\n\n<blockquote>\n\nBuilding Deep Dependency Structures Using A Wide-Coverage CCG Parser\n\nThis paper describes a wide-coverage statistical parser that uses Combinatory Categorial Grammar (CCG) to derive dependency structures.\n\nThe parser differs from most existing wide-coverage treebank parsers in capturing the long-range dependencies inherent in constructions such as coordination, extraction, raising and control, as well as the standard local predicate-argument dependencies.\n\nA set of dependency structures used for training and testing the parser is obtained from a treebank of CCG normal-form derivations, which have been derived (semi-) automatically from the Penn Treebank.\\nThe parser correctly recovers over 80% of labelled dependencies, and around 90% of unlabelled dependencies.\n\nWe provide examples showing how heads can fill dependency slots during a derivation, and how long-range dependencies can be recovered through unification of co-indexed head variables.\n\nWe define predicate argument structure for CCG in terms of the dependencies that hold between words with lexical functor categories and their arguments.\\n\n</blockquote>\n\n**BPE Summary**\n\n<blockquote>\n\nBuilding Deep Dependency Structures Using A Wide-Coverage CCG Parser\n\nThis paper describes a wide-coverage statistical parser that uses Combinatory Categorial Grammar (CCG) to derive dependency structures.\n\nThe parser differs from most existing wide-coverage treebank parsers in capturing the long-range dependencies inherent in constructions such as coordination, extraction, raising and control, as well as the standard local predicate-argument dependencies.\n\nA set of dependency structures used for training and testing the parser is obtained from a treebank of CCG normal-form derivations, which have been derived (semi-) automatically from the Penn Treebank. The parser correctly recovers over 80% of labelled dependencies, and around 90% of unlabelled dependencies. However, the dependencies are typically derived from a context-free phrase structure.\n</blockquote>\n\n## Evaluation\n\nTo evaluate the quality of the summarization, we apply a [semantic similarity metric](https://www.tensorflow.org/api_docs/python/tf/keras/losses/cosine_similarity), to compare auto-summarized examples with human summaries from the [scisummnet dataset](https://cs.stanford.edu/~myasu/projects/scisumm_net/). Text was represented using [sentence-level embeddings](https://tfhub.dev/google/universal-sentence-encoder/4). Figure 1. charts the results from the BPE Summarizer as compared to [widely used](https://huggingface.co/transformers/model_doc/bart.html) summarization techniques. It performed competitively and completed summarization in one one-hundredth of a second as compared to 55 seconds* over 100 samples.\n\n![Side-by-side with widely used summarizer](notebooks/hf_bart_comparison.png)\n<p style="text-align: center;"><small>Fig1. Evaluation alongside a widely used summarizer</small></p>\n\n<small>\\*Performance evaluation was done using a CPU, and the competitive technique was applied after stripping down to use only the [summarization component](https://github.com/huggingface/transformers/blob/70bc3ead4f0b08e8cadd1805ada2a22f0c302399/src/transformers/pipelines.py#L1476).</small>\n\n**References:**\n- [Language Models are Unsupervised Multitask Learners, Radford, et.al](paper/language_models_are_unsupervised_multitask_learners.pdf)\n- [Huggingface/GPT Tokenizer](https://github.com/huggingface/transformers/blob/827d6d6ef071029cfe82838a18dab046b5813976/src/transformers/tokenization_gpt2.py)\n- [GPT-2/Encoder](https://github.com/openai/gpt-2/blob/master/src/encoder.py)\n- [Comparing Transformers and Tokenizers, Németh](https://towardsdatascience.com/comparing-transformer-tokenizers-686307856955)\n- [Huggingface Bart Summarization Pipeline](https://huggingface.co/transformers/model_doc/bart.html)\n',
    'author': 'crodriguez1a',
    'author_email': 'crodriguez1a@gmail.com',
    'maintainer': None,
    'maintainer_email': None,
    'url': 'https://github.com/crodriguez1a/bpe-summarizer',
    'package_dir': package_dir,
    'py_modules': modules,
    'install_requires': install_requires,
    'python_requires': '>=3.7,<4.0',
}


setup(**setup_kwargs)
