# -*- coding: utf-8 -*-
from setuptools import setup

package_dir = \
{'': 'src'}

packages = \
['pytorch_ie',
 'pytorch_ie.core',
 'pytorch_ie.data',
 'pytorch_ie.data.datamodules',
 'pytorch_ie.data.datasets',
 'pytorch_ie.data.datasets.hf_datasets',
 'pytorch_ie.models',
 'pytorch_ie.models.genre',
 'pytorch_ie.models.modules',
 'pytorch_ie.taskmodules',
 'pytorch_ie.utils']

package_data = \
{'': ['*']}

install_requires = \
['datasets>=2.4.0,<3.0.0',
 'huggingface-hub>=0.5.1,<0.6.0',
 'pytorch-lightning>=1.6.1,<2.0.0',
 'torchmetrics>=0.8.0,<0.9.0',
 'transformers>=4.18.0,<5.0.0']

setup_kwargs = {
    'name': 'pytorch-ie',
    'version': '0.12.0',
    'description': 'State-of-the-art Information Extraction in PyTorch',
    'long_description': '# PyTorch-IE: State-of-the-art Information Extraction in PyTorch\n\n[![PyPI](https://img.shields.io/pypi/v/pytorch-ie.svg)][pypi status]\n[![Status](https://img.shields.io/pypi/status/pytorch-ie.svg)][pypi status]\n[![Python Version](https://img.shields.io/pypi/pyversions/pytorch-ie)][pypi status]\n[![License](https://img.shields.io/pypi/l/pytorch-ie)][license]\n\n[![Read the documentation at https://pytorch-ie.readthedocs.io/](https://img.shields.io/readthedocs/pytorch-ie/latest.svg?label=Read%20the%20Docs)][read the docs]\n[![Tests](https://github.com/christophalt/pytorch-ie/workflows/Tests/badge.svg)][tests]\n[![Codecov](https://codecov.io/gh/christophalt/pytorch-ie/branch/main/graph/badge.svg)][codecov]\n\n[![pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit&logoColor=white)][pre-commit]\n[![Black](https://img.shields.io/badge/code%20style-black-000000.svg)][black]\n\n[pypi status]: https://pypi.org/project/pytorch-ie/\n[read the docs]: https://pytorch-ie.readthedocs.io/\n[tests]: https://github.com/christophalt/pytorch-ie/actions?workflow=Tests\n[codecov]: https://app.codecov.io/gh/christophalt/pytorch-ie\n[pre-commit]: https://github.com/pre-commit/pre-commit\n[black]: https://github.com/psf/black\n\n## 🤯 What\'s this about?\n\nThis is an experimental framework that aims to combine the lessons learned from five years of information extraction research.\n\n-   **Focus on the core task:** The main goal is to develop information extraction methods not dataset loading and evaluation logic. We use external well-maintained libraries for non-core functionality. PyTorch-Lightning for training and logging, Huggingface datasets for dataset reading, and Huggingface evaluate for evaluation (coming soon).\n-   **Sharing is caring:** Being able to quickly and easily share models is key to promote your work and facilitate further research. All models developed in PyTorch-IE can be easily shared via the Huggingface model hub. This further allows to quickly build demos based on Huggingface spaces, gradio or streamlit.\n-   **Unified document format:** A unified document format allows for quick experimentation on any dataset or task.\n-   **Beyond sentence level:** Most information extraction frameworks assume text inputs at a sentence granularity. We do not make any assumption on the granularity but generally aim for document-level information extraction.\n-   **Beyond unstructured text:** Unstructured text is only one possible area for information extraction. We developed the framework to also support information extraction from semi-structured text (e.g. HTML), two-dimensional text (e.g. OCR\'d images), and images.\n-   **Character-level annotation and evaluation:** Many information extraction frameworks annotate and evaluate on a token level. We believe that annotation and evaluation should be done on a character level as this also considers the suitability of the tokenizer for the task.\n-   **Make no assumptions on the structure of models:** The last years have seen many different and creative approaches to information extraction and a framework that imposes a structure on those will most certainly be to limiting. With PyTorch-iE you have full control over how a document is prepared for a model and how the model is structured. The logic is self-contained and thus can be easily shared and inspected by others. The only assumption we make is that the input is a document and the output are targets (training) or annotations (inference).\n\n## 🔭 Demos\n\n| Task                                                       | Link                                                                                                                                                                  |\n| ---------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------- |\n| Named Entity Recognition (Span-based)                      | [![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/pie/NER)                               |\n| Joint Named Entity Recognition and Relation Classification | [![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/pie/Joint-NER-and-Relation-Extraction) |\n\n## 🚀️ Quickstart\n\n```console\n$ pip install pytorch-ie\n```\n\n## 🥧 Concepts & Architecture\n\nPyTorch-IE builds on three core concepts: the **📃 Document**, the **🔤 ⇔ 🔢 Taskmodule**, and the **🧮 Model**. In a\nnutshell, the Document says how your data is structured, the Model defines your trainable logic and the Taskmodule\nconverts from one end to the other. All three concepts are represented as abstract classes that should be used to\nderive use-case specific versions. In the following, they are explained in detail.\n\n<details>\n<summary>\n\n### 📃 Document\n\n</summary>\n\nThe `Document` class is a special `dataclass` that defines the document model. Derivations can contain several\nelements:\n\n-   **Data fields** like strings to represent one or multiple texts or arrays for image data. These elements can be\n    arbitrary python objects.\n-   **Annotation fields** like labeled spans for entities or labeled tuples of spans for relations. These elements have\n    to be of a certain container type `AnnotationList` that is dynamically typed with the actual annotation type, e.g.\n    `entities: AnnotationList[LabeledSpan]`. Furthermore, annotation elements define one or multiple annotation `targets`.\n    An annotation target is either a data element or another annotation container. Internally, targets are used to construct the\n    annotation graph, i.e. data elements and annotation containers are the nodes and targets define the edges. The\n    annotation graph defines the (de-)serialization order and what is accessible from within an annotation. To\n    facilitate the setup of annotation containers, there is the `annotation_field()` method.\n-   **Other fields** to save metadata, ids, etc. They are not constrained in any way, but can not be accessed from within\n    annotations.\n\n<details>\n\n<summary>\n\n#### Example Document Model\n\n</summary>\n\n```python\nfrom typing import Optional\nfrom pytorch_ie.core import Document, AnnotationList, annotation_field\nfrom pytorch_ie.annotations import LabeledSpan, BinaryRelation, Label\n\nclass MyDocument(Document):\n    # data fields (any field that is targeted by an annotation fields)\n    text: str\n    # annotation fields\n    entities: AnnotationList[LabeledSpan] = annotation_field(target="text")\n    relations: AnnotationList[BinaryRelation] = annotation_field(target="entities")\n    label: AnnotationList[Label] = annotation_field()\n    # other fields\n    doc_id: Optional[str] = None\n```\n\nNote that the `label` is a special annotation field that does not define a target because it belongs to the whole document.\nYou can also have more complex constructs, like annotation fields that target multiple other fields by using\n`annotation_field(targets)` or `annotation_field(named_targets)`. The latter is useful if you want to access the\ntargets by name from within the annotation, see below for an example.\n\n</details>\n\n#### Annotations\n\nThere are several predefined **annotation types** in `pytorch_ie.annotations`, however, feel free to define your own.\nAnnotations have to be dataclasses that subclass `pytorch_ie.core.Annotation`. They also need to be hashable and\nimmutable. The following is a simple example:\n\n```python\n@dataclass(eq=True, frozen=True)\nclass SimpleLabeledSpan(Annotation):\n    start: int\n    end: int\n    label: str\n```\n\n<details>\n<summary>\n\n##### Accessing Target Content\n\n</summary>\n\nWe can expand the above example a little to have a nice string representation:\n\n```python\n@dataclass(eq=True, frozen=True)\nclass LabeledSpan(Annotation):\n    start: int\n    end: int\n    label: str\n\n    def __str__(self) -> str:\n        if self.targets is None:\n            return ""\n        return str(self.target[self.start : self.end])\n```\n\nThe content of `self.target` is lazily assigned as soon as the annotation is added to a document.\n\nNote that this now expects a single `collections.abc.Sequence` as `target`, e.g.:\n\n```python\nmy_spans: AnnotationList[Span] = annotation_field(target="<NAME_OF_THE_SEQUENCE_FIELD>")\n```\n\nIf we have multiple targets, we need to define target names to access them. For this, we need to set the special\nfield `TARGET_NAMES`:\n\n```python\n@dataclass(eq=True, frozen=True)\nclass Alignment(Annotation):\n    TARGET_NAMES = ("text1", "text2")\n    start1: int\n    end1: int\n    start2: int\n    end2: int\n\n    def __str__(self) -> str:\n        if self.targets is None:\n            return ""\n        # we can access the `named_targets` which has the keys defined in `TARGET_NAMES`\n        span1 = self.named_targets["text1"][self.start1 : self.end1]\n        span2 = self.named_targets["text2"][self.start2 : self.end2]\n        return f\'span1="{span1}" is aligned with span2="{span2}"\'\n```\n\nThis requires to define the annotation container as follows:\n\n```python\nclass MyDocumentWithAlignment(Document):\n    text_a: str\n    text_b: str\n    # `named_targets` defines the mapping from `TARGET_NAMES` to data fields\n    my_alignments: AnnotationList[Alignment] = annotation_field(named_targets={"text1": "text_a", "text2": "text_b"})\n```\n\nNote that `text1` and `text2` can also target the same field.\n\n</details>\n<details>\n<summary>\n\n##### (De-)Serialization of Annotations\n\n</summary>\n\nAs usual for dataclasses, annotations can be converted to json like objects with `.asdict()`. However, they can be\nalso created with `MyAnnotation.fromdict(dct, annotation_store)`. Both methods are required because documents and\ntheir annotations are created on the fly when working with PIE datasets (see below).\n\nSometimes, it is required to overwrite both methods. This is the case when targeting another annotation field. Consider\nthe following example where `head` and `tail` are entries from another annotation field:\n\n```python\n@dataclass(eq=True, frozen=True)\nclass BinaryRelation(Annotation):\n    head: Span\n    tail: Span\n    label: str\n\n    def asdict(self) -> Dict[str, Any]:\n        # Convert the annotations to their ids.\n        # We use the _asdicts() method with overrides to avoid converting the original\n        # entries to dicts in the first place (this can slow down the preprocessing a lot).\n        dct = self._asdict(overrides={"head": self.head._id, "tail": self.tail._id})\n        return dct\n\n    @classmethod\n    def fromdict(\n        cls,\n        dct: Dict[str, Any],\n        annotation_store: Optional[Dict[int, Annotation]] = None,\n    ):\n        # copy to not modify the input\n        tmp_dct = dict(dct)\n        # get the annotations by their ids\n        tmp_dct["head"] = resolve_annotation(tmp_dct["head"], store=annotation_store)\n        tmp_dct["tail"] = resolve_annotation(tmp_dct["tail"], store=annotation_store)\n        return super().fromdict(tmp_dct, annotation_store)\n```\n\nHere it is necessary to replace the referenced `Span` annotations with their ids during serialization because\nwe save them already in the respective annotation field. Thus, we also have to replace the ids with the actual\nannotations during construction. This can be easily done with the helper method\n`resolve_annotation(id_or_annotation, store)`.\n\n</details>\n</details>\n<details>\n<summary>\n\n### 🔤 ⇔ 🔢 Taskmodule\n\n</summary>\n\nThe taskmodule is responsible for converting documents to model inputs and back. For that purpose, it requires the\nuser to implement the following methods:\n\n-   `encode_input`: Taking one document, create one or multiple `TaskEncoding`s. A `TaskEncoding` represents an\n    example that will be passed to the model later on. It is a container holding `inputs`, optional `targets`, the\n    original `document`, and `metadata`. Note that `encode_input` should not assign a value to `targets`.\n-   `encode_target`: This gets a single `TaskEncoding` and should produce a target encoding that will be assigned\n    to `targets` later on. As such, it is called only during training / evaluation, but not for inference. Note that,\n    this is allowed to return None. In this case, the respective `TaskEncoding` will not be passed to the model at all.\n-   `collate`: Taking a batch of `TaskEncoding`s, this should produce a batch input for the model. Note that this has to\n    work with available targets (training and evaluation) and without them (inference).\n-   `unbatch_output`: This gets a batch output from the model and should rearrange that into a sequence of `TaskOutput`s.\n    In that means it can be understood as the opposite to `collate`. The number of `TaskOutput`s should match the\n    number of `TaskEncoding`s that got into the batch because we align them later on for easy creation of new annotations.\n-   `create_annotations_from_output`: This gets a single `TaskEncoding` with its corresponding `TaskOutput` and\n    should yield tuples each consisting of an annotation field name and an annotation. The annotations will be added\n    as predictions to the annotation field with the respective name.\n-   `prepare` (OPTIONAL): This will get the train dataset, i.e. a Sequence or Iterable of Documents, and can be used\n    to calculate additional parameters like the list of all available labels, etc.\n\nYou can find some predefined taskmodules for _text-_ and _token classification_, _text classification based relation\nextraction_, _joint entity and relation classification_ and other use cases in the package\n[`pytorch_ie.taskmodules`](src/pytorch_ie/taskmodules). Especially, have a look at the\n[SimpleTransformerTextClassificationTaskModule](src/pytorch_ie/taskmodules/simple_transformer_text_classification.py)\nthat is well documented and should provide a good starting point to implement your own one.\n\n</details>\n<details>\n<summary>\n\n### 🧮 Model\n\n</summary>\n\nPyTorch-IE models are meant to do the heavy lifting training and inference. They are\n[Pytorch-Lightning modules](https://pytorch-lightning.readthedocs.io/en/stable/common/lightning_module.html),\nenhanced with some functionality to ease persisting them, see [Reusability and Sharing](#reusability-and-sharing).\n\nYou can find some predefined models for transformer based _text-_ and _token classification_, _sequence generation_,\nand other use cases in the package [`pytorch_ie.models`](src/pytorch_ie/models).\n\n</details>\n\n### Reusability and Sharing\n\nTaskmodules and Models provide some functionality to ease reusability and reproducibility. Especially, they provide\nthe methods `save_pretrained()` and `from_pretrained()` that can be used to save their specification, i.e. their\n**config**, and available model wights to disc and exactly re-create them again from that data.\n\n<details>\n<summary>\n\n#### Huggingface Hub and Extended Configs\n\n</summary>\n\nThese methods come along\nwith integration to the [Huggingface Hub](https://huggingface.co/docs/hub/index). By passing `push_to_hub=True` to\n`save_pretrained()`, the taskmodule / model is directly pushed to the Hub and can be loaded again with the respective\nidentifier (see the [Examples](examples) for how to do so). However, to work properly, each taskmodule / model has to\ncorrectly implement the `_config()` getter method. Per default, it returns all parameters passed to the `__init__`\nmethod if this calls `save_hyperparameters()` which is very recommended. But you may have created some further\nparameters that should be persisted, for instance a label-to-id mapping. In this case, `_config()` should be\noverwritten to take this into account:\n\n```python\ndef _config(self) -> Dict[str, Any]:\n    # add the label-to-id mapping to the config\n    config = super()._config()\n    config["label_to_id"] = self.label_to_id\n    return config\n```\n\nFurthermore, you can use the property `is_from_pretrained` to know if the taskmodule / model is just loaded or created\nfrom scratch. This may be useful, for instance, to avoid downloading a model from Huggingface Transformers when you\nin fact want to load your own trained model from disc via `from_pretrained`:\n\n```python\nfrom transformers import AutoConfig, AutoModel\n\nhf_config = AutoConfig.from_pretrained(model_name_or_path)\n# If this is already trained, just create an empty transformer model. The weights are loaded afterwards\n# via the pytorch_ie.Model.from_pretrained() logic.\nif self.is_from_pretrained:\n    self.model = AutoModel.from_config(config=hf_config)\n# Otherwise, download the whole model from the Huggingface Hub.\nelse:\n    self.model = AutoModel.from_pretrained(model_name_or_path, config=hf_config)\n```\n\n</details>\n\nIn short, each taskmodule / model implementation should:\n\n-   call `save_hyperparameters()` in `__init__` to save all constructor arguments,\n-   pass remaining `__init__` kwargs (keyword arguments) to its super to not break some other helpful functionality\n    (e.g. `is_from_pretrained`), and\n-   overwrite `_config()` if additional parameters are calculated, e.g. from the dataset.\n\n## ⚡️ Examples: Prediction\n\n**The following examples work out of the box. No further setup like manually downloading a model is needed!**\n\n**Note:** Setting `num_workers=0` in the pipeline is only necessary when running an example in an\ninteractive python session. The reason is that multiprocessing doesn\'t play well with the interactive python\ninterpreter, see [here](https://docs.python.org/3/library/multiprocessing.html#using-a-pool-of-workers)\nfor details.\n\n### Span-classification-based Named Entity Recognition\n\n```python\nfrom dataclasses import dataclass\n\nfrom pytorch_ie.annotations import LabeledSpan\nfrom pytorch_ie.auto import AutoPipeline\nfrom pytorch_ie.core import AnnotationList, annotation_field\nfrom pytorch_ie.documents import TextDocument\n\n@dataclass\nclass ExampleDocument(TextDocument):\n    entities: AnnotationList[LabeledSpan] = annotation_field(target="text")\n\ndocument = ExampleDocument(\n    "“Making a super tasty alt-chicken wing is only half of it,” said Po Bronson, general partner at SOSV and managing director of IndieBio."\n)\n\n# see below for the long version\nner_pipeline = AutoPipeline.from_pretrained("pie/example-ner-spanclf-conll03", device=-1, num_workers=0)\n\nner_pipeline(document)\n\nfor entity in document.entities.predictions:\n    print(f"{entity} -> {entity.label}")\n\n# Result:\n# IndieBio -> ORG\n# Po Bronson -> PER\n# SOSV -> ORG\n```\n\n<details>\n<summary>\nTo create the same pipeline as above without `AutoPipeline`\n</summary>\n\n```python\nfrom pytorch_ie.auto import AutoTaskModule, AutoModel\nfrom pytorch_ie.pipeline import Pipeline\n\nmodel_name_or_path = "pie/example-ner-spanclf-conll03"\nner_taskmodule = AutoTaskModule.from_pretrained(model_name_or_path)\nner_model = AutoModel.from_pretrained(model_name_or_path)\nner_pipeline = Pipeline(model=ner_model, taskmodule=ner_taskmodule, device=-1, num_workers=0)\n```\n\n</details>\n\n<details>\n<summary>\nOr, without `Auto` classes at all\n</summary>\n\n```python\nfrom pytorch_ie.pipeline import Pipeline\nfrom pytorch_ie.models import TransformerSpanClassificationModel\nfrom pytorch_ie.taskmodules import TransformerSpanClassificationTaskModule\n\nmodel_name_or_path = "pie/example-ner-spanclf-conll03"\nner_taskmodule = TransformerSpanClassificationTaskModule.from_pretrained(model_name_or_path)\nner_model = TransformerSpanClassificationModel.from_pretrained(model_name_or_path)\nner_pipeline = Pipeline(model=ner_model, taskmodule=ner_taskmodule, device=-1, num_workers=0)\n```\n\n</details>\n<details>\n<summary>\n\n### Text-classification-based Relation Extraction\n\n</summary>\n\n```python\nfrom dataclasses import dataclass\n\nfrom pytorch_ie.annotations import BinaryRelation, LabeledSpan\nfrom pytorch_ie.auto import AutoPipeline\nfrom pytorch_ie.core import AnnotationList, annotation_field\nfrom pytorch_ie.documents import TextDocument\n\n\n@dataclass\nclass ExampleDocument(TextDocument):\n    entities: AnnotationList[LabeledSpan] = annotation_field(target="text")\n    relations: AnnotationList[BinaryRelation] = annotation_field(target="entities")\n\ndocument = ExampleDocument(\n    "“Making a super tasty alt-chicken wing is only half of it,” said Po Bronson, general partner at SOSV and managing director of IndieBio."\n)\n\nre_pipeline = AutoPipeline.from_pretrained("pie/example-re-textclf-tacred", device=-1, num_workers=0)\n\nfor start, end, label in [(65, 75, "PER"), (96, 100, "ORG"), (126, 134, "ORG")]:\n    document.entities.append(LabeledSpan(start=start, end=end, label=label))\n\nre_pipeline(document, batch_size=2)\n\nfor relation in document.relations.predictions:\n    print(f"({relation.head} -> {relation.tail}) -> {relation.label}")\n\n# Result:\n# (Po Bronson -> SOSV) -> per:employee_of\n# (Po Bronson -> IndieBio) -> per:employee_of\n# (SOSV -> Po Bronson) -> org:top_members/employees\n# (IndieBio -> Po Bronson) -> org:top_members/employees\n```\n\n</details>\n\n## ⚡️ Examples: Training\n\n<details>\n\n<summary>\n\n### Span-classification-based Named Entity Recognition\n\n</summary>\n\n```python\nimport pytorch_lightning as pl\nfrom pytorch_lightning.callbacks import ModelCheckpoint\nfrom torch.utils.data import DataLoader\n\nimport datasets\nfrom pytorch_ie.models.transformer_span_classification import TransformerSpanClassificationModel\nfrom pytorch_ie.taskmodules.transformer_span_classification import (\n    TransformerSpanClassificationTaskModule,\n)\n\npl.seed_everything(42)\n\nmodel_output_path = "./model_output/"\nmodel_name = "bert-base-cased"\nnum_epochs = 10\nbatch_size = 32\n\n# Get the PIE dataset consisting of PIE Documents that will be used for training (and evaluation).\ndataset = datasets.load_dataset(\n    path="pie/conll2003",\n)\ntrain_docs, val_docs = dataset["train"], dataset["validation"]\n\nprint("train docs: ", len(train_docs))\nprint("val docs: ", len(val_docs))\n\n# Create a PIE taskmodule.\ntask_module = TransformerSpanClassificationTaskModule(\n    tokenizer_name_or_path=model_name,\n    max_length=128,\n)\n\n# Prepare the taskmodule with the training data. This may collect available labels etc.\n# The result of this should affect the state of the taskmodule config which will be\n# persisted (and can be loaded) later on.\ntask_module.prepare(train_docs)\n\n# Persist the taskmodule. This writes the taskmodule config as a json file into the\n# model_output_path directory. The config contains all constructor parameters to\n# re-create the taskmodule at this state (via AutoTaskmodule.from_pretrained(model_output_path)).\ntask_module.save_pretrained(model_output_path)\n\n# Use the taskmodule to encode the train and dev sets. This may use the text and\n# available annotations of the documents.\ntrain_dataset = task_module.encode(train_docs, encode_target=True, as_dataset=True)\nval_dataset = task_module.encode(val_docs, encode_target=True, as_dataset=True)\n\n# Create the dataloaders. Note that the taskmodule provides the collate function!\ntrain_dataloader = DataLoader(\n    train_dataset,\n    batch_size=batch_size,\n    shuffle=True,\n    collate_fn=task_module.collate,\n)\n\nval_dataloader = DataLoader(\n    val_dataset,\n    batch_size=batch_size,\n    shuffle=False,\n    collate_fn=task_module.collate,\n)\n\n# Create the PIE model. Note that we use the number of entries in the previously\n# collected label_to_id mapping to set the number of classes to predict.\nmodel = TransformerSpanClassificationModel(\n    model_name_or_path=model_name,\n    num_classes=len(task_module.label_to_id),\n    t_total=len(train_dataloader) * num_epochs,\n    learning_rate=1e-4,\n)\n\n# Optionally, set up a model checkpoint callback. See here for further information:\n# https://pytorch-lightning.readthedocs.io/en/stable/api/pytorch_lightning.callbacks.ModelCheckpoint.html\n# checkpoint_callback = ModelCheckpoint(\n#     monitor="val/f1",\n#     dirpath=model_output_path,\n#     filename="zs-ner-{epoch:02d}-val_f1-{val/f1:.2f}",\n#     save_top_k=1,\n#     mode="max",\n#     auto_insert_metric_name=False,\n#     save_weights_only=True,\n# )\n\n# Create the pytorch-lightning trainer. See here for further information:\n# https://pytorch-lightning.readthedocs.io/en/latest/api/pytorch_lightning.trainer.trainer.Trainer.html\ntrainer = pl.Trainer(\n    fast_dev_run=False,\n    max_epochs=num_epochs,\n    gpus=0,\n    checkpoint_callback=False,\n    # callbacks=[checkpoint_callback],\n    precision=32,\n)\n# Start the training.\ntrainer.fit(model, train_dataloader, val_dataloader)\n\n# Persist the trained model. This will save the model weights and the model config that allows\n# to re-create the model at this state (via AutoModel.from_pretrained(model_output_path)).\n# model.save_pretrained(model_output_path)\n```\n\n</details>\n\n## 📚 Datasets\n\nWe parse all datasets into a common format that can be loaded directly from the model hub via Huggingface datasets. The documents are cached in an arrow table and serialized / deserialized on the fly. Any changes or preprocessing applied to the documents will be cached as well.\n\n```python\nimport datasets\n\ndataset = datasets.load_dataset("pie/conll2003")\n\nprint(dataset["train"][0])\n# >>> CoNLL2003Document(text=\'EU rejects German call to boycott British lamb .\', id=\'0\', metadata={})\n\ndataset["train"][0].entities\n# >>> AnnotationList([LabeledSpan(start=0, end=2, label=\'ORG\', score=1.0), LabeledSpan(start=11, end=17, label=\'MISC\', score=1.0), LabeledSpan(start=34, end=41, label=\'MISC\', score=1.0)])\n\nentity = dataset["train"][0].entities[1]\n\nprint(f"[{entity.start}, {entity.end}] {entity}")\n# >>> [11, 17] German\n```\n\n<details>\n<summary><b>How to create your own Pytorch-IE dataset</b></summary>\n\nPyTorch-IE datasets are built on top of Huggingface datasets. For instance, consider the\n[conll2003 from the Huggingface Hub](https://huggingface.co/datasets/conll2003) and especially their respective\n[dataset loading script](https://huggingface.co/datasets/conll2003/blob/main/conll2003.py). To create a PyTorch-IE\ndataset from that, you have to implement:\n\n1. A Document class. This will be the type of individual dataset examples.\n\n```python\n@dataclass\nclass CoNLL2003Document(TextDocument):\n    entities: AnnotationList[LabeledSpan] = annotation_field(target="text")\n```\n\nHere we derive from `TextDocument` that has a simple `text` string as base annotation target. The `CoNLL2003Document`\nadds one single annotation list called `entities` that consists of `LabeledSpan`s which reference the `text` field of\nthe document. You can add further annotation types by adding `AnnotationList` fields that may also reference (i.e.\n`target`) other annotations as you like. See [\'pytorch_ie.annotations`](src/pytorch_ie/annotations.py) for predefined\nannotation types.\n\n2. A dataset config. This is similar to\n   [creating a Huggingface dataset config](https://huggingface.co/docs/datasets/dataset_script#multiple-configurations).\n\n```python\nclass CoNLL2003Config(datasets.BuilderConfig):\n    """BuilderConfig for CoNLL2003"""\n\n    def __init__(self, **kwargs):\n        """BuilderConfig for CoNLL2003.\n        Args:\n          **kwargs: keyword arguments forwarded to super.\n        """\n        super().__init__(**kwargs)\n```\n\n3. A dataset builder class. This should inherit from\n   [`pytorch_ie.data.builder.GeneratorBasedBuilder`](src/pytorch_ie/data/builder.py) which is a wrapper around the\n   [Huggingface dataset builder class](https://huggingface.co/docs/datasets/v2.4.0/en/package_reference/builder_classes#datasets.GeneratorBasedBuilder)\n   with some utility functionality to work with PyTorch-IE `Documents`. The key elements to implement are: `DOCUMENT_TYPE`,\n   `BASE_DATASET_PATH`, and `_generate_document`.\n\n```python\nclass Conll2003(pytorch_ie.data.builder.GeneratorBasedBuilder):\n    # Specify the document type. This will be the class of individual dataset examples.\n    DOCUMENT_TYPE = CoNLL2003Document\n\n    # The Huggingface identifier that points to the base dataset. This may be any string that works\n    # as path with Huggingface `datasets.load_dataset`.\n    BASE_DATASET_PATH = "conll2003"\n\n    # The builder configs, see https://huggingface.co/docs/datasets/dataset_script for further information.\n    BUILDER_CONFIGS = [\n        CoNLL2003Config(\n            name="conll2003", version=datasets.Version("1.0.0"), description="CoNLL2003 dataset"\n        ),\n    ]\n\n    # [Optional] Define additional keyword arguments which will be passed to `_generate_document` below.\n    def _generate_document_kwargs(self, dataset):\n        return {"int_to_str": dataset.features["ner_tags"].feature.int2str}\n\n    # Define how a Pytorch-IE Document will be created from a Huggingface dataset example.\n    def _generate_document(self, example, int_to_str):\n        doc_id = example["id"]\n        tokens = example["tokens"]\n        ner_tags = [int_to_str(tag) for tag in example["ner_tags"]]\n\n        text, ner_spans = tokens_and_tags_to_text_and_labeled_spans(tokens=tokens, tags=ner_tags)\n\n        document = CoNLL2003Document(text=text, id=doc_id)\n\n        for span in sorted(ner_spans, key=lambda span: span.start):\n            document.entities.append(span)\n\n        return document\n```\n\nThe full script can be found here: [datasets/conll2003/conll2003.py](datasets/conll2003/conll2003.py). Note, that to\nload the dataset with `datasets.load_dataset`, the script has to be located in a directory with the same name (as it\nis the case for standard Huggingface dataset loading scripts).\n\n</details>\n\n<!-- github-only -->\n\n✨📚✨ [Read the full documentation](https://pytorch-ie.readthedocs.io/)\n\n## 🔧 Development Setup\n\n## 🏅 Acknowledgements\n\n-   This package is based on the [sourcery-ai/python-best-practices-cookiecutter](https://github.com/sourcery-ai/python-best-practices-cookiecutter) and [cjolowicz/cookiecutter-hypermodern-python](https://github.com/cjolowicz/cookiecutter-hypermodern-python) project templates.\n\n## 📃 Citation\n\nIf you find the framework useful please consider citing it:\n\n```bibtex\n@misc{alt2022pytorchie,\n    author={Christoph Alt, Arne Binder},\n    title = {PyTorch-IE: State-of-the-art Information Extraction in PyTorch},\n    year = {2022},\n    publisher = {GitHub},\n    journal = {GitHub repository},\n    howpublished = {\\url{https://github.com/ChristophAlt/pytorch-ie}}\n}\n```\n\n[license]: https://github.com/christophalt/pytorch-ie/blob/main/LICENSE\n',
    'author': 'Christoph Alt',
    'author_email': 'christoph.alt@posteo.de',
    'maintainer': None,
    'maintainer_email': None,
    'url': 'https://github.com/christophalt/pytorch-ie',
    'package_dir': package_dir,
    'packages': packages,
    'package_data': package_data,
    'install_requires': install_requires,
    'python_requires': '>=3.9,<4.0',
}


setup(**setup_kwargs)
