# -*- coding: utf-8 -*-
from setuptools import setup

packages = \
['seqal']

package_data = \
{'': ['*']}

install_requires = \
['flair==0.8', 'torch==1.7.1']

setup_kwargs = {
    'name': 'seqal',
    'version': '0.2.2',
    'description': 'Sequence labeling active learning framework for Python',
    'long_description': '# SeqAL\n\n<!-- <p align="center">\n  <a href="https://github.com/BrambleXu/seqal/actions?query=workflow%3ACI">\n    <img src="https://img.shields.io/github/workflow/status/BrambleXu/seqal/CI/main?label=CI&logo=github&style=flat-square" alt="CI Status" >\n  </a>\n  <a href="https://seqal.readthedocs.io">\n    <img src="https://img.shields.io/readthedocs/seqal.svg?logo=read-the-docs&logoColor=fff&style=flat-square" alt="Documentation Status">\n  </a>\n  <a href="https://codecov.io/gh/BrambleXu/seqal">\n    <img src="https://img.shields.io/codecov/c/github/BrambleXu/seqal.svg?logo=codecov&logoColor=fff&style=flat-square" alt="Test coverage percentage">\n  </a>\n</p> -->\n<p align="center">\n  <a href="https://github.com/BrambleXu/seqal/actions?query=workflow%3ACI">\n    <img src="https://img.shields.io/github/workflow/status/BrambleXu/seqal/CI/main?label=CI&logo=github&style=flat-square" alt="CI Status" >\n  </a>\n  <a href="https://python-poetry.org/">\n    <img src="https://img.shields.io/badge/packaging-poetry-299bd7?style=flat-square&logo=data:image/png" alt="Poetry">\n  </a>\n  <a href="https://github.com/ambv/black">\n    <img src="https://img.shields.io/badge/code%20style-black-000000.svg?style=flat-square" alt="black">\n  </a>\n  <a href="https://github.com/pre-commit/pre-commit">\n    <img src="https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit&logoColor=white&style=flat-square" alt="pre-commit">\n  </a>\n</p>\n<p align="center">\n  <a href="https://pypi.org/project/seqal/">\n    <img src="https://img.shields.io/pypi/v/seqal.svg?logo=python&logoColor=fff&style=flat-square" alt="PyPI Version">\n  </a>\n  <img src="https://img.shields.io/pypi/pyversions/seqal.svg?style=flat-square&logo=python&amp;logoColor=fff" alt="Supported Python versions">\n  <img src="https://img.shields.io/pypi/l/seqal.svg?style=flat-square" alt="License">\n</p>\n\nSeqAL is a sequence labeling active learning framework based on Flair.\n\n## Installation\n\nInstall this via pip (or your favourite package manager):\n\n`pip install seqal`\n\n\n## Usage\n\n### Prepare data\n\nThe tagging scheme is the IOB scheme.\n\n```\n    U.N. NNP I-ORG\nofficial NN  O\n   Ekeus NNP I-PER\n   heads VBZ O\n     for IN  O\n Baghdad NNP I-LOC\n       . .   O\n```\n\nEach line contains four fields: the word, its partof-speech tag and its named entity tag. Words tagged with O are outside of named entities. \n\n### Examples\n\nBecause SeqAL is based on flair, we heavily recommend to read the [tutorial](https://github.com/flairNLP/flair/blob/5c4231b30865bf4426ba8076eb91492d329c8a9b/resources/docs/TUTORIAL_1_BASICS.md) of flair first. \n\n```python\nimport json\n\nfrom flair.embeddings import StackedEmbeddings, WordEmbeddings\n\nfrom seqal.active_learner import ActiveLearner\nfrom seqal.datasets import ColumnCorpus, ColumnDataset\nfrom seqal.query_strategies import mnlp_sampling\n\n# 1. get the corpus\ncolumns = {0: "text", 1: "pos", 2: "ner"}\ndata_folder = "../conll"\ncorpus = ColumnCorpus(\n    data_folder,\n    columns,\n    train_file="seed.data",\n    dev_file="dev.data",\n    test_file="test.data",\n)\n```\n\nFirst we need to create the corpus. `date_folder` is the directry path where we store datasets. `seed.data` contains NER labels, which usually just a small part of data (around 2% of total train data). `dev.data` and `test.data` should contains NER labels for evaluation. All three kinds of data should follow the IOB scheme. But if you have 4 columns, you can just change `columns` to specify the tag column.\n\n\n```python\n# 2. tagger params\ntagger_params = {}\ntagger_params["tag_type"] = "ner"  # what tag do we want to predict?\ntagger_params["hidden_size"] = 256\nembedding_types = [WordEmbeddings("glove")]\nembeddings = StackedEmbeddings(embeddings=embedding_types)\ntagger_params["embeddings"] = embeddings\n\n# 3. Trainer params\ntrainer_params = {}\ntrainer_params["max_epochs"] = 10\ntrainer_params["mini_batch_size"] = 32\ntrainer_params["learning_rate"] = 0.01\ntrainer_params["train_with_dev"] = True\n\n# 4. initialize learner\nlearner = ActiveLearner(tagger_params, mnlp_sampling, corpus, trainer_params)\n```\n\nThis part is where we set the parameters for sequence tagger and trainer. The above setup can conver most of situations. If you want to add more paramters, I recommend to the read [SequenceTagger](https://github.com/flairNLP/flair/blob/master/flair/models/sequence_tagger_model.py#L68) and [ModelTrainer](https://github.com/flairNLP/flair/blob/master/flair/trainers/trainer.py#L42) in flair.\n\n\n```python\n# 5. initial training\nlearner.fit(save_path="output/init_train")\n```\n\nThe initial training will be trained on the seed data.\n\n```python\n# 6. prepare data pool\npool_columns = {0: "text", 1: "pos"}\npool_file = data_folder + "/pool.data"\ndata_pool = ColumnDataset(pool_file, pool_columns)\nsents = data_pool.sentences\n```\nHere we prepare the unlabeled data pool.\n\n```python\n# 7. query data\nquery_number = 1\nsents, query_samples = learner.query(sents, query_number, token_based=True)\n```\n\nWe can query samples from data pool by the `learner.query()` method. `query_number` means how many sentence we want to query. But if we set `token_based=True`, the `query_number` means how many tokens we want to query. For the sequence labeling task, we usually set `token_based=True`.\n\n`query_samples` is a list that contains queried sentences (the Sentence class in flair). `sents` contains the rest of unqueried sentences.\n\n```\nIn [1]: query_samples[0].to_plain_string()\nOut[1]: \'I love Berlin .\'\n```\n\nWe can get the text by calling `to_plain_strin()` method and put it into the interface for human annotation.\n\n\n```python\n# 8. obtaining labels for "query_samples" by the human\nquery_labels = [\n      {\n        "text": "I love Berlin .",\n        "labels": [{"start_pos": 7, "text": "Berlin", "label": "S-LOC"}]\n      },\n      {\n        "text": "This book is great.",\n        "labels": []\n      }\n]\n\n\nannotated_sents = assign_labels(query_labels)\n```\n`query_labels` is the label information of a sentence after annotation by human. We use such information to create Flair Sentence class by calling `assign_labels()` method.\n\nFor more detail, see [Adding labels to sentences](https://github.com/flairNLP/flair/blob/master/resources/docs/TUTORIAL_1_BASICS.md#adding-labels-to-sentences)\n\n\n```python\n# 9. retrain model with new labeled data\nlearner.teach(annotated_sents, save_path=f"output/retrain")\n```\n\nFinally, we call `learner.teach()` to retrain the model. The `annotated_sents` will be added to `corpus.train` automatically.\n\nIf you want to run the workflow in a loop, you can take a look at the `examples` folders.\n\n\n## Construct envirement locally\n\nIf you want to make a PR or implement something locally, you can follow bellow instruction to construct the development envirement locally.\n\nFirst we create a environment "seqal" based on the `environment.yml` file.\n\nWe use conda as envirement management tool, so install it first.\n\n```\nconda env create -f environment.yml\n```\n\nThen we activate the environment.\n\n```\nconda activate seqal\n```\n\nInstall poetry for dependency management.\n\n```\ncurl -sSL https://raw.githubusercontent.com/python-poetry/poetry/master/get-poetry.py | python -\n```\n\nAdd poetry path in your shell configure file (`bashrc`, `zshrc`, etc.)\n```\nexport PATH="$HOME/.poetry/bin:$PATH"\n```\n\nInstalling dependencies from `pyproject.toml`.\n\n```\npoetry install\n```\n\nYou can make development locally now.\n\nIf you want to delete the local envirement, run below command.\n```\nconda remove --name seqal --all\n```\n\n## Performance\n\nSee [performance.md](./docs/source/performance.md) for detail.\n\n\n## Contributors ✨\n\nThanks goes to these wonderful people ([emoji key](https://allcontributors.org/docs/en/emoji-key)):\n\n<!-- ALL-CONTRIBUTORS-LIST:START - Do not remove or modify this section -->\n<!-- prettier-ignore-start -->\n<!-- markdownlint-disable -->\n<!-- markdownlint-enable -->\n<!-- prettier-ignore-end -->\n<!-- ALL-CONTRIBUTORS-LIST:END -->\n\nThis project follows the [all-contributors](https://github.com/all-contributors/all-contributors) specification. Contributions of any kind welcome!\n\n## Credits\n\n- [Cookiecutter](https://github.com/audreyr/cookiecutter)\n- [browniebroke/cookiecutter-pypackage](https://github.com/browniebroke/cookiecutter-pypackage)\n- [flairNLP/flair](https://github.com/flairNLP/flair)\n- [modal](https://github.com/modAL-python/modAL)\n',
    'author': 'Xu Liang',
    'author_email': 'liangxu006@gmail.com',
    'maintainer': None,
    'maintainer_email': None,
    'url': 'https://github.com/BrambleXu/seqal',
    'packages': packages,
    'package_data': package_data,
    'install_requires': install_requires,
    'python_requires': '>=3.8,<4.0',
}


setup(**setup_kwargs)
