Metadata-Version: 2.1
Name: ua-gec
Version: 1.0.0
Summary: UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian language
Home-page: https://github.com/grammarly/ua-gec
Author: Oleksiy Syvokon
Author-email: oleksiy.syvokon@gmail.com
License: License :: OSI Approved :: CC-BY-4.0
Description: # UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language
        
        This package contains the UA-GEC data and code to work with it.
        
        
        ## Python library
        
        There is a Python package that consists of the data and the code to work with it.
        
        ### Getting started
        
        A simple way to install the package is by `pip`:
        
        ```
            $ pip install ua_gec==1.0
        ```
        
        Alternatively, you can install it from source:
        
        ```
            $ cd python
            $ python setup.py develop
        ```
        
        
        ### Iterating through corpus
        
        Once installed, you may get annotated documents from Python code:
        
        ```python
            
            >>> from ua_gec import Corpus
            >>> corpus = Corpus(partition="train")
            >>> for doc in corpus:
            ...     print(doc.source)
            ...     print(doc.target)
            ...     print(doc.annotated)
            ...     print(doc.meta.region)
        ```
        
        
        ### Working with annotations
        
        [The docs are under construction]
        
        
        ## Train-test split
        
        We expect users of the corpus train and tune their models on the train split
        only (of course, you are free to further split it into train-dev or use
        cross-validation). Use the test split for reporting scores of your final
        models.  Never optimize on the test set. Do not tune hyperparameters on it. And
        please, do not use it for model selection in any way.
        
        The [Statistics](#statistics) for the per-split statistics.
        
        
        ## Annotation format
        
        Annotated files are text file that use the following in-text annotation format:
        `{error=>edit:::error_type=Tag}`, where `error` and `edit` stand for the text item before
        and after correction, respectively, and `Tag` denotes an error category
        (`Grammar`, `Spelling`, `Punctuation`, or `Fluency`).
        
        Example of an annotated text:
        ```
            I {like=>likes:::error_type=Grammar} turtles.
        ```
        
        An accompanying Python package, `ua_gec`, provides many tools for working with
        annotated texts. See its documentation for details.
        
        
        ## Statistics
        
        UA-GEC contains:
        
        | Split | Documents | Sentences |  Tokens | Authors |
        |:-----:|:---------:|----------:|--------:|:-------:|
        | train | 851       | 18,225    | 285,247 | 416     |
        |  test | 160       | 2,490     | 43,432  | 76      |
        | TOTAL | 1,011     | 20,715    | 328,779 | 492     |
        
        The corpus statistics can be generated by running a script from the Python
        package (note that the `ua-gec` package must be installed first):
        
        ```
        $ python ./python/ua_gec/stats.py
        ```
        
        
        ## Contributing
        
        * The data collection is an ongoing activity. You can always contribute
          your Ukrainian writings or complete one of the writing tasks at
          https://ua-gec-dataset.grammarly.ai/
        
        * Code improvements and document are welcomed. Please, open a pull request.
        
        
        ## Contacts
        
        * oleksiy.syvokon@grammarly.com
        
Keywords: gec ukrainian dataset corpus grammatical error correction grammarly
Platform: UNKNOWN
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Education
Classifier: Intended Audience :: Science/Research
Classifier: Natural Language :: Ukrainian
Classifier: License :: CC0 1.0 Universal (CC0 1.0) Public Domain Dedication
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Requires-Python: >=3.6
Description-Content-Type: text/markdown
Provides-Extra: test
