Metadata-Version: 2.1
Name: gr-nlp-toolkit
Version: 0.0.2
Summary: A Transformer-based Natural Language Processing Pipeline for Greek
Home-page: https://github.com/nlpaueb/gr-nlp-toolkit
Author: nlpaueb
Author-email: p3170148@aueb.gr, p3170039@aueb.gr
License: UNKNOWN
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: OS Independent
Classifier: Topic :: Text Processing :: Linguistic
Classifier: Natural Language :: Greek
Requires-Python: >=3.6
Description-Content-Type: text/markdown
License-File: LICENSE

# gr-nlp-toolkit

A Transformer-based Natural Language
Processing Pipeline for Greek. This toolkit has state-of-the art accuracies in Greek
and offers predictions 
for Named Entity Recognition, Part-of-Speech tagging, Morphological Tagging
as well as Dependency Parsing.

## Installation

You can install the toolkit by executing the following in the command line:
```sh
pip install gr-nlp-toolkit
```

## Usage

To use the toolkit first initialize a Pipeline specifying which processors you need. Each processor 
annotates the text with a specific task's annotations.

- To obtain Part-of-Speech and Morphological Tagging annotations add the `pos` processor
- To obtain Named Entity Recognition annotations add the `ner` processor
- To obtain Dependency Parsing annotations add the `dp` processor

```python
from gr_nlp_toolkit import Pipeline
nlp = Pipeline("pos,ner,dp") # Use ner,pos,dp processors
# nlp = Pipeline("ner,dp") # Use only ner and dp processors
```

The first time you use a processor, that processors data files are cached in the .cache folder of 
your home directory so you will not have to download them again.

## Generating the annotations

After creating the pipeline you can annotate a text by calling the pipeline's `__call__` method.

```python
doc = nlp('Η Ιταλία κέρδισε την Αγγλία στον τελικό του Euro το 2021')
```
A `Document` object is then created and is annotated. The original text is tokenized 
and split to tokens

## Accessing the annotations

The following code explains how you can access the annotations generated by the toolkit.

```python
for token in doc.tokens:
  token.text # the text of the token
  
  token.ner # the named entity label in IOBES encoding : str
  
  token.upos # the UPOS tag of the token
  token.feats # the morphological features for the token
  
  token.head # the head of the token
  token.deprel # the dependency relation between the current token and its head
```

`token.ner` is set by the `ner` processor, `token.upos` and `token.feats` are set by the `pos` processor
and `token.head` and `token.deprel` are set by the `dp` processor.

A small detail is that to get the `Token` object that is the head of another token you need to access
`doc.tokens[head-1]`. The reason for this is that the enumeration of the tokens starts from 1 and when the
field `token.head` is set to 0, that means the token is the root of the word.


