Metadata-Version: 2.1
Name: lambre
Version: 1.1.0
Summary: a tool to measure the grammatical well-formedness of multilingual texts
Home-page: https://github.com/adithya7/lambre
Author: Adithya Pratapa, Antonios Anastasopoulos, Shruti Rijhwani, Aditi Chaudhary, David R. Mortensen, Graham Neubig and Yulia Tsvetkov
License: UNKNOWN
Project-URL: Bug Tracker, https://github.com/adithya7/lambre/issues
Keywords: multilingual,text-generation,evaluation
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3.6
Classifier: License :: OSI Approved :: MIT License
Requires-Python: >=3.6
Description-Content-Type: text/markdown

# Morphological Evaluation of NLG

L'AMBRE is a tool to measure the grammatical well-formedness of texts generated by NLG systems. It analyzes the dependency parses of the text using morpho-syntactic rules, and returns a well-formedness score. This tool utilizes the Surface Syntactic Universal Dependency ([SUD](https://surfacesyntacticud.github.io)) project both for extracting rules as well as parsing, and is therefore applicable across languages. See our [EMNLP 2021 paper](https://aclanthology.org/2021.emnlp-main.570/) for more details.

## Getting Started

### Install

```bash
python -m pip install .
```

### Try L'AMBRE

For a given input text file, `lambre` computes a morpho-syntactic well-formedness score [0-1]. The following method first downloads the parsers and rule sets for the specified language before computing the document-level score. See the output folder (`out`) for error visualizations.

```python
>>> import lambre
>>> lambre.score("ru", "data/txt/ru.txt") # Russian
# lambre.score("de", "data/txt/de.txt")
# lambre.score("en", "data/txt/en.txt")
```

L'AMBRE can also be used from command line. See `lambre --help` for more options.

```bash
lambre ru data/txt/ru.txt
```

## Morpho-syntactic Rules

`lambre` currently supports two rule sets, `chaudhary-etal-2021` (see [Chaudhary et al., 2020](https://aclanthology.org/2020.emnlp-main.422/), [2021](https://aclanthology.org/2021.emnlp-main.553/)) and `pratapa-etal-2021` (see [Pratapa et al., 2021](https://aclanthology.org/2021.emnlp-main.570)). The former is the default, but the rule set can be specified using `--rule-set` option.

## Visualization Examples

Along with the overall L'AMBRE score, we write the erroneous sentences to the output folder `out/errors`. We provide two visualizations, i) plain text (`errors.txt`), ii) HTML (`errors/*.html`). For plain text visualization, we use the [ipymarkup](https://github.com/natasha/ipymarkup) tool. We use [brat](https://brat.nlplab.org/) and [Universal Dependencies](https://universaldependencies.org/introduction.html#contributors) for HTML visualizations.

Below is a sample run on 1000 example Hindi sentences from the [Samanantar corpus](https://indicnlp.ai4bharat.org/samanantar/).

```python
>>> import lambre
>>> lambre.score("hi", "examples/hi_sents_1k.txt")
generating SUD parse for the input document
storing .conllu file at out/hi_sents_1k.conllu
computing document-level lambre score
lambre score: 0.8821
writing grammatical errors to out/errors
```

A few erroneous sentences from this corpus (as detected by L'AMBRE):

```
Input sentence: संख्या की स्टाफ प्रशिक्षित में हिंदी/वाले जानने हिंदी
                 (Stenography Hindi in trained persons of No.)
                 
                 तीनवर्षतक अग्रनीत किए| जाने के बाद व्यपगत हुए| आरक्षणों की| संख्या
                 (of after forward No. reservations lapsed carrying for 3 years)
```

Below, we show the visualizations of word order related errors for the above two sentences. We also generate separate files for agreement and case marking (see [examples/](examples/) for full HTML outputs).

![errors in word order](examples/wordorder_1.png "word order")

![errors in word order](examples/wordorder_2.png "word order")

## Parser

We provide SUD parsers trained using [Stanza](https://stanfordnlp.github.io/stanza/) toolkit. See section 4 in [our paper](https://aclanthology.org/2021.emnlp-main.570) for more details.

## Supported Languages

We currently support the following languages. `lambre` automatically downloads the necessary language-specific resources (when available).

| Language     | Code | Language     | Code | Language     | Code | Language     | Code |
| --------     | ---- | --------     | ---- | --------     | ---- | --------     | ---- |
| Catalan      | ca   | Spanish      | es   | Italian      | it   | Russian      | ru   |
| Czech        | cs   | Estonian     | et   | Latvian      | lv   | Slovenian    | sl   |
| Danish       | da   | Persian      | fa   | Dutch        | nl   | Swedish      | sv   |
| German       | de   | French       | fr   | Polish       | pl   | Ukrainian    | uk   |
| Greek        | el   | Hindi        | hi   | Portuguese   | pt   | Urdu         | ur   |
| English      | en   | Indonesian   | id   | Romanian     | ro   |              |      |

To manually download rules or parsers for a given language,

```python
>>> import lambre
>>> lambre.download("ru") # Russian
```

## Reference

If you find this toolkit helpful in your research, consider citing our paper,

```bib
@inproceedings{pratapa-etal-2021-evaluating,
    title = "Evaluating the Morphosyntactic Well-formedness of Generated Texts",
    author = "Pratapa, Adithya  and
      Anastasopoulos, Antonios  and
      Rijhwani, Shruti  and
      Chaudhary, Aditi  and
      Mortensen, David R.  and
      Neubig, Graham  and
      Tsvetkov, Yulia",
    booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2021",
    address = "Online and Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.emnlp-main.570",
    pages = "7131--7150",
}
```

We also encourage you to cite the original works for the `chaudhary-etal-2021` ruleset, [Chaudhary et al., 2020](https://aclanthology.org/2020.emnlp-main.422/) and [Chaudhary et al., 2021](https://aclanthology.org/2021.emnlp-main.553/).

## License

L'AMBRE is available under MIT License. The code for training parsers is adapted from [stanza](https://github.com/stanfordnlp/stanza), which is available under Apache License, Version 2.0.

## Issues

For any issues, questions or requests, please use the [Github Issue Tracker](https://github.com/adithya7/lambre/issues).


