Metadata-Version: 2.1
Name: lmproof
Version: 0.2.1
Summary: Language model powered proof reader for correcting contextual errors in natural language.
Home-page: https://github.com/sai-prasanna/lmproof
License: Apache-2.0
Keywords: NLP,language model,Grammatical error correction
Author: sai-prasanna
Author-email: sai.r.prasanna@gmail.com
Requires-Python: >=3.6
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Dist: dataclasses (>=0.7); python_version >= "3.6" and python_version < "3.7"
Requires-Dist: lemminflect (>=0.2.0)
Requires-Dist: spacy (>=2.0)
Requires-Dist: symspellpy (>=6.0)
Requires-Dist: torch (>=1.2.0)
Requires-Dist: transformers (>=2.0)
Project-URL: Repository, https://github.com/sai-prasanna/lmproof
Description-Content-Type: text/markdown

## lmproof - Language Model Proof Reader

Library to do proof-reading corrections for Grammatical Errors, spelling errors, confused word errors and other errors using pre-trained Language Models.

## Usage
Install spacy model by `python -m spacy download en`.
Then try out this snippet.

``` python
import lmproof
proof_reader = lmproof.load("en")
source = "The foxes living on the Shire is brown.'"
corrected = proof_reader.proofread(source) # "The foxes living in the Shire are brown."
```

## How it works?

We use the language model based scoring approach mentioned in [Christopher Bryant and Ted Briscoe. 2018](http://aclweb.org/anthology/W18-0529) with few changes.

Unlike many approaches to GEC, this approach does NOT require annotated training data and mainly depends on a monolingual language model. The program works by iteratively comparing certain words in a text against alternative candidates and applying a correction if one of these candidates is more probable than the original word. These correction candidates are variously generated by a word inflection library or are otherwise defined manually. Currently, this system only corrects:

    Non-words (e.g. freind and informations)
    Morphology (e.g. eat, ate, eaten, eating, etc.)
    Common Determiners and Prepositions (e.g. the, a, in, at, to, etc.)
    Commonly Confused Words (e.g. bear/bare, lose/loose, etc.)

This work builds upon https://github.com/chrisjbryant/lmgec-lite/

## Components

### Language Models
* [transformers](https://github.com/huggingface/transformers)
### Inflection generators
* [LemmInflect](https://github.com/bjascob/LemmInflect) is used to lemmatize and generate inflections for candidate proposals to the language model.
### Spell Checker
* [symspellpy](https://github.com/mammothb/symspellpy) is used for obtaining spell check candidates.

The components are highly modularised to facilitate experimentation with newer scorers and support more languages.
Pre-trained language models for other languages, inflectors, common error patterns can be easily added to support more languages.

## TODOs

* Use edits in existing GEC corpus to generate candidates.
* Tests
* Publish benchmarks of the model.
* Think of simple ways to generate insertion candidates.
* Add more languages.

