Metadata-Version: 2.1
Name: daluke
Version: 0.0.5
Summary: A Danish-speaking language model with entity-aware self-attention
Home-page: https://github.com/peleiden/daLUKE
Author: Søren Winkel Holm, Asger Laurits Schultz
Author-email: s18911@dtu.dk, s183912@dtu.dk
License: MIT License
Download-URL: https://pypi.org/project/daluke/
Description: # DaLUKE: The Entity-aware, Danish Language Model
        
        <img src="https://raw.githubusercontent.com/peleiden/daluke/master/daluke-mascot.png" align="right"/>
        
        [![pytest](https://github.com/peleiden/daLUKE/actions/workflows/pytest.yml/badge.svg?branch=master)](https://github.com/peleiden/daLUKE/actions/workflows/pytest.yml)
        
        Implementation of the knowledge-enhanced transformer [LUKE](https://github.com/studio-ousia/luke) pretrained on the Danish Wikipedia and evaluated on named entity recognition (NER).
        
        ## Installation
        
        ```
        pip install daluke
        ```
        For including optional requirements that are necessary for training and general analysis:
        ```
        pip install daluke[full]
        ```
        Python 3.8 or newer is required.
        
        ## Explanation
        For an explanation of the model, see our [bachelor's thesis](https://peleiden.github.io/bug-free-guacamole/main.pdf) or the original [LUKE paper](https://www.aclweb.org/anthology/2020.emnlp-main.523/).
        
        ## Usage
        ### Inference on simple NER or masked language modeling (MLM) examples
        
        #### Python
        For performing NER predictions
        ```py
        from daluke import AutoNERDaLUKE, predict_ner
        
        daluke = AutoNERDaLUKE()
        
        document = "Det Kgl. Bibliotek forvalter Danmarks største tekstsamling, der strækker sig fra middelalderen til det nyeste litteratur."
        iob_list = predict_ner(document, daluke)
        ```
        
        For testing MLM predictions
        ```py
        from daluke import AutoMLMDaLUKE, predict_mlm
        
        daluke = AutoMLMDaLUKE()
        # Empty list => No entity annotations in the string
        document = "Professor i astrofysik, [MASK] [MASK], udtaler til avisen, at den nye måling sandsynligvis ikke er en fejl."
        best_prediction, table = predict_mlm(document, list(), daluke)
        ```
        
        #### CLI
        ```bash
        daluke ner --text "Thomas Delaney fører Danmark til sejr ved EM i fodbold."
        daluke masked --text "Slutresultatet af kampen mellem Danmark og Rusland bliver [MASK]-[MASK]."
        ```
        For Windows, or systems where `#!/usr/bin/env python3` is not linked to the correct Python interpreter, the command `python -m daluke.api.cli` can be used instead of `daluke`.
        
        ### Training DaLUKE yourself
        
        This part shows how to recreate the entire DaLUKE training pipeline from dataset preparation to fine-tuning.
        This guide is designed to be run in a bash shell.
        If you use Windows, you will probably have to make some modifications to the shell scripts used.
        
        ```bash
        # Download forked luke submodule
        git submodule update --init --recursive
        # Install requirements
        pip install -r requirements.txt
        pip install -r optional-requirements.txt
        pip install -r luke/requirements.txt
        
        # Build dataset
        # The script performs all the steps of building the dataset, including downloading the Danish Wikipedia
        # You only need to modify DATA_PATH to where you want the data to be saved
        # Be aware that this takes several hours
        dev/build_data.sh
        
        # Start pretraining using default hyperparameters
        python daluke/pretrain/run.py <INSERT DATA_PATH HERE> -c configs/pretrain-main.ini --name $NAME --save-every 5 --epochs 150 --name daluke --fp16
        # Optional: Make plots of pretraining
        python daluke/plot/plot_pretraining.py <DATA_PATH>/daluke
        
        # Fine-tune on DaNE
        python daluke/collect_modelfile.py <DATA_PATH>/daluke <DATA_PATH>/ner/daluke.tar.gz
        python daluke/ner/run.py <DATA_PATH>/ner/daluke -c configs/main-finetune.ini --model <DATA_PATH>/ner/daluke.tar.gz --name finetune --eval
        # Evaluate on DaNE test set
        python daluke/ner/run_eval.py <DATA_PATH>/ner/daluke/finetune --model <DATA_PATH>/ner/daluke/finetune/daluke_ner_best.tar.gz
        # Optional: Fine-tuning plots
        python daluke/plot/plot_finetune_ner.py <DATA_PATH>/ner/daluke/finetune/train-results
        ```
        
        
        # History
        
        ## 0.0.5
            - Added batching in Python API NER forward passing
        
        ## 0.0.4
            - Added a Python API for maintaining a stateful model and performing CWR, MLM and NER predictions
        
        ## 0.0.3: Finalization of Bachelor's Project
            - Allowed specifying entity spans in masked word prediction CLI
        
        ## 0.0.2
            - CLI made working on Windows
        
        ## 0.0.1
            - Simple single-example CLI released
        
Keywords: nlp,ai,pytorch,ner
Platform: UNKNOWN
Description-Content-Type: text/markdown
Provides-Extra: full
