Metadata-Version: 2.1
Name: frenchnlp
Version: 0.1.8
Summary: State of the art toolchain for natural language processing in French
Home-page: https://github.com/xiaoouwang/frenchnlp
Author: Xiaoou WANG
Author-email: xiaoouwangfrance@gmail.com
License: MIT
Download-URL: https://pypi.org/project/frenchnlp
Description: # French NLP Toolkit
        
        State of the art toolkit for Natural Language Processing in French based on CamemBERT/FlauBERT.
        
        - [x] sentence similarity measure
        
        * For better sentence similarity then average pooling/[cls], see
        
        Reimers, Nils, and Iryna Gurevych. “Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks.” ArXiv:1908.10084 [Cs], August 27, 2019. http://arxiv.org/abs/1908.10084.
        
        * For use of sentence similarity in real life, see
        
        Xiaoou Wang, Xingyu Liu, Yimei Yue. “Mesure de similarité textuelle pour l’évaluation automatique de copies d’étudiants.” TALN-RECITAL 2021. [Download](https://xiaoouwang.github.io/xowang/TALN-RECITAL_2021_paper_74.pdf)
        
        - [ ] text classification
        
        ## How to use the package
        
        ```python
        from frenchnlp import *
        from transformers import AutoTokenizer, AutoModel
        import torch
        ```
        ## Transformer-based sentence similarity measure (using CamemBERT as example)
        
        ### Using the [cls] token
        
        `compare_compare_cls(model,tokenizer,sentence1,sentence2)`
        
        ```py
        fr_tokenizer = AutoTokenizer.from_pretrained('camembert-base')
        fr_model = AutoModel.from_pretrained('camembert-base')
        
        sentences = [
            "J'aime les chats.",
            "Je déteste les chats.",
            "J'adore les chats."
        ]
        
        for i in range(1,3):
            print(f"similarité sémantique entre\n{sentences[0]}\n{sentences[i]}")
            print(bert_compare_cls(fr_model,fr_tokenizer,sentences[0],sentences[i]))
        ```
        
        Output:
        
        ```
        similarité sémantique entre
        J'aime les chats.
        Je déteste les chats.
        0.9145417
        similarité sémantique entre
        J'aime les chats.
        J'adore les chats.
        0.9809468
        ```
        
        ### Average pooling
        
        `compare_bert_average(model,tokenizer,sent1,sent2)`
        
        ```python
        fr_tokenizer = AutoTokenizer.from_pretrained('camembert-base')
        fr_model = AutoModel.from_pretrained('camembert-base')
        
        for i in range(1,3):
            print(f"similarité sémantique entre\n{sentences[0]}\n{sentences[i]}")
            print(compare_bert_average(fr_model,fr_tokenizer,sentences[0],sentences[i])
        ```
        
        Output:
        
        ```
        similarité sémantique entre
        J'aime les chats.
        Je déteste les chats.
        0.9145417
        similarité sémantique entre
        J'aime les chats.
        J'adore les chats.
        0.9809468
        ```
        
        ### Using multilingual sentence embeddings
        
        See above for the reference on multilingual sentence embeddings.
        
        `compare_sent_transformer(model,sent1,sent2)`
        
        ```
        from sentence_transformers import SentenceTransformer
        
        sent_model = SentenceTransformer('stsb-xlm-r-multilingual')
        
        for i in range(1,3):
            print(f"similarité sémantique entre\n{sentences[0]}\n{sentences[i]}")
            print(compare_sent_transformer(sent_model,sentences[0],sentences[i])
        ```
        
        Output:
        
        ```
        similarité sémantique entre
        J'aime les chats.
        Je déteste les chats.
        0.46124768
        similarité sémantique entre
        J'aime les chats.
        J'adore les chats.
        0.9557947
        ```
Keywords: text mining,npl,corpus,french
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Description-Content-Type: text/markdown
