Metadata-Version: 2.1
Name: compling
Version: 0.0.5
Summary: compling is a Python module that provide some Natural Language Processing and Computational Linguistics functionality to work with human language data.
Home-page: https://github.com/FrancescoPeriti/compling
Author: Francesco Periti
Author-email: peritifrancesco@gmail.com
License: UNKNOWN
Description: # compling
        
        [![Build Status](https://travis-ci.org/joemccann/dillinger.svg?branch=master)](https://travis-ci.org/joemccann/dillinger)
        
        **compling** is a Python module that provide some Natural Language Processing and Computational Linguistics functionality to work with human language data. It incorporate various Data and Text Mining features from other famous library (e.g. spacy, NLTK, sklearn, ...) in order to arrange a pipeline for the analysis of corpora of JSON documents.
        
        ### Documentation
         See documentation: http://pycompling.altervista.org/.
        
        ### Installation
        ```sh
        pip install compling
        ```
        You also need to download the spacy model based on your corpus language.
        See here the available models: https://spacy.io/models.
        By default, **complig** expects you to download _sm_ models. You can still choose to download larger models, but remember to edit the [_confg.ini_](#config.ini) file, so it can work properly.
        
        For example... 
        If the language of your documents is English, you could run:
        ```sh
        $ python -m spacy download en_core_web_sm
        ```
        # config.ini
        The functionalities offered by **compling** may require a large variety of parameters. To facilitate their use, default values are provided for some parameters:
        - some can be changed directly in the function invocation. Many functions provide optional parameters;
        - others are stored in the _config.ini_ file. This file is a configuration file that contains the values of some special parameters that characterize the processing of your corpora. 
        (e.g. _the language of documents in your corpus._)
        
        See here a preview:
        ```ini
        [Corpus]
        ;The language of documents in your corpus.
        language = english
        
        ;The standard iso639 of 'language'.
        ;See: https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes .
        iso639 = en
        
        ;Documents in your corpus store their text in this key.
        text_key = text
        
        ;Documents in your corpus store their date values as string in this format.
        ;For a complete list of formatting directives, see: https://docs.python.org/3/library/datetime.html#strftime-strptime-behavior.
        date_format = %d/%m/%Y
        
        [Document_record]
        ;Document records metadata:
        
        ;If lower==1, A lowercase version will be stored for each document.
        lower = 0
        
        ;If lemma==1, A version with tokens replace by their lemma will be stored for each document.
        lemma = 0
        
        ;If stem==1, A version with tokens replace by their stem will be stored for each document.
        stem = 0
        
        ;If negations==1, A version where negated token are preceded by 'NOT_' prefix will be stored for each document.
        negations = 1
        
        ;If named_entities==1, the occurring named entities will be stored in a list for each document.
        named_entities = 1
        ; ...
        ```
        **compling** provide the ConfigManager class to help you handling it.
        
        See here the available methods.
        ```python
        class ConfigManager:
            def __init__(self) -> None:
                """Constructor: creates a ConfigManager object."""
        
            def load(self) -> None:
                """Loads content of config.ini file."""
        
            def cat(self) -> None:
                """Shows the content of the config.ini file as plain-text."""
        
            def updates(self, config:dict) -> None:
                """Updates some values of some sections."""
        
            def update(self, section, k, v) -> None:
                """Update a k field with a v value in the s section."""
        
            def reset(self) -> None:
                """Reset the config.ini file to default conditions."""
        
            def whereisconfig(self) -> str:
                """Shows the config.ini file location."""
        ```
        **Example of usage**
        ```python
        from compling.config import ConfigManager
        cm = ConfigManager()
        
        # documents of my corpora are italian
        cm.updates({'Corpus': {'language': 'italian', 'iso639':'it'})
        
        # I want to keep a lowercase version of each document
        cm.update('Document_record', 'lower', '1')
        
        # default conditions
        cm.reset()
        ```
        ### Tree structure
        The **compling** tree structure is shown below. 
        Different fonts are used: **bold**, for packages; _italic_, for files; Capitalized, for available classes.
        ----------------------
        * **compling**
            - ++example-corpus++: folder containing a sample corpus.
                - ++vatican-publications++
                    -  _[...]_
            - _config.ini_: configuration file.
            - _config. py_
            - **nlptoolkit**
                + NLP
            - **analysis** 
                + **lexical**
                    +  **tokenization**
                        + [Tokenizer](#Tokenization)
                    +  **vectorization**
                        +  [Vectorizer](#Vectorization)
                        +  [VSM](#Vectorization)
                    +  **unsupervised_learning**
                        +  **clustering**
                            +  [KMeans](#Unsupervised-Learning)
                            +  [Linkage](#Unsupervised-Learning)
                        +  **dimensionality reduction**
                            +  [PCA](#Unsupervised-Learning)
                            +  [TruncateSVD](#Unsupervised-Learning)
                + **sentiment**
                    + **lexicon**
                        + Vader
                        + Sentiwordnet
                    + [SentimentAnalyzer](####Sentiment-Analysis) 
            - **embeddings**
                + **word**
                    + [Word2vec](#Embeddings) 
                    + [Fasttext](#Embeddings) 
                + **document**
                    + [Doc2vec](#Embeddings) 
                    
        ### Example of usage
        As example let's use the Vatican Publication corpus.
        ```python
        import pkg_resources
        corpus_path = pkg_resources.resource_filename('compling', 'example-corpus/vatican-publications')
        
        def doc_iterator(path:str):
            """Yields json documents."""
            import os, json
        
            for root, dirs, files in os.walk(path):
                for file in files:
                    if file.endswith('.json'):
                        with open(os.path.join(root, file), mode='r', encoding='utf-8') as f_json:
                            data = json.load(f_json)
                            yield data
        ```
        See the documentation for more details.
        #### Tokenization
        The tokenization converts input text to streams of tokens, where each token is a separate word, punctuation sign, number/amount, date, etc.
        
        **compling** provides a _Tokenizer_ class that tokenizes a stream of json documents.
        
        A Tokenizer object converts the corpus documents into a stream of:
        
           * _tokens_: tokens occurring in those documents. Each token is characterized by:
              * _token_id_: unique token identifier;
              * _sent_id_: unique sentence identifier. The id of the sentence the token occurs in;
              * _para_id_: unique paragraph identifier. The id of the paragraph the token occurs in;
              * _doc_id_: unique document identifier. The id of the document the token occurs in;
              * _text_: the text of the token;
              * a large variety of _optional meta-information_ (e.g. PoS tag, dep tag, lemma, stem, ...);
           * _sentences_ : sentences occurring in those documents. Each sentence is characterized by:
              * _sent_id_: unique sentence identifier;
              * _para_id_: unique paragraph identifier. The id of the paragraph the sentence occurs in;
              * _doc_id_: unique document identifier. The id of the document the sentence occurs in;
              * _text_: the text of the sentence;
              * a large variety of _optional meta-information_ (e.g.lemma, stem, ...);
           * _paragraphs_: sentences occurring in those documents. Each paragraph is characterized by:
              * _para_id_: unique paragraph identifier;
              * _doc_id_: unique document identifier. The id of the document the paragraph occurs in;
              * _text_: the text of the paragraph;
              * a large variety of _optional meta-information_ (e.g.lemma, stem, ...);
           * _documents_: Each document is characterized by:
              * _doc_id_: unique document identifier;
              * _text_: the text of the document;
              * a large variety of _optional meta-information_ (e.g.lemma, stem, ...);
        
        A Tokenizer object is also able to retrieve frequent n-grams to be considered as unique tokens.
        
        **Example of usage**
        See [config.ini](#config.ini) section for **_doc_iterator_** function.
        ```python
        from compling.analysis.lexical.tokenization import Tokenizer
        
        # new Tokenizer
        json_docs_stream_input = doc_iterator(corpus_path)
        json_docs_stream_output = doc_iterator(corpus_path)
        t = Tokenizer()
        
        # let's consider frequent bigrams as unique tokens
        json_docs_stream_output = t.ngrams2tokens(n=2, json_docs_stream_input, json_docs_stream_output)
        
        # run tokenization
        tokenization_records = t.run(json_docs_stream_output)
        
        token_records = list()
        sentence_records = list()
        paragraph_records = list()
        document_records = list()
        
        # you could store the records: if your corpus is large, tokenization could take a long time.
        for doc in tokenization_records:
            token_records.extend(doc['tokens'])
            sentence_records.extend(doc['sentences'])
            paragraph_records.extend(doc['tokens'])
            document_records.extend(doc['paragraphs'])
        ```
        #### Vectorization
        The process of converting text into vector is called vectorization. 
        The set of corpus documents vectorized corpus makes up the Vector Space Model, which can have a sparse or dense representation.
        
        **compling** provides a _Vectorizer_ class that, given corpus tokens records, vectorizes the corpus documents.
        
        A Vectorizer object allows you to create vectors grouping tokens for an arbitrary field.
        E.g. grouping tokens by:
        - _'doc_id'_: you 're creating document vectors;
        - _'sent_id'_: you 're creating sentence vectors;
        - _'author'_: you're creating author vectors (each token must have an 'author' field);
        - ...
        You can also choose the text field the tokens will be grouped by too.
        E.g.
        - lemma
        - text
        - stem
        - ...
        
        It offers several functions to set the vector components values, such as:
        - One-hot encoding
        - Tf
        - TfIdf
        - Mutual Information
        
        You can specify the vectorization representation format: Term x Document matrix, Postings list.
        
        You can also inspect the Vector Space Model. 
        
        **compling** provides a Vector Space Model class. It allows you to analyze the distance between each vectors.
        
        **Example of usage**
        ```python
        from compling.analysis.lexical.vectorization import Vectorizer
        
        # new Vectorizer
        v = Vectorizer(token_field='lemma', group_by_field='author')
        
        # stream of author vectors
        vector_stream = v.run('tfidf', tokens_records)
        ```
        ```python
        from compling.analysis.lexical.vectorization import VSM
        
        # stream to list
        vector_list = list(vector_stream)
        
        # new VSM objecy
        v = VSM(vectors=vector_list, id_field='author')
        
        # calculates the vector distance matrix between vectors.
        v.distance(metric='euclidean')
        
        # plot the distance matrix as a hitmap
        v.plot()
        
        # top n values for each vector id
        v.topn(n=10)
        ```
        #### Unsupervised Learning
        Unsupervised learning is a type of machine learning that looks for previously undetected patterns in a data set with no pre-existing labels and with a minimum of human supervision.
        
        **compling** provides these classes:
        - _KMeans_
        - _Linkage_
        - _PCA_
        - _TruncateSVD_
        
        **Example of usage**
        ```python
        from compling.analysis.lexical.unsupervised_learning.clustering import KMeans
        
        # new Kmeans object
        kmeans = KMeans(vectors=vector_list, id_field='author')
        
        # run kmeans: 4 clusters
        clusters = kmeans.run(k=4)
        ```
        ```python
        from compling.analysis.lexical.unsupervised_learning.clustering import Linkage
        
        # new Linkage object
        linkage = Linkage(vectors=vector_list, id_field='author')
        
        # run hierarchical clustering
        linkage.run(method='complete')
        
        # plot the dendrogram showing the set of all possible clusters
        linkage.plot()
        ```
        ```python
        from compling.analysis.lexical.unsupervised_learning.dimensionality_reduction import PCA
        
        # new PCA object
        pca = PCA(vectors=vector_list, id_field='author')
        
        # run PCA: reduction to 2 components.
        pca.run(n=2)
        
        # plot 2D vectors
        pca.plot()
        ```
        ```python
        from compling.analysis.lexical.unsupervised_learning.dimensionality_reduction import TruncatedSVD
        
        # new TruncatedSVD object
        truncateSVD = TruncatedSVD(vectors=vector_list, id_field='author')
        
        #run TruncatedSVD: reduction to 2 components.
        truncateSVD.run(n=2)
        
        # plot 2D vectors
        truncateSVD.plot()
        ```
        #### Sentiment Analysis
        **compling** implements a _SentimentAnalyzer_ class that allows you to perform sentiment analysis through a lexicon-based approach.
        
        SentimentAnalyzer uses a summation strategy: the polarity level of a document is calculated as the sum of the polarities of all the words in the document.
        
        The analysis detects negation pattern and reverses the negated tokens polarity.
        
        Providing a regex, you can filter sentences/paragraphs/documents to analyze.
        
        Providing a pos list and/or a dep list you can filter the words whose polarities will be summed.
        
        At the moment, only the analysis for English documents is available.
        **Example of usage**
        ```python
        from compling.analysis.sentiment import SentimentAnalyzer
        from compling.analysis.sentiment.lexicon import Vader
        
        # new SentimentAnalyzer. 
        # polarity of documents as sum of VERB, NOUN, PROPN, ADJ token polarities.
        s = SentimentAnalyzer(token_records, text_field='lemma', group_by_field='author',
                              id_index_field='para_id', # you can filter some paragraphs
                              pos=('VERB', 'NOUN', 'PROPN', 'ADJ')) 
        
        # polarity of documents as sum of VERB, NOUN, PROPN, ADJ token polarities occurring in paragraphs filtered by regex_pattern.
        s.filter(paragraph_records, regex_pattern="^.*(work).*$")
        
        # new Lexicon    
        lexicon = Vader()
          
        # run sentiment analysis 
        polarities, words = s.run(lexicon=lexicon)
        ```
        #### Embeddings
        An embedding is a relatively low-dimensional space into which you can translate high-dimensional vectors. Embeddings make it easier to do machine learning on large inputs like sparse vectors representing words or documents.
        
        **compling** incorporates some gensim class as Word2vec, Fasttext and Doc2vec.
        
        **Example of usage**
        ```python
        from compling.embeddings.words import Word2vec
        
        # new Word2vec
        w = Word2vec(index=sentence_records, text_field='text')
        
        # build Word2vec model
        w.run()
                
        love_sim = w.most_similar('love')
        ```
        ```python
        from compling.embeddings.documents import Doc2vec
        
        # new Doc2vec
        w = Doc2vec(index=sentence_records, id_field='author', text_field='text')
        
        # build Doc2vec model
        w.run()
        
        paulvi_sim = w.most_similar("Paul VI")
        ```
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.6
Description-Content-Type: text/markdown
