Metadata-Version: 2.1
Name: LFExtractor
Version: 1.0.1
Summary: A corpus-linguistic tool to extract and search for linguistic features
Home-page: https://github.com/jaaack-wang/ling_feature_extractor
Author: Jack Wang
Author-email: jackwang196531@gmail.com
License: MIT License
Description: # ling_feature_extractor
        ## Description
        - A corpus-linguistic tool to extract and search for linguistic features in a text or a corpus.
        - There are 95 built-in linguistic features in the main version versus 98 features in the Thesis_Project version. Deleted features are words per utterance, number of utterances, and number of overlaps, which are not deemed as generally accessible in a normal corpus.
        - Over 2/3 of these features come from Biber et al.(2006) with 42 features also present in Biber(1988). These features are generally known as part of the Multi-Dimensional (MD) analysis framework.
        - The program is mainly tested on two online accessible corpora, namely [British Academic Spoken Corpus](http://www.reading.ac.uk/AcaDepts/ll/base_corpus/) and [Michigan Corpus of Academic Englush](https://quod.lib.umich.edu/cgi/c/corpus/corpus?page=home;c=micase;cc=micase), but due to copyright concerns, here it is tested on the [test_sample](https://github.com/jaaack-wang/ling_feature_extractor/tree/main/test_sample). 
        
        ## Prerequisites
        
        - `Computer Langauges`: 
           - Python 3.6+: check with cmd: `python --version` or `python3 --version` ([Download Page](https://www.python.org/downloads/)); 
           - Java 1.8+: check with cmd: 'java --version' ([Download Page](https://www.java.com/en/download/)). 
        - `Python packages`
        
        | Package | Description | Pip download | 
        | :---: | :---: | :---: |
        | [stanfordcorenlp](https://github.com/Lynten/stanford-corenlp) | A Python wrapper for StanforeCoreNLP | `pip/pip3 install stanfordcorenlp` |
        | [pandas](https://pandas.pydata.org) | Used for storing extracted feature frequencies  | `pip/pip3 install pandas` |
        
        Besides, built-in packages are heavily employed in the program, especially the built-in `re` package for Regular Expression.
        
        ## Installation
        - Directly download from this page and cd to the project folder.
        - By pip: `pip/pip3 install LFExtractor`
        
        ## Usage
        ### path to StanfordCoreNLP
        **Please specify _the directory to StanfordCoreNLP_ in the text_processor.py under LFE folder when first using the program.**
        - [X] `nlp = StanfordCoreNLP("/path/to/StanfordCoreNLP/")` 
        
        Example: nlp = StanfordCoreNLP("/Users/wzx/p_package/stanford-corenlp-4.1.0")
        
        ### Dealing with a corpus of files
        ```python 
        from LFE.extractor import CorpusLFE
        
        lfe = CorpusLFE('/directory/to/the/corpus/under/analysis/')
        # get frequency data and tagged corpus and extracted features by default
        lfe.corpus_feature_fre_extraction() lfe.corpus_feature_fre_extraction()    # lfe.corpus_feature_fre_extraction(normalized_rate=100, save_tagged_corpus=True, save_extracted_features=True, left=0, right=0). 
        # change the normalized_rate, trun off tagged text and leave extracted text with specified context to display
        lfe.corpus_feature_fre_extraction(1000, False, True, 2, 3) # extract frequency data only, and the data are normalized at 1000 words.  
        
        # get frequency data only
        lfe.corpus_feature_fre_extraction(save_tagged_corpus=False, save_extracted_features=False)
        # get tagged corpus only
        lfe.save_tagged_corpus()
        # get extracted feature only
        lfe.save_corpus_extracted_features()   # lfe.save_corpus_extracted_features(left=0, right=0)
        # set how many words to display besides the target pattern
        lfe.save_corpus_extracted_features(2, 3)
        
        # extract and save specific linguistic feature by feature name
        # to see the built-in features' names, use `show_feature_names()`
        from LFE.extractor import *
        print(show_feature_names())   # Six letter words and longer, Contraction, Agentless passive, By passive...
        # specify which feature to extract and save
        lfe.save_corpus_one_extracted_feature_by_name('Six letter words and longer')
        
        # extract and save specific linguistic feature by feature regex, for example, 'you know' 
        lfe.save_corpus_one_extracted_feature_by_regex(r'you_\S+ know_\S+', 2, 2, feature_name='You Know')  # Extract phrase 'you know' along with 2 words spanning around. Also remember the '_\S+' at the end of each word since the corpus will be automatically POS tagged.
        # for more complex structure, the features_set.py can be ultilized, for example, to extract "article + adj + noun" structure
        from LFE import features_set as fs
        ART = fs.ART
        ADJ = fs.ADJ
        NOUN = fs.NOUN
        lfe.save_corpus_one_extracted_feature_by_regex(rf'{ART} {ADJ} {NOUN}', 2, 2, 'Noun phrase')
        # result example (use test_sample): away_RB by_IN	【 the_DT whole_JJ thing_NN 】	In_IN fact_NN 
        ```
        
        ### Dealing with a text
        ```python
        from LFE import extractor as ex
        
        # check the functionalities contained in ex by dir(ex)
        # show built-in feature names
        print(ex.show_feature_names())   # Six letter words and longer, Contraction, Agentless passive, By passive...
        # get built-in features' regex by its name
        print(ex.get_feature_regex_by_name('Contraction'))  # (n't| '\S\S?)_[^P]\S+
        # get built-in features' names by regex
        print(ex.get_feature_name_by_regex(r"(n't| '\S\S?)_[^P]\S+"))  # Contraction
        
        # text processing
        # tagged file
        ex.save_single_tagged_text('/path/to/the/file')
        # cleaned file
        ex.save_single_cleaned_text('/path/to/the/file')
        
        # display extracted feature by name
        res = ex.display_extracted_feature_by_name('/path/to/the/file', 'Contraction', left=0, right=0)
        print(res)  #  's_VBZ, n't_NEG, 've_VBP...
        # save the result
        ex.save_extracted_feature_by_name('/path/to/the/file', 'Contraction', left=0, right=0)
        
        # display extracted feature by regex, for example, noun phrase
        from LFE import features_set as fs
        
        ART = fs.ART
        ADJ = fs.ADJ
        NOUN = fs.NOUN
        res = ex.display_extracted_feature_by_regex(rf'{ART} {ADJ} {NOUN}', 2, 2, 'Noun phrase')
        print(res)  # One_CD is_VBZ	【 the_DT extraordinary_JJ evidence_NN 】	of_IN human_JJ
        # save the result
        ex.save_extracted_feature_by_regex(rf'{ART} {ADJ} {NOUN}', 2, 2, 'Noun phrase')
        
        # get the frequency data of all the linguistic features for a file 
        res = ex.get_single_file_feature_fre(file_path, normalized_rate=100, save_tagged_file=True, save_extracted_features=True, left=0, right=0)
        print(res)
        ```
        
        ### Dealing with a part of a corpus
        ```python
        from LFE.extractor import *
        
        lfe = CorpusLFE('/directory/to/the/corpus/under/analysis/')
        # get_filepath_list and select the files you want to examine and construct a list
        fp_list = lfe.get_filepath_list()   
        # loop through the list and use the functionalities mentioned above to get the results you want
        ```
        
Keywords: Computational linguistics,Corpus-linguistic tool,NLP,Natural Language Processing
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.6
Description-Content-Type: text/markdown
