Metadata-Version: 2.1
Name: lexicalrichness
Version: 0.1.6
Summary: A small module to compute textual lexical richness (aka lexical diversity).
Home-page: https://github.com/LSYS/lexicalrichness
Author: Lucas Shen YS
Author-email: lucas@lucasshen.com
License: MIT license
Download-URL: https://github.com/LSYS/lexicalrichness/archive/0.1.6.tar.gz
Description: ===============
        LexicalRichness
        ===============
        .. image:: https://badge.fury.io/py/lexicalrichness.svg
                :target: https://pypi.org/project/lexicalrichness/
        .. image:: https://img.shields.io/conda/vn/conda-forge/lexicalrichness.svg
                :target: https://anaconda.org/conda-forge/lexicalrichness
        .. image:: https://img.shields.io/conda/pn/conda-forge/lexicalrichness   
        	:target: https://anaconda.org/conda-forge/lexicalrichness
        .. image:: https://badgen.net/github/release/Naereen/Strapdown.js
                :target: https://github.com/LSYS/LexicalRichness.js/releases
        
        .. image:: https://github.com/LSYS/LexicalRichness/blob/housekeep/images/ghdocbadge.svg
                :target: https://github.com/LSYS/LexicalRichness/blob/master/README.rst
        	
        .. image:: https://www.codefactor.io/repository/github/lsys/lexicalrichness/badge
                :target: https://www.codefactor.io/repository/github/lsys/lexicalrichness  
        .. image:: https://img.shields.io/lgtm/grade/python/g/LSYS/LexicalRichness.svg?logo=lgtm&logoWidth=18)
                :target: https://lgtm.com/projects/g/LSYS/LexicalRichness/context:python
        
        .. image:: https://img.shields.io/pypi/pyversions/lexicalrichness   
        	:target: https://img.shields.io/pypi/pyversions/lexicalrichness  
        .. image:: https://img.shields.io/badge/Maintained%3F-yes-green.svg
           :target: https://GitHub.com/Naereen/StrapDown.js/graphs/commit-activity
        
        .. |Maintenance yes| image:: https://img.shields.io/badge/Maintained%3F-yes-green.svg
           :target: https://GitHub.com/Naereen/StrapDown.js/graphs/commit-activity
        	
        
        .. image:: https://img.shields.io/badge/PRs-welcome-brightgreen.svg
                :target: http://makeapullrequest.com
        .. image:: https://img.shields.io/badge/License-MIT-blue.svg
                :target: https://lbesson.mit-license.org
        .. image:: https://mybinder.org/badge_logo.svg
                :target: https://mybinder.org/v2/gh/LSYS/lexicaldiversity-example/main?labpath=example.ipynb
        	
        .. image:: https://zenodo.org/badge/132715931.svg
           :target: https://zenodo.org/badge/latestdoi/132715931
           
        A small python module to compute textual lexical richness (aka lexical diversity) measures.
        
        Lexical richness refers to the range and variety of vocabulary deployed in a text by a speaker/writer (McCarthy and Jarvis 2007). Lexical richness is used interchangeably with lexical diversity, lexical variation, lexical density, and vocabulary richness and is measured by a wide variety of indices. Uses include (but not limited to) measuring writing quality, vocabulary knowledge (Å iÅ¡kovÃ¡ 2012), speaker competence, and socioeconomic status (McCarthy and Jarvis 2007).
        
        
        
        1. Installation
        ---------------
        **Install using PIP**
        
        .. code-block:: bash
        
        	pip install lexicalrichness
        
        If you encounter, 
        
        .. code-block:: python
        
        	ModuleNotFoundError: No module named 'textblob'
        
        install textblob:
        
        .. code-block:: bash
        
        	pip install textblob
        
        *Note*: This error should only exist for :code:`versions <= v0.1.3`. Fixed in 
        `v0.1.4 <https://github.com/LSYS/LexicalRichness/releases/tag/0.1.4>`__ by `David Lesieur <https://github.com/davidlesieur>`__ and `Christophe Bedetti <https://github.com/cbedetti>`__.
        
        
        **Install from Conda-Forge**
        
        *LexicalRichness* is now also available on conda-forge. If you have are using the `Anaconda <https://www.anaconda.com/distribution/#download-section>`__ or `Miniconda <https://docs.conda.io/en/latest/miniconda.html>`__ distribution, you can create a conda environment and install the package from conda.
        
        .. code-block:: bash
        
        	conda create -n lex
        	conda activate lex 
        	conda install -c conda-forge lexicalrichness
        
        *Note*: If you get the error :code:`CommandNotFoundError: Your shell has not been properly configured to use 'conda activate'` with :code:`conda activate lex` in *Bash* either try
        
        	* :code:`conda activate bash` in the *Anaconda Prompt* and then retry :code:`conda activate lex` in *Bash*
        	* or just try :code:`source activate lex` in *Bash*
        
        **Install manually using Git and GitHub**
        
        .. code-block:: bash
        
        	git clone https://github.com/LSYS/LexicalRichness.git
        	cd LexicalRichness
        	pip install .
        
        **Run from the cloud**
        
        Try the package on the cloud (without setting anything up on your local machine) by clicking the icon here:  
        
        |mybinder|
        
        .. |mybinder| image:: https://mybinder.org/badge_logo.svg
         :target: https://mybinder.org/v2/gh/LSYS/lexicaldiversity-example/main?labpath=example.ipynb
        
        2. Quickstart
        -------------
        
        .. code-block:: python
        
        	>>> from lexicalrichness import LexicalRichness
        
        	# text example
        	>>> text = """Measure of textual lexical diversity, computed as the mean length of sequential words in
                    		a text that maintains a minimum threshold TTR score.
        
                    		Iterates over words until TTR scores falls below a threshold, then increase factor
                    		counter by 1 and start over. McCarthy and Jarvis (2010, pg. 385) recommends a factor
                    		threshold in the range of [0.660, 0.750].
                    		(McCarthy 2005, McCarthy and Jarvis 2010)"""
        
        	# instantiate new text object (use the tokenizer=blobber argument to use the textblob tokenizer)
        	>>> lex = LexicalRichness(text)
        
        	# Return word count.
        	>>> lex.words
        	57
        
        	# Return (unique) word count.
        	>>> lex.terms
        	39
        
        	# Return type-token ratio (TTR) of text.
        	>>> lex.ttr
        	0.6842105263157895
        
        	# Return root type-token ratio (RTTR) of text.
        	>>> lex.rttr
        	5.165676192553671
        
        	# Return corrected type-token ratio (CTTR) of text.
        	>>> lex.cttr
        	3.6526846651686067
        
        	# Return mean segmental type-token ratio (MSTTR).
        	>>> lex.msttr(segment_window=25)
        	0.88
        
        	# Return moving average type-token ratio (MATTR).
        	>>> lex.mattr(window_size=25)
        	0.8351515151515151
        
        	# Return Measure of Textual Lexical Diversity (MTLD).
        	>>> lex.mtld(threshold=0.72)
        	46.79226361031519
        
        	# Return hypergeometric distribution diversity (HD-D) measure.
        	>>> lex.hdd(draws=42)
        	0.7468703323966486
        
        	# Return Herdan's lexical diversity measure.
        	>>> lex.Herdan
        	0.9061378160786574
        
        	# Return Summer's lexical diversity measure.
        	>>> lex.Summer
        	0.9294460323356605
        
        	# Return Dugast's lexical diversity measure.
        	>>> lex.Dugast
        	43.074336212149774
        
        	# Return Maas's lexical diversity measure.
        	>>> lex.Maas
        	0.023215679867353005
        	
        3. Use LexicalRichness in your own pipeline
        -------------------------------------------
        :code:`LexicalRichness` comes packaged with minimal preprocessing + tokenization for a quick start. 
        
        But for intermediate users, you likely have your preferred :code:`nlp_pipeline`:
        
        .. code-block:: python
        
        	# Your preferred preprocessing + tokenization pipeline
        	def nlp_pipeline(text):
        		...
        		return list_of_tokens
        
        Use :code:`LexicalRichness` with your own :code:`nlp_pipeline`:
        
        .. code-block:: python
        
        	# Initiate new LexicalRichness object with your preprocessing pipeline as input
        	lex = LexicalRichness(text, preprocesser=None, tokenizer=nlp_pipeline)
        
        	# Compute lexical richness
        	mtld = lex.mtld()
        	
        Or use :code:`LexicalRichness` at the end of your pipeline and input the :code:`list_of_tokens` with :code:`
        preprocesser=None` and :code:`tokenizer=None`:
        	
        .. code-block:: python
        
        	# Preprocess the text
        	list_of_tokens = nlp_pipeline(text)
        	
        	# Initiate new LexicalRichness object with your list of tokens as input
        	lex = LexicalRichness(list_of_tokens, preprocesser=None, tokenizer=None)
        
        	# Compute lexical richness
        	mtld = lex.mtld()	
        
        4. Attributes
        -------------
        
        +-------------------------+-----------------------------------------------------------------------------------+
        | ``wordlist``            | list of words                                                   		      |
        +-------------------------+-----------------------------------------------------------------------------------+
        | ``words``  		  | number of words (w) 				   			      |
        +-------------------------+-----------------------------------------------------------------------------------+
        | ``terms``		  | number of unique terms (t)			                                      |
        +-------------------------+-----------------------------------------------------------------------------------+
        | ``preprocessor``        | preprocessor used		                                                      |
        +-------------------------+-----------------------------------------------------------------------------------+
        | ``tokenizer``           | tokenizer used		                                                      |
        +-------------------------+-----------------------------------------------------------------------------------+
        | ``ttr``		  | type-token ratio computed as t / w (Chotlos 1944, Templin 1957)         	      |
        +-------------------------+-----------------------------------------------------------------------------------+
        | ``rttr``	          | root TTR computed as t / sqrt(w) (Guiraud 1954, 1960)                             |
        +-------------------------+-----------------------------------------------------------------------------------+
        | ``cttr``	          | corrected TTR computed as t / sqrt(2w) (Carrol 1964)		              |
        +-------------------------+-----------------------------------------------------------------------------------+
        | ``Herdan`` 	          | log(t) / log(w) (Herdan 1960, 1964)                                               |
        +-------------------------+-----------------------------------------------------------------------------------+
        | ``Summer``    	  | log(log(t)) / log(log(w)) Summer (1966)                                           |
        +-------------------------+-----------------------------------------------------------------------------------+
        | ``Dugast``          	  | (log(w) ** 2) / (log(w) - log(t) Dugast (1978)				      |
        +-------------------------+-----------------------------------------------------------------------------------+
        | ``Maas`` 	          | (log(w) - log(t)) / (log(w) ** 2) Maas (1972)                                     |
        +-------------------------+-----------------------------------------------------------------------------------+
        
        5. Methods
        ----------
        
        +-------------------------+-----------------------------------------------------------------------------------+
        | ``msttr``            	  | Mean segmental TTR (Johnson 1944)						      |
        +-------------------------+-----------------------------------------------------------------------------------+
        | ``mattr``  		  | Moving average TTR (Covington 2007, Covington and McFall 2010)		      |
        +-------------------------+-----------------------------------------------------------------------------------+
        | ``mtld``		  | Measure of Lexical Diversity (McCarthy 2005, McCarthy and Jarvis 2010)            |
        +-------------------------+-----------------------------------------------------------------------------------+
        | ``hdd``                 | HD-D (McCarthy and Jarvis 2007)                                                   |
        +-------------------------+-----------------------------------------------------------------------------------+
        
        **Assessing method docstrings**
        
        .. code-block:: python
        
        	>>> import inspect
        
        	# docstring for hdd (HD-D)
        	>>> print(inspect.getdoc(LexicalRichness.hdd))
        
        	Hypergeometric distribution diversity (HD-D) score.
        
        	For each term (t) in the text, compute the probabiltiy (p) of getting at least one appearance
        	of t with a random draw of size n < N (text size). The contribution of t to the final HD-D
        	score is p * (1/n). The final HD-D score thus sums over p * (1/n) with p computed for
        	each term t. Described in McCarthy and Javis 2007, p.g. 465-466.
        	(McCarthy and Jarvis 2007)
        
        	Parameters
        	__________
        	draws: int
        	    Number of random draws in the hypergeometric distribution (default=42).
        
        	Returns
        	_______
        	float
        	
        Alternatively, just do
        
        .. code-block:: python
        
        	>>> print(lex.hdd.__doc__)
        	
        	Hypergeometric distribution diversity (HD-D) score.
        
                    For each term (t) in the text, compute the probabiltiy (p) of getting at least one appearance
                    of t with a random draw of size n < N (text size). The contribution of t to the final HD-D
                    score is p * (1/n). The final HD-D score thus sums over p * (1/n) with p computed for
                    each term t. Described in McCarthy and Javis 2007, p.g. 465-466.
                    (McCarthy and Jarvis 2007)
        
                    Parameters
                    ----------
                    draws: int
                        Number of random draws in the hypergeometric distribution (default=42).
        
                    Returns
                    -------
                    float	
        
        6. Contributing
        ---------------
        **Author**
        
        `Lucas Shen <https://www.lucasshen.com/>`__
        
        **Contributors**
        
        * `Christophe Bedetti <https://github.com/cbedetti>`__
        * `David Lesieur <https://github.com/davidlesieur>`__
        
        Contributions are welcome, and they are greatly appreciated! Every little bit helps, and credit will always be given. 
        See here for `how to contribute  <./CONTRIBUTING.rst>`__ to this project.
        See here for `Contributor Code of
        Conduct <http://contributor-covenant.org/version/1/0/0/>`__.
        
        7. Citing
        ---------
        If you have used this codebase and wish to cite it, please cite as below.
        
        Codebase:
        
        .. code-block:: bib
        
        	@software{lex,
        	author = {Shen, Lucas},
        	doi = {10.5281/zenodo.6607008},
        	license = {MIT license},
        	title = {{LexicalRichness: A small module to compute textual lexical richness}},
        	url = {https://github.com/LSYS/lexicalrichness},
        	year = {2022}
        	}
        
        Documentation on formulations and algorithms:
        
        .. code-block:: bib
        
        	@techreport{accuracybias, 
        	title={Measuring Political Media Slant Using Text Data},
        	author={Shen, Lucas},
        	url={https://www.lucasshen.com/research/media.pdf}
        	}
        
        
        The package is released under the `MIT
        License <https://opensource.org/licenses/MIT>`__.
        
        
        =======
        History
        =======
        
        0.1.2 (2018-05-09)
        ------------------
        
        * First release on PyPI.
        
        0.1.3 (2018-05-27)
        ------------------
        
        * Minor fix for compatibility issue with hyphens (ascii) in python 2.
        
        0.1.4 (2021-11-13)
        ------------------
        * Add textblob in setup.py as requirements (to fix "ModuleNotFoundError: No module named 'textblob'") [Christophe Bedetti].
        * Make preprocessing and tokenization optional [David Lesieur].
Keywords: lexical diversity,lexical richness,vocabulary diversity,lexical density,lexical
Platform: UNKNOWN
Classifier: Development Status :: 2 - Pre-Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Natural Language :: English
Classifier: Programming Language :: Python :: 2.7
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.4
Classifier: Programming Language :: Python :: 3.5
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Description-Content-Type: text/x-rst
