Metadata-Version: 2.1
Name: deltas
Version: 0.6.1
Summary: An experimental diff library for generating operation deltas that represent the difference between two sequences of comparable items.
Home-page: https://github.com/halfak/deltas
Author: Aaron Halfaker
Author-email: aaron.halfaker@gmail.com
License: MIT
Description: 
        # Deltas
        
        An open licensed (MIT) library for performing generating deltas (A.K.A sequences of operations) representing the difference between two sequences of comparable tokens.
        
        * **Installation:**  ``pip install deltas``
        * **Repo**: <http://github.com/halfak/Deltas>
        * **Documentation**: <http://pythonhosted.org/deltas>
        * Note this library requires Python 3.3 or newer
        
        This library is intended to be used to make experimental difference detection strategies more easily available. There are currently two strategies available:
        
        `deltas.sequence_matcher.diff(a, b)`:  
        A shameless wrapper around `difflib.SequenceMatcher` to get it to work within the structure of *deltas*.
        
        `deltas.segment_matcher.diff(a, b, segmenter=None)`:  
        A generalized difference detector that is designed to detect block moves and copies based on the use of a ``Segmenter``.
        
        **Example:**
        
        ```python
        from deltas import segment_matcher, text_split
        a = text_split.tokenize("This is some text. This is some other text.")`|
        b = text_split.tokenize("This is some other text. This is some text.")
        operations = segment_matcher.diff(a, b)
        
        for op in operations:
         print(op.name, repr(''.join(a[op.a1:op.a2])),
          repr(''.join(b[op.b1:op.b2])))
        
        equal 'This is some other text.' 'This is some other text.'
        insert ' ' ' '
        equal 'This is some text.' 'This is some text.'
        delete ' ' ''
        ```
        
        ## Tokenization
        
        By default Deltas performs tokenization by regexp text splitting. We included CJK tokenization functionality. If text consists of at least 1/4 (default value) Japanse or Korean symbols it is tokenized by language specific Tokenizer. Else, Chinese Tokenizer is used.
        
        * Chinese Tokenizer - Jieba
        * Japanese Tokenizer - Sudachi
        * Korean Tokenizer - KoNLPy(Okt)
        
        **Tokenization example:**
        
        ```python
        import mwapi
        import deltas
        import deltas.tokenizers
        
        # example title ["China", "Haiku", "Kimchi"]: "中国" - Chinese(zh), "俳句" - Japanese(ja), "김치" - Korean(ko)
        session = mwapi.Session("https://zh.wikipedia.org")
        doc = session.get(action="query", prop="revisions", titles="中国", rvprop="content", rvslots="main",formatversion=2)
        text = doc['query']['pages'][0]['revisions'][0]['slots']['main']['content']
        
        # text processed only by regexp tokenizer
        tokenized_text = deltas.tokenizers.wikitext_split.tokenize(text)
        # text processed regexp tokenizer with cjk post processing
        tokenized_text_cjk = deltas.tokenizers.wikitext_split_w_cjk.tokenize(text)
        ```
        
        **FOR IMPROVED JAPANESE TOKENIZER ACCURACY PLEASE INSTALL FULL DICTIONARY:**
        
        ```bash
        pip install sudachidict_full
        # and link sudachi to dict
        sudachipy link -t full
        ```
        
Platform: UNKNOWN
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Development Status :: 3 - Alpha
Classifier: License :: OSI Approved :: MIT License
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Intended Audience :: Developers
Classifier: Operating System :: OS Independent
Classifier: Topic :: Text Processing :: Linguistic
Classifier: Topic :: Text Processing :: General
Classifier: Topic :: Utilities
Classifier: Topic :: Scientific/Engineering
Description-Content-Type: text/markdown
