Metadata-Version: 2.1
Name: pyuegc
Version: 15.0.0
Summary: An implementation of the Unicode algorithm for breaking code point sequences into extended grapheme clusters as specified in UAX #29
Home-page: https://github.com/mlodewijck/pyuegc
Author: Marc Lodewijck
Author-email: mlodewijck@gmail.com
License: MIT
Project-URL: Bug Reports, https://github.com/mlodewijck/pyuegc/issues
Project-URL: Source, https://github.com/mlodewijck/pyuegc/
Keywords: Unicode,Unicode grapheme clusters,extended grapheme cluster,EGC,grapheme cluster,graphemes,segmentation
Platform: UNKNOWN
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Topic :: Software Development
Classifier: Topic :: Software Development :: Internationalization
Classifier: Topic :: Text Processing
Classifier: Topic :: Utilities
Requires-Python: >=3.6
Description-Content-Type: text/markdown
License-File: LICENSE

# pyuegc
An implementation of the Unicode algorithm for breaking strings of text (i.e., code point sequences) into **extended grapheme clusters** (“user-perceived characters”) as specified in UAX #29, “Unicode Text Segmentation”. This package supports version&nbsp;15.0 of the Unicode standard (released on September&nbsp;13, 2022). It has been thoroughly tested against the [Unicode test file](https://www.unicode.org/Public/15.0.0/ucd/auxiliary/GraphemeBreakTest.txt).

### Installation
The easiest method to install is using pip:
```shell
pip install pyuegc
```

### UCD version
To get the version of the Unicode character database currently used:
```python
>>> from pyuegc import UCD_VERSION
>>> UCD_VERSION
'15.0.0'
```

### Example usage
```python
from pyuegc import EGC

for s in ["e\u0301le\u0300ve", "Z̷̳̎a̸̛ͅl̷̻̇g̵͉̉o̸̰͒", "기운찰만하다"]:
    egc = EGC(s)
    print(f"{len(s):>2}, {len(egc)}: {egc}")

# Output
#  7, 5: ['é', 'l', 'è', 'v', 'e']
# 20, 5: ['Z̷̳̎', 'a̸̛ͅ', 'l̷̻̇', 'g̵͉̉', 'o̸̰͒']
# 15, 6: ['기', '운', '찰', '만', '하', '다']


s = "ai\u0302ne\u0301e"  # aînée
print("".join(reversed(s)))
print("".join(reversed(EGC(s))))

# Output
# éen̂ia -> wrong (diacritics are messed up)
# eénîa -> right (regardless of the Unicode normalization form)
```

### Related resources
This implementation is based on the following resources:
- [“Grapheme Clusters”, in the Unicode core specification, version&nbsp;15.0.0](https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf#G52443)
- [Unicode Standard Annex #29: Unicode Text Segmentation, version&nbsp;41](https://www.unicode.org/reports/tr29/tr29-41.html)

### Licenses
The code is available under the [MIT license](https://github.com/mlodewijck/pyuegc/blob/main/LICENSE).

Usage of Unicode data files is governed by the [UNICODE TERMS OF USE](https://www.unicode.org/copyright.html), a copy of which is included as [UNICODE-LICENSE](https://github.com/mlodewijck/pyuegc/blob/main/UNICODE-LICENSE).


