Metadata-Version: 2.1
Name: unicategories
Version: 0.1.2
Summary: Unicode category database
Home-page: https://gitlab.com/ergoithz/unicategories
Author: Felipe A. Hernandez
Author-email: ergoithz@gmail.com
License: MIT
Keywords: unicode
Platform: any
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Requires-Python: >=2.7, !=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*, !=3.4.*
Description-Content-Type: text/markdown
Provides-Extra: codestyle
Provides-Extra: coverage
Provides-Extra: tests
License-File: LICENSE

# unicategories

Unicode category database, generated and cached on setup.

This module exposes a category dictionary containing `RangeGroup` instances,
containing all unicode category character ranges detected on your system.

## Example

```python
from unicategories import categories

upperchars = categories['Lu'].characters()  # iterator
print('Unicode uppercase caracters are "%s"' % ''.join(upperchars))
# Unicode uppercase caracters are "ABCDEF..."
```

## RangeGroup

Immutable iterable (based on tuple, with some useful methods) of (start, end)
tuples being, like python's `range`, open at the end.

This method have been chosen for memory efficiency, storing individually all
characters on memory would take a lot of memory.

RangeGroup class provides the following methods:

### `range_group.characters()`
`type: () -> typing.Iterator[str]`
```rst
Get iterator with all characters on this range group.

:returns: iterator of characters (str of size 1)
```

### `range_group.codes()`
`type: () -> typing.Iterator[int]`
```rst
Get iterator for all unicode code points contained in this range group.

:returns: iterator of character indexes (int)
```

### `range_group.has(character)`
`type: (typing.Union[str, int]) -> bool`
```rst
Get if character (or character code point) is part of this range group.

:param character: character or unicode code point to look for
:returns: True if character is contained by any range, False otherwise
```

## Unicode categories

Taken from [wikipedia](https://en.wikipedia.org/wiki/Template:General_Category_(Unicode)).

| Value  | Category Major, minor      | Basic type     | Character assigned     | Fixed                                                       | Remarks                                                                                                                   |
|--------|----------------------------|----------------|------------------------|-------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------|
| Lu     | Letter, uppercase          | Graphic        | Character              |                                                             |                                                                                                                           |
| Ll     | Letter, lowercase          | Graphic        | Character              |                                                             |                                                                                                                           |
| Lt     | Letter, titlecase          | Graphic        | Character              |                                                             | Ligatures containing uppercase followed by lowercase letters (e.g., `ǅ` , `ǈ` , `ǋ` , and `ǲ` )                           |
| Lm     | Letter, modifier           | Graphic        | Character              |                                                             |                                                                                                                           |
| Lo     | Letter, other              | Graphic        | Character              |                                                             |                                                                                                                           |
| Mn     | Mark, nonspacing           | Graphic        | Character              |                                                             |                                                                                                                           |
| Mc     | Mark, spacing combining    | Graphic        | Character              |                                                             |                                                                                                                           |
| Me     | Mark, enclosing            | Graphic        | Character              |                                                             |                                                                                                                           |
| Nd     | Number, decimal digit      | Graphic        | Character              |                                                             | All these, and only these, have Numeric Type = De                                                                         |
| Nl     | Number, letter             | Graphic        | Character              |                                                             | Numerals composed of letters or letterlike symbols (e.g., Roman numerals )                                                |
| No     | Number, other              | Graphic        | Character              |                                                             | E.g., vulgar fractions , superscript and subscript digits                                                                 |
| Pc     | Punctuation, connector     | Graphic        | Character              |                                                             | Includes "_" underscore                                                                                                   |
| Pd     | Punctuation, dash          | Graphic        | Character              |                                                             | Includes several hyphen characters                                                                                        |
| Ps     | Punctuation, open          | Graphic        | Character              |                                                             | Opening bracket characters                                                                                                |
| Pe     | Punctuation, close         | Graphic        | Character              |                                                             | Closing bracket characters                                                                                                |
| Pi     | Punctuation, initial quote | Graphic        | Character              |                                                             | Opening quotation mark . Does not include the ASCII "neutral" quotation mark. May behave like Ps or Pe depending on usage |
| Pf     | Punctuation, final quote   | Graphic        | Character              |                                                             | Closing quotation mark. May behave like Ps or Pe depending on usage                                                       |
| Po     | Punctuation, other         | Graphic        | Character              |                                                             |                                                                                                                           |
| Sm     | Symbol, math               | Graphic        | Character              |                                                             |                                                                                                                           |
| Sc     | Symbol, currency           | Graphic        | Character              |                                                             |                                                                                                                           |
| Sk     | Symbol, modifier           | Graphic        | Character              |                                                             |                                                                                                                           |
| So     | Symbol, other              | Graphic        | Character              |                                                             |                                                                                                                           |
| Zs     | Separator, space           | Graphic        | Character              |                                                             | Includes the space, but not TAB , CR , or LF , which are Cc                                                               |
| Zl     | Separator, line            | Format         | Character              |                                                             | Only U+2028 LINE SEPARATOR (LSEP)                                                                                         |
| Zp     | Separator, paragraph       | Format         | Character              |                                                             | Only U+2029 PARAGRAPH SEPARATOR (PSEP)                                                                                    |
| Cc     | Other, control             | Control        | Character              | Fixed 65                                                    | No name     , `<control>`                                                                                                 |
| Cf     | Other, format              | Format         | Character              |                                                             | Includes the soft hyphen , control characters to support bi-directional text , and language tag characters                |
| Cs     | Other, surrogate           | Surrogate      | Not (but abstract)     | Fixed 2,048                                                 | No name     , `<surrogate>`                                                                                               |
| Co     | Other, private use         | Private-use    | Not (but abstract)     | Fixed 137,468 total: 6,400 in BMP , 131,068 in Planes 15–16 | No name     , `<private-use>`                                                                                             |
| Cn     | Other, not assigned        | Noncharacter   | Not                    | Fixed 66                                                    | No name     , `<noncharacter>`                                                                                            |
| Cn     | Other, not assigned        | Reserved       | Not                    | Not fixed                                                   | No name     , `<reserved>`                                                                                                |

In addition to that, unicategories provide general categories `L`, `M`, `N`, `P`, `S`, `Z` and `C`.
