Metadata-Version: 2.1
Name: dom-tokenizers
Version: 0.0.7
Summary: DOM-aware tokenizers for 🤗 Hugging Face language models
Author: Gary Benson
License: Apache-2.0
Project-URL: Source, https://github.com/gbenson/dom-tokenizers
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Education
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Internet :: WWW/HTTP
Classifier: Topic :: Software Development :: Libraries
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Topic :: Text Processing :: Markup :: HTML
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: python-magic
Requires-Dist: tokenizers
Requires-Dist: transformers
Provides-Extra: dev
Requires-Dist: build; extra == "dev"
Requires-Dist: datasets; extra == "dev"
Requires-Dist: flake8; extra == "dev"
Requires-Dist: pytest; extra == "dev"
Requires-Dist: pytest-cov; extra == "dev"
Provides-Extra: train
Requires-Dist: datasets; extra == "train"
Requires-Dist: pillow; extra == "train"

<p style="float: right">
    <a href="https://badge.fury.io/py/dom-tokenizers">
         <img alt="Build" src="https://badge.fury.io/py/dom-tokenizers.svg">
    </a>
    <a href="https://github.com/gbenson/dom-tokenizers/blob/master/LICENSE">
        <img alt="GitHub" src="https://img.shields.io/github/license/gbenson/dom-tokenizers.svg?color=blue">
    </a>
</p>

# DOM tokenizers

DOM-aware tokenizers for Hugging Face language models.

## Installation

### With PIP

```sh
pip install dom-tokenizers[train]
```

### From sources

```sh
git clone https://github.com/gbenson/dom-tokenizers.git
cd dom-tokenizers
python3 -m venv .venv
. .venv/bin/activate
pip install --upgrade pip
pip install -e .[dev,train]
```

## Train a tokenizer

### On the command line

Check everything's working using a small dataset of around 300 examples:

```sh
train-tokenizer gbenson/interesting-dom-snapshots
```

Train a tokenizer with a 10,000-token vocabulary using a dataset of
4,536 examples and upload it to the Hub:

```sh
train-tokenizer gbenson/webui-dom-snapshots -n 10000 -N 4536
huggingface-cli login
huggingface-cli upload dom-tokenizer-10k
```
