Metadata-Version: 2.1
Name: revizor
Version: 0.1.1
Summary: Ecommerce product title recognition package
Home-page: https://github.com/bureaucratic-labs/revizor
License: MIT
Keywords: natural language processing
Author: Dima Veselov
Author-email: d.a.veselov@yandex.ru
Requires-Python: >=3.8,<4.0
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Scientific/Engineering :: Artificial Intelligence
Classifier: Text Processing :: Linguistic
Requires-Dist: flair (>=0.8.0,<0.9.0)
Requires-Dist: razdel (>=0.5.0,<0.6.0)
Project-URL: Repository, https://github.com/bureaucratic-labs/revizor
Description-Content-Type: text/markdown

# revizor [![Test & Lint](https://github.com/bureaucratic-labs/revizor/actions/workflows/test-and-lint.yml/badge.svg)](https://github.com/bureaucratic-labs/revizor) [![codecov](https://codecov.io/gh/bureaucratic-labs/revizor/branch/main/graph/badge.svg?token=YHND3N25LI)](https://codecov.io/gh/bureaucratic-labs/revizor)

This package solves task of splitting product title string into components, like `type`, `brand`, `model` and `article` (or SKU or product code or you name it).  
Imagine classic named entity recognition, but recognition done on product titles.

## Install

`revizor` requires python **3.8+** version **on Linux or macOS**, Windows **isn't supported** now, but contributions are welcome.

```bash
$ pip install revizor
```

## Usage

```python
from revizor.tagger import ProductTagger

tagger = ProductTagger()
product = tagger.predict("Смартфон Apple iPhone 12 Pro 128 gb Gold (CY.563781.P273)")

assert product.type == "Смартфон"
assert product.brand == "Apple"
assert product.model == "iPhone 12 Pro"
assert product.article == "CY.563781.P273"
```

## Boring numbers

Actually, just output from flair training log:
```
Corpus: "Corpus: 138959 train + 15440 dev + 51467 test sentences"
Results:
- F1-score (micro) 0.8843
- F1-score (macro) 0.8766

By class:
ARTICLE    tp: 9893 - fp: 1899 - fn: 3268 - precision: 0.8390 - recall: 0.7517 - f1-score: 0.7929
BRAND      tp: 47977 - fp: 2335 - fn: 514 - precision: 0.9536 - recall: 0.9894 - f1-score: 0.9712
MODEL      tp: 35187 - fp: 11824 - fn: 9995 - precision: 0.7485 - recall: 0.7788 - f1-score: 0.7633
TYPE       tp: 25044 - fp: 637 - fn: 443 - precision: 0.9752 - recall: 0.9826 - f1-score: 0.9789
```

## Dataset

Model was trained on automatically annotated corpus. Since it may be affected by DMCA, we'll not publish it.  
But we can give hint on how to obtain it, don't we?  
Dataset can be created by scrapping any large marketplace, like goods, yandex.market or ozon.  
We extract product title and table with product info, then we parse brand and model strings from product info table.  
Now we have product title, brand and model. Then we can split product title by brand string, e.g.:

```python
product_title = "Смартфон Apple iPhone 12 Pro 128 Gb Space Gray"
brand = "Apple"
model = "iPhone 12 Pro"

product_type, product_model_plus_some_random_info = product_title.split(brand)

product_type # => 'Смартфон'
product_model_plus_some_random_info # => 'iPhone 12 Pro 128 Gb Space Gray'
```

## License

This package is licensed under MIT license.

