Metadata-Version: 2.1
Name: hash-chunker
Version: 0.1.4
Summary: Generator that yields hash chunks for distributed data processing.
Home-page: https://github.com/whysage/hash_chunker
License: MIT
Author: Volodymyr Kochetkov
Author-email: whysages@gmail.com
Requires-Python: >=3.7,<4.0
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Project-URL: Bug Tracker, https://github.com/whysage/hash_chunker/issues
Project-URL: Repository, https://github.com/whysage/hash_chunker
Description-Content-Type: text/markdown

# Hash Chunker

Generator that yields hash chunks for distributed data processing.

### TLDR

```
pip install hash-chunker
```

```
from hash_chunker import HashChunker 

chunks = list(HashChunker().get_chunks(chunk_size=1000, all_items_count=2000))

assert chunks == [("0000000000", "8000000000"), ("8000000000", "ffffffffff")]
```
 
### Description

Imagine a situation when you need to process huge amount data rows in parallel.
Each data row has a hash field and the task is to use it for chunking.

Possible reasons for using hash field and not int id field:
- No auto increment id field.
- Id field has many blank lines (1,2,3, 100500, 100501, 1000000).
- Chunking by id will break data that must be in one chunk to different chunks
(in user behavioral analytics id can be autoincrement for all users actions and
user_session hash is linked to concrete user, so if we chunk by id one user session may
not be in one chunk).

