Metadata-Version: 2.1
Name: datachunks
Version: 0.0.5
Summary: Data chunking for humans, particularly for data engineers
Home-page: https://github.com/amaslyaev/datachunks/
License: MIT
Author: Alexander Maslyeav
Author-email: maslyaev@gmail.com
Requires-Python: >=3.7,<4.0
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Project-URL: Repository, https://github.com/amaslyaev/datachunks/
Description-Content-Type: text/markdown

# datachunks

Data chunking for humans, particularly for data engineers. Makes batched data
processing less painful and a little bit more joyful.

## Install
datachunks requires Python 3.8 or newer. Install it from PyPI:
```shell
$ pip install datachunks
```

## Usage
datachunks implements two chunking strategy:
1. "Pull" strategy. Wrap your source stream with <code>chunks</code> generator and consume a chunked data stream.
2. "Push" strategy. Create a special "feeder" object that will send data chunks to a specified consumer function.

First strategy is simple an sutable for most of applications, but second gives more flexibility in building non-trivial
in-memory processing pipelines.

### <code>chunks</code> and <code>achunks</code> functions
These functions implement the "pull" chunking strategy resectively for synchronous and async/await apllications.
```python
from datachunks import chunks

for chunk in chunks(range(12), 5):
    print(chunk)
```
Expected output:
```
[0, 1, 2, 3, 4]
[5, 6, 7, 8, 9]
[10, 11]
```
Asynchronous version example:
```python
import asyncio
from datachunks import achunks

async def arange(*args, **kwargs):
    for i in range(*args, **kwargs):
        yield i

async def achunks_demo():
    async for chunk in achunks(arange(12), 5):
        print(chunk)

asyncio.run(achunks_demo())
```
Expected output:
```
[0, 1, 2, 3, 4]
[5, 6, 7, 8, 9]
[10, 11]
```

### Using "push" strategy
"Push" strategy is implemented in <code>ChunkingFeeder</code> and <code>AsyncChunkingFeeder</code> objects.

Consider the situation we decided to process odd and even numbers separately. For this purpose we create two feeders
and <code>put</code> into them odd and even values. 
```python
from datachunks import ChunkingFeeder

with ChunkingFeeder(lambda c: print(f'evens: {c}'), 5) as print_evens_feeder, \
        ChunkingFeeder(lambda c: print(f'odds: {c}'), 5) as print_odds_feeder:
    for i in range(25):
        if i % 2 == 0:
            print_evens_feeder.put(i)
        else:
            print_odds_feeder.put(i)
```
Expected output:
```
evens: [0, 2, 4, 6, 8]
odds: [1, 3, 5, 7, 9]
evens: [10, 12, 14, 16, 18]
odds: [11, 13, 15, 17, 19]
odds: [21, 23]
evens: [20, 22, 24]
```
Additional features:
- It is guarandeed that all data is delivered to the callback functions after the context exit.
- It is possible to produce additional items in callback function. It allows to build flexible and even recursive data processing, but of course it is your responsibility to avoid infinite recursion.
- By default <code>ChunkingFeeder</code> calls its chunk consumer synchronously. To use multithreading specify the <code>workers_num</code> parameter.
- To use multiprocessing set the <code>multiprocessing</code> parameter to <code>True</code> in addition to <code>workers_num</code> parameter.
- The <code>AsyncChunkingFeeder</code> also supports the <code>workers_num</code> parameter, but does not support <code>multiprocessing</code>.

## ETL example
Consider a simple ETL task: we have an <code>orders.jsonl</code> file that we need to upload to some Mongo database. Sending objects one-by-one is too slow, and file is too big to opload it in one big batch. So we are going to split this data to chunks of reasonable size.

Function <code>read_jsonl</code> reads the file and yields objects one-by-one:
```python
import json

def read_jsonl():
    with open('orders.jsonl', 'r', encoding='utf-8') as jsonl:
        for jsoned_obj in jsonl:
            if jsoned_obj:
                yield json.loads(jsoned_obj)
```
The following function pulls objects through chunks generator and send objects to some MongoDB:
```python
from datachunks import chunks

def transfer_orders(db_connection):
    for chunk in chunks(read_jsonl(), chunk_size=200):
        db_connection.orders.insert_many(chunk)
```
After a while we decided to store purchase and sales orders into different MongoDB collections, so let's use two chunking feeders:
```python
from datachunks import ChunkingFeeder

class TransferOrders():
    def __init__(self, db_connection):
        self.db_connection = db_connection

    def store_purchase_orders(self, chunk):
        self.db_connection.purchase_orders.insert_many(chunk)

    def store_sales_orders(self, chunk):
        self.db_connection.sales_orders.insert_many(chunk)

    def do_transfer(self):
        with ChunkingFeeder(self.store_purchase_orders, 100, workers_num=1) as purchase_feeder, \
                ChunkingFeeder(self.store_purchase_orders, 500, workers_num=1) as sales_feeder:
            for order in read_jsonl():
                if order.get('order_type') == 'purchase':
                    purchase_feeder.put(order)
                elif order.get('order_type') == 'sales':
                    sales_feeder.put(order)

def transfer_orders(db_connection):
    TransferOrders(db_connection).do_transfer()
```
The <code>pymongo</code> library is thread-safe, so it makes sense to speed up our process by storing data in separate threads.

