Metadata-Version: 2.1
Name: cogdata
Version: 0.0.7
Summary: A lightweight data management and preprocessing tool.
Home-page: https://github.com/Sleepychord/cogdata
Author: Ming Ding, Yuxiang Chen, Wendi Zheng
Author-email: dm_thu@qq.com
License: MIT license
Platform: UNKNOWN
Requires-Python: >=3.5
Description-Content-Type: text/markdown
License-File: LICENSE

# Cogdata

## Install
```
pip install cogdata
sudo `which install_unrarlib.sh`
```
## Directory Structure
```
.
├── cogdata_task_task1
│   ├── cogdata_config.json (indicating a task path)
│   ├── merged.bin
│   ├── dataset1
│   │   ├── dataset1.bin
│   │   └── meta_info.json
│   └── dataset2
│       ├── dataset2.bin
│       └── meta_info.json
├── dataset1
│   ├── cogdata_info.json (indicating a dataset path)
│   ├── dataset1.json
│   └── dataset1.rar
└── dataset2
    ├── cogdata_info.json
    ├── dataset2.json
    └── dataset2.zip
```

## Pipeline
The motivation of this project is to provide lightweight APIs for large-scale NN-based data-processing, e.g. ImageTokenization. The abstraction has 3 parts:
* **Dataset**: Raw dataset from other organization in various formats, e.g. rar, zip, etc. The information are recorded at `cogdata_info.json` in its split folder. 
* **Task**: A task is a collection of "configs, results for different datsets, logs, merged results, and evenly split results". The config of a task are recorded in `cogdata_info.json`. The states (processed, hanging/running, unprocessed)of a dataset in this tasks are in `meta_info.json`.
* **DataSaver**: The format of saved results. The first option is our `BinSaver`, which saves plain bytes with fixed length. It can be read or memmap very fast. The config of DataSaver are also with the task in `cogdata_info.json`. 

### Commands
```
cogdata create_dataset  [-h] [--description DESCRIPTION] --data_files DATA_FILES [DATA_FILES ...] --data_format DATA_FORMAT [--text_files TEXT_FILES [TEXT_FILES ...]] [--text_format TEXT_FORMAT] name
```
Alias: `cogdata data ...`. `data_format` is chosen from class names in cogdata.datasets, e.g. `StreamingRarDataset`. Texts related options are optional for text-image datasets.

```
cogdata create_task [-h] [--description DESCRIPTION] --task_type TASK_TYPE --saver_type SAVER_TYPE [--length_per_sample LENGTH_PER_SAMPLE] [--img_sizes IMG_SIZES [IMG_SIZES ...]] [--txt_len TXT_LEN]
                           [--dtype {int32,int64,float32,uint8,bool}]
                           task_id
```
Alias: `cogdata task ...`. `task_type` and `saver_type` is chosen from class names in cogdata, e.g. `ImageTextTokenizationTask` or `BinarySaver`.
```
cogdata process [-h] --task_id TASK_ID [--nproc NPROC] [--dataloader_num_workers DATALOADER_NUM_WORKERS]
                       [--batch_size BATCH_SIZE] [--ratio RATIO]
                       [datasets [datasets ...]]
```
The i-th proc will be binded to the i-th GPU.

```
cogdata merge [-h] --task_id TASK_ID
```
Merge all the processed data.

```
cogdata list [-h] [--task_id TASK_ID]
```
List all the current datasets in this folder.
```
cogdata clean [-h] [--task_id TASK_ID]
```
Clean the unfinished states of the task.
### Customized Tasks
Add `--extra_code PATH_TO_CODE` after `cogdata `(e.g., `cogdata --extra_code ../examples/convert2tar_task.py [task or process]` to execute and register your own task before running the command. See `examples/` for details. 

## TODO List

* [ ] 支持多种不同格式文本处理
* [ ] sphinx 注释文档更详细撰写
* [ ] 更精细化的参数管理，将tokenization一般化
* [ ] PPT & 视频介绍
* [ ] Merge 视频处理 [Wenyi]
* [ ] Merge Object detection [Zhuoyi]




