Metadata-Version: 2.1
Name: split_dataset
Version: 0.4.4
Summary: A package for HDF5-based chunked arrays
Home-page: https://github.com/portugueslab/split_dataset
Author: Vilim Stih & Luigi Petrucco @portugueslab
Author-email: luigi.petrucco@gmail.com
License: GNU General Public License v3
Keywords: split_dataset
Platform: UNKNOWN
Classifier: Development Status :: 2 - Pre-Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3)
Classifier: Natural Language :: English
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.5
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Requires-Python: >=3.5
Description-Content-Type: text/markdown
Provides-Extra: dev
License-File: LICENSE


[![Python Version](https://img.shields.io/pypi/pyversions/split_dataset.svg)](https://pypi.org/project/split_dataset)
[![PyPI](https://img.shields.io/pypi/v/split_dataset.svg)](
    https://pypi.python.org/pypi/split_dataset)
[![Tests](https://img.shields.io/github/workflow/status/portugueslab/split_dataset/tests)](
    https://github.com/portugueslab/split_dataset/actions)
[![Coverage Status](https://coveralls.io/repos/github/portugueslab/split_dataset/badge.svg?branch=master)](https://coveralls.io/github/portugueslab/split_dataset?branch=master)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/python/black)
[![License: GPL v3](https://img.shields.io/badge/License-GPLv3-blue.svg)](https://www.gnu.org/licenses/gpl-3.0)



A minimal package for saving and reading large HDF5-based chunked arrays.

This package has been developed in the [`Portugues lab`](http://www.portugueslab.com) for volumetric calcium imaging data. `split_dataset` is extensively used in the calcium imaging analysis package [`fimpy`](https://github.com/portugueslab/fimpy); The microscope control libraries [`sashimi`](https://github.com/portugueslab/sashimi) and [`brunoise`](https://github.com/portugueslab/brunoise) save files as split datasets.

[`napari-split-dataset`](https://github.com/portugueslab/napari-split-dataset) support the visualization of SplitDatasets in `napari`.

## Why using Split dataset?
Split datasets are numpy-like array saved over multiple h5 files. The concept of spli datasets is not different from e.g. [zarr arrays](https://zarr.readthedocs.io/en/stable/); however, relying on h5 files allow for partial reading even within the same file, which is crucial for visualizing volumetric time series, the main application `split_dataset` has been developed for (see [this discussion](https://github.com/zarr-developers/zarr-python/issues/521) on the limitation of zarr arrays).

# Structure of a split dataset
A split dataset is contained in a folder containing multiple, numbered  h5 files (one file per chunk) and a metadata json file with information on the shape of the full dataset and of its chunks.
The h5 files are saved using the [flammkuchen](https://github.com/portugueslab/flammkuchen) library (ex [deepdish](https://deepdish.readthedocs.io/en/latest/)). Each file contains a dictionary with the data under the `stack` keyword.

`SplitDataset` objects can than be instantiated from the dataset path, and numpy-style indexing can then be used to load data as numpy arrays. Any n of dimensions and block sizes are supported in principle; the package has been used mainly with 3D and 4D arrays.



## Minimal example
```python
# Load a  SplitDataset via a SplitDataset object:
from split_dataset import SplitDataset
ds = SplitDataset(path_to_dataset)

# Retrieve data in an interval:
data_array = ds[n_start:n_end, :, :, :]
```

## Creating split datasets
New split datasets can be created with the `split_dataset.save_to_split_dataset` function, provided that the original data is fully loaded in memory. Alternatively, e.g. for time acquisitions, a split dataset can be saved one chunk at a time. It is enough to save with `flammkuchen` correctly formatted .h5 files and the correspondent json metadata file describing the full split dataset shape (this is [what happens in sashimi](https://github.com/portugueslab/sashimi/blob/01046f2f24483ab702be379843a1782ababa7d2d/sashimi/processes/streaming_save.py#L186))


# TODO
* provide utilities for partial saving of split datasets
* support for more advanced indexing (support for step and vector indexing)
* support for cropping a `SplitDataset`
* support for resolution and frequency metadata


# History

### 0.4.0 (2021-03-23)
* Added support to use a `SplitDataset` as data in a `napari` layer.

...

### 0.1.0 (2020-05-06)
* First release on PyPI.


Credits
-------

Part of this package was inspired by  [Cookiecutter](https://github.com/audreyr/cookiecutter) and [this](https://github.com/audreyr/cookiecutter-pypackage) template.

.. _`Portugues lab`:
.. _Cookiecutter:
.. _this:


