# -*- coding: utf-8 -*-
from setuptools import setup

packages = \
['chunkr']

package_data = \
{'': ['*']}

install_requires = \
['fsspec>=2022.7.1,<2023.0.0',
 'paramiko>=2.11.0,<3.0.0',
 'pyarrow>=11.0.0,<12.0.0']

setup_kwargs = {
    'name': 'chunkr',
    'version': '0.3.0',
    'description': 'A library for chunking different types of data files.',
    'long_description': '# chunkr\n[![PyPI version][pypi-image]][pypi-url]\n<!-- [![Build status][build-image]][build-url] -->\n<!-- [![Code coverage][coverage-image]][coverage-url] -->\n<!-- [![GitHub stars][stars-image]][stars-url] -->\n[![Support Python versions][versions-image]][versions-url]\n\n\nA python library for the purpose of chunking different types of data files, without having to store the whole file in memory.\n\nchunkr creates chunks from the source file with a user defined chunk size, then returns an iterator to loop over the resulting batches sequentially.\n\nThe type of a resulting batch is PyArrow\'s [Table](https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow-table) due to PyArrow\'s [performance](https://towardsdatascience.com/stop-using-pandas-to-read-write-data-this-alternative-is-7-times-faster-893301633475) in reading & writing data files.\n\nIt\'s also possible to create a directory which contains the chunks as parquet files (currently only parquet is possible, new suggestions are welcomed), which will be cleaned up automatically when the user is done with the resulting files.\n\nCurrently supported input formats: csv, parquet\n\n# Getting started\n\n```bash\npip install chunkr\n```\n\n# Usage\n\n## Iterate over resulting batches\n\nCSV input:\n\n```py\nfrom chunkr import create_csv_chunk_iter\n\nwith create_csv_chunk_iter(path, chunk_size, storage_options, **extra_args) as chunk_iter:\n    # process chunks\n    for chunk in chunk_iter:\n        # process chunk.to_pandas() or sth\n\n```\n\nParquet:\n\n```py\nfrom chunkr import create_parquet_chunk_iter\n\nwith create_parquet_chunk_iter(path, chunk_size, storage_options, **extra_args) as chunk_iter:\n    # process chunks\n    for chunk in chunk_iter:\n        # process chunk.to_pandas() or sth\n\n```\n\nparameters:\n\n- path (str): the path of the input (local, sftp etc, see [fsspec](https://filesystem-spec.readthedocs.io/en/latest/) for possible inputs, not everything is supported though)\n- chunk_size (int, optional): number of records in a chunk. Defaults to 100_000.\n- storage_options (dict, optional): extra options to pass to the underlying storage e.g. username, password etc. Defaults to None.\n- extra_args (dict, optional): extra options passed on to the parsing system, file type specific\n\n\n## Create a directory containing the chunks as Parquet files\n\nCSV input:\n\n```py\nfrom chunkr import create_csv_chunk_dir\n\nwith create_csv_chunk_dir(input_filepath, output_dir, chunk_size, storage_options, write_options, exclude, **extra_args) as chunks_dir:\n    # process chunk files inside dir\n    pd.read_parquet(file) for file in chunks_dir.iterdir()\n    # the directory will be deleted when the context manager exits \n```\n\nor Parquet:\n\n```py\nfrom chunkr import create_csv_chunk_dir\n\nwith create_csv_chunk_dir(input_filepath, output_dir, chunk_size, storage_options, write_options, exclude, **extra_args) as chunks_dir:\n    # process chunk files inside dir\n    pd.read_parquet(file) for file in chunks_dir.iterdir()\n    # the directory will be deleted when the context manager exits\n```\n\n\nparameters:\n\n- path (str): the path of the input (local, sftp etc, see fsspec for possible input)\n- output_path (str): the path of the directory to output the chunks to\n- chunk_size (int, optional): number of records in a chunk. Defaults to 100_000.\n- storage_options (dict, optional): extra options to pass to the underlying storage e.g. username, password etc. Defaults to None.\n- write_options (dict, optional): extra options for writing the chunks passed to PyArrow\'s [write_table()](https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_table.html) function. Defaults to None.\n- extra_args (dict, optional): extra options passed on to the parsing system, file specific\n\n>**Note**: currently chunkr only supports parquet as the output chunk files format\n\n# Additional examples\n\n\n## CSV input\n\nSuppose you want to chunk a csv file of 1 million records into 10 parquet pieces, you can do the following:\n\nCSV extra args are passed to PyArrows [Parsing Options](https://arrow.apache.org/docs/python/generated/pyarrow.csv.ParseOptions.html#pyarrow.csv.ParseOptions)\n\n```py\nfrom chunkr import create_csv_chunk_dir\nimport pandas as pd\n\nwith create_csv_chunk_dir(\n            \'path/to/file\',\n            \'temp/output\',\n            chunk_size=100_000,\n            quote_char=\'"\',\n            delimiter=\',\',\n            escape_char=\'\\\\\',\n    ) as chunks_dir:\n\n        assert 1_000_000 == sum(\n            len(pd.read_parquet(file)) for file in chunks_dir.iterdir()\n        )\n```\n\n## Parquet input\n\nParquet extra args are passed to PyArrows [iter_batches()](https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetFile.html#pyarrow.parquet.ParquetFile.iter_batches) function\n\n```py\nfrom chunkr import create_parquet_chunk_dir\nimport pandas as pd\n\nwith create_parquet_chunk_dir(\n            \'path/to/file\',\n            \'temp/output\',\n            chunk_size=100_000,\n            columns=[\'id\', \'name\'],\n    ) as chunks_dir:\n\n        assert 1_000_000 == sum(\n            len(pd.read_parquet(file)) for file in chunks_dir.iterdir()\n        )\n```\n\n## Reading file(s) inside an archive (zip, tar)\n\nreading multiple files from a zip archive is possible, for csv files in `/folder_in_archive/*.csv` within an archive `csv/archive.zip` you can do:\n\n```py\nfrom chunkr import create_csv_chunk_iter\nimport pandas as pd\n\npath = \'zip://folder_in_archive/*.csv::csv/archive.zip\'\nwith create_csv_chunk_iter(path) as chunk_iter:\n    assert 1_000_000 == sum(len(chunk) for chunk in chunk_iter)\n```\n\nThe only exception is when particularly reading a csv file from a tar.gz, there can be **only 1 csv file** within the archive:\n\n```py\nfrom chunkr import create_csv_chunk_iter\nimport pandas as pd\n\npath = \'tar://*.csv::csv/archive_single.tar.gz\'\nwith create_csv_chunk_iter(path) as chunk_iter:\n    assert 1_000_000 == sum(len(chunk) for chunk in chunk_iter)\n```\n\nbut it\'s okay for other file types like parquet:\n\n```py\nfrom chunkr import create_parquet_chunk_iter\nimport pandas as pd\n\npath = \'tar://partition_idx=*/*.parquet::test/parquet/archive.tar.gz\'\nwith create_parquet_chunk_iter(path) as chunk_iter:\n    assert 1_000_000 == sum(len(chunk) for chunk in chunk_iter)\n```\n\n## Reading from an SFTP remote system\n\nTo authenticate to the SFTP server, you can pass the credentials via storage_options:\n\n```py\nfrom chunkr import create_parquet_chunk_iter\nimport pandas as pd\n\nsftp_path = f"sftp://{sftpserver.host}:{sftpserver.port}/parquet/pyarrow_snappy.parquet"\n\nwith create_parquet_chunk_iter(\n        sftp_path,\n        storage_options={\n            "username": "user",\n            "password": "pw",\n        }\n    ) as chunk_iter:\n    assert 1_000_000 == sum(len(chunk) for chunk in chunk_iter)\n```\n\nReading from a URL\n\n```py\nfrom chunkr import create_parquet_chunk_iter\nimport pandas as pd\n\nurl = "https://example.com/1mil.parquet"\n\nwith create_parquet_chunk_iter(url) as chunk_iter:\n    assert 1_000_000 == sum(len(chunk) for chunk in chunk_iter)\n```\n\n<!-- Badges -->\n\n[pypi-image]: https://img.shields.io/pypi/v/chunkr\n[pypi-url]: https://pypi.org/project/chunkr/\n[build-image]: https://github.com/1b5d/chunkr/actions/workflows/build.yaml/badge.svg\n[build-url]: https://github.com/1b5d/chunkr/actions/workflows/build.yaml\n[coverage-image]: https://codecov.io/gh/1b5d/chunkr/branch/main/graph/badge.svg\n[coverage-url]: https://codecov.io/gh/1b5d/chunkr/\n[stars-image]: https://img.shields.io/github/stars/1b5d/chunkr\n[stars-url]: https://github.com/1b5d/chunkr\n[versions-image]: https://img.shields.io/pypi/pyversions/chunkr\n[versions-url]: https://pypi.org/project/chunkr/\n',
    'author': '1b5d',
    'author_email': '8110504+1b5d@users.noreply.github.com',
    'maintainer': 'None',
    'maintainer_email': 'None',
    'url': 'https://github.com/1b5d/chunkr',
    'packages': packages,
    'package_data': package_data,
    'install_requires': install_requires,
    'python_requires': '>=3.8,<4.0',
}


setup(**setup_kwargs)
