Metadata-Version: 2.4
Name: sql2arrow
Version: 0.1.0
Classifier: Development Status :: 4 - Beta
Classifier: Programming Language :: Rust
Classifier: Programming Language :: Python
Classifier: Environment :: Console
Summary: This is a Python library that provides convenient and high-performance methods to parse INSERT SQL statements into Arrow arrays.
Keywords: arrow,sql,mysql,rust
Author: zhan zhang
Author-email: goalzz85@gmail.com
Maintainer-email: zhan zhang <goalzz85@gmail.com>
Requires-Python: >=3.8
Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM
Project-URL: homepage, https://github.com/goalzz85/sql2arrow

# SQL2Arrow

This is a Python library that provides convenient and high-performance methods to parse INSERT SQL statements into Arrow arrays. It's very useful for analyzing data dumped by mysqldump or other tools.

## How to use

### Parsing SQL str
```python
import sql2arrow

sql_str = '''
INSERT INTO `region` VALUES
	('', '', '2023-01-31 18:00:48', '2023-01-31 18:00:48', ''),
	('1541947646568607746', 'region name', '2022-06-29 08:52:21', '2022-06-29 08:52:21', 'D99'),
	('1541947680890597378', 'region name1', '2022-06-29 08:52:29', '2022-06-29 08:52:29', 'D98'),
	('620422117205', 'region name7', '2021-10-25 18:23:48', '2021-10-25 18:23:48', 'D620422117');
'''

columns = [
    sql2arrow.Column("region_code", sql2arrow.ArrowTypes.utf8()),
    sql2arrow.Column("region_name", sql2arrow.ArrowTypes.utf8()),
    sql2arrow.Column("create_time", sql2arrow.ArrowTypes.utf8()),
    sql2arrow.Column("update_time", sql2arrow.ArrowTypes.utf8()),
    sql2arrow.Column("parent_region_code", sql2arrow.ArrowTypes.utf8())
]

arrow_data = sql2arrow.parse_sql(sql_str, columns)
```


### Parsing sql files

```python
import sql2arrow

sql_paths = [
    "region.sql_0.gz", "region.sql_1.gz","region.sql_2.gz","region.sql_3.gz","region.sql_4.gz","region.sql_5.gz","region.sql_6.gz"
]

columns = [
    sql2arrow.Column("region_code", sql2arrow.ArrowTypes.utf8()),
    sql2arrow.Column("region_name", sql2arrow.ArrowTypes.utf8()),
    sql2arrow.Column("create_time", sql2arrow.ArrowTypes.utf8()),
    sql2arrow.Column("update_time", sql2arrow.ArrowTypes.utf8()),
    sql2arrow.Column("parent_region_code", sql2arrow.ArrowTypes.utf8())
]


partition_func_spec = sql2arrow.partition.IcebergPartitionFuncSpec()
partition_func_spec.add_partition("region_code", sql2arrow.partition.IcebergTransforms.bucket(30))

# load data with partition func
partitioned_arrs = sql2arrow.load_sqls_with_partition_func(sql_paths, columns, partition_func_spec, sql2arrow.CompressionType.GZIP, sql2arrow.Dialect.MYSQL)

# load data from files one by one
arrs = sql2arrow.load_sqls(sql_paths, columns, sql2arrow.CompressionType.GZIP, sql2arrow.Dialect.MYSQL)
```


## arro3

SQL2Arrow uses arro3 as the default Python library for Apache Arrow. Thanks to the [Arrow PyCapsule Interface](https://arrow.apache.org/docs/format/CDataInterface/PyCapsuleInterface.html), we can seamlessly pass Arro3's Array data to other libraries compatible with the Arrow PyCapsule Interface, including PyArrow, Polars (v1.2+), Pandas (v2.2+), NanoArrow, and more, all with zero-copy memory.

```python
# some codes from above

import pyarrow as pa
tables = [pa.Table.from_arrays(a, names=names) for a in arrs]
```
## Limitations

### Dialect
    It currently supports only MySQL INSERT statements, but PostgreSQL support will be added soon.
