Metadata-Version: 2.1
Name: pyarrow_ops
Version: 0.0.3
Summary: Useful data crunching tools for pyarrow
Home-page: https://github.com/TomScheffers/pyarrow_ops
Author: Tom Scheffers
Author-email: tom@youngbulls.nl 
License: APACHE
Download-URL: https://pypi.org/project/pyarrow-ops/
Description: # Pyarrow ops
        Pyarrow ops is Python libary for data crunching operations directly on the pyarrow.Table class, using only numpy. For convenience, function naming and behavior tries to replicates that of the Pandas API. The performance is decent, however performance can be significantly improved by utilizing pyarrow.compute functions or improving algorithms in numpy.
        
        Current use cases:
        - Data operations like joins, groupby, filters & drop_duplicates
        - Pre-processing for ML applications
        
        ## Installation
        
        Use the package manager [pip](https://pip.pypa.io/en/stable/) to install pyarrow_ops.
        
        ```bash
        pip install pyarrow_ops
        ```
        
        ## Usage
        See test_*.py for runnable test examples
        
        Data operations:
        ```python
        import pyarrow as pa 
        from pyarrow_ops import join, filters, groupby, head, drop_duplicates
        
        # Create pyarrow.Table
        t = pa.Table.from_pydict({
            'Animal': ['Falcon', 'Falcon', 'Parrot', 'Parrot', 'Parrot'],
            'Max Speed': [380., 370., 24., 26., 24.]
        })
        head(t) # Use head to print, like df.head()
        
        # Drop duplicates based on column values
        d = drop_duplicates(t, on=['Animal'], keep='first')
        
        # Groupby iterable
        for key, value in groupby(t, ['Animal']):
            print(key)
            head(value)
        
        # Group by aggregate functions
        g = groupby(t, ['Animal']).sum()
        g = groupby(t, ['Animal']).agg({'Max Speed': 'max'})
        
        # Use filter predicates using list of tuples (column, operation, value)
        f = filters(t, [('Animal', 'not in', ['Falcon', 'Duck']), ('Max Speed', '<', 25)])
        
        # Join operations (currently performs inner join)
        t2 = pa.Table.from_pydict({
            'Animal': ['Falcon', 'Parrot'],
            'Age': [10, 20]
        })
        j = join(t, t2, on=['Animal'])
        ```
        
        ML Preprocessing (note: personal tests showed ~5x speed up compared to pandas on large datasets)
        ```python
        import pyarrow as pa 
        from pyarrow_ops import head, TableCleaner
        
        # Training data
        t1 = pa.Table.from_pydict({
            'Animal': ['Falcon', 'Falcon', 'Parrot', 'Parrot', 'Parrot'],
            'Max Speed': [380., 370., None, 26., 24.],
            'Value': [2000, 1500, 10, 30, 20],
        })
        
        # Create TableCleaner & register columns to be processed
        cleaner = TableCleaner()
        cleaner.register_numeric('Max Speed', impute='min', clip=True)
        cleaner.register_label('Animal', categories=['Goose', 'Falcon'])
        cleaner.register_one_hot('Animal')
        
        # Clean table and split into train/test
        X, y = cleaner.clean_table(t1, label='Value')
        X_train, X_test, y_train, y_test = cleaner.split(X, y)
        
        # Train a model + Save cleaner settings
        cleaner_dict = cleaner.to_dict()
        
        # Prediction data
        t2 = pa.Table.from_pydict({
            'Animal': ['Falcon', 'Goose', 'Parrot', 'Parrot'],
            'Max Speed': [380., 10., None, 26.]
        })
        new_cleaner = TableCleaner().from_dict(cleaner_dict)
        X_pred = new_cleaner.clean_table(t2)
        ```
        
        ### To Do's
        - [x] Improve groupby speed by not create copys of table
        - [x] Add ML cleaning class
        - [ ] Add unit tests using pytest
        - [ ] Add window functions on Grouping class
        - [ ] Improve speed of groupby by avoiding for loops
        - [ ] Allow for functions to be classmethods of pa.Table* (t.groupby())
        - [ ] Extend the pq.ParquetDataset with methods
        - [ ] Add more join options (left, right, outer, full, cross)
        
        *One of the main difficulties is that the pyarrow classes are written in C and do not have a __dict__ method, this hinders inheritance and adding classmethods.
        
        ## Relation to pyarrow
        In the future many of these functions might be obsolete by enhancements in the pyarrow package, but for now it is a convenient alternative to switching back and forth between pyarrow and pandas.
        
        ## Contributing
        Pull requests are very welcome, however I believe in 80% of the utility in 20% of the code. I personally get lost reading the tranches of the pandas source code. If you would like to seriously improve this work, please let me know!
Keywords: arrow,pyarrow,data
Platform: UNKNOWN
Description-Content-Type: text/markdown
