Metadata-Version: 2.1
Name: dynamo-pandas
Version: 1.3.0
Summary: Make working with pandas dataframe and AWS DynamoDB easy.
Home-page: https://github.com/DrGFreeman/dynamo-pandas
Author: Julien de la Bruère-Terreault
Author-email: drgfreeman@tuta.io
Maintainer: Julien de la Bruère-Terreault
Maintainer-email: drgfreeman@tuta.io
License: MIT
Platform: UNKNOWN
Classifier: Development Status :: 5 - Production/Stable
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Requires-Python: >=3.7
Description-Content-Type: text/markdown
Provides-Extra: boto3
License-File: LICENSE

[![unit-tests-linux](https://github.com/drgfreeman/dynamo-pandas/actions/workflows/linux-checks.yml/badge.svg)](https://github.com/DrGFreeman/dynamo-pandas/actions/workflows/linux-checks.yml)
[![unit-tests-windows](https://github.com/drgfreeman/dynamo-pandas/actions/workflows/windows-checks.yml/badge.svg)](https://github.com/DrGFreeman/dynamo-pandas/actions/workflows/windows-checks.yml)
[![Documentation Status](https://readthedocs.org/projects/dynamo-pandas/badge/?version=latest)](https://dynamo-pandas.readthedocs.io/en/latest/?badge=latest)

# dynamo-pandas
Make working with pandas data and AWS DynamoDB easy.

## Motivation
This package aims a making the transfer of data between pandas dataframes and DynamoDB as simple as possible. To meet this goal, the package offers two key features:
1. Automatic conversion of pandas data types to DynamoDB supported data types.
1. A simple, high level interface to *put* data from a dataframe into a DynamoDB table and *get* all or selected items from a table into a dataframe.


## Documentation

The project's documentation is available at https://dynamo-pandas.readthedocs.io/.


## Requirements
* `python>=3.7`
* `pandas>=1`
* `boto3`

## Installation

```
python -m pip install dynamo-pandas
```

This will install the package and its dependencies except for `boto3` which is not installed by default to avoid unnecessary installation when building Lambda layers.

To include `boto3` as part of the installation, add the `boto3` "extra" this way:

```
python -m pip install dynamo-pandas[boto3]
```

## Example Usage

Consider the pandas DataFrame below.


```python
>>> print(players_df)

      player_id           last_play       play_time  rating  bonus_points
0    player_one 2021-01-18 22:47:23 2 days 17:41:55     4.3             3
1    player_two 2021-01-19 19:07:54 0 days 22:07:34     3.8             1
2  player_three 2021-01-21 10:22:43 1 days 14:01:19     2.5             4
3   player_four 2021-01-22 13:51:12 0 days 03:45:49     4.8          <NA>
```

The columns of the dataframe use different data types, some of which are not natively supported by DynamoDB, like numpy.datetime64, timedelta64 and pandas' nullable integers.


```python
>>> players_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 5 columns):
    #   Column        Non-Null Count  Dtype          
---  ------        --------------  -----          
    0   player_id     4 non-null      object         
    1   last_play     4 non-null      datetime64[ns] 
    2   play_time     4 non-null      timedelta64[ns]
    3   rating        4 non-null      float64        
    4   bonus_points  3 non-null      Int8           
dtypes: Int8(1), datetime64[ns](1), float64(1), object(1), timedelta64[ns](1)
memory usage: 264.0+ bytes
```

Storing the rows of this dataframe to DynamoDB requires multiple data type conversions.

```python
>>> from dynamo_pandas import put_df, get_df, keys
```

The `put_df` function adds or updates the rows of a dataframe into the specified table, taking care of the required type conversions (the table must be already created and the primary key column(s) be present in the dataframe).

```python
>>> put_df(players_df, table="players")
```

The `get_df` function retrieves the items matching the speficied key(s) from the table into a dataframe.


```python
>>> df = get_df(table="players", keys=[{"player_id": "player_three"}, {"player_id": "player_one"}])
>>> print(df)

   bonus_points     player_id            last_play  rating        play_time
0             4  player_three  2021-01-21 10:22:43     2.5  1 days 14:01:19
1             3    player_one  2021-01-18 22:47:23     4.3  2 days 17:41:55
```

In the case where only a partition key is used, the `keys` function simplifies the generation of the keys list.


```python
>>> df = get_df(table="players", keys=keys(player_id=["player_two", "player_four"]))
>>> print(df)

   bonus_points    player_id            last_play  rating        play_time
0           1.0   player_two  2021-01-19 19:07:54     3.8  0 days 22:07:34
1           NaN  player_four  2021-01-22 13:51:12     4.8  0 days 03:45:49
```

The data types returned by the `get_df` function are basic types and no automatic type conversion is attempted.


```python
>>> df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 5 columns):
    #   Column        Non-Null Count  Dtype  
   ---  ------        --------------  -----  
    0   bonus_points  1 non-null      float64
    1   player_id     2 non-null      object 
    2   last_play     2 non-null      object 
    3   rating        2 non-null      float64
    4   play_time     2 non-null      object 
dtypes: float64(2), object(3)
memory usage: 208.0+ bytes
```

The `dtype` parameter of the `get_df` function allows specifying the desired data types.

```python
>>> df = get_df(
...     table="players",
...     keys=keys(player_id=["player_two", "player_four"]),
...     dtype={
...         "bonus_points": "Int8",
...         "last_play": "datetime64[ns, UTC]",
...         # "play_time": "timedelta64[ns]"  # See note below.
...     }
... )
```

**Note**: Due to a known bug in pandas, timedelta strings cannot currently be converted back to Timedelta type via this parameter (ref. https://github.com/pandas-dev/pandas/issues/38509). Use the pandas.to_timedelta function instead:


```python
>>> df.play_time = pd.to_timedelta(df.play_time)
>>> df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 5 columns):
    #   Column        Non-Null Count  Dtype              
---  ------        --------------  -----              
    0   bonus_points  1 non-null      Int8               
    1   player_id     2 non-null      object             
    2   last_play     2 non-null      datetime64[ns, UTC]
    3   rating        2 non-null      float64            
    4   play_time     2 non-null      timedelta64[ns]    
dtypes: Int8(1), datetime64[ns, UTC](1), float64(1), object(1), timedelta64[ns](1)
memory usage: 196.0+ bytes
```

Omitting the `keys` parameter performs a scan of the table and returns all the items.


```python
>>> df = get_df(table="players")
>>> print(df)

       bonus_points     player_id            last_play  rating        play_time
    0           4.0  player_three  2021-01-21 10:22:43     2.5  1 days 14:01:19
    1           NaN   player_four  2021-01-22 13:51:12     4.8  0 days 03:45:49
    2           3.0    player_one  2021-01-18 22:47:23     4.3  2 days 17:41:55
    3           1.0    player_two  2021-01-19 19:07:54     3.8  0 days 22:07:34
```

## License

Released under the terms of the [MIT License](LICENSE).

