Metadata-Version: 2.1
Name: cartwright
Version: 0.0.2
Summary: A recurrent neural network paired with heuristic methods that automatically infer geospatial, temporal and feature columns
License: LGPL-3.0-or-later
Author: Kyle Marsh
Author-email: kyle@jataware.com
Requires-Python: >=3.7,<4.0
Classifier: License :: OSI Approved :: GNU Lesser General Public License v3 or later (LGPLv3+)
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Requires-Dist: arrow (==1.0.3)
Requires-Dist: faker (>=14.0)
Requires-Dist: fuzzywuzzy (==0.18.0)
Requires-Dist: joblib (==1.0.1)
Requires-Dist: numpy (>=1.19)
Requires-Dist: pandas (>=1.1)
Requires-Dist: pydantic (==1.8.2)
Requires-Dist: python-levenshtein (==0.20.7)
Requires-Dist: scipy (>=1.5)
Requires-Dist: torch (>=1.8)
Requires-Dist: torchvision (>=0.9)
Description-Content-Type: text/markdown


# Cartwright
![Tests](https://github.com/jataware/geotime_classify/actions/workflows/tests.yml/badge.svg)

Cartwirght categorizes spatial and temporal features in a dataset. 

Cartwright uses natural language processing and heuristic 
functions to determine the best guess categorization of a feature. 
The goal of this project was for a given dataframe where we expect
some kind of geospatial and temporal columns, automatically infer:

-   Country
-   Admin levels (0 through 3)
-   Timestamp (from arbitrary formats)
-   Latitude
-   Longitude
-   Dates (including format)
-   Time resolution for date columns


 The model and transformation code can be used locally by installing
 the pip package or downloaded the github repo and following the directions
 found in /docs.

# Simple use case

Cartwright has the ability to classify features of a dataframe which can help
with automation tasks that normally require a human in the loop.
For a simple example we have a data pipeline that ingests dataframes and
creates a standard timeseries plots or a map with datapoints. The problem is these new dataframes
are not standarized, and we have no way of knowing which columns contain dates or locations data.
By using Cartwright we can automatically infer which columns are dates or coordinate values and 
continue with our pipeline.

Here is the dataframe with :

| x_value  |  y_value   | date_value | Precip |
|:---------|:----------:|-----------:|--------|
| 7.942658 | 107.240322 | 07/14/1992 | .2     |
| 7.943745 | 137.240633 | 07/15/1992 | .1     |
| 7.943725 | 139.240664 | 07/16/1992 | .3     |


python code example and output.
    
    from cartwright import categorize
    cartwright = categorize.CartwrightClassify()
    categorizations = cartwright.columns_categorized(path="path/to/csv.csv")
    for column, category in categorization.items():
        print(column, category)

You can see from the output we were able to infer that x_value and y_values were geo category with subcategory of latitude and longitude. In some cases these can be impossible to tell apart since all latitude values are valid longitude values. For our date feature the category is time and the subcategory is date. The format is correct and we were able to pick out the time resolution of one day.  


    x_value {'category': <Category.geo: 'geo'>, 'subcategory': <Subcategory.latitude: 'latitude'>, 'format': None, 'time_resolution': {'resolution': None, 'unit': None, 'density': None, 'error': None}, 'match_type': [<Matchtype.LSTM: 'LSTM'>], 'fuzzyColumn': None}
    
    y_value {'category': <Category.geo: 'geo'>, 'subcategory': <Subcategory.longitude: 'longitude'>, 'format': None, 'time_resolution': {'resolution': None, 'unit': None, 'density': None, 'error': None}, 'match_type': [<Matchtype.LSTM: 'LSTM'>], 'fuzzyColumn': None}

    date_value {'category': <Category.time: 'time'>, 'subcategory': <Subcategory.date: 'date'>, 'format': '%m/%d/%Y', 'time_resolution': {'resolution': TimeResolution(uniformity=<Uniformity.PERFECT: 1>, unit=<TimeUnit.day: 86400.0>, density=1.0, error=0.0), 'unit': None, 'density': None, 'error': None}, 'match_type': [<Matchtype.LSTM: 'LSTM'>], 'fuzzyColumn': None}

    precip_value {'category': None, 'subcategory': None, 'format': None, 'time_resolution': {'resolution': None, 'unit': None, 'density': None, 'error': None}, 'match_type': [], 'fuzzyColumn': None}

With this information we can now convert the date values to a timestamp and plot a timeseries with other features.


