Metadata-Version: 2.1
Name: datamatch
Version: 0.0.1
Summary: Data matching utilities
Home-page: https://github.com/pckhoi/datamatch
Author: Khoi Pham
Author-email: pckhoi@gmail.com
License: UNKNOWN
Project-URL: Bug Tracker, https://github.com/pckhoi/datamatch/issues
Description: # Datamatch
        
        Data matching is the process of identifying similar entities or matches across different datasets. This package provides utilities for data matching. There are 3 components to a data matching system:
        
        - **Index** which divides the data into buckets such that only records within a single bucket need to be matched against each other. This improves matching time substantially as records with no chance of matching are never considered.
        - **Similarity function** which computes a similarity score when given 2 values.
        - **Matcher** which fetches record pairs from the index and for every pair computes similarity score for each field using the similarity functions, and finally combines the similarity scores into a single score for each pair.
        
        This package mostly interface with Pandas so using Pandas is mandatory.
        
        ## Basic usage
        
        ```python
        from datamatch import ThresholdMatcher, ColumnsIndex, JaroWinklerSimiarity
        
        matcher = ThresholdMatcher(df1, df2, ColumnsIndex(['year_of_birth']), {
            'first_name': JaroWinklerSimilarity(),
            'last_name': JaroWinklerSimilarity()
        })
        print(matcher.get_index_pairs_within_thresholds(0.8))
        ```
        
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.8
Description-Content-Type: text/markdown
