Metadata-Version: 2.1
Name: datamatch
Version: 0.0.2
Summary: Data matching utilities
Home-page: https://github.com/pckhoi/datamatch
Author: Khoi Pham
Author-email: pckhoi@gmail.com
License: UNKNOWN
Project-URL: Bug Tracker, https://github.com/pckhoi/datamatch/issues
Description: # Datamatch
        
        Data matching is the process of identifying similar entities or matches across different datasets. This package provides utilities for data matching. There are 3 components to a data matching system:
        
        - **Index** which divides the data into buckets such that only records within a single bucket need to be matched against each other. This improves matching time substantially as records with no chance of matching are never considered.
        - **Similarity function** which computes a similarity score when given 2 values.
        - **Matcher** which fetches record pairs from the index and for every pair computes similarity score for each field using the similarity functions, and finally combines the similarity scores into a single score for each pair.
        
        This package mostly interface with Pandas so using Pandas is mandatory.
        
        ## Basic usage
        
        Match records between 2 frames
        
        ```python
        from datamatch import (
            ThresholdMatcher, ColumnsIndex, JaroWinklerSimiarity, DateSimilarity
        )
        
        matcher = ThresholdMatcher(ColumnsIndex(['year_of_birth']), {
            'first_name': JaroWinklerSimilarity(),
            'last_name': JaroWinklerSimilarity()
        }, df1, df2)
        
        # print sample pairs within different score ranges
        print(matcher.get_sample_pairs())
        
        # print all pairs that score 0.8 or higher
        print(matcher.get_all_pairs(0.8))
        
        # save matching result to an excel file
        matcher.save_pairs_to_excel('match_result.xlsx', 0.8)
        
        # get index pairs of matched records
        matches = matcher.get_index_pairs_within_thresholds(0.8)
        ```
        
        To deduplicate records in a frame, pass in just one frame instead of 2
        
        ```python
        matcher = ThresholdMatcher(ColumnsIndex(['year_of_birth']), {
            'first_name': JaroWinklerSimilarity(),
            'last_name': JaroWinklerSimilarity()
        }, df)
        ```
        
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.8
Description-Content-Type: text/markdown
