Metadata-Version: 2.1
Name: anonlink
Version: 0.15.0
Summary: Anonymous linkage using cryptographic hashes and bloom filters
Home-page: https://github.com/data61/anonlink
License: Apache
Platform: UNKNOWN
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Natural Language :: English
Classifier: Operating System :: MacOS :: MacOS X
Classifier: Operating System :: POSIX
Classifier: Operating System :: POSIX :: Linux
Classifier: Operating System :: Microsoft :: Windows
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: Implementation :: CPython
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Topic :: Security :: Cryptography
Description-Content-Type: text/x-rst
Provides-Extra: test
License-File: LICENSE


.. image:: https://dev.azure.com/data61/Anonlink/_apis/build/status/data61.anonlink?branchName=main
    :target: https://dev.azure.com/data61/Anonlink/_build/latest?definitionId=3&branchName=main



.. image:: https://codecov.io/gh/data61/anonlink/branch/main/graph/badge.svg
    :target: https://codecov.io/gh/data61/anonlink


.. image:: https://requires.io/github/data61/anonlink/requirements.svg?branch=main
    :target: https://requires.io/github/data61/anonlink/requirements/?branch=main


.. image:: https://pepy.tech/badge/anonlink
    :target: https://pepy.tech/project/anonlink


A Python (and optimised C++) implementation of **anonymous linkage** using
*cryptographic linkage keys* as described by Rainer Schnell, Tobias
Bachteler, and Jörg Reiher in `A Novel Error-Tolerant Anonymous Linking
Code <http://grlc.german-microsimulation.de/wp-content/uploads/2017/05/downloadwp-grlc-2011-02.pdf>`__.

``anonlink`` computes similarity scores, and/or best guess matches between sets
of *cryptographic linkage keys* (hashed entity records).

Use `clkhash <https://github.com/data61/clkhash>`__ to create cryptographic linkage keys
from personally identifiable data.

Installation
============

Install a precompiled wheel from PyPi::

    pip install anonlink

Or (if your system has a C++ compiler) you can locally install from source::

    pip install -r requirements.txt
    pip install -e .


Benchmark
---------

You can run the benchmark with:

::

    $ python -m anonlink.benchmark
    Anonlink benchmark -- see README for explanation
    ------------------------------------------------

    Threshold: 0.5, All results returned
    Size 1 | Size 2 | Comparisons      | Total Time (s)          | Throughput
           |        |        (match %) | (comparisons / matching)|  (1e6 cmp/s)
    -------+--------+------------------+-------------------------+-------------
      1000 |   1000 |    1e6  (50.73%) |  0.762  (49.2% / 50.8%) |     2.669
      2000 |   2000 |    4e6  (51.04%) |  3.696  (42.6% / 57.4%) |     2.540
      3000 |   3000 |    9e6  (50.25%) |  8.121  (43.5% / 56.5%) |     2.548
      4000 |   4000 |   16e6  (50.71%) | 15.560  (41.1% / 58.9%) |     2.504

    Threshold: 0.5, Top 100 matches per record returned
    Size 1 | Size 2 | Comparisons      | Total Time (s)          | Throughput
           |        |        (match %) | (comparisons / matching)|  (1e6 cmp/s)
    -------+--------+------------------+-------------------------+-------------
      1000 |   1000 |    1e6  ( 6.86%) |  0.170  (85.9% / 14.1%) |     6.846
      2000 |   2000 |    4e6  ( 3.22%) |  0.384  (82.9% / 17.1%) |    12.561
      3000 |   3000 |    9e6  ( 2.09%) |  0.612  (81.6% / 18.4%) |    18.016
      4000 |   4000 |   16e6  ( 1.52%) |  0.919  (78.7% / 21.3%) |    22.135
      5000 |   5000 |   25e6  ( 1.18%) |  1.163  (80.8% / 19.2%) |    26.592
      6000 |   6000 |   36e6  ( 0.97%) |  1.535  (75.4% / 24.6%) |    31.113
      7000 |   7000 |   49e6  ( 0.82%) |  1.791  (80.6% / 19.4%) |    33.951
      8000 |   8000 |   64e6  ( 0.71%) |  2.095  (81.5% / 18.5%) |    37.466
      9000 |   9000 |   81e6  ( 0.63%) |  2.766  (72.5% / 27.5%) |    40.389
     10000 |  10000 |  100e6  ( 0.56%) |  2.765  (81.7% / 18.3%) |    44.277
     20000 |  20000 |  400e6  ( 0.27%) |  7.062  (86.2% / 13.8%) |    65.711

    Threshold: 0.7, All results returned
    Size 1 | Size 2 | Comparisons      | Total Time (s)          | Throughput
           |        |        (match %) | (comparisons / matching)|  (1e6 cmp/s)
    -------+--------+------------------+-------------------------+-------------
      1000 |   1000 |    1e6  ( 0.01%) |  0.009  (99.0% /  1.0%) |   113.109
      2000 |   2000 |    4e6  ( 0.01%) |  0.033  (98.7% /  1.3%) |   124.076
      3000 |   3000 |    9e6  ( 0.01%) |  0.071  (99.1% /  0.9%) |   128.515
      4000 |   4000 |   16e6  ( 0.01%) |  0.123  (99.0% /  1.0%) |   131.654
      5000 |   5000 |   25e6  ( 0.01%) |  0.202  (99.1% /  0.9%) |   124.999
      6000 |   6000 |   36e6  ( 0.01%) |  0.277  (99.0% /  1.0%) |   131.403
      7000 |   7000 |   49e6  ( 0.01%) |  0.368  (98.9% /  1.1%) |   134.428
      8000 |   8000 |   64e6  ( 0.01%) |  0.490  (99.0% /  1.0%) |   131.891
      9000 |   9000 |   81e6  ( 0.01%) |  0.608  (99.0% /  1.0%) |   134.564
     10000 |  10000 |  100e6  ( 0.01%) |  0.753  (99.0% /  1.0%) |   134.105
     20000 |  20000 |  400e6  ( 0.01%) |  2.905  (98.8% /  1.2%) |   139.294

    Threshold: 0.7, Top 100 matches per record returned
    Size 1 | Size 2 | Comparisons      | Total Time (s)          | Throughput
           |        |        (match %) | (comparisons / matching)|  (1e6 cmp/s)
    -------+--------+------------------+-------------------------+-------------
      1000 |   1000 |    1e6  ( 0.01%) |  0.009  (99.0% /  1.0%) |   111.640
      2000 |   2000 |    4e6  ( 0.01%) |  0.033  (98.6% /  1.4%) |   122.060
      3000 |   3000 |    9e6  ( 0.01%) |  0.074  (99.1% /  0.9%) |   123.237
      4000 |   4000 |   16e6  ( 0.01%) |  0.124  (99.0% /  1.0%) |   130.204
      5000 |   5000 |   25e6  ( 0.01%) |  0.208  (99.1% /  0.9%) |   121.351
      6000 |   6000 |   36e6  ( 0.01%) |  0.275  (99.0% /  1.0%) |   132.186
      7000 |   7000 |   49e6  ( 0.01%) |  0.373  (99.0% /  1.0%) |   132.650
      8000 |   8000 |   64e6  ( 0.01%) |  0.496  (99.1% /  0.9%) |   130.125
      9000 |   9000 |   81e6  ( 0.01%) |  0.614  (99.0% /  1.0%) |   133.216
     10000 |  10000 |  100e6  ( 0.01%) |  0.775  (99.1% /  0.9%) |   130.230
     20000 |  20000 |  400e6  ( 0.01%) |  2.939  (98.9% /  1.1%) |   137.574


The tables are interpreted as follows. Each table measures the throughput
of the Dice coefficient comparison function. The four tables correspond to
two different choices of "matching threshold" and "result limiting".

These parameters have been chosen to characterise two different performance
scenarios. Since the data used for comparisons is randomly generated, the
first threshold value (`0.5`) will cause about 50% of the candidates to
"match", while the second threshold value (`0.7`) will cause ~0.01% of the
candidates to match (these values are reported in the "match %" column).
Where the table heading includes "All results returned", all matches above
the threshold are returned and passed to the solver.
With the threshold of `0.5`, the large number of matches means that much
of the time is spent keeping the candidates in order. Next we limit the
number of matches per record to the top 100 - which also must be above the
threshold.

In the final two tables we use the threshold value of `0.7`, this very
effectively filters the number of candidate matches down. Here the throughput
is determined primarily by the comparison code itself, adding the top 100
filter has no major impact.

Finally, the Total Time column includes indications as to the
proportion of time spent calculating the (sparse) similarity matrix
`comparisons` and the proportion of time spent `matching` in the
greedy solver. This latter is determined by the size of the similarity
matrix, which will be approximately `#comparisons * match% / 100`.

Tests
=====

Run unit tests with `pytest`:

::

    $ pytest
    ====================================== test session starts ======================================
    platform linux -- Python 3.6.4, pytest-3.2.5, py-1.4.34, pluggy-0.4.0
    rootdir: /home/hlaw/src/n1-anonlink, inifile:
    collected 71 items

    tests/test_benchmark.py ...
    tests/test_bloommatcher.py ..............
    tests/test_e2e.py .............ss....
    tests/test_matcher.py ..x.....x......x....x..
    tests/test_similarity.py .........
    tests/test_util.py ...

    ======================== 65 passed, 2 skipped, 4 xfailed in 4.01 seconds ========================

To enable slightly larger tests add the following environment variables:

-  INCLUDE_10K
-  INCLUDE_100K

Limitations
-----------

-  The linkage process has order n^2 time complexity - although algorithms exist to
   significantly speed this up. Several possible speedups are described
   in `Privacy Preserving Record Linkage with PPJoin <http://dbs.uni-leipzig.de/file/P4Join-BTW2015.pdf>`__.


Discussion
----------

If you run into bugs, you can file them in our `issue tracker <https://github.com/data61/anonlink/issues>`__
on GitHub.

There is also an `anonlink mailing list <https://groups.google.com/forum/#!forum/anonlink>`__
for development discussion and release announcements.

Wherever we interact, we strive to follow the `Python Community Code of Conduct <https://www.python.org/psf/codeofconduct/>`__.

Citing
======

Anonlink is designed, developed and supported by `CSIRO's Data61 <https://www.data61.csiro.au/>`__. If you use any part
of this library in your research, please cite it using the following BibTex entry::

    @misc{Anonlink,
      author = {CSIRO's Data61},
      title = {Anonlink Private Record Linkage System},
      year = {2017},
      publisher = {GitHub},
      journal = {GitHub Repository},
      howpublished = {\url{https://github.com/data61/anonlink}},
    }


License
-------

Copyright 2017 CSIRO (Data61)

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.


