Metadata-Version: 2.1
Name: anti-clustering
Version: 0.2.1
Summary: Generic Anti-Clustering
Author: Matthias Als
Author-email: mata@ecco.com
Requires-Python: >=3.8,<3.11
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Requires-Dist: numpy (>=1.23.1,<1.24.0)
Requires-Dist: ortools (==9.3.10497)
Requires-Dist: pandas (>=1.4.4,<1.5.0)
Requires-Dist: scikit-learn (>=1.1.1,<1.2.0)
Requires-Dist: scipy (>=1.9.0,<1.10.0)
Description-Content-Type: text/markdown

# Anti-clustering

A generic Python library for solving the anti-clustering problem. While clustering algorithms will achieve high similarity within a cluster and low similarity between clusters, the anti-clustering algorithms will achieve the opposite; namely to minimise similarity within a cluster and maximise the similarity between clusters.
Currently, a handful of algorithms are implemented in this library:
* An exact approach using a BIP formulation.
* An enumerated exchange heuristic.
* A simulated annealing heuristic.

Keep in mind anti-clustering is computationally difficult problem and may run slow even for small instance sizes. The current ILP does not finish in reasonable time when anti-clustering the Iris dataset (150 data points).

The two former approaches are implemented as described in following paper:\
*Papenberg, M., & Klau, G. W. (2021). Using anticlustering to partition data sets into equivalent parts.
Psychological Methods, 26(2), 161–174. [DOI](https://doi.org/10.1037/met0000301). [Preprint](https://psyarxiv.com/3razc/)* \
The paper is accompanied by a library for the R programming language: [anticlust](https://github.com/m-Py/anticlust).

Differently to the [anticlust](https://github.com/m-Py/anticlust) R package, this library currently only have one objective function. 
In this library the objective will maximise intra-cluster distance: Euclidean distance for numerical columns and Hamming distance for categorical columns.

## Use cases
Within software testing, anti-clustering can be used for generating test and control groups in AB-testing.
Example: You have a webshop with a number of users. The webshop is undergoing active development and you have a new feature coming up. 
This feature should be tested against as many different users as possible without testing against the entire user-base. 
For that you can create a maximally diverse subset of the user-base to test against (the A group). 
The remaining users (B group) will not test this feature. For dividing the user-base you can use the anti-clustering algorithms. 
A and B groups should be as similar as possible to have a reliable basis of comparison, but internally in group A (and B) the elements should be as dissimilar as possible.

This is just one use case, probably many more exists.

## Installation

The anti-clustering package is available on [PyPI](https://pypi.org/project/anti-clustering/). To install it, run the following command:

```bash
pip install anti-clustering
```

The package currently supports Python 3.8 and above. 

## Usage
The input to the algorithm is a Pandas dataframe with each row representing a data point. The output is the same dataframe with an extra column containing integer encoded cluster labels. Below is an example based on the Iris dataset:
```python
from anti_clustering import ExactClusterEditingAntiClustering
from sklearn import datasets
import pandas as pd

iris_data = datasets.load_iris(as_frame=True)
iris_df = pd.DataFrame(data=iris_data.data, columns=iris_data.feature_names)

algorithm = ExactClusterEditingAntiClustering()

df = algorithm.run(
    df=iris_df,
    numerical_columns=list(iris_df.columns),
    categorical_columns=None,
    num_groups=2,
    destination_column='Cluster'
)
```

## Contributions
If you have any suggestions or have found a bug, feel free to open issues. If you have implemented a new algorithm or know how to tweak the existing ones; PRs are very appreciated.

## License
This library is licensed under the Apache 2.0 license.

