
# pySTAD 

This is a python implementation of [STAD](https://ieeexplore.ieee.org/document/9096616/) for the exploration and visualisation of high-dimensional data. This implementation is based on the [R version](https://github.com/vda-lab/stad).

## Background

[STAD](https://ieeexplore.ieee.org/document/9096616/) is a dimensionality reduction algorithm, that generates an abstract representation of high-dimensional data by giving each data point a location in a graph which preserves the distances in the original high-dimensional space. The STAD graph is built upon the Minimum Spanning Tree (MST) to which new edges are added until the correlation between the graph and the original dataset is maximized. Additionally, STAD supports the inclusion of filter functions to analyse data from new perspectives, emphasizing traits in data which otherwise would remain hidden. 

### Topological Data analysis

Topological data analysis (TDA) aims to describe the geometric structures present in data. A dataset is interpreted as a point-cloud, where each point is sampled from an underlying geometric object. TDA tries to recover and describe the geometry of that object in terms of features that are invariant ["under continuous deformations, such as stretching, twisting, crumpling and bending, but not tearing or gluing"](https://en.wikipedia.org/wiki/Topology). Two geometries that can be deformed into each other without tearing or glueing are *homeomorphic* (for instance a donut and coffee mug). Typically, TDA describes the *holes* in a geometry, formalised as [Betti numbers](https://en.wikipedia.org/wiki/Betti_number).


Like other TDA algorithms, STAD constructs a graph that describes the structure of the data. However, the output of STAD should be interpreted as a data-visualisation result, rather than a topological description of the data's structure. Other TDA algorithms, like [mapper](https://github.com/scikit-tda/kepler-mapper), do produce topological results. However, they rely on aggregating the data, whereas STAD encodes the original data points as vertices in a graph.

### Dimensionality reduction

Compared to dimensionality reduction algorithms like, t-SNE and UMAP, the STAD produces a more flexible description of the data. A graph can be drawn using different layouts and a user can interact with it. In addition, STAD's projections retain the global structure of the data. In general, the STAD graph tends to underestimate distant data-points in the network structure. On the other hand, t-SNE and UMAP emphasize the relation of data-points with their closest neighbors over that with distant data-points.

<p style="text-align:center;"><img src="./assets/dimensionality_reduction_comparison.png" width="90%" /></p>

from [Alcaide & Aerts (2020)](https://ieeexplore.ieee.org/document/9096616/)

## Installation

pySTAD can be installed with:
```bash
pip install pystad
```
Which will install the following dependencies:
- numpy
- scipy
- python-igraph
- pandas

The example notebooks have additional dependencies:
- matplotlib
- networkx
- scikit-learn
- jupyterlab
- ipywidgets

These can be installed with pip or conda. Enabling ipywidgets in jupyter lab takes two more steps:
- First, install nodejs using conda:
```bash
conda install -c conda-forge nodejs
```
- Then install the jupyter lab extension:
```bash
jupyter labextension install @jupyter-widgets/jupyterlab-manager
```

## Examples

Please see the example notebooks for demonstrations of STAD and interactive exploration dashboards. The code below provides a quick-start:

```Python
import stad
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.sparse import triu
from sklearn.metrics.pairwise import euclidean_distances

# Circles dataset
data = pd.read_csv('./examples/data/horse.csv', header=0)
data = vertex_data.sample(n=500)
dist = triu(euclidean_distances(data), k = 1)

plt.scatter(data.z, data.y, s=5, c=data.x)
plt.show()

## STAD without lens
network_no_lens, detail = stad.stad(dist)
stad.draw_network_matplotlib(network_no_lens, detail))
plt.show()
stad.draw_correlations_matplotlib(detail)
plt.show()

## STAD with lens
network_lens, detail = stad.stad(dist, lens_values = data['x'], lens_bins = 3)
stad.draw_network_matplotlib(network_lens, detail)
plt.show()
stad.draw_correlations_matplotlib(detail)
plt.show()
```

## Compared to the R-implementation

The [R implementation](https://github.com/vda-lab/stad) supports 2 dimensional filters (lenses) and uses Simulated Annealing to optimise the output graph. This implementation currently only supports 1D lenses. In addition, aside from simulated annealing, this implementation also supports linear and logistic sweeps.