Data input formats¶
Contents
1 Pre-requirements
1.1 Import dependencies
1.2 Notebook configuration
2 Overview
3 Points
3.1 2D NumPy array of shape (n, d)
4 Distances
4.1 2D NumPy array of shape (n, n)
5 Neighbourhoods
6 Densitygraph
Pre-requirements¶
Import dependencies¶
[1]:
import sys
import matplotlib as mpl
import cnnclustering.cnn as cnn # CNN clustering
[2]:
# Version information
print(sys.version)
3.8.3 (default, May 15 2020, 15:24:35)
[GCC 8.3.0]
Notebook configuration¶
[3]:
# Matplotlib configuration
mpl.rc_file(
"matplotlibrc",
use_default_template=False
)
[3]:
# Axis property defaults for the plots
ax_props = {
"xlabel": None,
"ylabel": None,
"xlim": (-2.5, 2.5),
"ylim": (-2.5, 2.5),
"xticks": (),
"yticks": (),
"aspect": "equal"
}
# Line plot property defaults
line_props = {
"linewidth": 0,
"marker": '.',
}
Overview¶
A data set of \(n\) points can primarily be represented through point coordinates in a \(d\)-dimensional space, or in terms of a pairwise distance matrix (of arbitrary metric). Secondarily, the data set can be described by neighbourhoods (in a graph structure) with respect to a specific radius cutoff. Furthermore, it is possible to trim the neighbourhoods into a density graph containing density connected points rather then neighbours for each point. The memory demand of the input forms
and the speed at which they can be clustered varies. Currently the cnnclustering.cnn module can deal with the following data structures (\(n\): number of points, \(d\): number of dimensions).
Points
2D NumPy array of shape (n, d), holding point coordinates
Distances
2D NumPy array of shape (n, n), holding pairwise distances
Neighbourhoods
1D Numpy array of shape (n,) of 1D Numpy arrays of shape (<= n,), holding point indices
Python list of length (n) of Python sets of length (<= n), holding point indices
Sparse graph with 1D NumPy array of shape (<= n²), holding point indices, and 1D NumPy array of shape (n,), holding neighbourhood start indices
Density graph
1D Numpy array of shape (n,) of 1D Numpy arrays of shape (<= n,), holding point indices
Python list of length (n) of Python sets of length (<= n), holding point indices
Sparse graph with 1D NumPy array of shape (<= n²), holding point indices, and 1D NumPy array of shape (n,), holding connectivity start indices
The different input structures are wrapped by corresponding classes to be handled as attributes of a CNN cluster object. Different kinds of input formats corresponding to the same data set are bundled in an Data object.
Points¶
2D NumPy array of shape (n, d)¶
The cnn module provides the class Points to handle data set point coordinates. Instances of type Points behave essentially like NumPy arrays.
[19]:
points = cnn.Points()
print("Representation of points: ", repr(points))
print("Points are Numpy arrays: ", isinstance(points, np.ndarray))
Representation of points: Points([], dtype=float64)
Points are Numpy arrays: True
If you have your data points already in the format of a 2D NumPy array, the conversion into Points is straightforward and does not require any copying. Note that the dtype of Points is for now fixed to np.float_.
[42]:
original_points = np.array([[0, 0, 0],
[1, 1, 1]], dtype=np.float_)
points = cnn.Points(original_points)
points[0, 0] = 1
points
[42]:
Points([[1., 0., 0.],
[1., 1., 1.]])
[43]:
original_points
[43]:
array([[1., 0., 0.],
[1., 1., 1.]])
1D sequences are interpreted as a single point on initialisation.
[45]:
points = cnn.Points(np.array([0, 0, 0]))
points
[45]:
Points([[0., 0., 0.]])
Other sequences like lists do work as input, too but consider that this requires a copy.
[47]:
original_points = [[0, 0, 0],
[1, 1, 1]]
points = cnn.Points(original_points)
points
[47]:
Points([[0., 0., 0.],
[1., 1., 1.]])
Points can be used to represent data sets distributed over multiple parts. Parts could constitute independent measurements that should be clustered together but remain separated for later analyses. Internally Points stores the underlying point coordinates always as a (vertically stacked) 2D array. Points.edges is used to track the number of points belonging to each part. The alternative constructor Points.from_parts can be used to deduce edges from parts of points passed as a
sequence of 2D sequences.
[64]:
points = cnn.Points.from_parts([[[0, 0, 0],
[1, 1, 1]],
[[2, 2, 2],
[3, 3, 3]]])
points
[64]:
Points([[0., 0., 0.],
[1., 1., 1.],
[2., 2., 2.],
[3., 3., 3.]])
[65]:
points.edges # 2 parts, 2 points each
[65]:
array([2, 2])
Trying to set edges manually to a sequence not consistent with the total number of points, will raise an error. Setting the edges of an empty Points object is, however, allowed and can be used to store part information even when no points are loaded.
[66]:
points.edges = [2, 3]
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-66-4bd144cf309c> in <module>
----> 1 points.edges = [2, 3]
~/CNN/cnnclustering/cnn.py in edges(self, x)
810
811 if (n != 0) and (sum_edges != n):
--> 812 raise ValueError(
813 f"Part edges ({sum_edges} points) do not match data points "
814 f"({n} points)"
ValueError: Part edges (5 points) do not match data points (4 points)
Points.by_parts can be used to retrieve the parts again one by one.
[70]:
for part in points.by_parts():
print(f"{part} \n")
[[0. 0. 0.]
[1. 1. 1.]]
[[2. 2. 2.]
[3. 3. 3.]]
To provide one possible way to calculate neighbourhoods from points, Points has a thin method wrapper for scipy.spatial.cKDTree. This will set Points.tree which is used by CNN.calc_neighbours_from_cKDTree. The user is encouraged to use any other external method instead.
[75]:
points.cKDTree()
points.tree
[75]:
<scipy.spatial.ckdtree.cKDTree at 0x7f0f6d3f3900>
Distances¶
2D NumPy array of shape (n, n)¶
The cnn module provides the class Distances to handle data set pairwise distances as a dense matrix. Instances of type Distances behave (like Points) much like NumPy arrays.
[79]:
distances = cnn.Distances([[0, 1], [1, 0]])
distances
[79]:
Distances([[0., 1.],
[1., 0.]])
Distances do not support an edges attribute, i.e. can not represent part information. Use the edges of an associated Points instance instead.
Pairwise Distances can be calculated for \(n\) points within a data set from a Points instance for example with CNN.calc_dist, resulting in a matrix of shape (\(n\), \(n\)). They can be also calculated between \(n\) points in one and \(m\) points in another data set, resulting in a relative distance matrix (map matrix) of shape (\(n\), \(m\)). In the later case Distances.reference should be used to keep track of the CNN object carrying the second
data set. Such a map matrix can be used to predict cluster labels for a data set based on the fitted cluster labels of another set.