{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Data input formats"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "toc": true
   },
   "source": [
    "<h1>Contents<span class=\"tocSkip\"></span></h1>\n",
    "<div class=\"toc\"><ul class=\"toc-item\"><li><span><a href=\"#Pre-requirements\" data-toc-modified-id=\"Pre-requirements-1\"><span class=\"toc-item-num\">1&nbsp;&nbsp;</span>Pre-requirements</a></span><ul class=\"toc-item\"><li><span><a href=\"#Import-dependencies\" data-toc-modified-id=\"Import-dependencies-1.1\"><span class=\"toc-item-num\">1.1&nbsp;&nbsp;</span>Import dependencies</a></span></li><li><span><a href=\"#Notebook-configuration\" data-toc-modified-id=\"Notebook-configuration-1.2\"><span class=\"toc-item-num\">1.2&nbsp;&nbsp;</span>Notebook configuration</a></span></li></ul></li><li><span><a href=\"#Overview\" data-toc-modified-id=\"Overview-2\"><span class=\"toc-item-num\">2&nbsp;&nbsp;</span>Overview</a></span></li><li><span><a href=\"#Points\" data-toc-modified-id=\"Points-3\"><span class=\"toc-item-num\">3&nbsp;&nbsp;</span>Points</a></span><ul class=\"toc-item\"><li><span><a href=\"#2D-NumPy-array-of-shape-(n,-d)\" data-toc-modified-id=\"2D-NumPy-array-of-shape-(n,-d)-3.1\"><span class=\"toc-item-num\">3.1&nbsp;&nbsp;</span>2D NumPy array of shape (<em>n</em>, <em>d</em>)</a></span></li></ul></li><li><span><a href=\"#Distances\" data-toc-modified-id=\"Distances-4\"><span class=\"toc-item-num\">4&nbsp;&nbsp;</span>Distances</a></span><ul class=\"toc-item\"><li><span><a href=\"#2D-NumPy-array-of-shape-(n,-n)\" data-toc-modified-id=\"2D-NumPy-array-of-shape-(n,-n)-4.1\"><span class=\"toc-item-num\">4.1&nbsp;&nbsp;</span>2D NumPy array of shape (<em>n</em>, <em>n</em>)</a></span></li></ul></li><li><span><a href=\"#Neighbourhoods\" data-toc-modified-id=\"Neighbourhoods-5\"><span class=\"toc-item-num\">5&nbsp;&nbsp;</span>Neighbourhoods</a></span></li><li><span><a href=\"#Densitygraph\" data-toc-modified-id=\"Densitygraph-6\"><span class=\"toc-item-num\">6&nbsp;&nbsp;</span>Densitygraph</a></span></li></ul></div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Pre-requirements"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Import dependencies"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-09T13:28:29.353908Z",
     "start_time": "2020-06-09T13:28:28.875899Z"
    }
   },
   "outputs": [],
   "source": [
    "import sys\n",
    "\n",
    "import matplotlib as mpl\n",
    "\n",
    "import cnnclustering.cnn as cnn  # CNN clustering"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-09T13:28:31.783588Z",
     "start_time": "2020-06-09T13:28:31.780355Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "3.8.3 (default, May 15 2020, 15:24:35) \n",
      "[GCC 8.3.0]\n"
     ]
    }
   ],
   "source": [
    "# Version information\n",
    "print(sys.version)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Notebook configuration"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-09T13:29:19.030216Z",
     "start_time": "2020-06-09T13:29:19.021536Z"
    }
   },
   "outputs": [],
   "source": [
    "# Matplotlib configuration\n",
    "mpl.rc_file(\n",
    "    \"matplotlibrc\",\n",
    "    use_default_template=False\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-08T11:04:19.629939Z",
     "start_time": "2020-06-08T11:04:19.619848Z"
    }
   },
   "outputs": [],
   "source": [
    "# Axis property defaults for the plots\n",
    "ax_props = {\n",
    "    \"xlabel\": None,\n",
    "    \"ylabel\": None,\n",
    "    \"xlim\": (-2.5, 2.5),\n",
    "    \"ylim\": (-2.5, 2.5),\n",
    "    \"xticks\": (),\n",
    "    \"yticks\": (),\n",
    "    \"aspect\": \"equal\"\n",
    "}\n",
    "\n",
    "# Line plot property defaults\n",
    "line_props = {\n",
    "    \"linewidth\": 0,\n",
    "    \"marker\": '.',\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Overview"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A data set of $n$ points can primarily be represented through point coordinates in a $d$-dimensional space, or in terms of a pairwise distance matrix (of arbitrary metric). Secondarily, the data set can be described by neighbourhoods (in a graph structure) with respect to a specific radius cutoff. Furthermore, it is possible to trim the neighbourhoods into a density graph containing density connected points rather then neighbours for each point. The memory demand of the input forms and the speed at which they can be clustered varies. Currently the `cnnclustering.cnn` module can deal with the following data structures ($n$: number of points, $d$: number of dimensions)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "__Points__\n",
    "  \n",
    "  - 2D NumPy array of shape (*n*, *d*), holding point coordinates\n",
    "\n",
    "__Distances__\n",
    "\n",
    "  - 2D NumPy array of shape (*n*, *n*), holding pairwise distances\n",
    "  \n",
    "__Neighbourhoods__\n",
    "\n",
    "  - 1D Numpy array of shape (*n*,) of 1D Numpy arrays of shape (<= *n*,), holding point indices\n",
    "  - Python list of length (*n*) of Python sets of length (<= *n*), holding point indices\n",
    "  - Sparse graph with 1D NumPy array of shape (<= *n²*), holding point indices, and 1D NumPy array of shape (*n*,), holding neighbourhood start indices\n",
    "  \n",
    "__Density graph__\n",
    "\n",
    "  - 1D Numpy array of shape (*n*,) of 1D Numpy arrays of shape (<= *n*,), holding point indices\n",
    "  - Python list of length (*n*) of Python sets of length (<= *n*), holding point indices\n",
    "  - Sparse graph with 1D NumPy array of shape (<= *n²*), holding point indices, and 1D NumPy array of shape (*n*,), holding connectivity start indices"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The different input structures are wrapped by corresponding classes to be handled as attributes of a `CNN` cluster object. Different kinds of input formats corresponding to the same data set are bundled in an `Data` object."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Points"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2D NumPy array of shape (*n*, *d*)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-09T13:38:24.825701Z",
     "start_time": "2020-06-09T13:38:24.820167Z"
    }
   },
   "source": [
    "The `cnn` module provides the class `Points` to handle data set point coordinates. Instances of type `Points` behave essentially like NumPy arrays."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-09T13:51:00.057109Z",
     "start_time": "2020-06-09T13:51:00.041521Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Representation of points:  Points([], dtype=float64)\n",
      "Points are Numpy arrays:   True\n"
     ]
    }
   ],
   "source": [
    "points = cnn.Points()\n",
    "print(\"Representation of points: \", repr(points))\n",
    "print(\"Points are Numpy arrays:  \", isinstance(points, np.ndarray))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you have your data points already in the format of a 2D NumPy array, the conversion into `Points` is straightforward and does not require any copying. Note that the dtype of `Points` is for now fixed to `np.float_`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-09T14:08:02.743358Z",
     "start_time": "2020-06-09T14:08:02.722019Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Points([[1., 0., 0.],\n",
       "        [1., 1., 1.]])"
      ]
     },
     "execution_count": 42,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "original_points = np.array([[0, 0, 0],\n",
    "                            [1, 1, 1]], dtype=np.float_)\n",
    "points = cnn.Points(original_points)\n",
    "points[0, 0] = 1\n",
    "points"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-09T14:08:04.908069Z",
     "start_time": "2020-06-09T14:08:04.890473Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[1., 0., 0.],\n",
       "       [1., 1., 1.]])"
      ]
     },
     "execution_count": 43,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "original_points"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1D sequences are interpreted as a single point on initialisation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-09T14:09:13.954928Z",
     "start_time": "2020-06-09T14:09:13.947732Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Points([[0., 0., 0.]])"
      ]
     },
     "execution_count": 45,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "points = cnn.Points(np.array([0, 0, 0]))\n",
    "points"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Other sequences like lists do work as input, too but consider that this requires a copy."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-09T14:11:44.034862Z",
     "start_time": "2020-06-09T14:11:44.030212Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Points([[0., 0., 0.],\n",
       "        [1., 1., 1.]])"
      ]
     },
     "execution_count": 47,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "original_points = [[0, 0, 0],\n",
    "                   [1, 1, 1]]\n",
    "points = cnn.Points(original_points)\n",
    "points"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`Points` can be used to represent data sets distributed over multiple parts. Parts could constitute independent measurements that should be clustered together but remain separated for later analyses. Internally `Points` stores the underlying point coordinates always as a (vertically stacked) 2D array. `Points.edges` is used to track the number of points belonging to each part. The alternative constructor `Points.from_parts` can be used to deduce `edges` from parts of points passed as a sequence of 2D sequences."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 64,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-09T14:34:57.510045Z",
     "start_time": "2020-06-09T14:34:57.495326Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Points([[0., 0., 0.],\n",
       "        [1., 1., 1.],\n",
       "        [2., 2., 2.],\n",
       "        [3., 3., 3.]])"
      ]
     },
     "execution_count": 64,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "points = cnn.Points.from_parts([[[0, 0, 0],\n",
    "                                 [1, 1, 1]],\n",
    "                                [[2, 2, 2],\n",
    "                                 [3, 3, 3]]])\n",
    "points"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 65,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-09T14:34:57.741356Z",
     "start_time": "2020-06-09T14:34:57.728562Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([2, 2])"
      ]
     },
     "execution_count": 65,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "points.edges  # 2 parts, 2 points each"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Trying to set `edges` manually to a sequence not consistent with the total number of points, will raise an error. Setting the `edges` of an empty `Points` object is, however, allowed and can be used to store part information even when no points are loaded."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 66,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-09T14:34:58.262197Z",
     "start_time": "2020-06-09T14:34:58.243633Z"
    }
   },
   "outputs": [
    {
     "ename": "ValueError",
     "evalue": "Part edges (5 points) do not match data points (4 points)",
     "output_type": "error",
     "traceback": [
      "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[1;31mValueError\u001b[0m                                Traceback (most recent call last)",
      "\u001b[1;32m<ipython-input-66-4bd144cf309c>\u001b[0m in \u001b[0;36m<module>\u001b[1;34m\u001b[0m\n\u001b[1;32m----> 1\u001b[1;33m \u001b[0mpoints\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0medges\u001b[0m \u001b[1;33m=\u001b[0m \u001b[1;33m[\u001b[0m\u001b[1;36m2\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;36m3\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m",
      "\u001b[1;32m~/CNN/cnnclustering/cnn.py\u001b[0m in \u001b[0;36medges\u001b[1;34m(self, x)\u001b[0m\n\u001b[0;32m    810\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    811\u001b[0m         \u001b[1;32mif\u001b[0m \u001b[1;33m(\u001b[0m\u001b[0mn\u001b[0m \u001b[1;33m!=\u001b[0m \u001b[1;36m0\u001b[0m\u001b[1;33m)\u001b[0m \u001b[1;32mand\u001b[0m \u001b[1;33m(\u001b[0m\u001b[0msum_edges\u001b[0m \u001b[1;33m!=\u001b[0m \u001b[0mn\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 812\u001b[1;33m             raise ValueError(\n\u001b[0m\u001b[0;32m    813\u001b[0m                 \u001b[1;34mf\"Part edges ({sum_edges} points) do not match data points \"\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    814\u001b[0m                 \u001b[1;34mf\"({n} points)\"\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
      "\u001b[1;31mValueError\u001b[0m: Part edges (5 points) do not match data points (4 points)"
     ]
    }
   ],
   "source": [
    "points.edges = [2, 3]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`Points.by_parts` can be used to retrieve the parts again one by one. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 70,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-09T14:38:23.058112Z",
     "start_time": "2020-06-09T14:38:23.052497Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[0. 0. 0.]\n",
      " [1. 1. 1.]] \n",
      "\n",
      "[[2. 2. 2.]\n",
      " [3. 3. 3.]] \n",
      "\n"
     ]
    }
   ],
   "source": [
    "for part in points.by_parts():\n",
    "    print(f\"{part} \\n\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To provide one possible way to calculate neighbourhoods from points, `Points` has a thin method wrapper\n",
    "for `scipy.spatial.cKDTree`. This will set `Points.tree` which is used by `CNN.calc_neighbours_from_cKDTree`. The user is encouraged to use any other external method instead."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 75,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-09T14:45:24.238328Z",
     "start_time": "2020-06-09T14:45:24.221649Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<scipy.spatial.ckdtree.cKDTree at 0x7f0f6d3f3900>"
      ]
     },
     "execution_count": 75,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "points.cKDTree()\n",
    "points.tree"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Distances"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2D NumPy array of shape (*n*, *n*)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-09T13:38:24.825701Z",
     "start_time": "2020-06-09T13:38:24.820167Z"
    }
   },
   "source": [
    "The `cnn` module provides the class `Distances` to handle data set pairwise distances as a dense matrix. Instances of type `Distances` behave (like `Points`) much like NumPy arrays."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 79,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-09T14:56:08.026576Z",
     "start_time": "2020-06-09T14:56:08.012575Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Distances([[0., 1.],\n",
       "           [1., 0.]])"
      ]
     },
     "execution_count": 79,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "distances = cnn.Distances([[0, 1], [1, 0]])\n",
    "distances"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`Distances` do not support an `edges` attribute, i.e. can not represent part information. Use the `edges` of an associated `Points` instance instead."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Pairwise `Distances` can be calculated for $n$ points within a data set from a `Points` instance for example with `CNN.calc_dist`, resulting in a matrix of shape ($n$, $n$). They can be also calculated between $n$ points in one and $m$ points in another data set, resulting in a relative distance matrix (map matrix) of shape ($n$, $m$). In the later case `Distances.reference` should be used to keep track of the `CNN` object carrying the second data set. Such a map matrix can be used to predict cluster labels for a data set based on the fitted cluster labels of another set."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Neighbourhoods"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Densitygraph"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.3"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": true,
   "title_cell": "Contents",
   "title_sidebar": "Contents",
   "toc_cell": true,
   "toc_position": {
    "height": "calc(100% - 180px)",
    "left": "10px",
    "top": "150px",
    "width": "164.988px"
   },
   "toc_section_display": true,
   "toc_window_display": true
  },
  "varInspector": {
   "cols": {
    "lenName": 16,
    "lenType": 16,
    "lenVar": 40
   },
   "kernels_config": {
    "python": {
     "delete_cmd_postfix": "",
     "delete_cmd_prefix": "del ",
     "library": "var_list.py",
     "varRefreshCmd": "print(var_dic_list())"
    },
    "r": {
     "delete_cmd_postfix": ") ",
     "delete_cmd_prefix": "rm(",
     "library": "var_list.r",
     "varRefreshCmd": "cat(var_dic_list()) "
    }
   },
   "types_to_exclude": [
    "module",
    "function",
    "builtin_function_or_method",
    "instance",
    "_Feature"
   ],
   "window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
