{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# More Value Expressions"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import ibis\n",
    "\n",
    "ibis.options.interactive = True\n",
    "\n",
    "connection = ibis.sqlite.connect(os.path.join('data', 'geography.db'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Type casting\n",
    "\n",
    "The Ibis type system is pretty basic and will get better (and more documented over time). It maps directly onto the current Impala type system\n",
    "\n",
    "- `int8`\n",
    "- `int16`\n",
    "- `int32`\n",
    "- `int64`\n",
    "- `boolean`\n",
    "- `float`\n",
    "- `double`\n",
    "- `string`\n",
    "- `timestamp`\n",
    "- `decimal($precision, $scale)`\n",
    "\n",
    "These type names can be used to cast from one type to another"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "countries = connection.table('countries')\n",
    "countries\n",
    "connection.table('gdp')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "countries = connection.table('countries')\n",
    "countries.population.cast('float').sum()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "countries.area_km2.cast('int32').sum()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Case / if-then-else expressions\n",
    "\n",
    "\n",
    "We support a number of variants of the SQL-equivalent `CASE` expression, and will add more API functions over time to meet different use cases and enhance the expressiveness of any branching-based value logic."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "expr = (countries.continent\n",
    "        .case()\n",
    "        .when('AF', 'Africa')\n",
    "        .when('AN', 'Antarctica')\n",
    "        .when('AS', 'Asia')\n",
    "        .when('EU', 'Europe')\n",
    "        .when('NA', 'North America')\n",
    "        .when('OC', 'Oceania')\n",
    "        .when('SA', 'South America')\n",
    "        .else_(countries.continent)\n",
    "        .end()\n",
    "        .name('continent_name'))\n",
    "\n",
    "expr.value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If the `else_` default condition is not provided, any values not matching one of the conditions will be `NULL`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "expr = (countries.continent\n",
    "        .case()\n",
    "        .when('AF', 'Africa')\n",
    "        .when('AS', 'Asia')\n",
    "        .when('EU', 'Europe')\n",
    "        .when('NA', 'North America')\n",
    "        .when('OC', 'Oceania')\n",
    "        .when('SA', 'South America')\n",
    "        .end()\n",
    "        .name('continent_name_with_nulls'))\n",
    "\n",
    "expr.value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To test for an arbitrary series of boolean conditions, use the `case` API method and pass any boolean expressions potentially involving columns of the table:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "expr = (ibis.case()\n",
    "        .when(countries.population > 25_000_000, 'big')\n",
    "        .when(countries.population < 5_000_000, 'small')\n",
    "        .else_('medium')\n",
    "        .end()\n",
    "        .name('size'))\n",
    "\n",
    "countries['name', 'population', expr].limit(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Simple ternary-cases (like the Python `X if COND else Y`) can be written using the `ifelse` function:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "expr = ((countries.continent == 'AS')\n",
    "        .ifelse('Asia', 'Not Asia')\n",
    "        .name('is_asia'))\n",
    "\n",
    "countries['name', 'continent', expr].limit(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Set membership\n",
    "\n",
    "\n",
    "The `isin` and `notin` functions are like their pandas counterparts. These can take:\n",
    "\n",
    "- A list of value expressions, either literal values or other column expressions\n",
    "- An array/column expression of some kind"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "is_america = countries.continent.isin(['NA', 'SA'])\n",
    "countries[is_america].continent.value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can also check for membership in an array. Here is an example of filtering based on the top 3 (ignoring ties) most frequently-occurring values in the `string_col` column of alltypes:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "top_continents = countries.continent.value_counts().limit(3).continent\n",
    "top_continents_filter = countries.continent.isin(top_continents)\n",
    "expr = countries[top_continents_filter]\n",
    "\n",
    "expr.count()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is a common enough operation that we provide a special analytical filter function `topk`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "countries.continent.topk(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Cool, huh? More on `topk` later."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Null-ness\n",
    "\n",
    "Like their pandas equivalents, the `isnull` and `notnull` functions return True values if the values are null, or non-null, respectively. For example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "expr = (countries.continent\n",
    "        .case()\n",
    "        .when('AF', 'Africa')\n",
    "        .when('EU', 'Europe')\n",
    "        .when('AS', 'Asia')\n",
    "        .end()\n",
    "        .name('top_continent_name'))\n",
    "\n",
    "expr.isnull().value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Functions like `isnull` can be combined with `case` expressions or functions like `ifelse` to replace null values with some other value. `ifelse` here will use the first value supplied for any `True` value and the second value for any `False` value. Either value can be a scalar or array. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "expr2 = expr.isnull().ifelse('Other continent', expr).name('continent')\n",
    "expr2.value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Distinct-based operations\n",
    "\n",
    "\n",
    "Ibis supports using `distinct` to remove duplicate rows or values on tables or arrays. For example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "countries['continent'].distinct()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "countries.continent.distinct()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This can be combined with `count` to form a reduction metric:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "metric = (countries.continent\n",
    "          .distinct().count()\n",
    "          .name('num_continents'))\n",
    "metric"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is common enough to have a shortcut `nunique`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "countries.continent.nunique()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## String operations\n",
    "\n",
    "\n",
    "What's supported is pretty basic right now. We intend to support the full gamut of regular expression munging with a nice API, though in some cases some work will be required on SQLite's backend to support everything. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "countries[['name']].limit(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "At the moment, basic substring operations (`substr`, with conveniences `left` and `right`) and Python-like APIs such as `lower` and `upper` (for case normalization) are supported. So you could count first letter occurrences in a string column like so:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "expr = countries.name.lower().left(1).name('first_letter')\n",
    "expr.value_counts().sort_by(('count', False)).limit(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For fuzzy and regex filtering/searching, you can use one of the following\n",
    "\n",
    "- `like`, works as the SQL `LIKE` keyword\n",
    "- `rlike`, like `re.search` or SQL `RLIKE`\n",
    "- `contains`, like `x in str_value` in Python"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "countries[countries.name.like('%GE%')].name"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "countries[countries.name.lower().rlike('.*ge.*')].name"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "countries[countries.name.lower().contains('ge')].name"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Timestamp operations\n",
    "\n",
    "\n",
    "Date and time functionality is relatively limited at present compared with pandas, but we'll get there. The main things we have right now are\n",
    "\n",
    "- Field access (year, month, day, ...)\n",
    "- Timedeltas\n",
    "- Comparisons with fixed timestamps"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "independence = connection.table('independence')\n",
    "\n",
    "independence[independence.independence_date, independence.independence_date.month().name('month')].limit(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Somewhat more comprehensively"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_field(f):\n",
    "    return getattr(independence.independence_date, f)().name(f)\n",
    "\n",
    "fields = ['year', 'month', 'day']  # datetime fields can also use: 'hour', 'minute', 'second', 'millisecond'\n",
    "projection = [independence.independence_date] + [get_field(x) for x in fields]\n",
    "independence[projection].limit(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For timestamp arithmetic and comparisons, check out functions in the top level `ibis` namespace. This include things like `day` and `second`, but also the `ibis.timestamp` function:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "independence[independence.independence_date.min(),\n",
    "             independence.independence_date.max(),\n",
    "             independence.count().name('nrows')].distinct()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "independence[independence.independence_date > '2000-01-01'].count()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Some backends support adding offsets, for example `independence.independence_date + ibis.interval(days=1)` or `ibis.now() - independence.independence_date`."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
