{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Aggregating and joining data\n",
    "\n",
    "This is the second introductory tutorial to Ibis. If you are new to Ibis, you may want to start\n",
    "by the first tutorial, _01-Introduction-to-Ibis_.\n",
    "\n",
    "In the first tutorial, we saw how to operate on the data of a table. We will work again with\n",
    "the `countries` table as we did previously."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "                     name continent   area_km2  population\n",
       "0                 Andorra        EU      468.0       84000\n",
       "1    United Arab Emirates        AS    82880.0     4975593\n",
       "2             Afghanistan        AS   647500.0    29121286\n",
       "3     Antigua and Barbuda        NA      443.0       86754\n",
       "4                Anguilla        NA      102.0       13254\n",
       "..                    ...       ...        ...         ...\n",
       "247                 Yemen        AS   527970.0    23495361\n",
       "248               Mayotte        AF      374.0      159042\n",
       "249          South Africa        AF  1219912.0    49000000\n",
       "250                Zambia        AF   752614.0    13460305\n",
       "251              Zimbabwe        AF   390580.0    13061000\n",
       "\n",
       "[252 rows x 4 columns]"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import os\n",
    "import ibis\n",
    "\n",
    "\n",
    "ibis.options.interactive = True\n",
    "\n",
    "connection = ibis.sqlite.connect(os.path.join('data', 'geography.db'))\n",
    "countries = connection.table('countries')\n",
    "\n",
    "countries['name', 'continent', 'area_km2', 'population']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Expressions\n",
    "\n",
    "We will continue by exploring the data by continent. We will start by creating an expression\n",
    "with the continent names, since our table only contains the abbreviations.\n",
    "\n",
    "An expression is one or more operations performed over the data. They can be used to retrieve the\n",
    "data or to build more complex operations.\n",
    "\n",
    "In this case we will use a `case` conditional statement to replace values depending on a condition.\n",
    "A `case` expression will return a case builder, and must be followed by one or more `when` calls,\n",
    "optionally an `else_` call, and must end with a call to `end`, to complete the full expression.\n",
    "The expression where `case` is called (`countries['continent']` in this case)\n",
    "is evaluated to see if it's equal to any of the first arguments of the calls to `when`. And the second\n",
    "argument is returned. If the value does not match any of the `when` values, the value of `else_` is returned."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0             Europe\n",
       "1               Asia\n",
       "2               Asia\n",
       "3      North America\n",
       "4      North America\n",
       "           ...      \n",
       "247             Asia\n",
       "248           Africa\n",
       "249           Africa\n",
       "250           Africa\n",
       "251           Africa\n",
       "Name: tmp, Length: 252, dtype: object"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "continent_name = (countries['continent'].case()\n",
    "                                        .when('NA', 'North America')\n",
    "                                        .when('SA', 'South America')\n",
    "                                        .when('EU', 'Europe')\n",
    "                                        .when('AF', 'Africa')\n",
    "                                        .when('AS', 'Asia')\n",
    "                                        .when('OC', 'Oceania')\n",
    "                                        .when('AN', 'Anctartica')\n",
    "                                        .else_('Unknown continent')\n",
    "                                        .end()\n",
    "                                        .name('continent_name'))\n",
    "continent_name"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What we did is take the values of the column `countries['continent']`, and we created a calculated\n",
    "column with the names of the continents, as defined in the `when` methods.\n",
    "\n",
    "This calculated column is an expression. The computations didn't happen when defining the `continent_name`\n",
    "variable, and the results are not stored. They have been computed when we printed its content.\n",
    "\n",
    "We can see that by checking the type of `continent_name`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "ibis.expr.types.StringColumn"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "type(continent_name)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the next tutorial we will see more about eager and lazy mode, and when operations are being\n",
    "executed. For now we can think that the query to the database happens only when we want to see\n",
    "the results.\n",
    "\n",
    "The important part is that now we can use our `continent_name` expression in other expressions.\n",
    "For example, since this is a column (a `StringColumn` to be specific), we can use it as a column\n",
    "to query the countries table.\n",
    "\n",
    "Note that when we created the expression we added `.name('continent_name')` to it, so the column\n",
    "has a name when being returned."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "                     name continent_name   area_km2  population\n",
       "0                 Andorra         Europe      468.0       84000\n",
       "1    United Arab Emirates           Asia    82880.0     4975593\n",
       "2             Afghanistan           Asia   647500.0    29121286\n",
       "3     Antigua and Barbuda  North America      443.0       86754\n",
       "4                Anguilla  North America      102.0       13254\n",
       "..                    ...            ...        ...         ...\n",
       "247                 Yemen           Asia   527970.0    23495361\n",
       "248               Mayotte         Africa      374.0      159042\n",
       "249          South Africa         Africa  1219912.0    49000000\n",
       "250                Zambia         Africa   752614.0    13460305\n",
       "251              Zimbabwe         Africa   390580.0    13061000\n",
       "\n",
       "[252 rows x 4 columns]"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "countries['name', continent_name, 'area_km2', 'population']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Just for illustration, let's repeat the same query, but renaming the expression to `continent`\n",
    "when using it in the list of columns to fetch."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "                     name      continent   area_km2  population\n",
       "0                 Andorra         Europe      468.0       84000\n",
       "1    United Arab Emirates           Asia    82880.0     4975593\n",
       "2             Afghanistan           Asia   647500.0    29121286\n",
       "3     Antigua and Barbuda  North America      443.0       86754\n",
       "4                Anguilla  North America      102.0       13254\n",
       "..                    ...            ...        ...         ...\n",
       "247                 Yemen           Asia   527970.0    23495361\n",
       "248               Mayotte         Africa      374.0      159042\n",
       "249          South Africa         Africa  1219912.0    49000000\n",
       "250                Zambia         Africa   752614.0    13460305\n",
       "251              Zimbabwe         Africa   390580.0    13061000\n",
       "\n",
       "[252 rows x 4 columns]"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "countries['name', continent_name.name('continent'), 'area_km2', 'population']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Aggregating data\n",
    "\n",
    "Now, let's group our data by continent, and let's find the total population of each."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "  continent_name  total_population\n",
       "0         Africa        1021238685\n",
       "1     Anctartica               170\n",
       "2           Asia        4130584841\n",
       "3         Europe         750724554\n",
       "4  North America         540204371\n",
       "5        Oceania          36067549\n",
       "6  South America         400143568"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "countries.group_by(continent_name).aggregate(countries['population'].sum().name('total_population'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see how Asia is the most populated country, followed by Africa. Antarctica is the least populated,\n",
    "as we would expect.\n",
    "\n",
    "The code to aggregate has two main parts:\n",
    "- The `group_by` method, that receive the column, expression or list of them to group by\n",
    "- The `aggregate` method, that receives an expression with the reduction we want to apply\n",
    "\n",
    "To make things a bit clearer, let's first save the reduction."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "6878963738"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "total_population = countries['population'].sum().name('total_population')\n",
    "total_population"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we can see, if we perform the operation directly, we will get the sum of the total in the column.\n",
    "\n",
    "But if we take the `total_population` expression as the parameter of the `aggregate` method, then the total is computed\n",
    "over every group defined by the `group_by` method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "  continent_name  total_population\n",
       "0         Africa        1021238685\n",
       "1     Anctartica               170\n",
       "2           Asia        4130584841\n",
       "3         Europe         750724554\n",
       "4  North America         540204371\n",
       "5        Oceania          36067549\n",
       "6  South America         400143568"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "countries.group_by(continent_name).aggregate(total_population)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If we want to compute two aggregates at the same time, we can pass a list to the `aggregate` method.\n",
    "\n",
    "For illustration, we use the `continent` column, instead of the `continent_names` expression. We can\n",
    "use both column names and expressions, and also a list with any of them (e.g. `[continent_names, 'name']`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "  continent  total_population  average_area\n",
       "0        AF        1021238685  5.234534e+05\n",
       "1        AN               170  2.802439e+06\n",
       "2        AS        4130584841  6.196685e+05\n",
       "3        EU         750724554  4.293017e+05\n",
       "4        NA         540204371  5.836313e+05\n",
       "5        OC          36067549  3.044157e+05\n",
       "6        SA         400143568  1.272751e+06"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "countries.group_by('continent').aggregate([total_population,\n",
    "                                           countries['area_km2'].mean().name('average_area')])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Joining data\n",
    "\n",
    "Now we are going to get the total gross domestic product (GDP) for each continent. In this case, the GDP data\n",
    "is not in the same table `countries`, but in a table `gdp`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "     country_code  year         value\n",
       "0             ABW  1986  4.054634e+08\n",
       "1             ABW  1987  4.876025e+08\n",
       "2             ABW  1988  5.964236e+08\n",
       "3             ABW  1989  6.953044e+08\n",
       "4             ABW  1990  7.648871e+08\n",
       "...           ...   ...           ...\n",
       "9995          SVK  2002  3.513034e+10\n",
       "9996          SVK  2003  4.681659e+10\n",
       "9997          SVK  2004  5.733202e+10\n",
       "9998          SVK  2005  6.278531e+10\n",
       "9999          SVK  2006  7.070810e+10\n",
       "\n",
       "[10000 rows x 3 columns]"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gdp = connection.table('gdp')\n",
    "gdp"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The table contains information for different years, we can easily check the range with:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(1960, 2017)"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gdp['year'].min(), gdp['year'].max()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, we are going to join this data with the `countries` table so we can obtain the continent\n",
    "of each country. The `countries` table has several different codes for the countries. Let's find out which\n",
    "one matches the three letter code in the `gdp` table."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "    iso_alpha2 iso_alpha3  iso_numeric fips                  name\n",
       "0           AD        AND           20   AN               Andorra\n",
       "1           AE        ARE          784   AE  United Arab Emirates\n",
       "2           AF        AFG            4   AF           Afghanistan\n",
       "3           AG        ATG           28   AC   Antigua and Barbuda\n",
       "4           AI        AIA          660   AV              Anguilla\n",
       "..         ...        ...          ...  ...                   ...\n",
       "247         YE        YEM          887   YM                 Yemen\n",
       "248         YT        MYT          175   MF               Mayotte\n",
       "249         ZA        ZAF          710   SF          South Africa\n",
       "250         ZM        ZMB          894   ZA                Zambia\n",
       "251         ZW        ZWE          716   ZI              Zimbabwe\n",
       "\n",
       "[252 rows x 5 columns]"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "countries['iso_alpha2', 'iso_alpha3', 'iso_numeric', 'fips', 'name']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `country_code` in `gdp` corresponds to `iso_alpha2` in the `countries` table. We can also see\n",
    "how the `gdp` table has `10,000` rows, while `countries` has `252`. We will start joining the\n",
    "two tables by the codes that match, discarding the codes that do not exist in both tables.\n",
    "This is called an inner join."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "     iso_alpha2 iso_alpha3  iso_numeric fips      name           capital  \\\n",
       "0            AD        AND           20   AN   Andorra  Andorra la Vella   \n",
       "1            AD        AND           20   AN   Andorra  Andorra la Vella   \n",
       "2            AD        AND           20   AN   Andorra  Andorra la Vella   \n",
       "3            AD        AND           20   AN   Andorra  Andorra la Vella   \n",
       "4            AD        AND           20   AN   Andorra  Andorra la Vella   \n",
       "...         ...        ...          ...  ...       ...               ...   \n",
       "9482         ZW        ZWE          716   ZI  Zimbabwe            Harare   \n",
       "9483         ZW        ZWE          716   ZI  Zimbabwe            Harare   \n",
       "9484         ZW        ZWE          716   ZI  Zimbabwe            Harare   \n",
       "9485         ZW        ZWE          716   ZI  Zimbabwe            Harare   \n",
       "9486         ZW        ZWE          716   ZI  Zimbabwe            Harare   \n",
       "\n",
       "      area_km2  population continent country_code  year         value  \n",
       "0        468.0       84000        EU          AND  1970  7.861921e+07  \n",
       "1        468.0       84000        EU          AND  1971  8.940982e+07  \n",
       "2        468.0       84000        EU          AND  1972  1.134082e+08  \n",
       "3        468.0       84000        EU          AND  1973  1.508201e+08  \n",
       "4        468.0       84000        EU          AND  1974  1.865587e+08  \n",
       "...        ...         ...       ...          ...   ...           ...  \n",
       "9482  390580.0    13061000        AF          ZWE  2013  1.909102e+10  \n",
       "9483  390580.0    13061000        AF          ZWE  2014  1.949552e+10  \n",
       "9484  390580.0    13061000        AF          ZWE  2015  1.996312e+10  \n",
       "9485  390580.0    13061000        AF          ZWE  2016  2.054868e+10  \n",
       "9486  390580.0    13061000        AF          ZWE  2017  2.281301e+10  \n",
       "\n",
       "[9487 rows x 12 columns]"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "countries_and_gdp = countries.inner_join(gdp,\n",
    "                                         predicates=countries['iso_alpha3'] == gdp['country_code'])\n",
    "countries_and_gdp[countries, gdp]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We joined the table with the information for all years. Now we are going to just take the information about the last available year, 2017."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "    country_code  year         value\n",
       "0            ABW  2017  2.700559e+09\n",
       "1            AFG  2017  2.019176e+10\n",
       "2            AGO  2017  1.221238e+11\n",
       "3            ALB  2017  1.302506e+10\n",
       "4            AND  2017  3.013387e+09\n",
       "..           ...   ...           ...\n",
       "242          XKX  2017  7.227700e+09\n",
       "243          YEM  2017  2.681870e+10\n",
       "244          ZAF  2017  3.495541e+11\n",
       "245          ZMB  2017  2.586814e+10\n",
       "246          ZWE  2017  2.281301e+10\n",
       "\n",
       "[247 rows x 3 columns]"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gdp_2017 = gdp.filter(gdp['year'] == 2017)\n",
    "gdp_2017"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Joining with the new expression we get:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "    iso_alpha2 iso_alpha3  iso_numeric fips          name           capital  \\\n",
       "0           AW        ABW          533   AA         Aruba        Oranjestad   \n",
       "1           AF        AFG            4   AF   Afghanistan             Kabul   \n",
       "2           AO        AGO           24   AO        Angola            Luanda   \n",
       "3           AL        ALB            8   AL       Albania            Tirana   \n",
       "4           AD        AND           20   AN       Andorra  Andorra la Vella   \n",
       "..         ...        ...          ...  ...           ...               ...   \n",
       "196         XK        XKX            0   KV        Kosovo          Pristina   \n",
       "197         YE        YEM          887   YM         Yemen             Sanaa   \n",
       "198         ZA        ZAF          710   SF  South Africa          Pretoria   \n",
       "199         ZM        ZMB          894   ZA        Zambia            Lusaka   \n",
       "200         ZW        ZWE          716   ZI      Zimbabwe            Harare   \n",
       "\n",
       "      area_km2  population continent country_code  year         value  \n",
       "0        193.0       71566        NA          ABW  2017  2.700559e+09  \n",
       "1     647500.0    29121286        AS          AFG  2017  2.019176e+10  \n",
       "2    1246700.0    13068161        AF          AGO  2017  1.221238e+11  \n",
       "3      28748.0     2986952        EU          ALB  2017  1.302506e+10  \n",
       "4        468.0       84000        EU          AND  2017  3.013387e+09  \n",
       "..         ...         ...       ...          ...   ...           ...  \n",
       "196    10908.0     1800000        EU          XKX  2017  7.227700e+09  \n",
       "197   527970.0    23495361        AS          YEM  2017  2.681870e+10  \n",
       "198  1219912.0    49000000        AF          ZAF  2017  3.495541e+11  \n",
       "199   752614.0    13460305        AF          ZMB  2017  2.586814e+10  \n",
       "200   390580.0    13061000        AF          ZWE  2017  2.281301e+10  \n",
       "\n",
       "[201 rows x 12 columns]"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "countries_and_gdp = countries.inner_join(gdp_2017,\n",
    "                                         predicates=countries['iso_alpha3'] == gdp_2017['country_code'])\n",
    "countries_and_gdp[countries, gdp_2017]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We have called the `inner_join` method of the `countries` table and passed\n",
    "the `gdp` table as a parameter. The method receives a second parameter, `predicates`, that is used to specify\n",
    "how the join will be performed. In this case we want the `iso_alpha3` column in `countries` to\n",
    "match the `country_code` column in `gdp`. This is specified with the expression\n",
    "`countries['iso_alpha3'] == gdp['country_code']`.\n",
    "\n",
    "In the example `countries_and_gdp` is not an expression that can be executed directly. The result of a join\n",
    "can cause conflicts if both tables have column names in common. Before we can execute the query, we need\n",
    "to materialize the join, by calling the `.materialize()` method, or by selecting the columns to be returned."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
