{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Introduction to Ibis\n",
    "\n",
    "Ibis is a Python framework to access data and perform analytical computations from different sources, in a standard way.\n",
    "\n",
    "In a way, you can think of Ibis as writing SQL in Python, with a focus on analytics, more than simply accessing data. And aside from SQL databases, you can use it with other backends, including big data systems.\n",
    "\n",
    "Why not simply use SQL instead? SQL is great and widely used. However, SQL has different flavors for different database engines, and SQL is very difficult to maintain when your queries are very complex. Ibis solves both problems by standardizing your code across backends and making it maintainable. Since Ibis is Python, you can structure your code in different files, functions, name variables, write tests, etc.\n",
    "\n",
    "This tutorial will guide you through Ibis features and provide practical examples. Some knowledge of Python is assumed and knowledge of SQL will be helpful but not required.\n",
    "\n",
    "Ibis is open source - if anything can be improved in this tutorial, or in Ibis itself, please open an issue in the [Ibis GitHub repository](https://github.com/ibis-project/ibis/) or open a pull request with the fix."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Getting started\n",
    "\n",
    "To start using Ibis, you need a Python environment with Ibis installed. If you don't know how to create an environment, we recommend following the [setup instructions](https://ibis-project.org/getting_started.html) in the Ibis website.\n",
    "\n",
    "Once you have your environment ready, to start using Ibis simply import the `ibis` module:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import ibis"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To make it things easier in this tutorial, we will be using _Ibis interactive mode_. For production code, that will rarely be the case. More details on Ibis non-interactive (aka lazy) mode are covered in the third tutorial, _Expressions, lazy mode and logging queries_.\n",
    "\n",
    "To set the interactive mode, use:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "ibis.options.interactive = True"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next thing we need is to create a connection object. The connection defines where the data is stored and where the computations will be performed.\n",
    "\n",
    "For a comparison to pandas, this is not the same as where the data is imported from (e.g. `pandas.read_sql`). pandas loads data into memory and performs the computations itself. Ibis won't load the data and perform any computation, but instead will leave the data in the backend defined in the connection, and will _ask_ the backend to perform the computations.\n",
    "\n",
    "In this tutorial we will be using a SQLite connection for its simplicity (no installation is needed). But Ibis can work with many different backends, including big data systems, or GPU-accelerated analytical databases. As well as most common relational databases (PostgreSQL, MySQL,...).\n",
    "\n",
    "To create a SQL connection to our example SQLite database `data/geography.db`, use:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "database_file_path = os.path.join('data', 'geography.db')\n",
    "\n",
    "connection = ibis.sqlite.connect(database_file_path)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that if you installed Ibis with `pip` instead of `conda`, you may need to install the SQLite backend separately with `pip install ibis-framework[sqlite]`.\n",
    "\n",
    "### Exploring the data\n",
    "\n",
    "To list the tables in the `connection` object, we can use the `.list_tables()` method. If you are using Jupyter, you can see all the methods and attributes of the `connection` object by writing `connection.` and pressing the `<TAB>` key."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['countries', 'gdp']"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "connection.list_tables()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "These two tables include data about countries, and about GDP by country and year.\n",
    "\n",
    "The data from countries has been obtained from [GeoNames](https://www.geonames.org/countries/).\n",
    "The GDP table will be used in the next tutorial, and the data has been obtained from the\n",
    "[World Bank website](https://data.worldbank.org/indicator/NY.GDP.MKTP.CD).\n",
    "\n",
    "Next, we want to access a specific table in the database. We can create a handler to the `countries` table with:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "countries = connection.table('countries')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To list the columns of the `countries` table, we can use the `columns` attribute.\n",
    "\n",
    "Again, Jupyter users can see all the methods and attributes of the `countries` object by typing `countries.` and pressing `<TAB>`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['iso_alpha2',\n",
       " 'iso_alpha3',\n",
       " 'iso_numeric',\n",
       " 'fips',\n",
       " 'name',\n",
       " 'capital',\n",
       " 'area_km2',\n",
       " 'population',\n",
       " 'continent']"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "countries.columns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can now access a sample of the data. Let's focus on the `name`, `continent` and `population` columns to start with. We can visualize the values of the columns with:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "                     name continent  population\n",
       "0                 Andorra        EU       84000\n",
       "1    United Arab Emirates        AS     4975593\n",
       "2             Afghanistan        AS    29121286\n",
       "3     Antigua and Barbuda        NA       86754\n",
       "4                Anguilla        NA       13254\n",
       "..                    ...       ...         ...\n",
       "247                 Yemen        AS    23495361\n",
       "248               Mayotte        AF      159042\n",
       "249          South Africa        AF    49000000\n",
       "250                Zambia        AF    13460305\n",
       "251              Zimbabwe        AF    13061000\n",
       "\n",
       "[252 rows x 3 columns]"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "countries['name', 'continent', 'population']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The table is too big for all the results to be displayed, and we probably don't want to see all of them at once anyway. For this reason, just the beginning and the end of the results is displayed. Often, the number of rows will be so large that this operation could take a long time.\n",
    "\n",
    "To check how many rows a table has, we can use the `.count()` method:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "252"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "countries.count()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To fetch just a subset of the rows, we can use the `.limit(n)` method, where `n` is the number of samples we want. In this case we will fetch the first `3` countries from the table:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "                   name continent  population\n",
       "0               Andorra        EU       84000\n",
       "1  United Arab Emirates        AS     4975593\n",
       "2           Afghanistan        AS    29121286"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "countries['name', 'continent', 'population'].limit(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Filters and order\n",
    "\n",
    "Now that we've got an intuition of the data available in the table `countries`, we will extract some information from it by applying filters and sorting the data.\n",
    "\n",
    "Let's focus on a single continent. We can see a list of unique continents in the table using the `.distinct()` method:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    EU\n",
       "1    AS\n",
       "2    NA\n",
       "3    AF\n",
       "4    AN\n",
       "5    SA\n",
       "6    OC\n",
       "Name: continent, dtype: object"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "countries['continent'].distinct()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will focus on Asia (`AS` in the table). We can identify which rows belong to Asian countries using the standard Python `==` operator:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0      False\n",
       "1       True\n",
       "2       True\n",
       "3      False\n",
       "4      False\n",
       "       ...  \n",
       "247     True\n",
       "248    False\n",
       "249    False\n",
       "250    False\n",
       "251    False\n",
       "Name: tmp, Length: 252, dtype: bool"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "countries['continent'] == 'AS'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The result has a value `True` for rows where the condition is true, and the value `False` when it's not.\n",
    "\n",
    "We can provide this expression to the method `.filter()`, and save the result in the variable `asian_countries` for future use."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "                              name continent  population\n",
       "0             United Arab Emirates        AS     4975593\n",
       "1                      Afghanistan        AS    29121286\n",
       "2                          Armenia        AS     2968000\n",
       "3                       Azerbaijan        AS     8303512\n",
       "4                       Bangladesh        AS   156118464\n",
       "5                          Bahrain        AS      738004\n",
       "6                           Brunei        AS      395027\n",
       "7                           Bhutan        AS      699847\n",
       "8          Cocos [Keeling] Islands        AS         628\n",
       "9                            China        AS  1330044000\n",
       "10                         Georgia        AS     4630000\n",
       "11                       Hong Kong        AS     6898686\n",
       "12                       Indonesia        AS   242968342\n",
       "13                          Israel        AS     7353985\n",
       "14                           India        AS  1173108018\n",
       "15  British Indian Ocean Territory        AS        4000\n",
       "16                            Iraq        AS    29671605\n",
       "17                            Iran        AS    76923300\n",
       "18                          Jordan        AS     6407085\n",
       "19                           Japan        AS   127288000\n",
       "20                      Kyrgyzstan        AS     5776500\n",
       "21                        Cambodia        AS    14453680\n",
       "22                     North Korea        AS    22912177\n",
       "23                     South Korea        AS    48422644\n",
       "24                          Kuwait        AS     2789132\n",
       "25                      Kazakhstan        AS    15340000\n",
       "26                            Laos        AS     6368162\n",
       "27                         Lebanon        AS     4125247\n",
       "28                       Sri Lanka        AS    21513990\n",
       "29                         Myanmar        AS    53414374\n",
       "30                        Mongolia        AS     3086918\n",
       "31                           Macao        AS      449198\n",
       "32                        Maldives        AS      395650\n",
       "33                        Malaysia        AS    28274729\n",
       "34                           Nepal        AS    28951852\n",
       "35                            Oman        AS     2967717\n",
       "36                     Philippines        AS    99900177\n",
       "37                        Pakistan        AS   184404791\n",
       "38                       Palestine        AS     3800000\n",
       "39                           Qatar        AS      840926\n",
       "40                    Saudi Arabia        AS    25731776\n",
       "41                       Singapore        AS     4701069\n",
       "42                           Syria        AS    22198110\n",
       "43                        Thailand        AS    67089500\n",
       "44                      Tajikistan        AS     7487489\n",
       "45                    Turkmenistan        AS     4940916\n",
       "46                          Turkey        AS    77804122\n",
       "47                          Taiwan        AS    22894384\n",
       "48                      Uzbekistan        AS    27865738\n",
       "49                         Vietnam        AS    89571130\n",
       "50                           Yemen        AS    23495361"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "asian_countries = countries['name', 'continent', 'population'].filter(countries['continent'] == 'AS')\n",
    "asian_countries"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can check how many countries exist in Asia (based on the information in the database) by using the `.count()` method we've already seen:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "51"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "asian_countries.count()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we want to find the most populated countries in Asia. To obtain them, we are going to sort the countries by the column `population`, and just fetch the first 10. To sort by a column in Ibis, we can use the `.sort_by()` method:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "                             name continent  population\n",
       "0         Cocos [Keeling] Islands        AS         628\n",
       "1  British Indian Ocean Territory        AS        4000\n",
       "2                          Brunei        AS      395027\n",
       "3                        Maldives        AS      395650\n",
       "4                           Macao        AS      449198\n",
       "5                          Bhutan        AS      699847\n",
       "6                         Bahrain        AS      738004\n",
       "7                           Qatar        AS      840926\n",
       "8                          Kuwait        AS     2789132\n",
       "9                            Oman        AS     2967717"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "asian_countries.sort_by('population').limit(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This will return the least populated countries, since `.sort_by` will by default order in ascending order (ascending order like in `1, 2, 3, 4`). This behavior is consistent with SQL `ORDER BY`.\n",
    "\n",
    "To order in descending order we can use `ibis.desc()`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "          name continent  population\n",
       "0        China        AS  1330044000\n",
       "1        India        AS  1173108018\n",
       "2    Indonesia        AS   242968342\n",
       "3     Pakistan        AS   184404791\n",
       "4   Bangladesh        AS   156118464\n",
       "5        Japan        AS   127288000\n",
       "6  Philippines        AS    99900177\n",
       "7      Vietnam        AS    89571130\n",
       "8       Turkey        AS    77804122\n",
       "9         Iran        AS    76923300"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "asian_countries.sort_by(ibis.desc('population')).limit(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is the list of the 10 most populated countries based on the data from [GeoNames](https://www.geonames.org/).\n",
    "\n",
    "To learn more about Ibis, continue to the next tutorial."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
