{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "# Allocation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The allocation module provides some utils to be used before running A/B test experiments. Groups allocation is the \n",
    "process that assigns (allocates) a list of users either to a group A (e.g. control) or to a group B (e.g. treatment). \n",
    "This module provides functionalities to randomly allocate users in two or more groups (A/B/C/...).\n",
    "\n",
    "Let's import first the tools needed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "pycharm": {
     "is_executing": false,
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "from abexp.core.allocation import Allocator\n",
    "from abexp.core.analysis_frequentist import FrequentistAnalyzer"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Complete randomization"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here we want to randomly assign users in *n* groups (where *n*=2) in order to run an A/B test experiment with 2 \n",
    "variants, so  called control and treatment groups. Complete randomization does not require any data on the user, and in \n",
    "practice, it yields balanced design for large-sample sizes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "pycharm": {
     "is_executing": false
    }
   },
   "outputs": [],
   "source": [
    "# Generate random data\n",
    "user_id = np.arange(100)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "pycharm": {
     "is_executing": false
    }
   },
   "outputs": [],
   "source": [
    "# Run allocation\n",
    "df, stats = Allocator.complete_randomization(user_id=user_id, \n",
    "                                             ngroups=2,\n",
    "                                             prop=[0.4, 0.6],\n",
    "                                             seed=42)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "pycharm": {
     "is_executing": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>user_id</th>\n",
       "      <th>group</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   user_id  group\n",
       "0        0      1\n",
       "1        1      1\n",
       "2        2      1\n",
       "3        3      1\n",
       "4        4      1"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Users list with group assigned\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "pycharm": {
     "is_executing": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>group</th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>#users</th>\n",
       "      <td>40</td>\n",
       "      <td>60</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "group    0   1\n",
       "#users  40  60"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Statistics of the randomization: #users per group\n",
    "stats"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note: Post-allocation checks can be made to ensure the groups homogeneity and in case of imbalance, a new randomization \n",
    "can be performed (see the [Homogeneity check](#homogeneity_check) section below for details)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Blocks randomization"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In some case, one would like to consider one or more confounding factor(s) i.e. features which could unbalance the \n",
    "groups and bias the results if not taken into account during the randomization process. In this example we want to \n",
    "randomly assign users in n groups (where n=3, one control and two treatment groups) considering a confounding factor \n",
    "('level'). Users with similar characteristics (level) define a block, and randomization is conducted within a block. \n",
    "This enables balanced and homogeneous groups of similar sizes according to the confounding feature."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "pycharm": {
     "is_executing": false
    }
   },
   "outputs": [],
   "source": [
    "# Generate random data\n",
    "np.random.seed(42)\n",
    "df = pd.DataFrame(data={'user_id': np.arange(1000),\n",
    "                        'level': np.random.randint(1, 6, size=1000)})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "pycharm": {
     "is_executing": false
    }
   },
   "outputs": [],
   "source": [
    "# Run allocation\n",
    "df, stats = Allocator.blocks_randomization(df=df, \n",
    "                                           id_col='user_id', \n",
    "                                           stratum_cols='level',\n",
    "                                           ngroups=3, \n",
    "                                           seed=42)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "pycharm": {
     "is_executing": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>user_id</th>\n",
       "      <th>level</th>\n",
       "      <th>group</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>4</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>5</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3</td>\n",
       "      <td>5</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4</td>\n",
       "      <td>5</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   user_id  level  group\n",
       "0        0      4      1\n",
       "1        1      5      2\n",
       "2        2      3      2\n",
       "3        3      5      1\n",
       "4        4      5      0"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Users data with group assigned\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "pycharm": {
     "is_executing": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>group</th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>level</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>70</td>\n",
       "      <td>70</td>\n",
       "      <td>70</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>64</td>\n",
       "      <td>63</td>\n",
       "      <td>63</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>62</td>\n",
       "      <td>64</td>\n",
       "      <td>64</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>69</td>\n",
       "      <td>69</td>\n",
       "      <td>68</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>68</td>\n",
       "      <td>68</td>\n",
       "      <td>68</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "group   0   1   2\n",
       "level            \n",
       "1      70  70  70\n",
       "2      64  63  63\n",
       "3      62  64  64\n",
       "4      69  69  68\n",
       "5      68  68  68"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Statistics of the randomization: #users per group in each level\n",
    "stats"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "__Multi-level block randomization__"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can stratify randomization on two or more features. In the example below we want to randomly allocate users in *n* \n",
    "groups (where *n*=5) in order to run an A/B test experiment with 5 variants, one control and four treatment groups. The\n",
    "stratification will be based on the user level and paying status in order to create homogeneous groups."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "pycharm": {
     "is_executing": false
    }
   },
   "outputs": [],
   "source": [
    "# Generate random data\n",
    "np.random.seed(42)\n",
    "df = pd.DataFrame(data={'user_id': np.arange(1000),\n",
    "                        'is_paying': np.random.randint(0, 2, size=1000),\n",
    "                        'level': np.random.randint(1, 7, size=1000)})\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "pycharm": {
     "is_executing": false
    }
   },
   "outputs": [],
   "source": [
    "# Run allocation\n",
    "df, stats = Allocator.blocks_randomization(df=df, \n",
    "                                           id_col='user_id', \n",
    "                                           stratum_cols=['level', 'is_paying'], \n",
    "                                           ngroups=5,\n",
    "                                           seed=42)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "pycharm": {
     "is_executing": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>user_id</th>\n",
       "      <th>is_paying</th>\n",
       "      <th>level</th>\n",
       "      <th>group</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>6</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4</td>\n",
       "      <td>0</td>\n",
       "      <td>5</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   user_id  is_paying  level  group\n",
       "0        0          0      6      2\n",
       "1        1          1      1      1\n",
       "2        2          0      1      0\n",
       "3        3          0      1      3\n",
       "4        4          0      5      1"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Users data with group assigned\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "pycharm": {
     "is_executing": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>group</th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "      <th>3</th>\n",
       "      <th>4</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>level</th>\n",
       "      <th>is_paying</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">1</th>\n",
       "      <th>0</th>\n",
       "      <td>19</td>\n",
       "      <td>17</td>\n",
       "      <td>19</td>\n",
       "      <td>18</td>\n",
       "      <td>19</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>15</td>\n",
       "      <td>17</td>\n",
       "      <td>18</td>\n",
       "      <td>18</td>\n",
       "      <td>18</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">2</th>\n",
       "      <th>0</th>\n",
       "      <td>17</td>\n",
       "      <td>17</td>\n",
       "      <td>14</td>\n",
       "      <td>17</td>\n",
       "      <td>17</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>18</td>\n",
       "      <td>17</td>\n",
       "      <td>16</td>\n",
       "      <td>18</td>\n",
       "      <td>17</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">3</th>\n",
       "      <th>0</th>\n",
       "      <td>16</td>\n",
       "      <td>16</td>\n",
       "      <td>16</td>\n",
       "      <td>15</td>\n",
       "      <td>16</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>19</td>\n",
       "      <td>19</td>\n",
       "      <td>19</td>\n",
       "      <td>19</td>\n",
       "      <td>19</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">4</th>\n",
       "      <th>0</th>\n",
       "      <td>12</td>\n",
       "      <td>12</td>\n",
       "      <td>12</td>\n",
       "      <td>12</td>\n",
       "      <td>11</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>15</td>\n",
       "      <td>15</td>\n",
       "      <td>15</td>\n",
       "      <td>14</td>\n",
       "      <td>15</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">5</th>\n",
       "      <th>0</th>\n",
       "      <td>18</td>\n",
       "      <td>18</td>\n",
       "      <td>17</td>\n",
       "      <td>16</td>\n",
       "      <td>17</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>17</td>\n",
       "      <td>18</td>\n",
       "      <td>19</td>\n",
       "      <td>18</td>\n",
       "      <td>19</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">6</th>\n",
       "      <th>0</th>\n",
       "      <td>18</td>\n",
       "      <td>19</td>\n",
       "      <td>19</td>\n",
       "      <td>18</td>\n",
       "      <td>18</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>16</td>\n",
       "      <td>15</td>\n",
       "      <td>16</td>\n",
       "      <td>16</td>\n",
       "      <td>15</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "group             0   1   2   3   4\n",
       "level is_paying                    \n",
       "1     0          19  17  19  18  19\n",
       "      1          15  17  18  18  18\n",
       "2     0          17  17  14  17  17\n",
       "      1          18  17  16  18  17\n",
       "3     0          16  16  16  15  16\n",
       "      1          19  19  19  19  19\n",
       "4     0          12  12  12  12  11\n",
       "      1          15  15  15  14  15\n",
       "5     0          18  18  17  16  17\n",
       "      1          17  18  19  18  19\n",
       "6     0          18  19  19  18  18\n",
       "      1          16  15  16  16  15"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Statistics of the randomization: #users per group in each level and paying status\n",
    "stats"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Homogeneity check\n",
    "<a id ='homogeneity_check'></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Complete randomization** does not guarantee homogeneous groups, but it yields balanced design for large-sample sizes. \n",
    "**Blocks randomization** guarantees homogeneous groups based on categorical variables (but not on continuous variable).\n",
    "\n",
    "Thus, we can perform post-allocation checks to ensure the groups homogeneity both for continuous or categorical \n",
    "variables. In case of imbalance, a new randomization can be performed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>user_id</th>\n",
       "      <th>points</th>\n",
       "      <th>collected_bonus</th>\n",
       "      <th>is_paying</th>\n",
       "      <th>level</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>202</td>\n",
       "      <td>6580</td>\n",
       "      <td>1</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>448</td>\n",
       "      <td>4075</td>\n",
       "      <td>0</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>370</td>\n",
       "      <td>2713</td>\n",
       "      <td>1</td>\n",
       "      <td>6</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3</td>\n",
       "      <td>206</td>\n",
       "      <td>3062</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4</td>\n",
       "      <td>171</td>\n",
       "      <td>3976</td>\n",
       "      <td>0</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   user_id  points  collected_bonus  is_paying  level\n",
       "0        0     202             6580          1      4\n",
       "1        1     448             4075          0      5\n",
       "2        2     370             2713          1      6\n",
       "3        3     206             3062          0      3\n",
       "4        4     171             3976          0      5"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Generate random data\n",
    "np.random.seed(42)\n",
    "df = pd.DataFrame(data={'user_id': np.arange(1000),\n",
    "                        'points': np.random.randint(100, 500, size=1000),\n",
    "                        'collected_bonus': np.random.randint(2000, 7000, size=1000),\n",
    "                        'is_paying': np.random.randint(0, 2, size=1000),\n",
    "                        'level': np.random.randint(1, 7, size=1000)})\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "__Single iteration__\n",
    "\n",
    "In the cell below it is shown a single iteration of check homogeneity analysis."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Run allocation\n",
    "df, stats = Allocator.blocks_randomization(df=df, \n",
    "                                           id_col='user_id', \n",
    "                                           stratum_cols=['level', 'is_paying'], \n",
    "                                           ngroups=2,\n",
    "                                           seed=42)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>coef</th>\n",
       "      <th>std err</th>\n",
       "      <th>z</th>\n",
       "      <th>P&gt;|z|</th>\n",
       "      <th>[0.025</th>\n",
       "      <th>0.975]</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>user_id</th>\n",
       "      <td>-3.000000e-04</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>-1.505000e+00</td>\n",
       "      <td>0.132</td>\n",
       "      <td>-0.001000</td>\n",
       "      <td>0.0001</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>points</th>\n",
       "      <td>2.000000e-04</td>\n",
       "      <td>0.001000</td>\n",
       "      <td>3.660000e-01</td>\n",
       "      <td>0.714</td>\n",
       "      <td>-0.001000</td>\n",
       "      <td>0.0010</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>collected_bonus</th>\n",
       "      <td>6.935000e-05</td>\n",
       "      <td>0.000044</td>\n",
       "      <td>1.559000e+00</td>\n",
       "      <td>0.119</td>\n",
       "      <td>-0.000018</td>\n",
       "      <td>0.0000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C(is_paying, Treatment('1'))[T.0]</th>\n",
       "      <td>8.000000e-03</td>\n",
       "      <td>0.127000</td>\n",
       "      <td>6.300000e-02</td>\n",
       "      <td>0.950</td>\n",
       "      <td>-0.240000</td>\n",
       "      <td>0.2560</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C(level, Treatment('3'))[T.1]</th>\n",
       "      <td>-1.180000e-02</td>\n",
       "      <td>0.215000</td>\n",
       "      <td>-5.500000e-02</td>\n",
       "      <td>0.956</td>\n",
       "      <td>-0.433000</td>\n",
       "      <td>0.4090</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C(level, Treatment('3'))[T.2]</th>\n",
       "      <td>1.440000e-02</td>\n",
       "      <td>0.226000</td>\n",
       "      <td>6.400000e-02</td>\n",
       "      <td>0.949</td>\n",
       "      <td>-0.429000</td>\n",
       "      <td>0.4580</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C(level, Treatment('3'))[T.4]</th>\n",
       "      <td>-1.646000e-16</td>\n",
       "      <td>0.213000</td>\n",
       "      <td>-7.740000e-16</td>\n",
       "      <td>1.000</td>\n",
       "      <td>-0.417000</td>\n",
       "      <td>0.4170</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C(level, Treatment('3'))[T.5]</th>\n",
       "      <td>-1.628000e-16</td>\n",
       "      <td>0.215000</td>\n",
       "      <td>-7.570000e-16</td>\n",
       "      <td>1.000</td>\n",
       "      <td>-0.422000</td>\n",
       "      <td>0.4220</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C(level, Treatment('3'))[T.6]</th>\n",
       "      <td>-1.628000e-16</td>\n",
       "      <td>0.214000</td>\n",
       "      <td>-7.590000e-16</td>\n",
       "      <td>1.000</td>\n",
       "      <td>-0.420000</td>\n",
       "      <td>0.4200</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                           coef   std err             z  \\\n",
       "user_id                           -3.000000e-04  0.000000 -1.505000e+00   \n",
       "points                             2.000000e-04  0.001000  3.660000e-01   \n",
       "collected_bonus                    6.935000e-05  0.000044  1.559000e+00   \n",
       "C(is_paying, Treatment('1'))[T.0]  8.000000e-03  0.127000  6.300000e-02   \n",
       "C(level, Treatment('3'))[T.1]     -1.180000e-02  0.215000 -5.500000e-02   \n",
       "C(level, Treatment('3'))[T.2]      1.440000e-02  0.226000  6.400000e-02   \n",
       "C(level, Treatment('3'))[T.4]     -1.646000e-16  0.213000 -7.740000e-16   \n",
       "C(level, Treatment('3'))[T.5]     -1.628000e-16  0.215000 -7.570000e-16   \n",
       "C(level, Treatment('3'))[T.6]     -1.628000e-16  0.214000 -7.590000e-16   \n",
       "\n",
       "                                   P>|z|    [0.025  0.975]  \n",
       "user_id                            0.132 -0.001000  0.0001  \n",
       "points                             0.714 -0.001000  0.0010  \n",
       "collected_bonus                    0.119 -0.000018  0.0000  \n",
       "C(is_paying, Treatment('1'))[T.0]  0.950 -0.240000  0.2560  \n",
       "C(level, Treatment('3'))[T.1]      0.956 -0.433000  0.4090  \n",
       "C(level, Treatment('3'))[T.2]      0.949 -0.429000  0.4580  \n",
       "C(level, Treatment('3'))[T.4]      1.000 -0.417000  0.4170  \n",
       "C(level, Treatment('3'))[T.5]      1.000 -0.422000  0.4220  \n",
       "C(level, Treatment('3'))[T.6]      1.000 -0.420000  0.4200  "
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Run homogeneity check analysis\n",
    "X = df.drop(columns=['group'])\n",
    "y = df['group']\n",
    "\n",
    "analyzer = FrequentistAnalyzer()\n",
    "analysis = analyzer.check_homogeneity(X, y, cat_cols=['is_paying','level'])\n",
    "\n",
    "analysis"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The ``check_homogeneity`` function performs univariate logistic regression per each feature of the input dataset. If the \n",
    "p-value (column ``P>|z|`` in the table above) of any variables is below a certain threshold (e.g. ``threshold = 0.2``), \n",
    "the random allocation is considered to be non homogeneous and it must be repeated. For instance, in the table above the \n",
    "variable ``collected_bonus`` is not homogeneously split across groups ``p-value = 0.119``."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "__Multiple iterations__"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>user_id</th>\n",
       "      <th>points</th>\n",
       "      <th>collected_bonus</th>\n",
       "      <th>is_paying</th>\n",
       "      <th>level</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>202</td>\n",
       "      <td>6580</td>\n",
       "      <td>1</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>448</td>\n",
       "      <td>4075</td>\n",
       "      <td>0</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>370</td>\n",
       "      <td>2713</td>\n",
       "      <td>1</td>\n",
       "      <td>6</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3</td>\n",
       "      <td>206</td>\n",
       "      <td>3062</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4</td>\n",
       "      <td>171</td>\n",
       "      <td>3976</td>\n",
       "      <td>0</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   user_id  points  collected_bonus  is_paying  level\n",
       "0        0     202             6580          1      4\n",
       "1        1     448             4075          0      5\n",
       "2        2     370             2713          1      6\n",
       "3        3     206             3062          0      3\n",
       "4        4     171             3976          0      5"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Generate random data\n",
    "np.random.seed(42)\n",
    "df = pd.DataFrame(data={'user_id': np.arange(1000),\n",
    "                        'points': np.random.randint(100, 500, size=1000),\n",
    "                        'collected_bonus': np.random.randint(2000, 7000, size=1000),\n",
    "                        'is_paying': np.random.randint(0, 2, size=1000),\n",
    "                        'level': np.random.randint(1, 7, size=1000)})\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the cell below we repeatedly perform random allocation until it creates homogeneous groups (up to a maximum number \n",
    "of iterations). The groups are considered to be homogeneous when the p-value (column ``P>|z|``) of any variables is \n",
    "below a certain threshold (e.g. ``p-values < 0.2``).  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>coef</th>\n",
       "      <th>std err</th>\n",
       "      <th>z</th>\n",
       "      <th>P&gt;|z|</th>\n",
       "      <th>[0.025</th>\n",
       "      <th>0.975]</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>user_id</th>\n",
       "      <td>-1.000000e-04</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>-5.640000e-01</td>\n",
       "      <td>0.573</td>\n",
       "      <td>-0.001000</td>\n",
       "      <td>0.000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>points</th>\n",
       "      <td>2.000000e-04</td>\n",
       "      <td>0.001000</td>\n",
       "      <td>3.200000e-01</td>\n",
       "      <td>0.749</td>\n",
       "      <td>-0.001000</td>\n",
       "      <td>0.001</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>collected_bonus</th>\n",
       "      <td>2.449000e-05</td>\n",
       "      <td>0.000044</td>\n",
       "      <td>5.520000e-01</td>\n",
       "      <td>0.581</td>\n",
       "      <td>-0.000063</td>\n",
       "      <td>0.000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C(is_paying, Treatment('1'))[T.0]</th>\n",
       "      <td>1.570000e-02</td>\n",
       "      <td>0.127000</td>\n",
       "      <td>1.240000e-01</td>\n",
       "      <td>0.901</td>\n",
       "      <td>-0.232000</td>\n",
       "      <td>0.264</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C(level, Treatment('3'))[T.1]</th>\n",
       "      <td>-1.180000e-02</td>\n",
       "      <td>0.215000</td>\n",
       "      <td>-5.500000e-02</td>\n",
       "      <td>0.956</td>\n",
       "      <td>-0.433000</td>\n",
       "      <td>0.409</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C(level, Treatment('3'))[T.2]</th>\n",
       "      <td>-1.440000e-02</td>\n",
       "      <td>0.226000</td>\n",
       "      <td>-6.400000e-02</td>\n",
       "      <td>0.949</td>\n",
       "      <td>-0.458000</td>\n",
       "      <td>0.429</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C(level, Treatment('3'))[T.4]</th>\n",
       "      <td>-9.064000e-17</td>\n",
       "      <td>0.213000</td>\n",
       "      <td>-4.260000e-16</td>\n",
       "      <td>1.000</td>\n",
       "      <td>-0.417000</td>\n",
       "      <td>0.417</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C(level, Treatment('3'))[T.5]</th>\n",
       "      <td>-9.236000e-17</td>\n",
       "      <td>0.215000</td>\n",
       "      <td>-4.290000e-16</td>\n",
       "      <td>1.000</td>\n",
       "      <td>-0.422000</td>\n",
       "      <td>0.422</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C(level, Treatment('3'))[T.6]</th>\n",
       "      <td>-9.237000e-17</td>\n",
       "      <td>0.214000</td>\n",
       "      <td>-4.310000e-16</td>\n",
       "      <td>1.000</td>\n",
       "      <td>-0.420000</td>\n",
       "      <td>0.420</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                           coef   std err             z  \\\n",
       "user_id                           -1.000000e-04  0.000000 -5.640000e-01   \n",
       "points                             2.000000e-04  0.001000  3.200000e-01   \n",
       "collected_bonus                    2.449000e-05  0.000044  5.520000e-01   \n",
       "C(is_paying, Treatment('1'))[T.0]  1.570000e-02  0.127000  1.240000e-01   \n",
       "C(level, Treatment('3'))[T.1]     -1.180000e-02  0.215000 -5.500000e-02   \n",
       "C(level, Treatment('3'))[T.2]     -1.440000e-02  0.226000 -6.400000e-02   \n",
       "C(level, Treatment('3'))[T.4]     -9.064000e-17  0.213000 -4.260000e-16   \n",
       "C(level, Treatment('3'))[T.5]     -9.236000e-17  0.215000 -4.290000e-16   \n",
       "C(level, Treatment('3'))[T.6]     -9.237000e-17  0.214000 -4.310000e-16   \n",
       "\n",
       "                                   P>|z|    [0.025  0.975]  \n",
       "user_id                            0.573 -0.001000   0.000  \n",
       "points                             0.749 -0.001000   0.001  \n",
       "collected_bonus                    0.581 -0.000063   0.000  \n",
       "C(is_paying, Treatment('1'))[T.0]  0.901 -0.232000   0.264  \n",
       "C(level, Treatment('3'))[T.1]      0.956 -0.433000   0.409  \n",
       "C(level, Treatment('3'))[T.2]      0.949 -0.458000   0.429  \n",
       "C(level, Treatment('3'))[T.4]      1.000 -0.417000   0.417  \n",
       "C(level, Treatment('3'))[T.5]      1.000 -0.422000   0.422  \n",
       "C(level, Treatment('3'))[T.6]      1.000 -0.420000   0.420  "
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Define parameters\n",
    "rep = 100\n",
    "threshold = 0.2\n",
    "\n",
    "analyzer = FrequentistAnalyzer()\n",
    "\n",
    "for i in np.arange(rep):\n",
    "    \n",
    "    # Run allocation\n",
    "    df, stats = Allocator.blocks_randomization(df=df, \n",
    "                                               id_col='user_id', \n",
    "                                               stratum_cols=['level', 'is_paying'], \n",
    "                                               ngroups=2,\n",
    "                                               seed=i + 45)\n",
    "    # Run homogeneity check analysis    \n",
    "    X = df.drop(columns=['group'])\n",
    "    y = df['group']\n",
    "\n",
    "    analysis = analyzer.check_homogeneity(X, y, cat_cols=['is_paying','level'])\n",
    "    \n",
    "    # Check p-values\n",
    "    if all(analysis['P>|z|'] > threshold): \n",
    "        break\n",
    "        \n",
    "    df = df.drop(columns=['group'])\n",
    "\n",
    "analysis"
   ]
  }
 ],
 "metadata": {
  "hide_input": false,
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  },
  "pycharm": {
   "stem_cell": {
    "cell_type": "raw",
    "metadata": {
     "collapsed": false
    },
    "source": []
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
