# relatable

relatable is a Python package for converting a collection of documents, 
such as a MongoDB collection, into an interrelated set of tables, such as a 
schema in a relational database.

## Installation

```
pip3 install relatable
```

## Example of use

In this example we will walk through a use case of `relatable` for the sample dataset found in the repository of this 
package in the data folder.

To generate a relational schema for this dataset, let's instantiate a `RelationalSchema` with the list of documents as 
input:

```
from relatable import RelationalSchema

import json

with open("data/example_input.json", "r") as fp:
    docs = json.load(fp)

rs = RelationalSchema(docs)
```

Once the `RelationalSchema` is instantiated, we can check its metadata. This metadata is a list of flat dictionaries, so 
we can make use of Pandas to load it into a DataFrame:

```
import pandas as pd

pd.DataFrame(rs.generate_metadata())
```

|     | table                       | field                          | type    | nullable | unique |
|----:|:----------------------------|:-------------------------------|:--------|:---------|:-------|
|   0 | main                        | main_id                        | Integer | False    | True   |
|   1 | main                        | name                           | String  | False    | True   |
|   2 | main                        | age                            | Integer | False    | True   |
|   3 | experience                  | experience_id                  | Integer | False    | True   |
|   4 | experience                  | main_id                        | Integer | False    | False  |
|   5 | experience                  | experience.company             | String  | False    | True   |
|   6 | experience                  | experience.role                | String  | False    | True   |
|   7 | experience                  | experience.from                | Integer | False    | True   |
|   8 | experience                  | experience.to                  | Integer | False    | False  |
|   9 | experience.technologies     | experience.technologies_id     | Integer | False    | True   |
|  10 | experience.technologies     | experience_id                  | Integer | False    | False  |
|  11 | experience.technologies     | main_id                        | Integer | False    | False  |
|  12 | experience.technologies     | experience.technologies        | String  | False    | True   |
|  13 | experience.responsibilities | experience.responsibilities_id | Integer | False    | True   |
|  14 | experience.responsibilities | experience_id                  | Integer | False    | False  |
|  15 | experience.responsibilities | main_id                        | Integer | False    | False  |
|  16 | experience.responsibilities | experience.responsibilities    | String  | False    | True   | 

We can see that `RelationalSchema` has inferred a relational schema consisting of four tables with primary keys and 
foreign keys interrelating the tables.

It would be nice to rename these tables with a more descriptive name, and also rename some columns. We can do so with 
the `rename` and `rename_column` methods:

```
rs.rename_table("main", "person")
rs.rename_table("experience", "job")
rs.rename_table("experience.technologies", "technology")
rs.rename_table("experience.responsibilities", "responsibility")

for name in ["company", "role", "from", "to"]:
    rs.tables["job"].rename_column(f"experience.{name}", name)

rs.tables["technology"].rename_column("experience.technologies", "technology")
rs.tables["responsibility"].rename_column("experience.responsibilities", "responsibility")

pd.DataFrame(rs.generate_metadata())
```

|     | table          | field             | type    | nullable | unique |
|----:|:---------------|:------------------|:--------|:---------|:-------|
|   0 | person         | person_id         | Integer | False    | True   |
|   1 | person         | name              | String  | False    | True   |
|   2 | person         | age               | Integer | False    | True   |
|   3 | job            | job_id            | Integer | False    | True   |
|   4 | job            | person_id         | Integer | False    | False  |
|   5 | job            | company           | String  | False    | True   |
|   6 | job            | role              | String  | False    | True   |
|   7 | job            | from              | Integer | False    | True   |
|   8 | job            | to                | Integer | False    | False  |
|   9 | technology     | technology_id     | Integer | False    | True   |
|  10 | technology     | job_id            | Integer | False    | False  |
|  11 | technology     | person_id         | Integer | False    | False  |
|  12 | technology     | technology        | String  | False    | True   |
|  13 | responsibility | responsibility_id | Integer | False    | True   |
|  14 | responsibility | job_id            | Integer | False    | False  |
|  15 | responsibility | person_id         | Integer | False    | False  |
|  16 | responsibility | responsibility    | String  | False    | True   | 

The relationships between the tables are the following:

- The table `person` represents the main entity of the dataset, with a row for each person.
- The table `job` references the table `person`.
- The tables `technology` and `responsibility` reference the table `job`, and inherits the reference of `person` from 
`job`.

Finally, let's look at each of the tables:

```
dfs = [pd.DataFrame(t.data).set_index(t.primary_key) for t in rs.tables]
```

Table `person`:

| person_id | name  | age |
|----------:|:------|----:|
|         0 | Alice |  34 |
|         1 | Bob   |  27 | 

Table `job`:

| job_id | person_id | company  | role                  | from |   to |
|-------:|----------:|:---------|:----------------------|-----:|-----:|
|      0 |         0 | Google   | Software Engineer     | 2020 | 2022 |
|      1 |         0 | Facebook | Senior Data Scientist | 2017 | 2020 |
|      2 |         1 | OpenAI   | NLP Engineer          | 2019 | 2022 | 

Table `technology`:

| technology_id | job_id | person_id | technology |
|--------------:|-------:|----------:|:-----------|
|             0 |      0 |         0 | C++        |
|             1 |      0 |         0 | LolCode    |
|             2 |      1 |         0 | Python     |
|             3 |      1 |         0 | Excel      |
|             4 |      2 |         1 | Triton     |
|             5 |      2 |         1 | LaTeX      | 

Table `responsibility`:

| responsibility_id | job_id | person_id | responsibility                                             |
|------------------:|-------:|----------:|:-----------------------------------------------------------|
|                 0 |      0 |         0 | Google stuff                                               |
|                 1 |      0 |         0 | Mark TensorFlow issues as "Won't Do"                       |
|                 2 |      1 |         0 | Censor media                                               |
|                 3 |      1 |         0 | Learn the foundations of ML                                |
|                 4 |      1 |         0 | Do Kaggle competitions                                     |
|                 5 |      2 |         1 | Assert that GPT-2 is racist                                |
|                 6 |      2 |         1 | Assert that GPT-3 is racist                                |
|                 7 |      2 |         1 | Develop a prototype of a premium non-racist language model | 

# Example of use with the Airbnb MongoDB sample dataset

Another example of use with the Airbnb MongoDB sample dataset, downloadable 
[here](https://github.com/neelabalan/mongodb-sample-dataset/blob/main/sample_airbnb/listingsAndReviews.json) can be 
found in the repository of this package in the script `airbnb_example.py`
