# DBIS Relational Algebra

[![pypi](https://img.shields.io/pypi/pyversions/dbis-relational-algebra)](https://pypi.org/project/dbis-relational-algebra/)
[![PyPI Status](https://img.shields.io/pypi/v/dbis-relational-algebra)](https://pypi.org/project/dbis-relational-algebra/)

This library provides a Python implementation of the [relational algebra](https://en.wikipedia.org/wiki/Relational_algebra).

# Features
 - Create expressions of the relational algebra in python.
 - Load data from SQLite tables.
 - Evaluate expressions on the data.
 - Convert these expressions to text in LaTeX math mode.
 - Convert a relation / the result of an expression to a Markdown table.

# Installation
Install via pip:
```bash
pip install dbis-relational-algebra
```

# Usage
## Overview of supported operators
- [x] Cross Product / Cartesian Product (`*`)
- [x] Difference (`-`)
- [x] Division (`/`)
- [x] Intersection (`&`)
- [x] Left Semijoin
- [x] Natural Join
- [x] Projection
- [x] Rename
- [x] Right Semijoin
- [x] Selection
- [x] Theta Join
- [x] Union (`|`)

The set operators Union, Intersection, and Difference require the relations to be [union-compatible](https://en.wikipedia.org/wiki/Relational_algebra#Union-compatible_relations).

## Formulas
For the Theta Join and the Selection, a formula is used to specify the join or selection condition. These formulas can be created using the following operators:
- [x] And
- [x] Or
- [x] Not
- [x] Equals
- [x] GreaterEquals
- [x] GreaterThan
- [x] LessEquals
- [x] LessThan

In the comparators, two values have to be specified. At least one of these values must be a python `str`, which references a column of the relation.

## Loading data & Evaluating an expression
To load data, an [SQLite connection](https://docs.python.org/3/library/sqlite3.html) can be used (recommended). This connection must be passed to the relational algebra expression for the evaluation.

It is also possible to load a relation with data by hand (not recommended):
```python
relation = Relation(name="R")
relation.add_attributes(["a", "b", "c"])
relation.add_rows([
	[1, 2, 3],
	[4, 5, 6],
	[7, 8, 9],
])
```

An expression can be created by using the operators and formulas listed above. The expression can then be evaluated on the data:
```python
# Cross Product RxS, see above
expression = Relation("R") * Relation("S")
result = expression.evaluate(sql_con=connection)
```
```python
# Theta Join R.a = S.b, see above
expression = ThetaJoin("R", "S", Not(Equals("R.a", "S.b")))
result = expression.evaluate(sql_con=connection)
```

The rows and column names of a relation (result) are accessible using the following attributes:
```python
result.attributes # list of column names (str)
result.rows # set of rows (tuple)
```


## Best practices:
 - After joining two relations or the cross product of two relations, you should always give column names that appear in both relations a new distinct name.
 - After joining two relations, the cross product of two relations, or some set operation on two relations, you should always give the resulting relation a new distinct name.
 - When referencing a column in a comparator, it is recommended that this column should be referred to using a detailed description, i.e. refer to column `a` of relation `R` as `"R.a"` instead of `"a"`.

# Developer Notes
A few design choices were made:
 - Internally, the data is stored in a [pandas DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html). This accelerates the relational algebra operators greatly.
 - In relational algebra, a column `a` from a relation `R` can be referred to as `a` and `R.a`. Internally, the column name is always stored using the full name, i.e. `R.a`. This is done to avoid ambiguities when a column `a` is present in multiple relations.
 - When joining two relations (or also cross product), the relational algebra provides no guidelines on how the resulting relation should be named. Thus, if `a` is a column of relation `R`, joining relations `R` and `S` results in a relation, where `R.a` and `S.a` might refer to this column `a` (depending on if `a` also references a column in `S`). Thus, generally speaking, joining two relations `R` and `S` will internally result in a relation named `R+S`, and the column `R.a` will now be named `R+S.a` (if there is no column `S.a`).
