Metadata-Version: 2.1
Name: sqltrack
Version: 0.1.3
Summary: SQL-based experiment tracking
Author-email: Joachim Folz <joachim.folz@dfki.de>
Maintainer-email: Joachim Folz <joachim.folz@dfki.de>
License: MIT License
        
        Copyright (c) 2023 Joachim Folz
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
        
Project-URL: Homepage, https://gitlab.com/jfolz/sqltrack
Project-URL: Documentation, https://gitlab.com/jfolz/sqltrack/-/blob/main/README.md
Project-URL: Source, https://gitlab.com/jfolz/sqltrack
Project-URL: Bug Reports, https://gitlab.com/jfolz/sqltrack/-/issues
Project-URL: Changelog, https://gitlab.com/jfolz/sqltrack/-/blob/main/CHANGELOG
Project-URL: License, https://gitlab.com/jfolz/sqltrack/-/blob/main/LICENSE
Keywords: SQL,postgres,PostgreSQL,machine learning,ML,experiment tracking,experiment,tracking,jupyter,jupyterlab,notebook
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: Software Development :: Build Tools
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: SQL
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Scientific/Engineering :: Visualization
Requires-Python: >=3.7
Description-Content-Type: text/markdown
Provides-Extra: full
Provides-Extra: docs
License-File: LICENSE

# SQLTrack

SQLTrack is a set of tools to track your (machine learning)
experiments.
While using other tools like 
[Sacred](https://github.com/IDSIA/sacred)
or
[mlflow tracking](https://mlflow.org/docs/latest/tracking.html),
we found that they limited how we could track our
experiments and later what analyses we could perform.
What we realized is that it is ultimately futile for library
authors to guess how experiment data will be used.
If it was possible for a library to cater to every single
use case it would then become too bloated to use.

That is why our goal is to collect a wide variety of examples
for analyses and visualizations to empower our users,
instead of providing complex functionality in our package.

To that end, SQLTrack only provides a basic schema of
experiments, runs, and metrics that you can extend to suit
your needs, as well as some basic tools to set up the
database and store experiment data.



## Getting started

Currently SQLTrack supports PostgreSQL through the psycopg driver.
We don't plan on adding support any other databases, except SQLite
if there is demand for it.
We've tried using ORMs, but found that they made things way more
complicated than they needed to be and - most importantly - they
obfuscated the DB schema from users.
Ideally we would use standard SQL and let users bring their own
database Python DB-API 2.0 compatible driver, but that would mean
we lose access to advanced features like indexable JSONB columns.



### Installation

SQLTrack can be installed like any other Python package,
e.g., `pip install sqltrack`.
By default only core dependencies are installed,
which speeds up usage in containerized environments.
Core functionality located in the toplevel package
`sqltrack` allows tracking experiments and working
with the database.
To use some of the convenience functions for anaylsis later,
install the full package with `pip install sqltrack[full]`.

On Linux, your distribution repositories should include
a version of PostgreSQL you can use.
We develop against 13, but any currently supported
version should work.
There are also install instructions for
[MacOS](https://www.postgresql.org/download/macosx/)
and
[Windows](https://www.postgresql.org/download/windows/).



### Base schema

This is the basic schema SQLTrack defines (minus some details like indexes),
with tables `experiments`, `experiment_relationships`, `runs`,
`run_relationships`, and `metrics`.

`runs.status` has the custom enum type `runstatus`.
It behaves like text when used with the psycopg driver.
Possible values have been lifted from Slurm job status.

```SQL
BEGIN;

CREATE TABLE experiments (
	id BIGINT GENERATED BY DEFAULT AS IDENTITY,
	time_created TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT CURRENT_TIMESTAMP,
	name TEXT NOT NULL,
    comment TEXT,
    tags JSONB,
	PRIMARY KEY (id),
    UNIQUE(name)
);

CREATE TABLE experiment_relationships (
	from_id BIGINT NOT NULL,
	kind TEXT NOT NULL,
    to_id BIGINT NOT NULL,
	PRIMARY KEY(from_id, kind, to_id),
	FOREIGN KEY(from_id) REFERENCES experiments(id),
	FOREIGN KEY(to_id) REFERENCES experiments(id)
);

CREATE TYPE runstatus AS ENUM (
    'BOOT_FAIL',
    'CANCELLED',
    'CONFIGURING',
    'COMPLETED',
    'COMPLETING',
    'DEADLINE',
    'FAILED',
    'NODE_FAIL',
    'OUT_OF_MEMORY',
    'PENDING',
    'PREEMPTED',
    'RESV_DEL_HOLD',
    'REQUEUE_FED',
    'REQUEUE_HOLD',
    'REQUEUED',
    'RESIZING',
    'REVOKED',
    'RUNNING',
    'SIGNALING',
    'SPECIAL_EXIT',
    'STAGE_OUT',
    'STOPPED',
    'SUSPENDED',
    'TIMEOUT'
);

CREATE TABLE runs (
	id BIGINT GENERATED BY DEFAULT AS IDENTITY,
	experiment_id BIGINT NOT NULL,
	status runstatus NOT NULL DEFAULT 'PENDING',
	time_created TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT CURRENT_TIMESTAMP,
	time_started TIMESTAMP WITH TIME ZONE,
	time_updated TIMESTAMP WITH TIME ZONE,
    comment TEXT,
    tags JSONB,
    args JSONB,
    env JSONB,
	PRIMARY KEY(id),
	FOREIGN KEY(experiment_id) REFERENCES experiments(id) ON DELETE CASCADE
);

CREATE TABLE run_relationships (
	from_id BIGINT NOT NULL,
	kind TEXT NOT NULL,
    to_id BIGINT NOT NULL,
	PRIMARY KEY(from_id, kind, to_id),
	FOREIGN KEY(from_id) REFERENCES runs(id) ON DELETE CASCADE,
	FOREIGN KEY(to_id) REFERENCES runs(id) ON DELETE CASCADE
);

CREATE TABLE metrics (
	run_id INTEGER NOT NULL,
	step BIGINT NOT NULL DEFAULT 0,
	progress DOUBLE PRECISION NULL DEFAULT 0.0,
    PRIMARY KEY (run_id, step, progress),
	FOREIGN KEY(run_id) REFERENCES runs(id) ON DELETE CASCADE
);

END;
```

Note that the `metrics` table doesn't contain any columns to store metrics yet.
Users need to add these as required.
E.g., a script to add columns for timing, loss, and accuracy in 
train, validation, and test phases could look like this:

```SQL
BEGIN;

ALTER TABLE metrics
	ADD COLUMN train_start TIMESTAMP WITH TIME ZONE,
	ADD COLUMN train_end TIMESTAMP WITH TIME ZONE,
	ADD COLUMN train_loss FLOAT,
	ADD COLUMN train_top1 FLOAT,
	ADD COLUMN train_top5 FLOAT,
	ADD COLUMN val_start TIMESTAMP WITH TIME ZONE,
	ADD COLUMN val_end TIMESTAMP WITH TIME ZONE,
	ADD COLUMN val_loss FLOAT,
	ADD COLUMN val_top1 FLOAT,
	ADD COLUMN val_top5 FLOAT,
	ADD COLUMN test_start TIMESTAMP WITH TIME ZONE,
	ADD COLUMN test_end TIMESTAMP WITH TIME ZONE,
	ADD COLUMN test_loss FLOAT,
	ADD COLUMN test_top1 FLOAT,
	ADD COLUMN test_top5 FLOAT;

END;
```

Now you might ask why we make you add columns for your metrics,
because that might seem annoying and wasteful compared
to a normalized name+value approach like what `mlflow` uses.
But don't worry, because PostgreSQL is smart.
Any NULL values aren't actually stored.
It only stores values that are not NULL and uses a bitmap
to keep track of them.
Also, each row has a fixed size header of ~23 bytes and `mlflow`
uses one row per metric value.
Since we store many metric values in a row we can afford
really large bitmaps to track those NULL values before we
come out worse.

Put your instructions to add metrics columns etc. in a SQL
script file, e.g. `v001.sql`, for use later.
Add `v002.sql` etc. to update your schema.



### Setup the database

SQLTrack provides a simple tool to setup your database.

```
usage: sqltrack [-h] [-u USER] [-a HOST] [-d DATABASE] [-s SCHEMA] [-c CONFIG_PATH] {setup} ...

positional arguments:
  {setup}               Available commands.
    setup               Setup and migrate the database.

options:
  -h, --help            show this help message and exit
  -u USER, --user USER  username
  -a HOST, --host HOST  DB host (and port)
  -d DATABASE, --database DATABASE
                        database name
  -s SCHEMA, --schema SCHEMA
                        schema name
  -c CONFIG_PATH, --config-path CONFIG_PATH
                        path to config file
```

User, host, database, and schema as parameters given on the command line take priority,
but you can also define environment variables `SQLTRACK_DSN_<PARAM>` to set them.
More info on available parameters can be found
[here](https://www.postgresql.org/docs/current/libpq-connect.html#LIBPQ-PARAMKEYWORDS).
Finally, most convenient is probably to store them in a config file.
The default path is `./sqltrack.conf`

```
user=<USER>
host=<HOST>
database=<DATABASE>
schema=<SCHEMA>
```

Those SQL script files you created earlier?
This is where you use them.
Run the setup command with them, e.g. `sqltrack setup v001.sql`.
This creates the base schema and updates it with your definitions.



### Track an experiment

```Python
from random import random
import sqltrack

def main():
    client = sqltrack.Client()
    experiment = sqltrack.Experiment(client, name="Very science, much data")
    run = experiment.get_run()
    with run.track():
        for epoch in range(90):
            metrics = {"train_loss": random(), "train_top1": random()}
            run.add_metrics(step=epoch, progress=epoch/epochs, **metrics)
```



### Analyzing results

This is where it's up to you.
We recommend Jupyter Lab to interact with the database,
but plain Jupyter or alternatives like
[Plotly Dash](https://dash.plotly.com/introduction)
work well too.
Look at the examples directory in our repository to get some ideas.
But really, you're the experimenter,
you know best what to do with your data.



### [Optional] Self-signed SSL certificate

You can create a SSL self-signed certificate to use with HTTPS:
```
openssl req -x509 -newkey rsa:4096 -keyout jupyter.key -out jupyter.crt -sha256 -days 365 -nodes
```

Start Jupyter Lab with your certificate:
```
jupyter-lab [options...] --certfile jupyter.crt --keyfile jupyter.key
```
