Metadata-Version: 2.4
Name: mostlyai-engine
Version: 1.0.0
Summary: Synthetic Data - Engine
Project-URL: Homepage, https://github.com/mostly-ai/mostlyai-engine
Project-URL: Documentation, https://mostly-ai.github.io/mostlyai-engine/
Project-URL: Source, https://github.com/mostly-ai/mostlyai-engine
Author-email: MOSTLY AI <dev@mostly.ai>
License-Expression: Apache-2.0
License-File: LICENSE
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Financial and Insurance Industry
Classifier: Intended Audience :: Healthcare Industry
Classifier: Intended Audience :: Information Technology
Classifier: Intended Audience :: Science/Research
Classifier: Intended Audience :: Telecommunications Industry
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: OS Independent
Classifier: Topic :: Software Development :: Libraries
Classifier: Typing :: Typed
Requires-Python: <3.13,>=3.10
Requires-Dist: accelerate<0.32,>=0.31.0
Requires-Dist: datasets<3,>=2.20.0
Requires-Dist: formatron<0.5,>=0.4.11
Requires-Dist: joblib<2,>=1.3.0
Requires-Dist: json-repair<0.31,>=0.30.0
Requires-Dist: numpy<2,>=1.26.3
Requires-Dist: opacus<2,>=1.5.2
Requires-Dist: pandas~=2.2.0
Requires-Dist: peft<0.12,>=0.11.1
Requires-Dist: psutil<6,>=5.9.8
Requires-Dist: pyarrow>=16.0.0
Requires-Dist: rich<14,>=13.9.4
Requires-Dist: setuptools<76,>=75.0.0
Requires-Dist: tokenizers<0.21,>=0.20.1
Requires-Dist: torch==2.5.1; sys_platform == 'darwin'
Requires-Dist: transformers<5,>=4.45.2
Provides-Extra: cpu
Requires-Dist: torch==2.5.1+cpu; (sys_platform != 'darwin') and extra == 'cpu'
Requires-Dist: torch==2.5.1; (sys_platform == 'darwin') and extra == 'cpu'
Provides-Extra: gpu
Requires-Dist: bitsandbytes<0.44,>=0.43.3; extra == 'gpu'
Requires-Dist: torch==2.5.1; extra == 'gpu'
Requires-Dist: vllm<0.7,>=0.6.4; extra == 'gpu'
Description-Content-Type: text/markdown

# Synthetic Data Engine 💎
[![Documentation](https://img.shields.io/badge/docs-latest-green)](https://mostly-ai.github.io/mostlyai-engine/) ![license](https://img.shields.io/github/license/mostly-ai/mostlyai-engine) ![GitHub Release](https://img.shields.io/github/v/release/mostly-ai/mostlyai-engine) ![PyPI - Python Version](https://img.shields.io/pypi/pyversions/mostlyai-engine) [![stats](https://pepy.tech/badge/mostlyai-engine)](https://pypi.org/project/mostlyai-engine/)

[Package Documentation](https://mostly-ai.github.io/mostlyai-engine/) | [Platform Documentation](https://mostly.ai/docs)

Create high-fidelity privacy-safe synthetic data:

1. prepare, analyze, and encode original data
2. train a generative model on the encoded data
3. generate synthetic data samples to your needs:
    * up-sample / down-sample
    * conditionally generate
    * rebalance categories
    * impute missings
    * incorporate fairness
    * adjust sampling temperature

...all within your safe compute environment, all with a few lines of Python code 💥.

Note: This library is the underlying model engine of the [Synthetic Data SDK ✨](https://github.com/mostly-ai/mostlyai). Please refer to the latter, for an easy-to-use, higher-level software toolkit.


## Installation

The latest release of `mostlyai-engine` can be installed via pip:

```bash
pip install -U mostlyai-engine
```

or alternatively for a GPU setup:
```bash
pip install -U mostlyai-engine[gpu]
```


## Quick start

### Tabular Model: flat data, without context

```python
from pathlib import Path
import pandas as pd
from mostlyai import engine

# set up workspace
ws = Path("ws-tabular-flat")

# load original data
url = "https://github.com/mostly-ai/public-demo-data/raw/refs/heads/dev/census"
trn_df = pd.read_csv(f"{url}/census.csv.gz")

# execute the engine steps
engine.split(                         # split data as PQT files for `trn` + `val` to `{ws}/OriginalData/tgt-data`
  workspace_dir=ws,
  tgt_data=trn_df,
  model_type="TABULAR",
)
engine.analyze(workspace_dir=ws)      # generate column-level statistics to `{ws}/ModelData/tgt-stats/stats.json`
engine.encode(workspace_dir=ws)       # encode training data to `{ws}/OriginalData/encoded-data`
engine.train(workspace_dir=ws)        # train model and store to `{ws}/ModelData/model-data`
engine.generate(workspace_dir=ws)     # use model to generate synthetic samples to `{ws}/SyntheticData`
pd.read_parquet(ws / "SyntheticData") # load synthetic data
```

### Tabular Model: sequential data, with context

```python
from pathlib import Path
import pandas as pd
from mostlyai import engine

# set up workspace
ws = Path("ws-tabular-sequential")

# load original data
url = "https://github.com/mostly-ai/public-demo-data/raw/refs/heads/dev/baseball"
trn_ctx_df = pd.read_csv(f"{url}/players.csv.gz")  # context data
trn_tgt_df = pd.read_csv(f"{url}/batting.csv.gz")  # target data

# execute the engine steps
engine.split(                         # split data as PQT files for `trn` + `val` to `{ws}/OriginalData/(tgt|ctx)-data`
  workspace_dir=ws,
  tgt_data=trn_tgt_df,
  ctx_data=trn_ctx_df,
  tgt_context_key="players_id",
  ctx_primary_key="id",
  model_type="TABULAR",
)
engine.analyze(workspace_dir=ws)      # generate column-level statistics to `{ws}/ModelStore/(tgt|ctx)-data/stats.json`
engine.encode(workspace_dir=ws)       # encode training data to `{ws}/OriginalData/encoded-data`
engine.train(                         # train model and store to `{ws}/ModelStore/model-data`
    workspace_dir=ws,
    max_training_time=2,              # limit TRAIN to 1 minute for demo purposes
)
engine.generate(workspace_dir=ws)     # use model to generate synthetic samples to `{ws}/SyntheticData`
pd.read_parquet(ws / "SyntheticData") # load synthetic data
```

### Language Model: flat data, without context

```python
from pathlib import Path
import pandas as pd
from mostlyai import engine

# set up workspace
ws = Path("ws-language-flat")

# load original data
trn_df = pd.read_parquet("https://github.com/mostly-ai/public-demo-data/raw/refs/heads/dev/headlines/headlines.parquet")
trn_df = trn_df.sample(n=10_000, random_state=42)[['category', 'headline']]

# execute the engine steps
engine.split(                         # split data as PQT files for `trn` + `val` to `{ws}/OriginalData/tgt-data`
    workspace_dir=ws,
    tgt_data=trn_df,
    model_type="LANGUAGE",
)
engine.analyze(workspace_dir=ws)      # generate column-level statistics to `{ws}/ModelStore/tgt-stats/stats.json`
engine.encode(workspace_dir=ws)       # encode training data to `{ws}/OriginalData/encoded-data`
engine.train(                         # train model and store to `{ws}/ModelStore/model-data`
    workspace_dir=ws,
    max_training_time=2,                   # limit TRAIN to 2 minute for demo purposes
    model="MOSTLY_AI/LSTMFromScratch-3m",  # use a light-weight LSTM model, trained from scratch (GPU recommended)
    # model="microsoft/phi-1.5",           # alternatively use a pre-trained HF-hosted LLM model (GPU required)
)
engine.generate(                      # use model to generate synthetic samples to `{ws}/SyntheticData`
    workspace_dir=ws,
    sample_size=10,
)
```
