Metadata-Version: 2.4
Name: eval-protocol
Version: 0.2.20
Summary: The official Python SDK for Eval Protocol (EP.) EP is an open protocol that standardizes how developers author evals for large language model (LLM) applications.
Author-email: Fireworks AI <info@fireworks.ai>
License-Expression: MIT
Project-URL: Homepage, https://github.com/fireworks-ai/eval-protocol
Classifier: Programming Language :: Python :: 3
Classifier: Operating System :: OS Independent
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: requests>=2.25.0
Requires-Dist: pydantic>=2.0.0
Requires-Dist: dataclasses-json>=0.5.7
Requires-Dist: uvicorn>=0.15.0
Requires-Dist: python-dotenv>=0.19.0
Requires-Dist: openai==1.78.1
Requires-Dist: aiosqlite
Requires-Dist: aiohttp
Requires-Dist: mcp>=1.9.2
Requires-Dist: PyYAML>=5.0
Requires-Dist: datasets>=3.0.0
Requires-Dist: fsspec
Requires-Dist: hydra-core>=1.3.2
Requires-Dist: omegaconf>=2.3.0
Requires-Dist: gymnasium>=0.29.0
Requires-Dist: httpx>=0.24.0
Requires-Dist: anthropic>=0.59.0
Requires-Dist: ipykernel>=6.30.0
Requires-Dist: jupyter>=1.1.1
Requires-Dist: toml>=0.10.0
Requires-Dist: loguru>=0.6.0
Requires-Dist: docstring-parser>=0.15
Requires-Dist: rich>=12.0.0
Requires-Dist: psutil>=5.8.0
Requires-Dist: litellm>=1.0.0
Requires-Dist: addict>=2.4.0
Requires-Dist: deepdiff>=6.0.0
Requires-Dist: pandas>=1.5.0
Requires-Dist: websockets>=15.0.1
Requires-Dist: fastapi>=0.116.1
Requires-Dist: pytest>=6.0.0
Requires-Dist: pytest-asyncio>=0.21.0
Requires-Dist: peewee>=3.18.2
Requires-Dist: backoff>=2.2.0
Provides-Extra: dev
Requires-Dist: build; extra == "dev"
Requires-Dist: twine; extra == "dev"
Requires-Dist: pytest-httpserver; extra == "dev"
Requires-Dist: werkzeug>=2.0.0; extra == "dev"
Requires-Dist: ruff>=0.5.0; extra == "dev"
Requires-Dist: pyright>=1.1.365; extra == "dev"
Requires-Dist: transformers>=4.0.0; extra == "dev"
Requires-Dist: types-setuptools; extra == "dev"
Requires-Dist: types-requests; extra == "dev"
Requires-Dist: types-PyYAML; extra == "dev"
Requires-Dist: types-docker; extra == "dev"
Requires-Dist: versioneer>=0.20; extra == "dev"
Requires-Dist: openai==1.78.1; extra == "dev"
Requires-Dist: pre-commit; extra == "dev"
Requires-Dist: e2b; extra == "dev"
Requires-Dist: pytest-cov; extra == "dev"
Requires-Dist: pytest-xdist; extra == "dev"
Requires-Dist: docker==7.1.0; extra == "dev"
Requires-Dist: ipykernel>=6.30.0; extra == "dev"
Requires-Dist: jupyter>=1.1.1; extra == "dev"
Requires-Dist: pip>=25.1.1; extra == "dev"
Requires-Dist: haikus==0.3.8; extra == "dev"
Provides-Extra: trl
Requires-Dist: torch>=1.9; extra == "trl"
Requires-Dist: trl>=0.7.0; extra == "trl"
Requires-Dist: peft>=0.7.0; extra == "trl"
Requires-Dist: transformers>=4.0.0; extra == "trl"
Requires-Dist: accelerate>=0.28.0; extra == "trl"
Provides-Extra: openevals
Requires-Dist: openevals>=0.1.0; extra == "openevals"
Provides-Extra: fireworks
Requires-Dist: fireworks-ai>=0.19.12; extra == "fireworks"
Provides-Extra: box2d
Requires-Dist: swig; extra == "box2d"
Requires-Dist: gymnasium[box2d]>=0.29.0; extra == "box2d"
Requires-Dist: Pillow; extra == "box2d"
Provides-Extra: langfuse
Requires-Dist: langfuse>=2.0.0; extra == "langfuse"
Provides-Extra: huggingface
Requires-Dist: datasets>=3.0.0; extra == "huggingface"
Requires-Dist: transformers>=4.0.0; extra == "huggingface"
Provides-Extra: adapters
Requires-Dist: langfuse>=2.0.0; extra == "adapters"
Requires-Dist: datasets>=3.0.0; extra == "adapters"
Requires-Dist: transformers>=4.0.0; extra == "adapters"
Provides-Extra: bigquery
Requires-Dist: google-cloud-bigquery>=3.0.0; extra == "bigquery"
Requires-Dist: google-auth>=2.0.0; extra == "bigquery"
Provides-Extra: svgbench
Requires-Dist: selenium>=4.0.0; extra == "svgbench"
Dynamic: license-file

# Eval Protocol (EP)

[![PyPI - Version](https://img.shields.io/pypi/v/eval-protocol)](https://pypi.org/project/eval-protocol/)

**Eval Protocol (EP) is the open-source standard and toolkit for practicing Eval-Driven Development.**

Building with AI is different. Traditional software is deterministic, but AI systems are probabilistic. How do you ship new features without causing silent regressions? How do you prove a new prompt is actually better?

The answer is a new engineering discipline: **Eval-Driven Development (EDD)**. It adapts the rigor of Test-Driven Development for the uncertain world of AI. With EDD, you define your AI's desired behavior as a suite of executable tests, creating a safety net that allows you to innovate with confidence.

EP provides a consistent way to write evals, store traces, and analyze results.

<p align="center">
	<img src="https://raw.githubusercontent.com/eval-protocol/python-sdk/refs/heads/main/assets/ui.png" alt="UI" />
	<br>
	<sub><b>Log Viewer: Monitor your evaluation rollouts in real time.</b></sub>
</p>

## Quick Example

Here's a simple test function that checks if a model's response contains **bold** text formatting:

```python test_bold_format.py
from eval_protocol.models import EvaluateResult, EvaluationRow, Message
from eval_protocol.pytest import SingleTurnRolloutProcessor, evaluation_test

@evaluation_test(
    input_messages=[
        [
            Message(role="system", content="You are a helpful assistant. Use bold text to highlight important information."),
            Message(role="user", content="Explain why **evaluations** matter for building AI agents. Make it dramatic!"),
        ],
    ],
    completion_params=[{"model": "accounts/fireworks/models/llama-v3p1-8b-instruct"}],
    rollout_processor=SingleTurnRolloutProcessor(),
    mode="pointwise",
)
def test_bold_format(row: EvaluationRow) -> EvaluationRow:
    """
    Simple evaluation that checks if the model's response contains bold text.
    """

    assistant_response = row.messages[-1].content

    # Check if response contains **bold** text
    has_bold = "**" in assistant_response

    if has_bold:
        result = EvaluateResult(score=1.0, reason="✅ Response contains bold text")
    else:
        result = EvaluateResult(score=0.0, reason="❌ No bold text found")

    row.evaluation_result = result
    return row
```

## Documentation

See our [documentation](https://evalprotocol.io) for more details.

## Installation

**This library requires Python >= 3.10.**

Install with pip:

```
pip install eval-protocol
```

## License

[MIT](LICENSE)
