Metadata-Version: 2.4
Name: clemcore
Version: 2.4.4
Summary: The cLLM (chat-optimized Large Language Model, 'clem') framework tests such models' ability to engage in games, that is, rule-constituted activities played using language.
Author-email: Philipp Sadler <first.last@uni-potsdam.de>, Jonathan Jordan <first.last@uni-potsdam.de>, Sherzod Hakimov <first.last@uni-potsdam.de>, Anne Beyer <first.last@uni-potsdam.de>, "L. Pfennigschmidt" <first.last@uni-potsdam.de>, Kushal Koshti <first.last@uni-potsdam.de>
License: MIT
Project-URL: Homepage, https://github.com/clp-research/clemcore
Requires-Python: <3.13,>=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pyyaml>=6.0
Requires-Dist: numpy<2.0.0,>=1.24.3
Requires-Dist: retry>=0.9.2
Requires-Dist: tqdm>=4.65.0
Requires-Dist: nltk==3.8.1
Requires-Dist: aleph-alpha-client==7.0.1
Requires-Dist: openai==1.75.0
Requires-Dist: anthropic==0.47.1
Requires-Dist: cohere==4.48
Requires-Dist: google-generativeai==0.8.4
Requires-Dist: mistralai==1.8.0
Requires-Dist: matplotlib==3.7.1
Requires-Dist: pandas==2.0.1
Requires-Dist: seaborn==0.12.2
Provides-Extra: vllm
Requires-Dist: torch~=2.6.0; extra == "vllm"
Requires-Dist: transformers==4.51.1; extra == "vllm"
Requires-Dist: vllm==0.8.4; extra == "vllm"
Provides-Extra: huggingface
Requires-Dist: torch~=2.1.1; extra == "huggingface"
Requires-Dist: sentencepiece==0.1.99; extra == "huggingface"
Requires-Dist: accelerate==1.2.1; extra == "huggingface"
Requires-Dist: protobuf==4.21.6; extra == "huggingface"
Requires-Dist: einops==0.6.1; extra == "huggingface"
Requires-Dist: bitsandbytes==0.45.3; extra == "huggingface"
Requires-Dist: peft==0.15.2; extra == "huggingface"
Requires-Dist: transformers==4.51.1; extra == "huggingface"
Requires-Dist: torchvision==0.16.1; extra == "huggingface"
Requires-Dist: timm>=1.0.15; extra == "huggingface"
Provides-Extra: slurk
Requires-Dist: python-engineio==4.4.0; extra == "slurk"
Requires-Dist: python-socketio==5.7.2; extra == "slurk"
Requires-Dist: websocket-client; extra == "slurk"
Dynamic: license-file

### Updates
(March 2025): Version 2.0 of the benchmark has been [released](https://clembench.github.io/). And the framework is now pip installable. The games that make the benchmark got their own [repository](https://github.com/clp-research/clembench).

(February 2024): We have updated the framework code. If you have written games using the initial release version, see [this guide](docs/howto_update_to_v1.md) on how to update your game.

# clembench: A Framework for the Systematic Evaluation of Chat-Optimized Language Models as Conversational Agents

The cLLM (chat-optimized Large Language Model, "clem") framework tests such models' ability to engage in games – rule-constituted activities played using language.
The framework is a systematic way of probing for the situated language understanding of language using agents.

This repository contains the code for setting up the framework and implements a number of games that are further discussed in 

> Chalamalasetti, K., Götze, J., Hakimov, S., Madureira, B., Sadler, P., & Schlangen, D. (2023). clembench: Using Game Play to Evaluate Chat-Optimized Language Models as Conversational Agents (arXiv:2305.13455). arXiv. https://doi.org/10.48550/arXiv.2305.13455

### Evaluation Results

On the [main project website](https://clembench.github.io) , under [leaderboard](https://clembench.github.io/leaderboard.html).

### Game details

see [clembench repository](https://github.com/clp-research/clembench)
- A Simple Word Game: [taboo](docs/taboo.md)
- A Word-Guessing Game Based on Clues: [wordle](docs/wordle.md)
- Drawing Instruction Giving and Following: [image](docs/image.md)
- An ASCII Picture Reference Game: [reference](docs/reference.md)
- Scorekeeping: [private and shared](docs/privateshared.md)

# Using the clemcore CLI

The project is now pip installable.
This means that there is no need to checkout the repository, but you can simply install the packaged project as usual:
```
(myclem) pip install clemcore
```
(However, clemcore developers should checkout this repository and install from within the directory `pip install -e .`) 

Note that we highly recommend to perform the installation in a distinct virtual python environment, because there might be a lot of dependencies necessary depending on your use case.
Additional install options are:
```
(myclem) pip install clemcore[huggingface] # dependencies for the local hf backend
(myclem) pip install clemcore[vllm]        # dependencies for the local vllm backend
(myclem) pip install clemcore[slurk]       # dependencies for the slurk backend 
```

After the installation you will have access to the `clem` CLI tool. The main functions are:

```
(myclem) clem list games               # list the games available for a run
(myclem) clem list backends            # list the backends available for a run
(myclem) clem list models              # list the models available for a run
(myclem) clem run -g <game> -m <model> # runs the game benchmark; also transcribes and scores
(myclem) clem transcribe               # translates interactions into html files
(myclem) clem score                    # computes individual performance measures
(myclem) clem eval                     # computes overall performances measures; requires scores
```

Note that `clem` operates relative to the current working directory, that is, the directory it is called from.
This directory is what we call the workspace.
A workspace may look like this.

```
(optional) key.json
(optional) game_registry.json 
(optional) model_registry.json  
(optional) custom_api.py 
clembench/
```

The files have the following functions:
- **key.json**: contains the secrets for the remote api calls; if this file does not exist, then `clem` looks into `~/.clemcore/`
- **game_registry.json**: allows to make additional game specifications useable for the runs. The game specifications must at least contain the `game_name`, `game_path` and `players` attribute. 
- **model_registry.json**: allows to add additional model specifications. This is specifically useful to run with models that have not been packaged yet. In addition, it allows to point model specification to custom backend names.
- **custom_api.py**: `clem` automatically discovers additional _api files placed into the cwd, so that users of the framework can run their own backends with the games.
- **clembench/**: contains the game directories (with the game code) available for the benchmark runs

Note that, `clem` does now automatically discovers game directories that are at most 3-levels away from the `cwd`. 
To be discoverable, directories have to carry a `clemgame.json` (here a game path is not required, because `clem` automatically determines it).

## Use Case: Benchmarker

As a benchmarker you want to run multiple models for all games that constitute the benchmark.
Therefore, you will checkout the [clembench](https://github.com/clp-research/clembench) repository into a new workspace directory.
You will add the `key.json` to the workspace to access the backends.
In addition, you might need to add additional model entries that are not yet packaged to a `model_registry.json`.
Then you will run via the cli `clem run -g all -m model1` etc. or potentially use a batch script.
When not otherwise specified, then the results files will be stored in the cwd under `results`. 
Hence, a benchmarkers workspace directory might look as follows:

```
myworkspace
- clembench/
- results/
- key.json 
- model_registry.json  
```

## Use Case: Game Developer

As a game developer you want to implement your own game to be run with `clem`.
You will use a typical clem game project structure.
The game directory will become your workspace.
To make the game visible to `clem` you need to add a `clemgame.json` to the directory.
This file should specify at least the following
```
{
"game_name": "mygame",
"description": "A brief description of mygame",
"player": "single" | "two" | "multi",
"image": "none" | "single" | "multi",
"languages": ["en"]
}
```

To test your game with some packaged models, you will add a `key.json` and run the command `clem run -g mygame -m model` from within the game directory.
The results will be written into `results`.
To also get html transcripts you can run `clem transcribe -g mygame`.
Overall, a game developers workspace directory will possibly look as follows:

```
mygame
- in/
- resources/
- results/
- __init__.py
- master.py
- instancegenerator.py
- clemgame.json
- key.json   
```

## Use Case: Model Developer

As a model developer you want to test the performance of your custom model on the benchmark.
For this you will checkout the [clembench](https://github.com/clp-research/clembench) repository into your workspace directory.
In addition, you want to make your custom model available via the `model_registry.json`.
The entry should at least specify a name and a backend, e.g., `{"model_name":"mymodel", "backend":"mybackend"}`. 
The important thing to consider is that `clem` will try to locate all additional backend files in the workspace.
Therefore, one of this should match the backend specified in the registry, meaning that you will create an `mybackend_api.py` in the workspace.
This files mainly implements the `generate_response` method for the model and might specify how it is loaded.
Finally, you will run `clem -g all -m mymodel` from the workspace directory to run your model on all games.
The results will be written into the `results` directory.
Hence, a model developers workspace might look as follows:

```
myworkspace
- clembench/
- results/
- model_registry.json
- mybackend_api.py  
```



We welcome you to contribute to or extend the benchmark with your own games and models. 
Please open a pull request in the respective repository. 
You can find more information on how to use the benchmark in the links below.

However, the following documentation needs still to be checked for up-to-dateness.

- [How to run the benchmark and evaluation locally](docs/howto_run_benchmark.md)
- [How to run the benchmark, update leaderboard workflow](docs/howto_benchmark_workflow.md)
- [How to add a new model](docs/howto_add_models.md)
- [How to add and run your own game](docs/howto_add_games.md)
- [How to integrate with Slurk](docs/howto_slurk.md)

This repository is tested on `Python 3.10`
