Metadata-Version: 2.4
Name: fair-matrix
Version: 0.2.0
Summary: matrix.
Author: Facebook AI Research
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Classifier: License :: OSI Approved :: MIT License
Classifier: Topic :: Scientific/Engineering
Classifier: Development Status :: 4 - Beta
License-File: LICENSE
License-File: LICENSE.md
Requires-Dist: submitit>=1.5.2
Requires-Dist: psutil
Requires-Dist: torch>=2.5.1
Requires-Dist: transformers>=4.45.2
Requires-Dist: vllm>=v0.6.6.post1
Requires-Dist: ray[serve]>=2.40.0
Requires-Dist: grpcio==1.70.0
Requires-Dist: grpcio-tools==1.70.0
Requires-Dist: fire
Requires-Dist: jinja2
Requires-Dist: pyyaml
Requires-Dist: portalocker
Requires-Dist: boto3
Requires-Dist: google-genai>=1.13.0
Requires-Dist: datasketch
Requires-Dist: s3fs
Requires-Dist: datasets
Requires-Dist: iopath
Requires-Dist: jsonlines
Requires-Dist: pytest>=4.3.0 ; extra == "dev"
Requires-Dist: pytest-asyncio>=0.26.0 ; extra == "dev"
Requires-Dist: coverage[toml]>=5.1 ; extra == "dev"
Requires-Dist: black==24.10.0 ; extra == "dev"
Requires-Dist: isort>=5.12.0 ; extra == "dev"
Requires-Dist: pre-commit ; extra == "dev"
Requires-Dist: mypy>=1.13.0 ; extra == "dev"
Requires-Dist: pylint>=2.8.0 ; extra == "dev"
Requires-Dist: types-PyYAML ; extra == "dev"
Requires-Dist: types-requests ; extra == "dev"
Requires-Dist: flit>=3.5.1 ; extra == "dev"
Requires-Dist: sglang[all]==0.4.5.post1 ; extra == "sglang-045"
Requires-Dist: sglang-router ; extra == "sglang-045"
Requires-Dist: vllm==v0.6.6.post1 ; extra == "vllm-066"
Requires-Dist: ray[serve]==2.40.0 ; extra == "vllm-066"
Requires-Dist: vllm==v0.7.3 ; extra == "vllm-073"
Requires-Dist: ray[serve]==2.40.0 ; extra == "vllm-073"
Requires-Dist: vllm==v0.8.3 ; extra == "vllm-083"
Requires-Dist: ray[serve]==2.43.0 ; extra == "vllm-083"
Requires-Dist: torch>=2.6.0 ; extra == "vllm-083"
Project-URL: Source, https://github.com/facebookresearch/matrix
Project-URL: Tracker, https://github.com/facebookresearch/matrix/issues
Provides-Extra: dev
Provides-Extra: sglang-045
Provides-Extra: vllm-066
Provides-Extra: vllm-073
Provides-Extra: vllm-083

<h1 align="center">
Matrix: Multi-Agent daTa geneRation Infra and eXperimentation
</h1>

<h3 align="center">
Fast, scalable, and easy-to-use LLM-generation engine
</h3>

---

*Latest News*
* 04/2025: 🔥 We officially released Matrix with [Collaborative Reasoner](https://github.com/facebookresearch/collaborative-reasoner), showcasing the generation of multi-agent collaborative conversation with Matrix as inference engine. 

---

# About

Matrix is a library for fast, scalable, and easy-to-use LLM-generation engine, for use cases including model benchmarking, data processing, and data generation. 

Matrix runs on top of a [Ray](https://github.com/ray-project/ray) cluster for scalability. Cluster resources are acquired from [Slurm](https://slurm.schedmd.com/documentation.html) or local through [submitit](https://github.com/facebookincubator/submitit). Matrix has following main features:

**Large scale inference** for maintstream opensourced and proprietary LLMs
- Hugging Face LLMs via seamless intergration with [vLLM](https://github.com/vllm-project/vllm) and [SGLang](https://github.com/sgl-project/sglang). Native multi-node inference support.
- Azure OpenAI, SageMaker, Gemini models with Proxy server

**Data piplines** of high-throughput data processing and quality check
- Code execution service as a wrapper of [bubblewrap](https://github.com/containers/bubblewrap).
- Data curation, quality filtering, and augmentation with classifiers.

### Matrix vs. Existing Frameworks

Matrix is designed for scalable LLM inference on [Slurm](https://slurm.schedmd.com/documentation.html). Here is a feature comparison with other popular LLM inference solutions.


| Serving Frameworks | Slurm | vLLM | HTTP | gRPC | Auto-scaling | Open-source |
|-------------------|:-----:|:----:|:----:|:----:|:-----------:|:-----------:|
| vector-inference | ✓ | ✓ | ✓ | ✗ | ✗ | ✓ |
| litellm | ✗ | ✓ | ✓ | ✗ | ✗ | ✓ |
| ollama | ✗ | ✗ | ✓ | ✗ | ✗ | ✓ |
| SageMaker | ✗ | ✓ | ✓ | ✗ | ✓ | ✗ |
| llm-swarm | ✓ | ✓ | ✓ | ✗ | ✗ | ✓ |
| Matrix | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |

---

## Quick Links
  - [Getting Started](#getting-started)
  - [Advanced Deployment](#advanced-deployment)
  - [LLM Inference](#llm-inference)
  - [Job Manager](#job-manager)
  - [Data piplines](#data-piplines)
  - [Contributing](#contributing)
  - [Citation](#citation)

---

## Getting Started

- Conda Environment
```
conda create --name matrix python=3.10
conda activate matrix
pip install matrix[vllm_083]
```

- Launch ray cluster
```
matrix start_cluster --add_workers 1 --slurm "{'account': $SLURM_ACCOUNT, 'qos': $SLURM_QOS}"
```

- Deploy Model
```
// login to access huggingface hub
huggingface-cli login

matrix deploy_applications --applications "[{'model_name': 'meta-llama/Llama-3.1-8B-Instruct', 'min_replica': 8, 'name': '8B'}]"
```

- LLM Inference
```
matrix check_health --app_name 8B
```

- Shudown ray cluster
```
matrix stop_cluster
```

---

## Advanced Deployment
### Enable Grafana Dashboard

- Install in conda
```
bash ./matrix/scripts/install_prometheus_and_grafana.sh
```
- Enable in Ray Dashboard
```
matrix start_cluster --enable_grafana
```

### Incremental Deployment

- Add More Workers
```
matrix start_cluster --add_workers 4 --slurm "{'account': $SLURM_ACCOUNT, 'qos': $SLURM_QOS}"
```

- Add/Remove Applications
```
matrix deploy_applications --action add --applications "[{'model_name': 'meta-llama/Llama-3.1-405B-Instruct', 'min_replica': 2, 'name': '405B'}]"
```

- Remove All Applications
```
matrix deploy_applications --applications ''
```
### Adjust Model Args
vLLM Engine [Aruments](https://docs.vllm.ai/en/latest/serving/engine_args.html) can be specified in the deploy_applications arguments. The default values for popular models are in [llm_config.py](matrix/app_server/llm/llm_config.py). Other useful args
* `model_name`: a huggingface model name or a directory containing checkpoints.
* `name`: the default app_name.
* `model_size`: template to apply when model is from a directory, such as 8B, 70B, 405B etc, templates are from the llm_config.py file.
* `max_ongoing_requests`: the max concurrent requests to each replica.
* `min_replia` and `max_replica`: the num of replicas ranges auto-scaled based on num of Ray workers.
* `use_grpc`: enable grpc by adding `{'use_grpc':  'true'}`.

### OpenAI Azure Model
- Note: no GPU is required, in start_workers, can add `--slurm "{'gpus_per_node': 0}"`

```
matrix deploy_applications --applications "[{'api_version': \"$AZURE_API_VERSION\", 'api_endpoint': \"$AZURE_ENDPOINT\", 'api_key': \"$AZURE_API_KEY\", 'app_type': 'openai', 'model_name': 'gpt-4o', 'name': 'openai'}]"
```

### Gemini
- Note: no GPU is required, in start_workers, can add `--slurm "{'gpus_per_node': 0}"`

```
matrix deploy_applications --applications "[{'app_type': 'gemini', 'name': "gemini", 'api_key': \"$GOOGLE_API_KEY\",  'model_name': 'gemini-2.0-flash'}]"
```

### Deepseek R1
vLLM >=0.8.3 supports DS R1. An alternative backend is sglang.
```
// install sglang
pip install matrix[sglang_045]

matrix deploy_applications --applications "[{'model_name': 'deepseek-ai/DeepSeek-R1', 'pipeline-parallel-size': 2, 'app_type': sglang_llm, 'name': 'r1'}]"
```
### Llama 4
```
matrix deploy_applications --applications "[{'model_name': 'meta-llama/Llama-4-Scout-17B-16E-Instruct', 'name': 'scout'}]"

matrix deploy_applications --applications "[{'model_name': 'meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8', 'name': 'maverick'}]"
```

---

## LLM Inference

### Batch Query
```
// download math-500 dataset
python -m matrix.scripts.hf_dataset_to_jsonl HuggingFaceH4/MATH-500 test test.jsonl

// query math-500
matrix inference --app_name maverick-fp8 --input_jsonls test.jsonl --output_jsonl response.jsonl --batch_size=64 \
  --system_prompt "Please reason step by step, and put your final answer within \boxed{}." --max_tokens 30000 --text_key problem --timeout_secs 1800
```

#### Input Format
There are two formats for the jsonl input files:
  - Message format with arg --messages_key request.messages
```json
{
    "request": {"messages": [{"role": "system","content": "You are ..."},{"role": "user","content": "Solve the following..."}]}
}
```
  - Instruct format with arg --text_key text
```json
{
    "text": "<|start_header_id|>system<|end_header_id|>You are ... <|eot_id|><|start_header_id|>user<|end_header_id|>Solve the following ...<|eot_id|><|start_header_id|>assistant<|end_header_id|>"
}
```
  - Raw tax with arg --text_key text
```json
{
    "text": "Solve the following ..."
}
```
### Inference API
```
from matrix import Cli
from matrix.client import query_llm

metadata = Cli().get_app_metadata(app_name="8B")

# async call
await query_llm.make_request(
  url=metadata["endpoints"]["head"],
  model=metadata["model_name"],
  app_name=metadata["name"],
  data={"messages": [{"role": "user", "content": "hi"}]},
))

# batch inference
query_llm.batch_requests(
  url=metadata["endpoints"]["head"],
  model=metadata["model_name"],
  app_name=metadata["name"],
  requests=[{"messages": [{"role": "user", "content": "hi"}]}],
)
```

---

## Job manager

Job manager allows users to submit tasks for distributed execution on Ray. More details are in [here](matrix/job/README.md).

---

## Data piplines

### Code Execution
- Install bubblewrap
```
conda install -c conda-forge bubblewrap
```
- Run example python code
```
matrix deploy_applications --applications "[{'name': 'code', 'app_type': code, 'min_replica': 5}]"
matrix check_health --app_name code

python -m -m matrix.scripts.hf_dataset_to_jsonl openai/openai_humaneval test humaneval/test.jsonl
matrix inference code ~/tmp/he.jsonl humaneval/test.jsonl --text_keys "[prompt, canonical_solution, test, entry_point]" --prompt_template "check({entry_point})"
```

### Data filtering and augmentation
- minhash dedup
```
python  -m matrix.data_pipeline.quality.dedup_minhash $ray_head:$client_server_port input.jsonl output_dir working_dir --text_key problem
```
- multilabel classification
```
python -m matrix.data_pipeline.classification.multi_label_classification $ray_head:$client_server_port  \
  cardiffnlp/twitter-roberta-base-emotion-multilabel-latest input.jsonl output_dir \
  --num_gpus 8 --text_key question --threshold_fname ""
```
- Offline batch inference
```
python -m matrix.data_pipeline.generate.vllm_generate $ray_head:$client_server_port ./math-500/test.jsonl math-500/response  \
  --prompt_template "<|start_header_id|>system<|end_header_id|>\n\nPlease reason step by step, and put your final answer within \boxed{}.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n<user_message><|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" \
  --model_args "{'model': 'meta-llama/Llama-3.3-70B-Instruct', 'seed': 42, 'max_model_len': 20480, 'tensor_parallel_size': 4}" \
  --sampling_params "{'n': 1, 'temperature': 0.6, 'top_p': 0.95, 'max_tokens': 16384}" \
  --min_concurrency 32 --output_key pred --batch_size=32
```
---

## Contributing
We always welcome contributions to matrix! Please refer to
[Contribution Guidelines](CONTRIBUTING.md) to learn how to format, test, and
submit your work. If you have any questions related to the code, 
feel free to email Dong Wang (dongwang@meta.com) or Daniel Li (shangwel@meta.com).

## Citation
If you use matrix in your research and wish to refer to it, please use the
following BibTeX entry.

```
@software{matrix2025,
  author = {Dong Wang and Yang Li and Ansong Ni and Youssef Emad and Xinjie Lei and Ruta Desai and Karthik Padthe and Xian Li and Asli Celikyilmaz and Ramya Raghavendra and Leo Huang and Daniel Li},
  title = {Matrix: Multi-Agent daTa geneRation Infra and eXperimentation},
  url = {http://github.com/facebookresearch/matrix},
  year = {2025},
}
```

## License
This project is MIT licensed, as found in the [LICENSE](LICENSE) file.


## Acknowledgement
We gratefully acknowledge the [Ray](https://github.com/ray-project/ray) and [vLLM](https://github.com/vllm-project/vllm) team for initial Ray Serve integration with vLLM.

