Metadata-Version: 2.1
Name: text-machina
Version: 0.2.10
Summary: Text Machina: Seamless Generation of Machine-Generated Text Datasets
Home-page: https://github.com/Genaios/TextMachina
Author: Genaios
Classifier: Development Status :: 4 - Beta
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: License :: Other/Proprietary License
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Operating System :: MacOS :: MacOS X
Classifier: Operating System :: Microsoft :: Windows
Classifier: Operating System :: POSIX
Requires-Python: >=3.8.0
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: scipy>=1.10.1
Requires-Dist: PyYAML>=6.0.1
Requires-Dist: datasets>=2.14.4
Requires-Dist: spacy>=3.6.1
Requires-Dist: typer>=0.9.0
Requires-Dist: pydantic>=2.3.0
Requires-Dist: petname>=2.6
Requires-Dist: pycountry>=22.3.5
Requires-Dist: ftfy>=6.1.3
Requires-Dist: fasttext-wheel
Requires-Dist: rich>=13.7.0
Requires-Dist: scikit-learn>=1.3.2
Requires-Dist: mauve-text>=0.3.0
Requires-Dist: matplotlib>=3.7.4
Requires-Dist: tabulate>=0.9.0
Requires-Dist: readchar>=4.0.5
Requires-Dist: evaluate>=0.4.1
Requires-Dist: textstat>=0.7.3
Requires-Dist: seqeval>=1.2.2
Provides-Extra: openai
Requires-Dist: openai>=1; extra == "openai"
Requires-Dist: tiktoken>=0.4.0; extra == "openai"
Provides-Extra: azure-openai
Requires-Dist: openai>=1; extra == "azure-openai"
Requires-Dist: tiktoken>=0.4.0; extra == "azure-openai"
Provides-Extra: bedrock
Requires-Dist: boto3; extra == "bedrock"
Requires-Dist: tiktoken>=0.4.0; extra == "bedrock"
Provides-Extra: ai21
Requires-Dist: ai21>=2.0.0; extra == "ai21"
Requires-Dist: ai21_tokenizer>=0.3.11; extra == "ai21"
Provides-Extra: anthropic
Requires-Dist: anthropic>=0.7.2; extra == "anthropic"
Provides-Extra: cohere
Requires-Dist: cohere>=4.36; extra == "cohere"
Provides-Extra: huggingface-local
Requires-Dist: torch==2.0.1; extra == "huggingface-local"
Requires-Dist: transformers>=4.32.0; extra == "huggingface-local"
Requires-Dist: accelerate>=0.22.0; extra == "huggingface-local"
Requires-Dist: bitsandbytes>=0.41.1; extra == "huggingface-local"
Provides-Extra: huggingface-remote
Requires-Dist: requests>=2.31.0; extra == "huggingface-remote"
Provides-Extra: vertex
Requires-Dist: google-auth; extra == "vertex"
Requires-Dist: google-cloud-aiplatform==1.25.0; extra == "vertex"
Requires-Dist: tiktoken>=0.4.0; extra == "vertex"
Provides-Extra: all
Requires-Dist: openai>=1; extra == "all"
Requires-Dist: tiktoken>=0.4.0; extra == "all"
Requires-Dist: openai>=1; extra == "all"
Requires-Dist: tiktoken>=0.4.0; extra == "all"
Requires-Dist: boto3; extra == "all"
Requires-Dist: tiktoken>=0.4.0; extra == "all"
Requires-Dist: ai21>=2.0.0; extra == "all"
Requires-Dist: ai21_tokenizer>=0.3.11; extra == "all"
Requires-Dist: anthropic>=0.7.2; extra == "all"
Requires-Dist: cohere>=4.36; extra == "all"
Requires-Dist: torch==2.0.1; extra == "all"
Requires-Dist: transformers>=4.32.0; extra == "all"
Requires-Dist: accelerate>=0.22.0; extra == "all"
Requires-Dist: bitsandbytes>=0.41.1; extra == "all"
Requires-Dist: requests>=2.31.0; extra == "all"
Requires-Dist: google-auth; extra == "all"
Requires-Dist: google-cloud-aiplatform==1.25.0; extra == "all"
Requires-Dist: tiktoken>=0.4.0; extra == "all"
Provides-Extra: dev
Requires-Dist: openai>=1; extra == "dev"
Requires-Dist: tiktoken>=0.4.0; extra == "dev"
Requires-Dist: openai>=1; extra == "dev"
Requires-Dist: tiktoken>=0.4.0; extra == "dev"
Requires-Dist: boto3; extra == "dev"
Requires-Dist: tiktoken>=0.4.0; extra == "dev"
Requires-Dist: ai21>=2.0.0; extra == "dev"
Requires-Dist: ai21_tokenizer>=0.3.11; extra == "dev"
Requires-Dist: anthropic>=0.7.2; extra == "dev"
Requires-Dist: cohere>=4.36; extra == "dev"
Requires-Dist: torch==2.0.1; extra == "dev"
Requires-Dist: transformers>=4.32.0; extra == "dev"
Requires-Dist: accelerate>=0.22.0; extra == "dev"
Requires-Dist: bitsandbytes>=0.41.1; extra == "dev"
Requires-Dist: requests>=2.31.0; extra == "dev"
Requires-Dist: google-auth; extra == "dev"
Requires-Dist: google-cloud-aiplatform==1.25.0; extra == "dev"
Requires-Dist: tiktoken>=0.4.0; extra == "dev"
Requires-Dist: black; extra == "dev"
Requires-Dist: flake8; extra == "dev"
Requires-Dist: mypy; extra == "dev"
Requires-Dist: types-requests; extra == "dev"
Requires-Dist: pytest; extra == "dev"
Requires-Dist: isort; extra == "dev"
Requires-Dist: autoflake; extra == "dev"
Requires-Dist: pre-commit; extra == "dev"
Requires-Dist: pytest-sphinx; extra == "dev"
Requires-Dist: pytest-cov; extra == "dev"
Requires-Dist: ruff; extra == "dev"
Requires-Dist: Sphinx<7.1.0,>=4.3.0; extra == "dev"
Requires-Dist: furo==2023.7.26; extra == "dev"
Requires-Dist: myst-parser<2.1,>=1.0; extra == "dev"
Requires-Dist: sphinx-copybutton==0.5.2; extra == "dev"
Requires-Dist: sphinx-autobuild==2021.3.14; extra == "dev"
Requires-Dist: sphinx-autodoc-typehints==1.23.3; extra == "dev"
Requires-Dist: packaging; extra == "dev"
Requires-Dist: setuptools; extra == "dev"
Requires-Dist: build; extra == "dev"
Requires-Dist: wheel; extra == "dev"

<!---
Copyright 2023 Genaios

Licensed under the CC BY-NC-ND 4.0 License

You must give appropriate credit, provide a link to the license, and indicate if changes were made.
You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
You may not use the material for commercial purposes.
If you remix, transform, or build upon the material, you may not distribute the modified material.
You are free to copy and redistribute this material as it is in any medium or format
You may obtain a copy of the License at

    https://creativecommons.org/licenses/by-nc-nd/4.0/
-->

<p align="center">
  <picture>
    <img alt="TextMachina" src="https://github.com/Genaios/TextMachina/blob/main/assets/title.png?raw=true" width="352" height="59" style="max-width: 100%;">
  </picture>
  <br/>
  <br/>
</p>

<p align="center">
    <a href="LICENSE">
        <img alt="license" src="https://img.shields.io/badge/license-CC_BY_NC_ND_4.0-green">
    </a>
    <a href="https://textmachina.readthedocs.io/en/latest/">
        <img alt="Documentation" src="https://img.shields.io/badge/Documentation-Readthedocs-green">
    </a>
    <a href="CODE_OF_CONDUCT.md">
        <img alt="Contributor Covenant" src="https://img.shields.io/badge/Contributor%20Covenant-v2.0-green">
    </a>
    <a href="https://pypi.org/project/text-machina/">
        <img alt="Pypi version" src="https://img.shields.io/pypi/v/text-machina">
    </a>
    <a href="https://pypi.org/project/text-machina/">
        <img alt="Downloads" src="https://img.shields.io/pypi/dm/text-machina">
    </a>
    

</p>

<h3 align="center">
    <p><b>Unifying strategies to build MGT datasets in a single framework</b></p>
</h3>

![icon](https://github.com/Genaios/TextMachina/blob/main/assets/typewriter.png?raw=true) TextMachina is a modular and extensible Python framework, designed to aid in the creation of high-quality, unbiased datasets to build robust models for MGT-related tasks such as:

- 🔎 **Detection**: detect whether a text has been generated by an LLM.
- 🕵️‍♂️ **Attribution**: identify what LLM has generated a text.
- 🚧 **Boundary detection**: find the boundary between human and generated text.
- 🎨 **Mixcase**: ascertain whether specific text spans are human-written or generated by LLMs.

![icon](https://github.com/Genaios/TextMachina/blob/main/assets/typewriter.png?raw=true) TextMachina provides a user-friendly pipeline that abstracts away the inherent intricacies of building MGT datasets:

- 🦜 **LLM integrations**: easily integrates any LLM provider. Currently, ![icon](https://github.com/Genaios/TextMachina/blob/main/assets/typewriter.png?raw=true) supports LLMs from Anthropic, Cohere, OpenAI, Google Vertex AI, Amazon Bedrock, AI21, Azure OpenAI, models deployed on VLLM and TRT inference servers, and any model from HuggingFace deployed either locally or remotely through Inference API or Inference Endpoints. See [models](text_machina/src/models/) to implement your own LLM provider.

- ✍️ **Prompt templating**: just write your prompt template with placeholders and let ![icon](https://github.com/Genaios/TextMachina/blob/main/assets/typewriter.png?raw=true) extractors to fill the template and prepare a prompt for an LLM. See [extractors](text_machina/src/extractors) to implement your own extractors and learn more about the placeholders for each extractor.
- 🔒 **Constrained decoding**: automatically infer LLM decoding hyper-parameters from the human texts to improve the quality and reduce the biases of your MGT datasets. See [constrainers](text_machina/src/constrainers) to implement your own constrainers.
- 🛠️ **Post-processing**: post-process functions aimed to improve the quality of any MGT dataset and prevent common biases and artifacts. See [postprocessing](text_machina/src/postprocessing.py) to add new postprocess functions.
- 🌈 **Bias mitigation**: ![icon](https://github.com/Genaios/TextMachina/blob/main/assets/typewriter.png?raw=true) is built with bias prevention in mind and helps you across all the pipeline to prevent introducing spurious correlations in your datasets.
- 📊 **Dataset exploration**: explore the generated datasets and quantify its quality with a set of metrics. See [metrics](text_machina/metrics) and [interactive](text_machina/src/interactive.py) to implement your own metrics and visualizations.

The following diagram depicts the ![icon](https://github.com/Genaios/TextMachina/blob/main/assets/typewriter.png?raw=true)'s pipeline.
<p align="center">
  <picture>
    <img alt="TextMachina Pipeline" src="https://github.com/Genaios/TextMachina/blob/main/assets/diagram.png?raw=true">
  </picture>
  <br/>
  <br/>
</p>

## 🔧 Installation
---

You can install all the dependencies with pip:

```
pip install text-machina[all]
```

or just with specific dependencies for an specific LLM provider or development dependencies (see [setup.py](setup.py)):

```
pip install text-machina[anthropic,dev]
```

You can also install directly from source:

```
pip install .[all]
```

If you're planning to modify the code for specific use cases, you can install ![icon](https://github.com/Genaios/TextMachina/blob/main/assets/typewriter.png?raw=true) in development mode:

```
pip install -e .[dev]
```

## 👀 Quick Tour
---

Once installed, you are ready to use ![icon](https://github.com/Genaios/TextMachina/blob/main/assets/typewriter.png?raw=true) for building MGT datasets either using the [CLI](text_machina/src/cli.py) or programmatically.

### 📟 Using the CLI
The first step is to define a YAML configuration file or a directory tree containing YAML files. Read the [examples/learning](etc/examples/learning) files to learn how to define configuration using different providers and extractors for different tasks. Take a look to [examples/use_cases](etc/examples/use_cases) to see configurations for specific use cases.

Then, we can call the *explore* and *generate* endpoints of ![icon](https://github.com/Genaios/TextMachina/blob/main/assets/typewriter.png?raw=true)'s CLI. The *explore* endpoint allows to inspect a small generated dataset using an specific configuration through an interactive interface. For instance, let's suppose we want to check how an MGT detection dataset generated using *[XSum](https://huggingface.co/datasets/EdinburghNLP/xsum)* news articles and *gpt-3.5-turbo-instruct* looks like, and compute some metrics:

```bash
text-machina explore --config-path etc/examples/xsum_gpt-3-5-turbo-instruct_openai.yaml \
--task-type detection \
--metrics-path etc/metrics.yaml \
--max-generations 10
```

<p align="center">
  <picture>
    <img alt="CLI interface showing generated and human text for detection" src="https://github.com/Genaios/TextMachina/blob/main/assets/explore.png?raw=true">
  </picture>
  <br/>
  <br/>
</p>

Great! Our dataset seems to look great, no artifacts, no biases, and high-quality text using this configuration. Let's now generate a whole dataset for MGT detection using that config file. The *generate* endpoint allows you to do that:

```bash
text-machina generate --config-path etc/examples/xsum_gpt-3-5-turbo-instruct_openai.yaml \
--task-type detection
```

A run name will be assigned to your execution and ![icon](https://github.com/Genaios/TextMachina/blob/main/assets/typewriter.png?raw=true) will cache results behind the scenes. If your run is interrupted at any point, you can use `--run-name <run-name>` to recover the progress and continue generating your dataset.

### 👩‍💻 Programmatically

You can also use ![icon](https://github.com/Genaios/TextMachina/blob/main/assets/typewriter.png?raw=true) programmatically. To do that, instantiate a dataset generator by calling *get_generator* with a *Config* object, and run its *generate* method. The *Config* object must contain the input, model, and generation configs, together with the task type for which the MGT dataset will be generated. Let's replicate the previous experiment programmatically:

```python
from text_machina import get_generator
from text_machina import Config, InputConfig, ModelConfig

input_config = InputConfig(
    domain="news",
    language="en",
    quantity=10,
    random_sample_human=True,
    dataset="xsum",
    dataset_text_column="document",
    dataset_params={"split": "test"},
    template=(
        "Write a news article whose summary is '{summary}'"
        "using the entities: {entities}\n\nArticle:"
    ),
    extractor="combined",
    extractors_list=["auxiliary.Auxiliary", "entity_list.EntityList"],
    max_input_tokens=256,
)

model_config = ModelConfig(
    provider="openai",
    model_name="gpt-3.5-turbo-instruct",
    api_type="COMPLETION",
    threads=8,
    max_retries=5,
    timeout=20,
)

generation_config = {"temperature": 0.7, "presence_penalty": 1.0}

config = Config(
    input=input_config,
    model=model_config,
    generation=generation_config,
    task_type="detection",
)
generator = get_generator(config)
dataset = generator.generate()
```

## 🛠️ Supported tasks
---
![icon](https://github.com/Genaios/TextMachina/blob/main/assets/typewriter.png?raw=true) can generate datasets for MGT detection, attribution, boundary detection, and mixcase detection:

<p align="center">
  <picture>
    <img alt="CLI interface showing generated and human text for detection" src="https://github.com/Genaios/TextMachina/blob/main/assets/tasks/detection.png?raw=true">
    <figcaption>Example from a detection task.</figcaption>
  </picture>
  <br/>
  <br/>
</p>

<p align="center">
  <picture>
    <img alt="CLI interface showing generated and human text for attribution" src="https://github.com/Genaios/TextMachina/blob/main/assets/tasks/attribution.png?raw=true">
    <figcaption>Example from an attribution task.</figcaption>
  </picture>
  <br/>
  <br/>
</p>

<p align="center">
  <picture>
    <img alt="CLI interface showing generated and human text for boundary" src="https://github.com/Genaios/TextMachina/blob/main/assets/tasks/boundary.png?raw=true">
    <figcaption>Example from a boundary detection task.</figcaption>
  </picture>
  <br/>
  <br/>
</p>

<p align="center">
  <picture>
    <img alt="CLI interface showing generated and human text for sentence-based mixcase" src="https://github.com/Genaios/TextMachina/blob/main/assets/tasks/mixcase_sentences.png?raw=true">
    <figcaption>Example from a mixcase task (tagging), interleaving generated sentences with human texts.</figcaption>
  </picture>
  <br/>
  <br/>
</p>

<p align="center">
  <picture>
    <img alt="CLI interface showing generated and human text for word-span-based mixcase" src="https://github.com/Genaios/TextMachina/blob/main/assets/tasks/mixcase_wordspans.png?raw=true">
    <figcaption>Example from a mixcase task (tagging), interleaving generated word spans with human texts.</figcaption>
  </picture>
  <br/>
  <br/>
</p>

However, the users can build datasets for other tasks not included in ![icon](https://github.com/Genaios/TextMachina/blob/main/assets/typewriter.png?raw=true) just by leveraging the provided task types. For instance, datasets for mixcase classification can be built from datasets for mixcase tagging, or datasets for mixcase attribution can be built using the generation model name as label.

## 🔄 Common Use Cases
---
There is a set of common use cases with ![icon](https://github.com/Genaios/TextMachina/blob/main/assets/typewriter.png?raw=true). Here's how to carry them out using the *explore* and *generate* endpoints.

| Use case                                                                    | Command                                                                                                                       |
|-----------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------|
| Explore a dataset of 10 samples for MGT detection and show metrics          | <pre>text-machina explore \ <br>--config-path config.yaml \ <br>--task-type detection \ <br>--max-generations 10 \ <br>--metrics_path metrics.yaml</pre>  |
| Explore an existing dataset for MGT detection and show metrics              | <pre>text-machina explore \ <br>--config-path config.yaml \ <br>--run-name greedy-bear \ <br>--task-type detection \ <br>--metrics_path metrics.yaml</pre> |
| Generate a dataset for MGT detection                                        | <pre>text-machina generate \ <br>--config-path config.yaml \ <br>--task-type detection</pre>                                                       |
| Generate a dataset for MGT attribution                                      | <pre>text-machina generate \ <br>--config-path config.yaml \ <br>--task-type attribution</pre>                                                     |
| Generate a dataset for boundary detection                                   | <pre>text-machina generate \ <br>--config-path config.yaml \ <br>--task-type boundary</pre>                                                        |
| Generate a dataset for mixcase detection                                | <pre>text-machina generate \ <br>--config-path config.yaml \ <br>--task-type mixcase</pre>                                                        |
| Generate a dataset for MGT detection using config files in a directory tree | <pre>text-machina generate \ <br>--config-path configs/ \ <br>--task-type detection</pre>                                                          |

## 💾 Caching
![icon](https://github.com/Genaios/TextMachina/blob/main/assets/typewriter.png?raw=true) TextMachina caches each dataset it generates through the CLI endpoints under a run name. 
The specific run name is given as the last message in the logs, and can be used with `--run-name <run-name>` to continue from interrupted runs.
The default cache dir used by ![icon](https://github.com/Genaios/TextMachina/blob/main/assets/typewriter.png?raw=true) TextMachina is `/tmp/text_machina_cache`. 
It can be modified by setting `TEXT_MACHINA_CACHE_DIR` to a different path.


## ⚠️ Notes and Limitations
---

- Although you can use any kind of extractor to build boundary detection datasets, it is highly recommended to use the *sentence_prefix* or
*word_prefix* extractors with a random number of sentences/words to avoid biases that lead boundary detection models to just count sentences or words.

- ![icon](https://github.com/Genaios/TextMachina/blob/main/assets/typewriter.png?raw=true) attempts to remove disclosure patterns (e.g., "*As an AI language model ...*") with a limited set of regular expressions, but they depend on the LLM and the language. We strictly recommend to first *explore* your dataset looking for these biases, and modify the postprocessing or the prompt template accordingly to remove them.

- Generating multilingual datasets is not well supported yet. At this moment, we recommend to generate independent datasets for each language and combine them together out of ![icon](https://github.com/Genaios/TextMachina/blob/main/assets/typewriter.png?raw=true).

- Generating machine-generated code datasets is not well supported yet.

## 📖 Citation
---
```
@misc{sarvazyan2024textmachina,
      title={TextMachina: Seamless Generation of Machine-Generated Text Datasets}, 
      author={Areg Mikael Sarvazyan and José Ángel González and Marc Franco-Salvador},
      year={2024},
      eprint={2401.03946},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
```
## 🤝 Contribute
---

Feel free to contribute to ![icon](https://github.com/Genaios/TextMachina/blob/main/assets/typewriter.png?raw=true) by raising an issue.

Please install and use the [dev-tools](dev-tools) for correctly formatting the code when contributing to this repo.

## 🏭 Commercial Purposes
---
Please, contact stuart.winter-tear@genaios.ai and marc.franco@genaios.ai if you are interested in using TextMachina for commercial purposes.
