Metadata-Version: 2.4
Name: sieves
Version: 0.17.8
Summary: Plug-and-play, zero-shot document processing pipelines.
Home-page: https://github.com/MantisAI/sieves
Author: Matthew Upson, Nick Sorros, Raphael Mitsch, Matthew Maufe, Angelo Di Gianvito
Author-email: hi@mantisnlp.com
License: MIT
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Topic :: Software Development :: Libraries
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Requires-Python: >=3.12
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: chonkie<2,>=1
Requires-Dist: datasets<4,>=3
Requires-Dist: jinja2<4,>=3
Requires-Dist: loguru<1,>=0.7
Requires-Dist: outlines<2,>=1
Requires-Dist: transformers[torch]<5,>=4
Requires-Dist: dspy-ai<3,>=2
Requires-Dist: dspy<3,>=2
Requires-Dist: pydantic<3,>=2
Requires-Dist: nest-asyncio<2,>=1
Provides-Extra: ingestion
Requires-Dist: docling<3,>=2; extra == "ingestion"
Requires-Dist: marker-pdf[full]<2,>=1.6.1; extra == "ingestion"
Requires-Dist: nltk>=3.9.1; extra == "ingestion"
Provides-Extra: engines
Requires-Dist: accelerate<2,>1.2; extra == "engines"
Requires-Dist: gliner<1; extra == "engines"
Requires-Dist: langchain-core<0.4,>=0.3; extra == "engines"
Requires-Dist: langchain<0.4,>=0.3; extra == "engines"
Requires-Dist: sentencepiece<1; extra == "engines"
Provides-Extra: distill
Requires-Dist: setfit<2,>=1.1; extra == "distill"
Requires-Dist: model2vec[train]<0.5,>0.4; extra == "distill"
Provides-Extra: test
Requires-Dist: pre-commit<5,>=4; extra == "test"
Requires-Dist: pytest<8,>=7; extra == "test"
Requires-Dist: mypy>=1; extra == "test"
Requires-Dist: mypy-extensions>=1; extra == "test"
Requires-Dist: pytest-cov>=6; extra == "test"
Requires-Dist: anthropic<1,>=0.45; extra == "test"
Requires-Dist: langchain-community<0.4,>=0.3.31; extra == "test"
Requires-Dist: langchain-openai>=0.3.35; extra == "test"
Requires-Dist: mkdocstrings[python]<1,>=0.27; extra == "test"
Requires-Dist: mkdocs-material<10,>=9.6; extra == "test"
Requires-Dist: pre-commit<5,>=4; extra == "test"
Dynamic: license-file

<img src="https://raw.githubusercontent.com/mantisai/sieves/main/docs/assets/sieve.png" width="230" align="left" style="margin-right:60px" />
<img src="https://raw.githubusercontent.com/mantisai/sieves/main/docs/assets/sieves_sieve_style.png" width="350" align="left" style="margin-right:60px" />

<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>

[![GitHub Actions Workflow Status](https://img.shields.io/github/actions/workflow/status/mantisai/sieves/test.yml)](https://github.com/mantisai/sieves/actions/workflows/test.yml)
![GitHub top language](https://img.shields.io/github/languages/top/mantisai/sieves)
[![PyPI - Version](https://img.shields.io/pypi/v/sieves)]((https://pypi.org/project/sieves/))
![PyPI - Status](https://img.shields.io/pypi/status/sieves)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg?style=flat-square)](https://github.com/ambv/black)
[![codecov](https://codecov.io/gh/mantisai/sieves/branch/main/graph/badge.svg)](https://codecov.io/gh/mantisai/sieves)
<a href="https://doi.org/10.5281/zenodo.17342348"><img src="https://zenodo.org/badge/902811666.svg" alt="DOI"></a>

## Zero-shot document processing made easy.

`sieves` is a library for zero- and few-shot NLP tasks with structured generation. Build production-ready NLP prototypes quickly, with guaranteed output formats and no training required.

Read our documentation [here](https://sieves.ai). An automatically generated version (courtesy of Devin via [DeepWiki](https://deepwiki.com/)) is available [here](https://deepwiki.com/MantisAI/sieves).

<details>
  <summary><b>Installation</b></summary>

Install `sieves` with `pip install sieves` (or `uv add sieves`).

The following extra groups exist:
- `ingestion` for ingestion libraries (for converting documents into text/markdown), e.g. `docling`
- `distill` for distillation utilities, e.g. training frameworks like `setfit`
- `engines` for structured generation utilities beyond the default `outlines`

If you want to install all dependencies by default
```
pip install "sieves[engines,distill,ingestion]"
```
You can also choose to install individual dependencies as you see fit.
</details>


> [!WARNING]
> `sieves` is in active development and currently in beta. Be advised that the API might change in between minor version
> updates.

### Why `sieves`?

Even in the era of generative AI, structured outputs and observability remain crucial.

Many real-world scenarios require rapid prototyping with minimal data. Generative language models excel here, but
producing clean, structured output can be challenging. Various tools address this need for structured/guided language
model output, including [`outlines`](https://github.com/dottxt-ai/outlines), [`dspy`](https://github.com/stanfordnlp/dspy),
[`langchain`](https://github.com/langchain-ai/langchain), and others. Each has different design patterns, pros and cons. `sieves` wraps these tools and provides
a unified interface for input, processing, and output.

Developing NLP prototypes often involves repetitive steps: parsing and chunking documents, exporting results for
model fine-tuning, and experimenting with different prompting techniques. All these needs are addressed by existing
libraries in the NLP ecosystem address (e.g. [`docling`](https://github.com/DS4SD/docling) for file parsing, or [`datasets`](https://github.com/huggingface/datasets) for transforming
data into a unified format for model training).

`sieves`  **simplifies NLP prototyping** by bundling these capabilities into a single library, allowing you to quickly
build modern NLP applications. It provides:
- Zero- and few-shot model support for immediate inference
- A bundle of utilities addressing common requirements in NLP applications
- A unified interface for structured generation across multiple libraries
- Built-in tasks for common NLP operations
- Easy extendability
- A document-based pipeline architecture for easy observability and debugging
- Caching - pipelines cache processed documents to prevent costly redundant model calls

`sieves` draws a lot of inspiration from [`spaCy`](https://spacy.io/) and particularly [`spacy-llm`](https://github.com/explosion/spacy-llm).

---

### Features

- :dart: **Zero Training Required:** Immediate inference using zero-/few-shot models
- :robot: **Unified Generation Interface:** Seamlessly use multiple libraries
  - [`dspy`](https://github.com/stanfordnlp/dspy)
  - [`gliner`](https://github.com/urchade/GLiNER)
  - [`langchain`](https://github.com/langchain-ai/langchain)
  - [`outlines`](https://github.com/dottxt-ai/outlines)
  - [`transformer`](https://github.com/huggingface/transformers)
- :arrow_forward: **Observable Pipelines:** Easy debugging and monitoring
- :hammer_and_wrench: **Integrated Tools:**
  - Document parsing (optional via `ingestion` extra): [`docling`](https://github.com/DS4SD/docling), [`marker`](https://github.com/VikParuchuri/marker)
  - Text chunking: [`chonkie`](https://github.com/chonkie-ai/chonkie)
- :label: **Ready-to-Use Tasks:**
  - Multi-label classification
  - Information extraction
  - Summarization
  - Translation
  - Multi-question answering
  - Aspect-based sentiment analysis
  - PII (personally identifiable information) anonymization
  - Named entity recognition
  - Coming soon: entity linking, knowledge graph creation, ...
- :floppy_disk: **Persistence:** Save and load pipelines with configurations
- :rocket: **Optimization:** Improve task performance by optimizing prompts and few-shot examples using [DSPy's MIPROv2](https://dspy-docs.vercel.app/api/optimizers/MIPROv2)
- :teacher: **Distillation:** Fine-tune smaller, specialized models using your zero-shot results with frameworks like SetFit and Model2Vec.
  Export results as HuggingFace [`Dataset`](https://github.com/huggingface/datasets) for custom training.
- :recycle: **Caching** to avoid unnecessary model calls

> [!IMPORTANT]
> `sieves` was built with and requires Python 3.12 or higher. Note however that some dependencies (such as `pyarrow`
> via `datasets`) don't have prebuilt wheels for Python versions newer than 3.12 yet, in which case you'll need to
> manually install those dependencies.
> For the time being we recommend using `sieves` with Python 3.12.

---

### Getting Started

Here's a simple classification example using [`outlines`](https://github.com/dottxt-ai/outlines):

```python
from sieves import Pipeline, tasks, Doc
import outlines

# 1. Define documents by text or URI.
docs = [Doc(text="Special relativity applies to all physical phenomena in the absence of gravity.")]

# 2. Choose a model (Outlines in this example).
model = outlines.models.transformers("HuggingFaceTB/SmolLM-135M-Instruct")

# 3. Create pipeline with tasks (verbose init).
pipe = Pipeline(
  # Add classification task to pipeline.
  tasks.Classification(labels=["science", "politics"], model=model)
)

# 4. Run pipe and output results.
for doc in pipe(docs):
  print(doc.results)

# Tip: Pipelines can also be composed succinctly via chaining (+).
# For multi-step pipelines, you can write:
#   pipe = tasks.Ingestion(export_format="markdown") + tasks.Chunking(chunker) + tasks.Classification(labels=[...], model=model)
# Note: Ingestion libraries are optional and not installed by default.
#       Install with: pip install "sieves[ingestion]" or install the specific libraries directly (e.g., `docling`, `marker`).
# Note: additional Pipeline parameters (e.g., use_cache=False) are only available via the verbose init,
# e.g., Pipeline([t1, t2], use_cache=False).
```

<details>
  <summary><b>Advanced Example</b></summary>

This example demonstrates PDF parsing, text chunking, and classification.

Note: Ingestion libraries are optional and not installed by default. To run the ingestion step, install with the extra or install the libraries directly:

```
pip install "sieves[ingestion]"   # or install ingestion libraries directly
```

```python
import pickle

import gliner.multitask
import chonkie
import tokenizers
import docling.document_converter

from sieves import Pipeline, tasks, Doc

# 1. Define documents by text or URI.
docs = [Doc(uri="https://arxiv.org/pdf/2408.09869")]

# 2. Choose a model for structured generation.
model_name = 'knowledgator/gliner-multitask-v1.0'
model = gliner.GLiNER.from_pretrained(model_name)

# 3. Create chunker object.
chunker = chonkie.TokenChunker(tokenizers.Tokenizer.from_pretrained(model_name))

# 3. Create pipeline with tasks.
pipe = Pipeline(
  [
    # 4. Add document parsing task.
    tasks.Ingestion(export_format="markdown"),
    # 5. Add chunking task to ensure we don't exceed our model's context window.
    tasks.Chunking(chunker),
    # 6. Add classification task to pipeline.
    tasks.Classification(
        task_id="classifier",
        labels=["science", "politics"],
        model=model,
    ),
  ]
)
# Alternatively you can also construct a pipeline by using the + operators:
# pipe = tasks.Ingestion(export_format="markdown") + tasks.Chunking(chunker) + tasks.Classification(
#     task_id="classifier", labels=["science", "politics"], model=model
# )

# 7. Run pipe and output results.
docs = list(pipe(docs))
for doc in docs:
  print(doc.results["classifier"])

# 8. Serialize pipeline and docs.
pipe.dump("pipeline.yml")
with open("docs.pkl", "wb") as f:
  pickle.dump(docs, f)

# 9. Load pipeline and docs from disk. Note: we don't serialize complex third-party objects, so you'll have
#    to pass those in at load time.
loaded_pipe = Pipeline.load(
  "pipeline.yml",
  (
    {"converter": docling.document_converter.DocumentConverter(), "export_format": "markdown"},
    {"chunker": chunker},
    {"model": model},
  ),
)
with open("docs.pkl", "rb") as f:
  loaded_docs = pickle.load(f)
```
</details>

---

### Core Concepts

`sieves` is built on six key abstractions.

#### **`Pipeline`**
Orchestrates task execution with features for.
- Task configuration and sequencing
- Pipeline execution
- Configuration management and serialization

#### **`Doc`**
Represents a document in the pipeline.
- Contains text content and metadata
- Tracks document URI and processing results
- Passes information between pipeline tasks

#### **`Task`**
Encapsulates a single processing step in a pipeline.
- Defines input arguments
- Wraps and initializes `Bridge` instances handling task-engine-specific logic
- Implements task-specific dataset export

#### `GenerationSettings`
Controls behavior of structured generation across tasks.
- Batch size
- Strict mode (whether errors in parsing individual documents should terminate execution)
- Arbitrary arguments passed on to structured generation tool (which one that is depends on the model you specified - Outlines, DSPy, LangChain, ...).

#### **`Engine`** (internals only)
Provides a unified interface to structured generation libraries (internal). You pass a backend model into tasks;
`Engine` is used under the hood.
- Manages model interactions
- Handles prompt execution
- Standardizes output formats

#### **`Bridge`** (internals only)
Connects `Task` with `Engine`.
- Implements engine-specific prompt templates
- Manages output type specifications
- Ensures compatibility between tasks and engine

---

## Frequently Asked Questions

<details>
  <summary><b>Show FAQs</b></summary>

### Why "sieves"?

`sieves` was originally motivated by the want to use generative models for structured information extraction. Coming
from this angle, there are two ways to explain why we settled on this name (pick the one you like better):
- An analogy to [gold panning](https://en.wikipedia.org/wiki/Gold_panning): run your raw data through a sieve to obtain structured, refined "gold."
- An acronym - "sieves" can be read as "Structured Information Extraction and VErification System" (but that's a mouthful).

### Why not just prompt an LLM directly?

Asked differently: what are the benefits of using `sieves` over directly interacting with an LLM?
- Validated, structured data output - also for LLMs that don't offer structured outputs natively.  Zero-/few-shot language models can be finicky without guardrails or parsing.
- A step-by-step pipeline, making it easier to debug and track each stage.
- The flexibility to switch between different models and ways to ensure structured and validated output.

### How do I create models?

Below are minimal examples for creating model objects for each supported structured‑generation tool. Pass these `model` objects directly to tasks, optionally with `GenerationSettings`.

- DSPy

  ```python
  import os
  import dspy

  # Anthropic example (set ANTHROPIC_API_KEY in your environment)
  model = dspy.LM("claude-3-haiku-20240307", api_key=os.environ["ANTHROPIC_API_KEY"])

  # Tip: DSPy can integrate with Ollama and vLLM backends for local model serving.
  # For Ollama, configure api_base and blank api_key:
  # model = dspy.LM("smollm:135m-instruct-v0.2-q8_0", api_base="http://localhost:11434", api_key="")
  # For vLLM, use the OpenAI-compatible server:
  # model = dspy.LM("meta-llama/Llama-3.2-1B-Instruct", api_base="http://localhost:8000/v1", api_key="")
  ```

- GLiNER

  ```python
  import gliner
  model = gliner.GLiNER.from_pretrained("knowledgator/gliner-multitask-v1.0")
  ```

- LangChain

  ```python
  from langchain.chat_models import init_chat_model
  import os

  model = init_chat_model(
      model="claude-3-haiku-20240307",
      api_key=os.environ["ANTHROPIC_API_KEY"],
      model_provider="anthropic",
  )
  ```

- Hugging Face Transformers (zero‑shot classification)

  ```python
  from transformers import pipeline

  model = pipeline(
      "zero-shot-classification",
      model="MoritzLaurer/xtremedistil-l6-h256-zeroshot-v1.1-all-33",
  )
  ```

- Outlines

  ```python
  import outlines
  from transformers import AutoModelForCausalLM, AutoTokenizer

  model_name = "HuggingFaceTB/SmolLM-135M-Instruct"
  # Outlines supports different backends, also remote ones. We use a local `transformers` model here.
  model = outlines.models.from_transformers(
      AutoModelForCausalLM.from_pretrained(model_name),
      AutoTokenizer.from_pretrained(model_name),
  )
  ```

**Notes**
- Provide provider API keys via environment variables (e.g., `ANTHROPIC_API_KEY`).
- **Local model serving:** DSPy can integrate with Ollama and vLLM for local model serving (see DSPy examples above).
- After you have a `model`, use it in tasks like: `tasks.predictive.Classification(labels=[...], model=model)`.
- A bunch of useful utilities for pre- and post-processing you might need.
- An array of useful tasks you can right of the bat without having to roll your own.
- Look up the respective tool's documentation for more information.

### Why use `sieves` and not a structured generation library, like `outlines`, directly?

Which library makes the most sense to you depends strongly on your use-case. `outlines` provides structured generation
abilities, but not the pipeline system, utilities and pre-built tasks that `sieves` has to offer (and of course not the
flexibility to switch between different structured generation libraries). Then again, maybe you don't need all that -
in which case we recommend using `outlines` (or any other structured generation libray) directly.

Similarly, maybe you already have an existing tech stack in your project that uses exclusively `langchain` or
`dspy`? All of these libraries (and more) are supported by `sieves` - but they are not _just_ structured generation
libraries, they come with a plethora of features that are out of scope for `sieves`. If your application deeply
integrates with a framework like LangChain or DSPy, it may be reasonable to stick to those libraries directly.

As many things in engineering, this is a trade-off. The way we see it: the less tightly coupled your existing
application is with a particular language model framework, the more mileage you'll get out of `sieves`. This means that
it's ideal for prototyping (there's no reason you can't use it in production too, of course).

</details>

---

> Source for `sieves` icon:
> <a href="https://www.flaticon.com/free-icons/sieve" title="sieve icons">Sieve icons created by Freepik - Flaticon</a>.
