Metadata-Version: 2.3
Name: mineru-flow
Version: 1.0.0a3
Summary: a mineru tool for enhancing your rag workflow
License: Apache 2.0
Author: shenguanlin
Author-email: shenguanlin@pjlab.org.cn
Requires-Python: >=3.11,<3.13
Classifier: License :: Other/Proprietary License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Dist: aiofiles (>=24.1.0,<25.0.0)
Requires-Dist: aiohttp (>=3.12.15,<4.0.0)
Requires-Dist: alembic (>=1.15.2,<2.0.0)
Requires-Dist: appdirs (>=1.4.4,<2.0.0)
Requires-Dist: beautifulsoup4 (>=4.14.2,<5.0.0)
Requires-Dist: boto3 (>=1.38.14,<2.0.0)
Requires-Dist: cryptography (>=44.0.3,<45.0.0)
Requires-Dist: fastapi (>=0.115.12,<0.116.0)
Requires-Dist: httpx (>=0.28.1,<0.29.0)
Requires-Dist: langchain-text-splitters (>=0.3.11,<0.4.0)
Requires-Dist: loguru (>=0.6.0,<0.7.0)
Requires-Dist: orjson (>=3.10.18,<4.0.0)
Requires-Dist: psutil (>=6.1.0,<7.0.0)
Requires-Dist: pydantic (>=2.11.4,<3.0.0)
Requires-Dist: pydantic-settings (>=2.9.1,<3.0.0)
Requires-Dist: python-jose (>=3.4.0,<4.0.0)
Requires-Dist: python-magic (>=0.4.27,<0.5.0)
Requires-Dist: python-multipart (>=0.0.20,<0.0.21)
Requires-Dist: ragflow-sdk (>=0.21.0,<0.22.0)
Requires-Dist: ruff (>=0.13.2,<0.14.0)
Requires-Dist: setuptools (>=80.9.0,<81.0.0)
Requires-Dist: sqlalchemy (>=2.0.40,<3.0.0)
Requires-Dist: typer[all] (>=0.15.3,<0.16.0)
Requires-Dist: uvicorn (>=0.34.2,<0.35.0)
Project-URL: Repository, https://github.com/OpenDataLab/mineru-flow
Description-Content-Type: text/markdown

# MinerU Flow

<div align="center">
  <article style="display: flex; flex-direction: column; align-items: center; justify-content: center;">
      <p align="center">
        <img width="100" src="./frontend/public/favicon.png" />
      </p>
      <p align="center">
          English | <a href="./README_zh-CN.md" >简体中文</a>
      </p>
  </article>
</div>

MinerU Flow is a document processing tool built around MinerU's document understanding capabilities. It helps you:

- Manage MinerU parsing configurations (SaaS or self-hosted deployments).
- Ingest documents from local directories, HTTP, or S3-compatible object storage.
- Run multi-phase jobs — parsing → chunking → knowledge base import — with retries and status monitoring.
- Inspect job progress, system information, and artifacts in a visual dashboard.

## Installation

```bash
pip install mineru-flow
mineru-flow
```

Using conda:

```bash
conda create -n mineru-flow python=3.11
conda activate mineru-flow
pip install mineru-flow
mineru-flow
```

## Local Development

The backend REST APIs are exposed under `/api/v1`, metadata is stored in SQLite by default, and job artifacts live under the user data directory (for example `~/Library/Application Support/mineru-flow` on macOS when no data directory is specified). The `mineru-flow` CLI launches both the HTTP service and the worker system in a single process.

- **Project structure**
  - `mineru_flow/`: FastAPI app, business logic, storage adapters, worker management.
  - `frontend/`: Vite + React single-page app (TanStack Router, Radix UI, Tailwind CSS).
  - `tests/`: Backend Pytest suites.
  - `mineru_flow/internal/processor/`: Phase implementations for parsing, chunking, and knowledge base import.

### Backend dependencies

- Python ≥ 3.11
- Poetry ≥ 1.8 (creates a virtual environment automatically)
- GCC or Clang toolchain (needed for some native packages such as `python-magic`)
- Optional: Docker for containerized deployment support

### Frontend dependencies

- Node.js 20+ (or Bun 1.1+)
- Any package manager (`npm`, `pnpm`, `bun`; examples use `npm`)

### Optional external services

- S3-compatible object storage (MinIO, Amazon S3, etc.) for remote file ingestion.
- An existing MinerU deployment (SaaS API key or self-hosted service URL).

## Startup & Configuration

### Backend (FastAPI + worker)

```bash
poetry install
poetry run mineru-flow --host 0.0.0.0 --port 8001 --open
```

This command will:

1. Apply database migrations (SQLite file is created under the app data directory).
2. Start the HTTP API server on the configured host and port.
3. Launch the asynchronous worker manager that polls for jobs.
4. Optionally open the default browser when `--open` is provided.

You can also start the application without the CLI by running:

```bash
poetry run python -m mineru_flow.main --host 127.0.0.1 --port 8001
```

### Frontend (Vite React dashboard)

```bash
cd frontend
npm install
npm run dev -- --port 3000
```

The Vite dev server proxies API requests to the backend (default `/api/v1`). For production, build and serve the static assets:

```bash
npm run build
npm run serve
```

If you prefer Bun:

```bash
bun install
bun run dev
```

### Environment configuration

Set environment variables before starting the backend (e.g. in a `.env` file or via the shell). Key variables include:

| Variable | Default | Description |
| --- | --- | --- |
| `HOST` | `0.0.0.0` | HTTP bind address. |
| `PORT` | `8001` | HTTP port. |
| `DATABASE_URL` | `sqlite:///<data_dir>/mineru_flow.sqlite` | Override to use PostgreSQL/MySQL if desired. |
| `LOG_LEVEL` | `INFO` | Log level for backend and workers. |
| `LOG_JSON` | `False` | Enable JSON-structured logs. |
| `LOG_FILE` | `None` | Path to an additional log file. |
| `WORKER_CONCURRENCY` | `4` | Number of concurrent worker coroutines. |
| `WORKER_POLLING_INTERVAL_MS` | `5000` | Polling interval for new jobs. |
| `WORKER_MAX_RETRY_ATTEMPTS` | `3` | Automatic retry limit per job phase. |

Frontend-specific values use the `VITE_` prefix (see `frontend/src/env.ts`). Create a `.env` or `.env.local` file under `frontend/` if you need to override defaults, for example:

```bash
VITE_APP_TITLE="Mineru Flow"
VITE_API_BASE_URL="http://localhost:8001/api/v1"
```

### Docker

Build and run the all-in-one container (serves both API and static UI):

```bash
docker build -t mineru-flow .
docker run --rm -p 8000:8000 \
  -e HOST=0.0.0.0 \
  -e PORT=8000 \
  -v $(pwd)/media:/app/media \
  mineru-flow
```

The image defaults `BASE_DATA_DIR` to `/app/media`, so mounting that path preserves the SQLite
database, uploaded files, and job artifacts across restarts. Override it by supplying a different
`BASE_DATA_DIR` (or `MINERU_FLOW_DATA_DIR`) if you prefer another mount point.

## 5. Additional Notes

- **Common commands**
  - Backend tests: `poetry run pytest`
  - Backend static analysis: `poetry run ruff check`
  - Frontend tests: `cd frontend && npm run test`
  - Frontend formatting / linting: `npm run format`, `npm run lint`, `npm run check`

- **Database migrations**
  - Migrations run automatically when the app starts. To trigger them manually, call `mineru_flow.alembic.run_migrate.run_db_migrations()`.

- **Processing pipeline extensions**
  - Each phase inherits from `BasePhaseProcessor` and is registered in `mineru_flow/internal/processor/registry.py`. Add new processors or replace existing ones as needed.
  - MinerU parsing strategies, chunking logic, and knowledge-base targets can be configured through `/api/v1/configs` or the frontend UI.
  - Artifacts are stored under `<data_dir>/media/artifacts/<task_id>/<phase>/` for debugging.

- **Debugging tips**
  1. Start the backend and worker with `poetry run mineru-flow --open`.
  2. Launch the frontend dev server in another terminal: `npm run dev`.
  3. Configure MinerU, S3, and knowledge base settings under **System Settings** before creating tasks.
  4. Track phase progress and logs in the task detail page; `/api/v1/system/worker` exposes worker status.
  5. Inspect logs (`LOG_FILE` if configured) and artifact directories for intermediate results when diagnosing failures.

- **Further development ideas**
  - Swap out the database for an alternative that suits your deployment.
  - Create custom processors to add new workflow stages or override defaults.
  - Reuse or extend frontend components under `frontend/src/components` to build additional UI.

