# ESG Document Classification Pipeline

This project extracts context from sustainability-related PDFs, applies a rule-based filter, classifies the candidates with one or more LLM backends, and dispatches the documents into an organized folder structure. The pipeline can optionally persist rich per-document logs (including the extracted text, the complete LLM request/response payloads, and heuristic scores) for auditing.

## Quick Start

1. Install dependencies:
   ```bash
   pip install -r requirements.txt
   ```
2. Prepare a configuration file (see below) and place documents under `input_docs/`.
3. Run the pipeline:
   ```bash
   python main.py --config config.json --input input_docs --output classified --logs logs
   ```

The command above will look for PDFs under `input_docs/`, classify them, and move files into the `classified` directory. Detected sustainability reports land in `classified/NFD/<Company>/<Year>/filename.pdf`, non-reports in `classified/trash/`, and low-confidence results in `classified/unknown/`. When logging is enabled, JSON logs are written to the directory specified by `--logs` (or `app.logs` in the config).

## Configuration (`config.json`)

The `app` block configures pipeline-wide behaviour, while `configs` defines one or more LLM endpoints the classifier can cycle through (with automatic cooldowns on failures).

```json
{
  "options": {
    "write_log": true,
    "logs": "logs",
    "start_pages": 6,
    "end_pages": 3,
    "min_total_pages": 20,
    "min_keywords_hit": 3
  },
  "configs": [
    {
      "name": "primary",
      "api_base": "https://api.your-llm.com/v1",
      "api_key": "sk-primary",
      "model": "gpt-4o-mini",
      "activity": true
    },
    {
      "name": "backup",
      "api_base": "https://api.backup-llm.com/v1",
      "api_key": "sk-backup",
      "model": "gpt-4o-mini",
      "activity": false
    }
  ]
}
```

- `write_log`: toggles writing per-document JSON logs to disk. When `false`, the pipeline only prints high-level console logs.
- `logs`: directory for JSON logs (`--logs` CLI flag overrides this).
- `start_pages` / `end_pages`: number of pages extracted from the beginning and the end of each PDF.
- `min_total_pages`: minimum number of pages a document must have to be considered a candidate (default: 20).
- `min_keywords_hit`: minimum number of sustainability keywords/markers that must be found to be considered a candidate (default: 3).
- `configs[*].activity`: when `true`, files are physically moved; when `false` the run behaves as a dry-run for that endpoint (results are logged but files stay put).

### Log Contents

When `write_log` is enabled the pipeline writes one JSON file per document containing:
- Extracted first/last page text and the chunks sent to the model.
- The heuristic language/keyword scores.
- The full chat payload (`messages`) sent to the LLM and the entire raw API response returned (`raw_api_response`).
- Classification verdict (company, year, certainty) and the final destination path selected.

This information can be used to debug misclassifications, audit model usage, or replay tricky documents later.

## How It Works

1. **Document discovery:** recursively finds PDF files under the input directory.
2. **Parsing:** pulls configurable numbers of pages from the start and end of every PDF.
3. **Heuristics:** detects language, counts keywords, and flags obvious non-candidates.
4. **Chunking:** collapses the selected pages into up to three overlapping chunks (≈1,800 chars each).
5. **LLM classification:** sends the chunks to the first available LLM config; if a call fails the pipeline retries with the next config after applying a cooldown to the failing one. The model returns strict JSON with the sustainability verdict, company, year, and certainty.
6. **Dispatch:** depending on the verdict and confidence, the file is moved to `NFD/<Company>/<Year>`, `trash`, or `unknown`. In dry-run runs (`activity=false`) the file is left in place but its log is still produced (if logging is on).

## Example Workflow

```bash
# 1. Ensure config.json is filled out (API keys, models, etc.)
# 2. Drop PDFs under input_docs/
python main.py \
  --config config.json \
  --input input_docs \
  --output classified \
  --logs logs \
  --log-level INFO
```

After the run you should see:
- Classified PDFs in `classified/NFD/...`, `classified/trash/`, or `classified/unknown/`.
- Per-document logs like `logs/nfd2019ita_1a2b3c4d.json` with the extracted text, heuristic scores, request payload, and raw LLM response.

## Development Notes

- The project targets Python 3.10+.
- PDF parsing relies on `pypdf` (installable via `pip install -r requirements.txt`).
- The code base is organized under `src/esg_classification/` with clear modules for discovery, parsing, heuristics, LLM interaction, and dispatching.
- Feel free to extend `chunking.py` for alternative chunking strategies or add more sophisticated heuristics.

Happy classifying!
