Metadata-Version: 2.1
Name: pdf-text-extractor
Version: 0.2.3
Summary: Extract text and images from PDF files
Home-page: https://github.com/baxromov/pdf_to_text
Author: Shahzod
Author-email: baxromov.shahzodbek@gmail.com
Project-URL: Documentation, https://pdf-to-text.readthedocs.io/en/latest/
Project-URL: Source Code, https://github.com/baxromov/pdf_to_text
Project-URL: Issue Tracker, https://github.com/baxromov/pdf_to_text/issues
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.12
Description-Content-Type: text/markdown
Requires-Dist: opencv-python
Requires-Dist: pillow
Requires-Dist: PyMuPDF
Requires-Dist: pytesseract
Requires-Dist: ollama

# PDFTextExtractor

A Python utility for extracting text and images from PDF files. The extracted text includes content from PDF pages and
OCR-processed text from images embedded in the PDF. Results are returned as a combined list of dictionaries, preserving
the order of appearance.

---

## Features

- Extract text directly from PDF pages.
- Extract and OCR-process images embedded in PDFs.
- Return results in a combined, ordered list of text and image content.
- Preprocess images to improve OCR accuracy.

---

## Requirements

### Python

- **Python Version**: 3.12 or higher

```shell
pip install pdf-text-extractor
```

Tesseract OCR

- Tesseract Installation:  
  Install Tesseract OCR and ensure it is accessible via the system’s PATH.
  Follow the Tesseract Installation Guide for details.

---
Usage

Import and Initialize:

```python
from pdf_text_extractor import PDFTextExtractor

# Provide the PDF file path and image directory
pdf_path = "example.pdf"
image_dir = "output_images"

# Initialize the extractor
extractor = PDFTextExtractor(pdf_path, image_dir)

```

Process PDF and Extract Content

```python

# Extract text and image content
results = extractor.process_and_extract_text()

# Display extracted content
for item in results:
    if "text" in item:
        print("PDF Text:", item["text"])
    elif "image_text" in item:
        print("Image Text:", item["image_text"])

```
Text and Image Extraction with LLM: 

The latest version adds a feature to refine OCR-processed text using a language model (LLM), such as Ollama. This enhances the accuracy and readability of text extracted from images embedded within the PDF.

```python
# Extract text and image content with LLM refinement for image-based text
results = extractor.process_and_extract_text(use_llm_for_image_text=True)
```

---
Output Format

The method process_and_extract_text() returns a list of dictionaries. Each dictionary contains either text or
image_text, corresponding to content from the PDF or OCR-processed images.

Example Output

```json

[
  {
    "text": "This is text from the first page of the PDF."
  },
  {
    "image_text": "Text extracted from an image on the first page."
  },
  {
    "text": "Another page of the PDF with textual content."
  },
  {
    "image_text": "Additional image-based text extracted."
  }
]
```

---

## How It Works

### Text Extraction

- Text from PDF pages is extracted using **PyMuPDF**.

### Image Extraction

- Embedded images are extracted and saved to the specified directory.
- Images are preprocessed before OCR.

### Image Preprocessing

- **Convert to Grayscale**: Converts the image to grayscale.
- **Enhance Contrast**: Increases contrast to make text stand out.
- **Binarization**: Uses Otsu’s thresholding to create a binary image.
- **Denoising**: Applies Gaussian blur to reduce noise.

### OCR

- Preprocessed images are processed with **Tesseract OCR** to extract text.

---

## Error Handling

- If an image fails to process, an empty `image_text` value is added to the results.
- **Example**:
```json
{
  "image_text": ""
}

```
---
## Methods

### `__init__(pdf_path, image_dir)`

**Parameters**:
- `pdf_path` (str): Path to the input PDF file.
- `image_dir` (str): Directory to save extracted images.

---

### `process_and_extract_text()`

**Description**: Processes the PDF to extract text and images.

**Returns**:
- A list of dictionaries containing extracted `text` or `image_text`.

---
## Contribution

Contributions are welcome! If you have suggestions or improvements, please open an issue or submit a pull request.
