Metadata-Version: 2.2
Name: spiderforce4ai
Version: 0.1.6
Summary: Python wrapper for SpiderForce4AI HTML-to-Markdown conversion service
Home-page: https://petertam.pro
Author: Piotr Tamulewicz
Author-email: Piotr Tamulewicz <pt@petertam.pro>
License: MIT
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Python: >=3.11
Description-Content-Type: text/markdown
Requires-Dist: aiohttp>=3.8.0
Requires-Dist: asyncio>=3.4.3
Requires-Dist: rich>=10.0.0
Requires-Dist: aiofiles>=0.8.0
Requires-Dist: httpx>=0.24.0
Dynamic: author
Dynamic: home-page
Dynamic: requires-python

# SpiderForce4AI Python Wrapper

A Python wrapper for SpiderForce4AI - a powerful HTML-to-Markdown conversion service. This package provides an easy-to-use interface for crawling websites and converting their content to clean Markdown format.

## Installation

```bash
pip install spiderforce4ai
```

## Quick Start (Minimal Setup)

```python
from spiderforce4ai import SpiderForce4AI, CrawlConfig

# Initialize with your SpiderForce4AI service URL
spider = SpiderForce4AI("http://localhost:3004")

# Use default configuration (will save in ./spiderforce_reports)
config = CrawlConfig()

# Crawl a single URL
result = spider.crawl_url("https://example.com", config)
```

## Crawling Methods

### 1. Single URL Crawling

```python
# Synchronous
result = spider.crawl_url("https://example.com", config)

# Asynchronous
async def crawl():
    result = await spider.crawl_url_async("https://example.com", config)
```

### 2. Multiple URLs Crawling

```python
# List of URLs
urls = [
    "https://example.com/page1",
    "https://example.com/page2",
    "https://example.com/page3"
]

# Synchronous
results = spider.crawl_urls(urls, config)

# Asynchronous
async def crawl():
    results = await spider.crawl_urls_async(urls, config)

# Parallel (using multiprocessing)
results = spider.crawl_urls_parallel(urls, config)
```

### 3. Sitemap Crawling

```python
# Synchronous
results = spider.crawl_sitemap("https://example.com/sitemap.xml", config)

# Asynchronous
async def crawl():
    results = await spider.crawl_sitemap_async("https://example.com/sitemap.xml", config)

# Parallel (using multiprocessing)
results = spider.crawl_sitemap_parallel("https://example.com/sitemap.xml", config)
```

## Configuration Options

All configuration options are optional with sensible defaults:

```python
config = CrawlConfig(
    # Content Selection (all optional)
    target_selector="article",              # Specific element to target
    remove_selectors=[                      # Elements to remove
        ".ads", 
        "#popup",
        ".navigation",
        ".footer"
    ],
    remove_selectors_regex=["modal-\\d+"],  # Regex patterns for removal
    
    # Processing Settings
    max_concurrent_requests=1,              # Default: 1 (parallel processing)
    request_delay=0.5,                     # Delay between requests in seconds
    timeout=30,                            # Request timeout in seconds
    
    # Output Settings
    output_dir="custom_output",            # Default: "spiderforce_reports"
    report_file="custom_report.json",      # Default: "crawl_report.json"
    webhook_url="https://your-webhook.com", # Optional webhook endpoint
    webhook_timeout=10                      # Webhook timeout in seconds
)
```

## Real-World Examples

### 1. Basic Website Crawling

```python
from spiderforce4ai import SpiderForce4AI, CrawlConfig
from pathlib import Path

spider = SpiderForce4AI("http://localhost:3004")
config = CrawlConfig(
    output_dir=Path("blog_content")
)

result = spider.crawl_url("https://example.com/blog", config)
print(f"Content saved to: {result.url}.md")
```

### 2. Advanced Parallel Sitemap Crawling

```python
config = CrawlConfig(
    max_concurrent_requests=5,
    output_dir=Path("website_content"),
    remove_selectors=[
        ".navigation",
        ".footer",
        ".ads",
        "#cookie-notice"
    ],
    webhook_url="https://your-webhook.com/endpoint"
)

results = spider.crawl_sitemap_parallel(
    "https://example.com/sitemap.xml",
    config
)
```

### 3. Async Crawling with Progress

```python
import asyncio

async def main():
    config = CrawlConfig(
        max_concurrent_requests=3,
        request_delay=1.0
    )
    
    async with spider:
        results = await spider.crawl_urls_async([
            "https://example.com/1",
            "https://example.com/2",
            "https://example.com/3"
        ], config)
    
    return results

results = asyncio.run(main())
```

## Output Structure

### 1. File Organization
```
output_dir/
├── example-com-page1.md
├── example-com-page2.md
└── crawl_report.json
```

### 2. Markdown Files
Each markdown file is named using a slugified version of the URL and contains the converted content.

### 3. Report JSON Structure
```json
{
  "timestamp": "2025-02-15T10:30:00.123456",
  "config": {
    "target_selector": "article",
    "remove_selectors": [".ads", "#popup"],
    "remove_selectors_regex": ["modal-\\d+"]
  },
  "results": {
    "successful": [
      {
        "url": "https://example.com/page1",
        "status": "success",
        "markdown": "# Page Title\n\nContent...",
        "timestamp": "2025-02-15T10:30:00.123456"
      }
    ],
    "failed": [
      {
        "url": "https://example.com/page2",
        "status": "failed",
        "error": "HTTP 404: Not Found",
        "timestamp": "2025-02-15T10:30:01.123456"
      }
    ]
  },
  "summary": {
    "total": 2,
    "successful": 1,
    "failed": 1
  }
}
```

### 4. Webhook Notifications
If configured, webhooks receive real-time updates in JSON format:
```json
{
  "url": "https://example.com/page1",
  "status": "success",
  "markdown": "# Page Title\n\nContent...",
  "timestamp": "2025-02-15T10:30:00.123456",
  "config": {
    "target_selector": "article",
    "remove_selectors": [".ads", "#popup"]
  }
}
```

## Error Handling

The package handles various types of errors:
- Network errors
- Timeout errors
- Invalid URLs
- Missing content
- Service errors

All errors are:
1. Logged in the console
2. Included in the JSON report
3. Sent via webhook (if configured)
4. Available in the results list

## Requirements

- Python 3.11 or later
- Running SpiderForce4AI service
- Internet connection

## License

MIT License

## Credits

Created by [Peter Tam](https://petertam.pro)
