Metadata-Version: 2.4
Name: idealista-scraper
Version: 1.0.0
Summary: Production web scraper for Idealista real estate listings
Author-email: Antony Ngigge <antonyngigge@iworldafric.com>
License-Expression: MIT
Project-URL: Homepage, https://github.com/antonyngigge/idealistaScraper
Project-URL: Repository, https://github.com/antonyngigge/idealistaScraper
Keywords: scraper,idealista,real-estate,web-scraping,async
Classifier: Development Status :: 4 - Beta
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Internet :: WWW/HTTP :: Indexing/Search
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: aiofiles>=23.2.1
Requires-Dist: aiohttp>=3.9.1
Requires-Dist: beautifulsoup4>=4.12.2
Requires-Dist: lxml>=5.3.0
Requires-Dist: loguru>=0.7.2
Requires-Dist: numpy>=1.26.2
Requires-Dist: opencv-python>=4.8.1.78
Requires-Dist: Pillow>=10.1.0
Requires-Dist: piexif>=1.1.3
Requires-Dist: psutil>=5.9.6
Requires-Dist: pymongo>=4.6.1
Requires-Dist: scrapfly-sdk>=0.8.5
Requires-Dist: typer[all]>=0.12.5
Requires-Dist: rich>=13.7.1
Requires-Dist: python-dotenv>=1.0.1
Requires-Dist: backoff>=2.2.1
Provides-Extra: dev
Requires-Dist: pytest>=8.3.3; extra == "dev"
Requires-Dist: pytest-asyncio>=0.24.0; extra == "dev"
Requires-Dist: ruff>=0.1.0; extra == "dev"
Requires-Dist: mypy>=1.0.0; extra == "dev"
Dynamic: license-file

# Idealista Scraper

Production-ready web scraper for Idealista real estate listings with async support, resumable sessions, and MongoDB-compatible output.

## Features

- Async scraping with configurable concurrency
- Multi-key Scrapfly rotation for high-volume scraping
- Resumable sessions with progress checkpoints
- S3 image upload with automatic retries
- MongoDB-compatible JSONL output
- Rich CLI with progress indicators

## Installation

```bash
# Clone the repository
git clone https://github.com/antonyngigge/idealistaScraper.git
cd idealistaScraper

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install in editable mode
pip install -e .

# Verify installation
idealista-scraper --help
```

## Configuration

Create a `.env.local` file in the project root:

```bash
# Scrapfly API keys (supports multiple for rotation)
SCRAPFLY_KEY_1=your_key_here
SCRAPFLY_KEY_2=optional_second_key
# ... up to SCRAPFLY_KEY_15

# Or single key
SCRAPFLY_KEY=your_key_here

# AWS S3 (optional, for image upload)
AWS_ACCESS_KEY_ID=your_access_key
AWS_SECRET_ACCESS_KEY=your_secret_key
AWS_REGION=eu-north-1
S3_BUCKET_NAME=your-bucket-name

# MongoDB (optional, for direct import)
MONGODB_URI=mongodb://localhost:27017
```

## Quick Start

```bash
# Scrape rental listings from Madrid (10 pages)
idealista-scraper scrape listings --location madrid --type rental --pages 10

# Scrape from Portugal
idealista-scraper scrape listings --country portugal --location lisboa --type sale

# Check scraping progress
idealista-scraper status

# Clean data for MongoDB import
idealista-scraper clean-mongo output/rental_properties.jsonl --output cleaned.jsonl

# Upload images to S3
idealista-scraper upload-images --bucket my-bucket
```

## Multi-Country Support

The scraper supports multiple Idealista country domains:

| Country  | Domain                    |
|----------|---------------------------|
| Spain    | https://www.idealista.com |
| Portugal | https://www.idealista.pt  |

### Discovering Available Regions

```bash
# List supported countries
idealista-scraper regions --list-countries

# List regions in Spain (uses API)
idealista-scraper regions --country spain

# List regions in Portugal for sale properties
idealista-scraper regions --country portugal --type sale

# List common regions without API call
idealista-scraper regions --country spain --common
```

### Scraping Different Countries

```bash
# Spain (default)
idealista-scraper scrape listings --location madrid --type rental

# Portugal
idealista-scraper scrape listings --country portugal --location lisboa --type sale

# Set default country via environment variable
export IDEALISTA_DEFAULT_COUNTRY=portugal
idealista-scraper scrape listings --location porto --type rental
```

## CLI Reference

### Scraping Commands

```bash
# Scrape listings
idealista-scraper scrape listings --location <city> --type <rental|sale> --pages <n>
idealista-scraper scrape listings --location madrid --type rental --pages 10

# Scrape property details from URL list
idealista-scraper scrape properties --input urls.txt --output properties.jsonl

# Scrape agent data
idealista-scraper scrape agents --limit 100
```

### Data Processing

```bash
# Transform HTML to JSONL
idealista-scraper transform raw.html --output mongo.jsonl --agents agents.jsonl

# Clean for MongoDB (UUID to ObjectID, BSON fixes)
idealista-scraper clean-mongo properties.jsonl --output cleaned.jsonl
```

### Pipeline Automation

```bash
# Interactive mode
idealista-scraper pipeline

# Full pipeline: Scrape -> Transform -> Clean -> Upload
idealista-scraper pipeline --preset full --location barcelona --pages 20

# Quick pipeline: Scrape -> Transform only
idealista-scraper pipeline --preset quick

# Export pipeline: Clean -> Upload (existing data)
idealista-scraper pipeline --preset export --bucket my-bucket
```

### Utilities

```bash
# Check progress and statistics
idealista-scraper status

# Resume interrupted session
idealista-scraper resume

# Estimate credit usage
idealista-scraper estimate --pages 100 --asp    # With ASP (25x credits)
idealista-scraper estimate --pages 100 --no-asp # Without ASP (1x credits)

# Test if ASP is required
idealista-scraper test-asp

# Show configuration info
idealista-scraper info

# Clean output files
idealista-scraper clean --cache     # Cache only
idealista-scraper clean --progress  # Progress files only
idealista-scraper clean --all       # Everything

# Upload images to S3
idealista-scraper upload-images --input image_urls.jsonl --bucket my-bucket
idealista-scraper upload-images --bucket my-bucket --resume  # Resume upload
```

## Python Library Usage

```python
from idealista_scraper import (
    PropertyDetailsScraper,
    AgentDetailsScraper,
    HTMLParser,
    S3ImageUploader,
    MongoCleaner,
    clean_html_content,
)

# Parse HTML content
parser = HTMLParser()
data = parser.parse(html_content)

# Clean data for MongoDB
cleaner = MongoCleaner()
cleaned = cleaner.clean_record(data)

# Upload images
uploader = S3ImageUploader(bucket_name="my-bucket")
await uploader.upload_image(url, s3_key)
```

## Output Files

All output is saved to the `output/` directory:

| File | Description |
|------|-------------|
| `rental_properties.jsonl` | Scraped property data |
| `raw_listings.jsonl` | Raw listing data |
| `image_urls.jsonl` | Property image URLs for S3 upload |
| `agent_properties.jsonl` | Agent data |
| `properties_cleaned.jsonl` | MongoDB-ready cleaned data |

## Project Structure

```
idealista_scraper/
├── cli/              # Typer CLI interface
├── scraping/         # Web scraping modules
├── parsing/          # HTML parsing
├── transform/        # Data transformation and MongoDB cleaning
├── client/           # Scrapfly client management
├── cache/            # Caching layers
├── session/          # Session management
├── output/           # File output writers
├── upload/           # S3 upload functionality
└── utils/            # Utilities (paths, config)
```

## Development

```bash
# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest tests/unit/

# Run with integration tests (requires API keys)
pytest tests/ --integration

# Lint code
ruff check idealista_scraper/

# Type check
mypy idealista_scraper/
```

## Alternative Entry Points

The package can be run in multiple ways:

```bash
# Console script (installed via pip)
idealista-scraper --help

# Short alias
idealista --help

# Module execution
python -m idealista_scraper --help
```

## License

MIT License - see LICENSE file for details.

## Author

Antony Ngigge
