# Idealista Scraper

Production-ready web scraper for Idealista real estate listings with async support, resumable sessions, and MongoDB-compatible output.

## Features

- Async scraping with configurable concurrency
- Multi-key Scrapfly rotation for high-volume scraping
- Resumable sessions with progress checkpoints
- S3 image upload with automatic retries
- MongoDB-compatible JSONL output
- Rich CLI with progress indicators

## Installation

```bash
# Clone the repository
git clone https://github.com/antonyngigge/idealistaScraper.git
cd idealistaScraper

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install in editable mode
pip install -e .

# Verify installation
idealista-scraper --help
```

## Configuration

Create a `.env.local` file in the project root:

```bash
# Scrapfly API keys (supports multiple for rotation)
SCRAPFLY_KEY_1=your_key_here
SCRAPFLY_KEY_2=optional_second_key
# ... up to SCRAPFLY_KEY_15

# Or single key
SCRAPFLY_KEY=your_key_here

# AWS S3 (optional, for image upload)
AWS_ACCESS_KEY_ID=your_access_key
AWS_SECRET_ACCESS_KEY=your_secret_key
AWS_REGION=eu-north-1
S3_BUCKET_NAME=your-bucket-name

# MongoDB (optional, for direct import)
MONGODB_URI=mongodb://localhost:27017
```

## Quick Start

```bash
# Scrape rental listings from Madrid (10 pages)
idealista-scraper scrape listings --location madrid --type rental --pages 10

# Scrape from Portugal
idealista-scraper scrape listings --country portugal --location lisboa --type sale

# Check scraping progress
idealista-scraper status

# Clean data for MongoDB import
idealista-scraper clean-mongo output/rental_properties.jsonl --output cleaned.jsonl

# Upload images to S3
idealista-scraper upload-images --bucket my-bucket
```

## Multi-Country Support

The scraper supports multiple Idealista country domains:

| Country  | Domain                    |
|----------|---------------------------|
| Spain    | https://www.idealista.com |
| Portugal | https://www.idealista.pt  |

### Discovering Available Regions

```bash
# List supported countries
idealista-scraper regions --list-countries

# List regions in Spain (uses API)
idealista-scraper regions --country spain

# List regions in Portugal for sale properties
idealista-scraper regions --country portugal --type sale

# List common regions without API call
idealista-scraper regions --country spain --common
```

### Scraping Different Countries

```bash
# Spain (default)
idealista-scraper scrape listings --location madrid --type rental

# Portugal
idealista-scraper scrape listings --country portugal --location lisboa --type sale

# Set default country via environment variable
export IDEALISTA_DEFAULT_COUNTRY=portugal
idealista-scraper scrape listings --location porto --type rental
```

## CLI Reference

### Scraping Commands

```bash
# Scrape listings
idealista-scraper scrape listings --location <city> --type <rental|sale> --pages <n>
idealista-scraper scrape listings --location madrid --type rental --pages 10

# Scrape property details from URL list
idealista-scraper scrape properties --input urls.txt --output properties.jsonl

# Scrape agent data
idealista-scraper scrape agents --limit 100
```

### Data Processing

```bash
# Transform HTML to JSONL
idealista-scraper transform raw.html --output mongo.jsonl --agents agents.jsonl

# Clean for MongoDB (UUID to ObjectID, BSON fixes)
idealista-scraper clean-mongo properties.jsonl --output cleaned.jsonl
```

### Pipeline Automation

```bash
# Interactive mode
idealista-scraper pipeline

# Full pipeline: Scrape -> Transform -> Clean -> Upload
idealista-scraper pipeline --preset full --location barcelona --pages 20

# Quick pipeline: Scrape -> Transform only
idealista-scraper pipeline --preset quick

# Export pipeline: Clean -> Upload (existing data)
idealista-scraper pipeline --preset export --bucket my-bucket
```

### Utilities

```bash
# Check progress and statistics
idealista-scraper status

# Resume interrupted session
idealista-scraper resume

# Estimate credit usage
idealista-scraper estimate --pages 100 --asp    # With ASP (25x credits)
idealista-scraper estimate --pages 100 --no-asp # Without ASP (1x credits)

# Test if ASP is required
idealista-scraper test-asp

# Show configuration info
idealista-scraper info

# Clean output files
idealista-scraper clean --cache     # Cache only
idealista-scraper clean --progress  # Progress files only
idealista-scraper clean --all       # Everything

# Upload images to S3
idealista-scraper upload-images --input image_urls.jsonl --bucket my-bucket
idealista-scraper upload-images --bucket my-bucket --resume  # Resume upload
```

## Python Library Usage

```python
from idealista_scraper import (
    PropertyDetailsScraper,
    AgentDetailsScraper,
    HTMLParser,
    S3ImageUploader,
    MongoCleaner,
    clean_html_content,
)

# Parse HTML content
parser = HTMLParser()
data = parser.parse(html_content)

# Clean data for MongoDB
cleaner = MongoCleaner()
cleaned = cleaner.clean_record(data)

# Upload images
uploader = S3ImageUploader(bucket_name="my-bucket")
await uploader.upload_image(url, s3_key)
```

## Output Files

All output is saved to the `output/` directory:

| File | Description |
|------|-------------|
| `rental_properties.jsonl` | Scraped property data |
| `raw_listings.jsonl` | Raw listing data |
| `image_urls.jsonl` | Property image URLs for S3 upload |
| `agent_properties.jsonl` | Agent data |
| `properties_cleaned.jsonl` | MongoDB-ready cleaned data |

## Project Structure

```
idealista_scraper/
├── cli/              # Typer CLI interface
├── scraping/         # Web scraping modules
├── parsing/          # HTML parsing
├── transform/        # Data transformation and MongoDB cleaning
├── client/           # Scrapfly client management
├── cache/            # Caching layers
├── session/          # Session management
├── output/           # File output writers
├── upload/           # S3 upload functionality
└── utils/            # Utilities (paths, config)
```

## Development

```bash
# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest tests/unit/

# Run with integration tests (requires API keys)
pytest tests/ --integration

# Lint code
ruff check idealista_scraper/

# Type check
mypy idealista_scraper/
```

## Alternative Entry Points

The package can be run in multiple ways:

```bash
# Console script (installed via pip)
idealista-scraper --help

# Short alias
idealista --help

# Module execution
python -m idealista_scraper --help
```

## License

MIT License - see LICENSE file for details.

## Author

Antony Ngigge
