Metadata-Version: 2.4
Name: scrapeMM
Version: 0.1.3
Summary: LLM-friendly scraper for media and text from social media and the open web.
Author-email: Mark Rothermel <mark.rothermel@tu-darmstadt.de>
License-Expression: Apache-2.0
Project-URL: Homepage, https://github.com/multimodal-ai-lab/scrapeMM
Project-URL: Issues, https://github.com/multimodal-ai-lab/scrapeMM
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: ezmm
Requires-Dist: telethon
Requires-Dist: tweepy
Requires-Dist: markdownify
Requires-Dist: keyring
Requires-Dist: platformdirs
Requires-Dist: PyYAML
Dynamic: license-file

# scrapeMM: Multimodal Web Retrieval
Simple web scraper to asynchronously retrieve webpages and access social media contents, fetching text along with media, i.e., images and videos.

This library aims to help developers and researchers to easily access multimodal data from the web and use it for LLM processing.

## Usage
```python
from scrapemm import retrieve

url = "https://example.com"
result = retrieve(url)
result.render()
```

## How it works
```
Input:                                  Output:
URL (string)   -->   retrieve()   -->   MultimodalSequence
```
The `MultimodalSequence` is a sequence of Markdown-formatted text and media provided by the [ezMM](https://github.com/multimodal-ai-lab/ezmm) library.

Web scraping is done with [Firecrawl](https://github.com/mendableai/firecrawl).

## Supported Proprietary APIs
- ✅ X/Twitter
- ✅ Telegram
- ⏳ Facebook
- ⏳ Instagram
- ⏳ Threads
- ⏳ TikTok
