Metadata-Version: 2.1
Name: robotsparser
Version: 0.0.9
Summary: Parse robots.txt files and find indexed urls
Home-page: https://github.com/Dvelezs94/robotsparser
Author: Diego Velez
Author-email: diegovelezs94@gmail.com
Keywords: robots.txt,sitemap,crawler,data mining
Classifier: Development Status :: 1 - Planning
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python :: 3
Classifier: Operating System :: Unix
Classifier: Operating System :: MacOS :: MacOS X
Classifier: Operating System :: Microsoft :: Windows
Description-Content-Type: text/markdown
License-File: LICENSE


# robotsparser
Python library that parses robots.txt files

## Functionalities

- Automatically discover all sitemap files
- Unzip gziped files
- Fetch all URLs from sitemaps

## Install
```
pip install robotsparser
```

## Usage

```python
from robotsparser.parser import Robotparser

robots_url = "https://www.example.com/robots.txt"
rb = Robotparser(url=robots_url, verbose=True)
# To initiate the crawl of sitemaps and indexed urls. sitemap_crawl_limit argument is optional
rb.read(fetch_sitemap_urls=True, sitemap_url_crawl_limit=5)

# Show information
rb.get_sitemap_indexes() # returns sitemap indexes
rb.get_sitemaps() # returns sitemaps
rb.get_urls() # returns a list of all urls
```

## Multiprocessing usage (crawl in the background)

Crawl in the background and output new entries to file

This is useful for sites where sitemaps are heavily nested and take a long
time to crawl

```python
from robotsparser.parser import Robotparser
import multiprocessing as mp
from time import sleep
from sh import tail

if __name__ == '__main__':
    mp.freeze_support()
    robots_url = "https://www.example.com/robots.txt"
    entries_lof_file="./entries.log"
    rb = Robotparser(url=robots_url, verbose=False, sitemap_entries_file=entries_lof_file)
    sitemap_crawl_proc = mp.Process(target = rb.read, kwargs = {'fetch_sitemap_urls': False})
    sitemap_crawl_proc.start()

    for line in tail("-f", entries_lof_file, _iter=True):
        print(line.replace("\n", ""))
        if not sitemap_crawl_proc.is_alive():
            break
```
