Metadata-Version: 2.1
Name: os-rq-scrapy
Version: 0.0.6
Summary: Scrapy for Request Queue
Home-page: https://github.com/cfhamlet/os-rq-scrapy
Author: Ozzy
Author-email: cfhamlet@gmail.com
License: MIT License
Description: # os-rq-scrapy
        
        [![Build Status](https://www.travis-ci.org/cfhamlet/os-rq-scrapy.svg?branch=master)](https://www.travis-ci.org/cfhamlet/os-rq-scrapy)
        [![codecov](https://codecov.io/gh/cfhamlet/os-rq-scrapy/branch/master/graph/badge.svg)](https://codecov.io/gh/cfhamlet/os-rq-scrapy)
        [![PyPI - Python Version](https://img.shields.io/pypi/pyversions/os-rq-scrapy.svg)](https://pypi.python.org/pypi/os-rq-scrapy)
        [![PyPI](https://img.shields.io/pypi/v/os-rq-scrapy.svg)](https://pypi.python.org/pypi/os-rq-scrapy)
        
        
        A framework for [Scrapy](https://github.com/scrapy/scrapy) working with [os-rq-pod](https://github.com/cfhamlet/os-rq-pod) and [os-rq-hub](https://github.com/cfhamlet/os-rq-hub) to build ["broad crawls"](https://docs.scrapy.org/en/latest/topics/broad-crawls.html) system.
        
        As you know, Scrapy is a very popular python crawler framework. It is suit for "focused crawl", start from several URLs of specific sites, fetch html, extract and save "structured data" also with patternd links to crawl recursively. But for large scale, long time crawling especially "broad crawls", scrapy is incompetent. Basically, you have to decouple the whole crawling system into several sub-systems, high-performance full-featured distributed fetcher, task scheduler, html extractor, link database, data storage, proxy and a lot of auxiliary equipments. It will be more complex when your system is for multi-tenancy.
        
        The os-rq-scrapy and [os-rq-pod](https://github.com/cfhamlet/os-rq-pod) project are basic components to build "broad crawls" system. The core conceptions are very simple, os-rq-pod is multi-sites request queue have http api to recieve requests. os-rq-scrapy is fetcher, getting reqests from os-rq-pod and crawl multi-sites concurrently.  [os-rq-hub](https://github.com/cfhamlet/os-rq-hub) can also be used to connect multi pod and scrapy instances to work simultaneously.
        
        
        ## Requirements
        
        * Python 3.6+ (pypy3.6+)
        * [Scrapy](https://github.com/scrapy/scrapy) 2.0
        
        extra requirements:
        
        * [ujson](https://github.com/ultrajson/ultrajson), for json performance
        
        ## Install
        
        ```
        pip install os-rq-scrapy
        ```
        
        ## Usage
        
        ### Command line
        
        ``rq-scrapy`` command enhance the basic ``scrapy`` command. When RQ_API is configured, the ``crawl`` subcommand will run on rq mode, get requests from rq.
        
        ## Unit Tests
        
        ```
        tox
        ```
        
        ## License
        
        MIT licensed.
        
Platform: UNKNOWN
Classifier: Development Status :: 2 - Pre-Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Natural Language :: English
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: Implementation :: CPython
Classifier: Programming Language :: Python :: Implementation :: PyPy
Requires-Python: >=3.6
Description-Content-Type: text/markdown
Provides-Extra: ujson
