Metadata-Version: 2.1
Name: bhfutils
Version: 0.0.71
Summary: Utilities that are used by any spider of Behoof project
Home-page: https://behoof.app/
Author: Teplygin Vladimir
Author-email: vvteplygin@gmail.com
License: MIT
Description: ******************************
        Behoof Scrapy Cluster Template
        ******************************
        
        Overview
        --------
        
        The ``bhfutils`` package is a collection of utilities that are used by any spider of Behoof project.
        
        Requirements
        ------------
        
        - Unix based machine (Linux or OS X)
        - Python 2.7 or 3.6
        
        Installation
        ------------
        
        Inside a virtualenv, run ``pip install -U bhfutils``.  This will install the latest version of the Behoof Scrapy Cluster Spider utilities.  After that you can use special settings.py compatibal with scrapy cluster (template placed in crawler/setting_template.py)
        
        Documentation
        -------------
        
        Full documentation for the ``bhfutils`` package does not exist
        
        custom_cookies.py
        ==================
        
        The ``custom_cookies`` module is custom Cookies Middleware to pass our required cookies along but not persist between calls
        
        distributed_scheduler.py
        ========================
        
        The ``distributed_scheduler`` module is scrapy request scheduler that utilizes Redis Throttled Priority Queues to moderate different domain scrape requests within a distributed scrapy cluster
        
        redis_domain_max_page_filter.py
        ===============================
        
        The ``redis_domain_max_page_filter`` module is redis-based max page filter. This filter is applied per domain. Using this filter the maximum number of pages crawled for a particular domain is bounded 
        
        redis_dupefilter.py
        ===================
        
        The ``redis_dupefilter`` module is redis-based request duplication filter
        
        redis_global_page_per_domain_filter.py
        ======================================
        
        The ``redis_global_page_per_domain_filter`` module is redis-based request number filter When this filter is enabled, all crawl jobs have GLOBAL_PAGE_PER_DOMAIN_LIMIT as a hard limit of the max pages they are allowed to crawl for each individual spiderid+domain+crawlid combination.
        
Keywords: behoof,scrapy-cluster,utilities
Platform: UNKNOWN
Description-Content-Type: text/x-rst
Provides-Extra: test
Provides-Extra: all
Provides-Extra: docs
Provides-Extra: lint
