Metadata-Version: 2.1
Name: Scrapy-Redis-BloomFilter
Version: 0.8.1
Summary: Bloom Filter Support for Scrapy-Redis
Home-page: https://github.com/Python3WebSpider/ScrapyRedisBloomFilter
Author: Germey
Author-email: cqc@cuiqingcai.com
License: MIT
Description: 
        # Scrapy-Redis-BloomFilter
        
        This is a package for supporting BloomFilter of Scrapy-Redis.
        
        ## Installation
        
        You can easily install this package with pip:
        
        ```
        pip install scrapy-redis-bloomfilter
        ```
        
        Dependency:
        
        * Scrapy-Redis >= 0.6.8
        
        ## Usage
        
        Add this settings to `settings.py`:
        
        ```python
        # Ensure use this Scheduler
        SCHEDULER = "scrapy_redis_bloomfilter.scheduler.Scheduler"
        
        # Ensure all spiders share same duplicates filter through redis
        DUPEFILTER_CLASS = "scrapy_redis_bloomfilter.dupefilter.RFPDupeFilter"
        
        # Redis URL
        REDIS_URL = 'redis://localhost:6379'
        
        # Number of Hash Functions to use, defaults to 6
        BLOOMFILTER_HASH_NUMBER = 6
        
        # Redis Memory Bit of Bloom Filter Usage, 30 means 2^30 = 128MB, defaults to 30
        BLOOMFILTER_BIT = 10
        
        # Persist
        SCHEDULER_PERSIST = True
        ```
        
        ## Test
        
        Here is a test of this project, usage:
        
        ```
        git clone https://github.com/Python3WebSpider/ScrapyRedisBloomFilter.git
        cd ScrapyRedisBloomFilter/test
        scrapy crawl test
        ```
        
        Note: please change REDIS_URL in settings.py.
        
        Spider like this:
        
        ```python
        from scrapy import Request, Spider
        
        class TestSpider(Spider):
            name = 'test'
            base_url = 'https://www.baidu.com/s?wd='
            
            def start_requests(self):
                for i in range(10):
                    url = self.base_url + str(i)
                    yield Request(url, callback=self.parse)
                    
                # Here contains 10 duplicated Requests    
                for i in range(100): 
                    url = self.base_url + str(i)
                    yield Request(url, callback=self.parse)
            
            def parse(self, response):
                self.logger.debug('Response of ' + response.url)
        ```
        
        Result like this:
        
        ```python
        {'bloomfilter/filtered': 10, # This is the number of Request filtered by BloomFilter
         'downloader/request_bytes': 34021,
         'downloader/request_count': 100,
         'downloader/request_method_count/GET': 100,
         'downloader/response_bytes': 72943,
         'downloader/response_count': 100,
         'downloader/response_status_count/200': 100,
         'finish_reason': 'finished',
         'finish_time': datetime.datetime(2017, 8, 11, 9, 34, 30, 419597),
         'log_count/DEBUG': 202,
         'log_count/INFO': 7,
         'memusage/max': 54153216,
         'memusage/startup': 54153216,
         'response_received_count': 100,
         'scheduler/dequeued/redis': 100,
         'scheduler/enqueued/redis': 100,
         'start_time': datetime.datetime(2017, 8, 11, 9, 34, 26, 495018)}
        ```
        
        
Platform: UNKNOWN
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3.5
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: Implementation :: CPython
Classifier: Programming Language :: Python :: Implementation :: PyPy
Requires-Python: >=3.5.0
Description-Content-Type: text/markdown
