Metadata-Version: 2.1
Name: firstscrap
Version: 0.2.0
Summary: Scraping sites with multithreading, random proxies and user-agents
Home-page: https://github.com/theodor85/first_scrap
Author: Teddy Coder
Author-email: fedor_coder@mail.ru
License: MIT
Description: # First_scrap
        
        https://theodor85.github.io/first_scrap/
        
        - - -
        [English](README.md), [Русский](README-ru.md)
        - - -
        
        First_scrap is a library for multithread scraping sites with random proxies and user-agents.
        
        ## Installation
        
        To get started with the first_scrap library, activate (or create if necessary) your virtual environment. For example, as follows:
        
            python3 -m venv env
            source ./env/bin/activate
        
        To install First_scrap use pip package manager:
        
            pip install firstscrap
        
        Another installing approach is getting source code from GitHub. For this execute the commands in your console:
        
            git clone http://github.com/theodor85/first_scrap
            cd first_scrap
            python setup.py develop
        
        ## How to use
        
        Using example for exctracting data from one web page:
        
        
        ```python
        from firstscrap import pagehandler
        
        @pagehandler(parser="BeautifulSoup")
        def get_data(url, soup=None):
            # your only beatifulsoup code, without any requests, proxies, etc
            span = soup.find( name="span", attrs={"class": "p-nickname vcard-username d-block"} )
            text = span.get_text().strip()
            return text
        
        if __name__ == '__main__' :
            print( get_data('https://github.com/theodor85') )
        
            # output:
            # theodor85
        ```
        
        ## What's under hood
        
        When extracting data from a single page:
        
        1. Random proxy server and user-agent are selected from the lists stored in the file.
        2. These proxies and user-agents are used to access the page we need.
        3. With BeautifulSoup the data is retrieved from the page.
        
        ## The most interesting thing is plenty identical pages processing
        
        Here is the example:
        
        ```python
        from firstscrap import listhandler
        
        TEST_URLLIST_OLX = [
            'https://www.olx.ua/obyavlenie/spetsialist-po-podklyucheniyu-interneta-IDGnCkB.html',
            'https://www.olx.ua/obyavlenie/menedzher-po-robot-s-klentami-IDGkGK6.html',
        ]
        
        @listhandler(threads_limit=5, parser='BeautifulSoup')
        def get_date_time_from_olx(urllist, soup=None):
            ''' Beautifulsoup code for one page '''
            em = soup.find('em')
            row_text = em.get_text().strip()
            return row_text
        
        if __name__ == '__main__' :
            data = get_date_time_from_olx(TEST_URLLIST_OLX)
            for item in data:
                print(item)
        # output:
        # Добавлено: в 16:49, 26 декабря 2019, Номер объявления: 626235005
        # Добавлено: в 16:18, 29 декабря 2019, Номер объявления: 625536978
        
        ```
        
        ## What's under hood
        
        The program processes each page in a separate thread, and the number of threads running at the same time does not exceed `threads_limit`.
        
        Every thread makes request using random proxy and user-agent.
        
        ## Running the tests
        
        To run the tests type in your console:
        
            python -m unittest -v tests/tests.py
        
        Before running the tests enjure that your internet connection is active.
        
        ## Contributing
        
        Merge you code to the 'develop' branch for contributing please.
        
        Forks and pull requests are welcome! If you like first_scrap, do not forget to put a star!
        
        ## Bug reports
        
        To bug report please mail to fedor_coder@mail.ru with tag "first_scrap bug reporting".
        
        ## License
        
        This project is licensed under the MIT License - see the [LICENSE.txt](LICENSE.txt) file for details.
        
Platform: UNKNOWN
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.7
Description-Content-Type: text/markdown
