Metadata-Version: 1.2
Name: courlan
Version: 0.5.0
Summary: Clean, filter, normalize, and sample URLs
Home-page: http://github.com/adbar/courlan
Author: Adrien Barbaresi
Author-email: barbaresi@bbaw.de
License: GPLv3+
Project-URL: Source, https://github.com/adbar/courlan
Project-URL: Coverage, https://codecov.io/github/adbar/courlan
Project-URL: Tracker, https://github.com/adbar/courlan/issues
Description: coURLan: Clean, filter, normalize, and sample URLs
        ==================================================
        
        
        .. image:: https://img.shields.io/pypi/v/courlan.svg
            :target: https://pypi.python.org/pypi/courlan
            :alt: Python package
        
        .. image:: https://img.shields.io/pypi/pyversions/courlan.svg
            :target: https://pypi.python.org/pypi/courlan
            :alt: Python versions
        
        .. image:: https://img.shields.io/codecov/c/github/adbar/courlan.svg
            :target: https://codecov.io/gh/adbar/courlan
            :alt: Code Coverage
        
        
        
        Why coURLan?
        ------------
        
        Avoid loosing bandwidth capacity and processing time for webpages which are probably not worth the effort. This library provides an additional brain for web crawling, scraping and management of Internet archives. Specific fonctionality for crawlers: stay away from pages with little text content or target synoptic pages explicitly to gather links.
        
        This navigation help targets text-based documents (i.e. currently web pages expected to be in HTML format) and tries to guess the language of pages to allow for language-focused collection. Additional functions include straightforward domain name extraction and URL sampling.
        
        
        Features
        --------
        
        Separate `the wheat from the chaff <https://en.wiktionary.org/wiki/separate_the_wheat_from_the_chaff>`_ and optimize crawls by focusing on non-spam HTML pages containing primarily text. Most helpers revolve around the ``strict`` and ``language`` arguments:
        
        - Heuristics for triage of links
           - Targeting spam and unsuitable content-types
           - Language-aware filtering
           - Crawl management
        - URL handling
           - Validation
           - Canonicalization/Normalization
           - Sampling
        - Command-line interface (CLI) and Python tool
        
        
        **Let the coURLan fish out juicy bits for you!**
        
        .. image:: courlan_harns-march.jpg
            :alt: Courlan 
            :align: center
            :width: 65%
            :target: https://commons.wikimedia.org/wiki/File:Limpkin,_harns_marsh_(33723700146).jpg
        
        Here is a `courlan <https://en.wiktionary.org/wiki/courlan>`_ (source: `Limpkin at Harn's Marsh by Russ <https://commons.wikimedia.org/wiki/File:Limpkin,_harns_marsh_(33723700146).jpg>`_, CC BY 2.0).
        
        
        
        Installation
        ------------
        
        This Python package is tested on Linux, macOS and Windows systems, it is compatible with Python 3.5 upwards. It is available on the package repository `PyPI <https://pypi.org/>`_ and can notably be installed with the Python package managers ``pip`` and ``pipenv``:
        
        .. code-block:: bash
        
            $ pip install courlan # pip3 install on systems where both Python 2 and 3 are installed
            $ pip install --upgrade courlan # to make sure you have the latest version
            $ pip install git+https://github.com/adbar/courlan.git # latest available code (see build status above)
        
        
        Python
        ------
        
        check_url()
        ~~~~~~~~~~~
        
        All useful operations chained in ``check_url(url)``:
        
        .. code-block:: python
        
            >>> from courlan import check_url
            # returns url and domain name
            >>> check_url('https://github.com/adbar/courlan')
            ('https://github.com/adbar/courlan', 'github.com')
            # noisy query parameters can be removed
            >>> check_url('https://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.org', strict=True)
            ('https://httpbin.org/redirect-to', 'httpbin.org')
            # Check for redirects (HEAD request)
            >>> url, domain_name = check_url(my_url, with_redirects=True)
        
        
        Language-aware heuristics, notably internationalization in URLs, are available in ``lang_filter(url, language)``:
        
        .. code-block:: python
        
            # optional argument targeting webpages in English or German
            >>> url = 'https://www.un.org/en/about-us'
            # success: returns clean URL and domain name
            >>> check_url(url, language='en')
            ('https://www.un.org/en/about-us', 'un.org')
            # failure: doesn't return anything
            >>> check_url(url, language='de')
            >>>
            # optional argument: strict
            >>> url = 'https://en.wikipedia.org/'
            >>> check_url(url, language='de', strict=False)
            ('https://en.wikipedia.org', 'wikipedia.org')
            >>> check_url(url, language='de', strict=True)
            >>>
        
        
        Define stricter restrictions on the expected content type with ``strict=True``. Also blocks certain platforms and pages types crawlers should stay away from if they don't target them explicitly and other black holes where machines get lost.
        
        .. code-block:: python
        
            # strict filtering
            >>> check_url('https://www.twitch.com/', strict=True)
            # blocked as it is a major platform
        
        
        
        Sampling by domain name
        ~~~~~~~~~~~~~~~~~~~~~~~
        
        
        .. code-block:: python
        
            >>> from courlan import sample_urls
            >>> my_sample = sample_urls(my_urls, 100)
            # optional: exclude_min=None, exclude_max=None, strict=False, verbose=False
        
        
        Web crawling and URL handling
        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        
        
        Determine if a link leads to another host:
        
        .. code-block:: python
        
            >>> from courlan import is_external
            >>> is_external('https://github.com/', 'https://www.microsoft.com/')
            True
            # default
            >>> is_external('https://google.com/', 'https://www.google.co.uk/', ignore_suffix=True)
            False
            # taking suffixes into account
            >>> is_external('https://google.com/', 'https://www.google.co.uk/', ignore_suffix=False)
            True
        
        
        Other useful functions dedicated to URL handling:
        
        - ``get_base_url(url)``: strip the URL of some of its parts
        - ``get_host_and_path(url)``: decompose URLs in two parts: protocol + host/domain and path
        - ``get_hostinfo(url)``: extract domain and host info (protocol + host/domain)
        - ``fix_relative_urls(baseurl, url)``: prepend necessary information to relative links
        
        
        .. code-block:: python
        
            >>> from courlan import *
            >>> url = 'https://www.un.org/en/about-us'
            >>> get_base_url(url)
            'https://www.un.org'
            >>> get_host_and_path(url)
            ('https://www.un.org', '/en/about-us')
            >>> get_hostinfo(url)
            ('un.org', 'https://www.un.org')
            >>> fix_relative_urls('https://www.un.org', 'en/about-us')
            'https://www.un.org/en/about-us'
        
        
        Other filters dedicated to crawl frontier management:
        
        - ``is_not_crawlable(url)``: check for deep web or pages generally not usable in a crawling context
        - ``is_navigation_page(url)``: check for navigation and overview pages
        
        
        .. code-block:: python
        
            >>> from courlan import is_navigation_page, is_not_crawlable
            >>> is_navigation_page('https://www.randomblog.net/category/myposts')
            True
            >>> is_not_crawlable('https://www.randomblog.net/login')
            True
        
        
        Python helpers
        ~~~~~~~~~~~~~~
        
        Helper function, scrub and normalize:
        
        .. code-block:: python
        
            >>> from courlan import clean_url
            >>> clean_url('HTTPS://WWW.DWDS.DE:80/')
            'https://www.dwds.de'
        
        
        Basic scrubbing only:
        
        .. code-block:: python
        
            >>> from courlan import scrub_url
        
        
        Basic canonicalization/normalization only:
        
        .. code-block:: python
        
            >>> from urllib.parse import urlparse
            >>> from courlan import normalize_url
            >>> my_url = normalize_url(urlparse(my_url))
            # passing URL strings directly also works
            >>> my_url = normalize_url(my_url)
            # remove unnecessary components and re-order query elements
            >>> normalize_url('http://test.net/foo.html?utm_source=twitter&post=abc&page=2#fragment', strict=True)
            'http://test.net/foo.html?page=2&post=abc'
        
        
        Basic URL validation only:
        
        .. code-block:: python
        
            >>> from courlan import validate_url
            >>> validate_url('http://1234')
            (False, None)
            >>> validate_url('http://www.example.org/')
            (True, ParseResult(scheme='http', netloc='www.example.org', path='/', params='', query='', fragment=''))
        
        
        
        Command-line
        ------------
        
        The main fonctions are also available through a command-line utility.
        
        .. code-block:: bash
        
            $ courlan --inputfile url-list.txt --outputfile cleaned-urls.txt
            $ courlan --help
            usage: courlan [-h] -i INPUTFILE -o OUTPUTFILE [-d DISCARDEDFILE] [-v]
                           [--strict] [-l LANGUAGE] [-r] [--sample]
                           [--samplesize SAMPLESIZE] [--exclude-max EXCLUDE_MAX]
                           [--exclude-min EXCLUDE_MIN]
        
        
        optional arguments:
          -h, --help            show this help message and exit
        
        I/O:
          Manage input and output
        
          -i INPUTFILE, --inputfile INPUTFILE
                                name of input file (required)
          -o OUTPUTFILE, --outputfile OUTPUTFILE
                                name of output file (required)
          -d DISCARDEDFILE, --discardedfile DISCARDEDFILE
                                name of file to store discarded URLs (optional)
          -v, --verbose         increase output verbosity
        
        Filtering:
          Configure URL filters
        
          --strict              perform more restrictive tests
          -l LANGUAGE, --language LANGUAGE
                                use language filter (ISO 639-1 code)
          -r, --redirects       check redirects
        
        Sampling:
          Use sampling by host, configure sample size
        
          --sample              use sampling
          --samplesize SAMPLESIZE
                                size of sample per domain
          --exclude-max EXCLUDE_MAX
                                exclude domains with more than n URLs
          --exclude-min EXCLUDE_MIN
                                exclude domains with less than n URLs
        
        
        License
        -------
        
        *coURLan* is distributed under the `GNU General Public License v3.0 <https://github.com/adbar/courlan/blob/master/LICENSE>`_. If you wish to redistribute this library but feel bounded by the license conditions please try interacting `at arms length <https://www.gnu.org/licenses/gpl-faq.html#GPLInProprietarySystem>`_, `multi-licensing <https://en.wikipedia.org/wiki/Multi-licensing>`_ with `compatible licenses <https://en.wikipedia.org/wiki/GNU_General_Public_License#Compatibility_and_multi-licensing>`_, or `contacting me <https://github.com/adbar/courlan#author>`_.
        
        See also `GPL and free software licensing: What's in it for business? <https://www.techrepublic.com/blog/cio-insights/gpl-and-free-software-licensing-whats-in-it-for-business/>`_
        
        
        
        Settings
        --------
        
        ``courlan`` is optimized for English and German but its generic approach is also usable in other contexts.
        
        To review details of strict URL filtering see ``settings.py``. This can be overriden by `cloning the repository <https://docs.github.com/en/github/creating-cloning-and-archiving-repositories/cloning-a-repository-from-github>`_ and `recompiling the package locally <https://packaging.python.org/tutorials/installing-packages/#installing-from-a-local-src-tree>`_.
        
        
        
        Contributing
        ------------
        
        `Contributions <https://github.com/adbar/courlan/blob/master/CONTRIBUTING.md>`_ are welcome!
        
        Feel free to file issues on the `dedicated page <https://github.com/adbar/courlan/issues>`_.
        
        
        Author
        ------
        
        This effort is part of methods to derive information from web documents in order to build `text databases for research <https://www.dwds.de/d/k-web>`_ (chiefly linguistic analysis and natural language processing). Extracting and pre-processing web texts to the exacting standards of scientific research presents a substantial challenge for those who conduct such research. Web corpus construction involves numerous design decisions, and this software package can help facilitate text data collection and enhance corpus quality.
        
        - Barbaresi, A. `Trafilatura: A Web Scraping Library and Command-Line Tool for Text Discovery and Extraction <https://aclanthology.org/2021.acl-demo.15/>`_, Proceedings of ACL/IJCNLP 2021: System Demonstrations, 2021, p. 122-131.
        - Barbaresi, A. "`Generic Web Content Extraction with Open-Source Software <https://konvens.org/proceedings/2019/papers/kaleidoskop/camera_ready_barbaresi.pdf>`_", Proceedings of KONVENS 2019, Kaleidoscope Abstracts, 2019.
        
        Contact: see `homepage <https://adrien.barbaresi.eu/>`_ or `GitHub <https://github.com/adbar>`_.
        
        Software ecosystem: see `this graphic <https://github.com/adbar/trafilatura/blob/master/docs/software-ecosystem.png>`_.
        
        
        
        Similar work
        ------------
        
        These Python libraries perform similar normalization tasks but don't entail language or content filters. They also don't necessarily focus on crawl optimization:
        
        - `furl <https://github.com/gruns/furl>`_
        - `ural <https://github.com/medialab/ural>`_
        - `urlnorm <https://github.com/kurtmckee/urlnorm>`_ (outdated)
        - `yarl <https://github.com/aio-libs/yarl>`_
        
Keywords: urls,url-parsing,url-manipulation,preprocessing,validation,webcrawling
Platform: UNKNOWN
Classifier: Development Status :: 4 - Beta
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Education
Classifier: Intended Audience :: Information Technology
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: GNU General Public License v3 or later (GPLv3+)
Classifier: Operating System :: MacOS :: MacOS X
Classifier: Operating System :: Microsoft :: Windows
Classifier: Operating System :: POSIX :: Linux
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.5
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Topic :: Internet :: WWW/HTTP
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Topic :: Text Processing :: Filters
Requires-Python: >=3.5
