Metadata-Version: 1.2
Name: courlan
Version: 0.3.1
Summary: Clean, filter, normalize, and sample URLs
Home-page: http://github.com/adbar/courlan
Author: Adrien Barbaresi
Author-email: barbaresi@bbaw.de
License: GPLv3+
Project-URL: Source, https://github.com/adbar/courlan
Project-URL: Coverage, https://codecov.io/github/adbar/courlan
Project-URL: Tracker, https://github.com/adbar/courlan/issues
Description: coURLan: Clean, filter, normalize, and sample URLs
        ==================================================
        
        
        .. image:: https://img.shields.io/pypi/v/courlan.svg
            :target: https://pypi.python.org/pypi/courlan
            :alt: Python package
        
        .. image:: https://img.shields.io/pypi/pyversions/courlan.svg
            :target: https://pypi.python.org/pypi/courlan
            :alt: Python versions
        
        .. image:: https://img.shields.io/travis/adbar/courlan.svg
            :target: https://travis-ci.org/adbar/courlan
            :alt: Travis build status
        
        .. image:: https://img.shields.io/codecov/c/github/adbar/courlan.svg
            :target: https://codecov.io/gh/adbar/courlan
            :alt: Code Coverage
        
        
        
        Features
        --------
        
        Separate `the wheat from the chaff <https://en.wiktionary.org/wiki/separate_the_wheat_from_the_chaff>`_ and optimize crawls by focusing on non-spam HTML pages containing primarily text.
        
        - URL validation and (basic) normalization
        - Filters targeting spam and unsuitable content-types
        - Sampling by domain name
        - Command-line interface (CLI) and Python tool
        
        
        **Let the coURLan fish out juicy bits for you!**
        
        .. image:: courlan_harns-march.jpg
            :alt: Courlan 
            :align: center
            :width: 65%
            :target: https://commons.wikimedia.org/wiki/File:Limpkin,_harns_marsh_(33723700146).jpg
        
        Here is a `courlan <https://en.wiktionary.org/wiki/courlan>`_ (source: `Limpkin at Harn's Marsh by Russ <https://commons.wikimedia.org/wiki/File:Limpkin,_harns_marsh_(33723700146).jpg>`_, CC BY 2.0).
        
        
        
        Installation
        ------------
        
        This Python package is tested on Linux, macOS and Windows systems, it is compatible with Python 3.4 upwards. It is available on the package repository `PyPI <https://pypi.org/>`_ and can notably be installed with the Python package managers ``pip`` and ``pipenv``:
        
        .. code-block:: bash
        
            $ pip install courlan # pip3 install on systems where both Python 2 and 3 are installed
            $ pip install --upgrade courlan # to make sure you have the latest version
            $ pip install git+https://github.com/adbar/courlan.git # latest available code (see build status above)
        
        
        
        Usage
        -----
        
        ``courlan`` is designed to work best on English, German and most frequent European languages.
        
        The current logic of detailed/strict URL filtering is focused on English and German, for more see ``settings.py``. This can be overriden by `cloning the repository <https://docs.github.com/en/github/creating-cloning-and-archiving-repositories/cloning-a-repository-from-github>`_ and `recompiling the package locally <https://packaging.python.org/tutorials/installing-packages/#installing-from-a-local-src-tree>`_.
        
        
        Python
        ~~~~~~
        
        All operations chained:
        
        .. code-block:: python
        
            >>> from courlan import check_url
            # returns url and domain name
            >>> check_url('https://github.com/adbar/courlan')
            ('https://github.com/adbar/courlan', 'github.com')
            # noisy query parameters can be removed
            >>> check_url('https://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.org', strict=True)
            ('https://httpbin.org/redirect-to', 'httpbin.org')
            # Check for redirects (HEAD request)
            >>> url, domain_name = check_url(my_url, with_redirects=True)
            # optional argument targeting webpages in English or German
            >>> url, domain_name = check_url(my_url, with_redirects=True, language='en')
            >>> url, domain_name = check_url(my_url, with_redirects=True, language='de')
        
        
        Helper function, scrub and normalize:
        
        .. code-block:: python
        
            >>> from courlan import clean_url
            >>> clean_url('HTTPS://WWW.DWDS.DE:80/')
            'https://www.dwds.de'
        
        
        Basic scrubbing only:
        
        .. code-block:: python
        
            >>> from courlan import scrub_url
        
        
        Basic normalization only:
        
        .. code-block:: python
        
            >>> from urllib.parse import urlparse
            >>> from courlan import normalize_url
            >>> my_url = normalize_url(urlparse(my_url))
            # passing URL strings directly also works
            >>> my_url = normalize_url(my_url)
            # remove unnecessary components and re-order query elements
            >>> normalize_url('http://test.net/foo.html?utm_source=twitter&post=abc&page=2#fragment', strict=True)
            'http://test.net/foo.html?page=2&post=abc'
        
        
        Basic URL validation only:
        
        .. code-block:: python
        
            >>> from courlan import validate_url
            >>> validate_url('http://1234')
            (False, None)
            >>> validate_url('http://www.example.org/')
            (True, ParseResult(scheme='http', netloc='www.example.org', path='/', params='', query='', fragment=''))
        
        
        Sampling by domain name:
        
        .. code-block:: python
        
            >>> from courlan import sample_urls
            >>> my_sample = sample_urls(my_urls, 100)
            # optional: exclude_min=None, exclude_max=None, strict=False, verbose=False
        
        
        Determine if a link leads to another host:
        
        .. code-block:: python
        
            >>> from courlan import is_external
            >>> is_external('https://github.com/', 'https://www.microsoft.com/')
            True
            # default
            >>> is_external('https://google.com/', 'https://www.google.co.uk/', ignore_suffix=True)
            False
            # taking suffixes into account
            >>> is_external('https://google.com/', 'https://www.google.co.uk/', ignore_suffix=False)
            True
        
        
        Command-line
        ~~~~~~~~~~~~
        
        .. code-block:: bash
        
            $ courlan --inputfile url-list.txt --outputfile cleaned-urls.txt
            $ courlan --help
        
        
        usage: courlan [-h] -i INPUTFILE -o OUTPUTFILE [-d DISCARDEDFILE] [-v]
                       [--strict] [-l {de,en}] [-r] [--sample]
                       [--samplesize SAMPLESIZE] [--exclude-max EXCLUDE_MAX]
                       [--exclude-min EXCLUDE_MIN]
        
        optional arguments:
          -h, --help            show this help message and exit
        
        I/O:
          Manage input and output
        
          -i INPUTFILE, --inputfile INPUTFILE
                                name of input file (required)
          -o OUTPUTFILE, --outputfile OUTPUTFILE
                                name of output file (required)
          -d DISCARDEDFILE, --discardedfile DISCARDEDFILE
                                name of file to store discarded URLs (optional)
          -v, --verbose         increase output verbosity
        
        Filtering:
          Configure URL filters
        
          --strict              perform more restrictive tests
          -l, --language        use language filter {de,en}
          -r, --redirects       check redirects
        
        Sampling:
          Use sampling by host, configure sample size
        
          --sample              use sampling
          --samplesize SAMPLESIZE
                                size of sample per domain
          --exclude-max EXCLUDE_MAX
                                exclude domains with more than n URLs
          --exclude-min EXCLUDE_MIN
                                exclude domains with less than n URLs
        
        
        
        Additional scripts
        ~~~~~~~~~~~~~~~~~~
        
        Scripts designed to handle URL lists are found under ``helpers``.
        
        
        License
        -------
        
        *coURLan* is distributed under the `GNU General Public License v3.0 <https://github.com/adbar/courlan/blob/master/LICENSE>`_. If you wish to redistribute this library but feel bounded by the license conditions please try interacting `at arms length <https://www.gnu.org/licenses/gpl-faq.html#GPLInProprietarySystem>`_, `multi-licensing <https://en.wikipedia.org/wiki/Multi-licensing>`_ with `compatible licenses <https://en.wikipedia.org/wiki/GNU_General_Public_License#Compatibility_and_multi-licensing>`_, or `contacting me <https://github.com/adbar/courlan#author>`_.
        
        See also `GPL and free software licensing: What's in it for business? <https://www.techrepublic.com/blog/cio-insights/gpl-and-free-software-licensing-whats-in-it-for-business/>`_
        
        
        Contributing
        ------------
        
        `Contributions <https://github.com/adbar/courlan/blob/master/CONTRIBUTING.md>`_ are welcome!
        
        Feel free to file issues on the `dedicated page <https://github.com/adbar/courlan/issues>`_.
        
        
        Author
        ------
        
        This effort is part of methods to derive information from web documents in order to build `text databases for research <https://www.dwds.de/d/k-web>`_ (chiefly linguistic analysis and natural language processing). A significant challenge resides in the ability to extract and pre-process web texts to meet scientific expectations: Web corpus construction involves numerous design decisions, and this software package can help facilitate collection and enhance corpus quality.
        
        -  Barbaresi, A. "`Generic Web Content Extraction with Open-Source Software <https://konvens.org/proceedings/2019/papers/kaleidoskop/camera_ready_barbaresi.pdf>`_", Proceedings of KONVENS 2019, Kaleidoscope Abstracts, 2019.
        -  Barbaresi, A. "`Efficient construction of metadata-enhanced web corpora <https://hal.archives-ouvertes.fr/hal-01371704v2/document>`_", Proceedings of the `10th Web as Corpus Workshop (WAC-X) <https://www.sigwac.org.uk/wiki/WAC-X>`_, 2016.
        
        Contact: see `homepage <https://adrien.barbaresi.eu/>`_ or `GitHub <https://github.com/adbar>`_.
        
        
        Similar work
        ------------
        
        These Python libraries perform similar normalization tasks but don't entail language or content filters:
        
        - `ural <https://github.com/medialab/ural/>`_
        - `urlnorm <https://github.com/kurtmckee/urlnorm>`_
        - `yarl <https://github.com/aio-libs/yarl/>`_
        
Keywords: urls,url-parsing,url-manipulation,preprocessing,validation,webcrawling
Platform: UNKNOWN
Classifier: Development Status :: 3 - Alpha
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Education
Classifier: Intended Audience :: Information Technology
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: GNU General Public License v3 or later (GPLv3+)
Classifier: Operating System :: MacOS :: MacOS X
Classifier: Operating System :: Microsoft :: Windows
Classifier: Operating System :: POSIX :: Linux
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.4
Classifier: Programming Language :: Python :: 3.5
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Topic :: Internet :: WWW/HTTP
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Topic :: Text Processing :: Filters
Requires-Python: >=3.4
