Metadata-Version: 2.1
Name: sitecrawl
Version: 1.0.3
Summary: Simple Python3 module to crawl a website and extract URLs
Home-page: https://github.com/gabfl/sitecrawl
Author: Gabriel Bordeaux
Author-email: pypi@gab.lc
License: MIT
Platform: UNKNOWN
Classifier: Topic :: Internet
Classifier: Topic :: Internet :: WWW/HTTP :: Site Management :: Link Checking
Classifier: Topic :: Software Development :: Libraries
Classifier: Topic :: Software Development
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: MacOS
Classifier: Operating System :: POSIX :: Linux
Classifier: Natural Language :: English
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python
Classifier: Development Status :: 4 - Beta
License-File: LICENSE

sitecrawl
=========

|Pypi| |Build Status| |codecov| |MIT licensed|

Simple Python module to crawl a website and extract URLs.

Installation
------------

Using pip:

.. code:: bash

   pip3 install sitecrawl

   sitecrawl --help

Or build from sources:

.. code:: bash

   # Clone project
   git clone https://github.com/gabfl/sitecrawl && cd sitecrawl

   # Installation
   pip3 install .

Usage
-----

CLI
~~~

.. code:: bash

   sitecrawl --url http://www.gab.lc --depth 3

   # Add --verbose for verbose mode

->

::

   * Found 4 internal URLs
     http://www.gab.lc
     http://www.gab.lc/articles
     http://www.gab.lc/contact
     http://www.gab.lc/about

   * Found 8 external URLs
     https://gpgtools.org/
     http://en.wikipedia.org/wiki/GNU_General_Public_License
     http://en.wikipedia.org/wiki/Pretty_Good_Privacy
     http://en.wikipedia.org/wiki/GNU_Privacy_Guard
     https://www.gpgtools.org
     https://www.google.com/#hl=en&q=install+gpg+windows
     http://www.gnupg.org/gph/en/manual/x135.html
     http://keys.gnupg.net

   * Skipped 0 URLs

As a module
~~~~~~~~~~~

Basic example:

.. code:: py

   from sitecrawl import crawl

   crawl.base_url = 'https://www.github.com'
   crawl.deep_crawl(depth=2)

   print('Internal URLs:', crawl.get_internal_urls())
   print('External URLs:', crawl.get_external_urls())
   print('Skipped URLs:', crawl.get_skipped_urls())

A more detailed example is available in
`example.py <https://github.com/gabfl/sitecrawl/blob/main/example.py>`__.

.. |Pypi| image:: https://img.shields.io/pypi/v/sitecrawl.svg
   :target: https://pypi.org/project/sitecrawl
.. |Build Status| image:: https://github.com/gabfl/sitecrawl/actions/workflows/ci.yml/badge.svg?branch=main
   :target: https://github.com/gabfl/sitecrawl/actions
.. |codecov| image:: https://codecov.io/gh/gabfl/sitecrawl/branch/main/graph/badge.svg
   :target: https://codecov.io/gh/gabfl/sitecrawl
.. |MIT licensed| image:: https://img.shields.io/badge/license-MIT-green.svg
   :target: https://raw.githubusercontent.com/gabfl/sitecrawl/main/LICENSE


