Metadata-Version: 2.1
Name: scrapy-zyte-api
Version: 0.5.1
Summary: Client library to process URLs through Zyte API
Home-page: https://github.com/scrapy-plugins/scrapy-zyte-api
Author: Zyte Group Ltd
Author-email: info@zyte.com
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: BSD License
Classifier: Natural Language :: English
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Description-Content-Type: text/x-rst

===============
scrapy-zyte-api
===============

.. image:: https://img.shields.io/pypi/v/scrapy-zyte-api.svg
   :target: https://pypi.python.org/pypi/scrapy-zyte-api
   :alt: PyPI Version

.. image:: https://img.shields.io/pypi/pyversions/scrapy-zyte-api.svg
   :target: https://pypi.python.org/pypi/scrapy-zyte-api
   :alt: Supported Python Versions

.. image:: https://github.com/scrapy-plugins/scrapy-zyte-api/actions/workflows/test.yml/badge.svg
   :target: https://github.com/scrapy-plugins/scrapy-zyte-api/actions/workflows/test.yml
   :alt: Automated tests

.. image:: https://codecov.io/gh/scrapy-plugins/scrapy-zyte-api/branch/main/graph/badge.svg?token=iNYIk4nfyd
   :target: https://codecov.io/gh/scrapy-plugins/scrapy-zyte-api
   :alt: Coverage report

Requirements
------------

* Python 3.7+
* Scrapy 2.0.1+

Installation
------------

.. code-block::

    pip install scrapy-zyte-api

This package requires Python 3.7+.

Configuration
-------------

Replace the default ``http`` and ``https`` in Scrapy's
`DOWNLOAD_HANDLERS <https://docs.scrapy.org/en/latest/topics/settings.html#std-setting-DOWNLOAD_HANDLERS>`_
in the ``settings.py`` of your Scrapy project.

You also need to set the ``ZYTE_API_KEY``.

Lastly, make sure to `install the asyncio-based Twisted reactor
<https://docs.scrapy.org/en/latest/topics/asyncio.html#installing-the-asyncio-reactor)>`_
in the ``settings.py`` file as well.

Here's an example of the things needed inside a Scrapy project's ``settings.py`` file:

.. code-block:: python

    DOWNLOAD_HANDLERS = {
        "http": "scrapy_zyte_api.handler.ScrapyZyteAPIDownloadHandler",
        "https": "scrapy_zyte_api.handler.ScrapyZyteAPIDownloadHandler"
    }

    # Having the following in the env var would also work.
    ZYTE_API_KEY = "<your API key>"

    TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"

Usage
-----

To enable a ``scrapy.Request`` to go through Zyte API, the ``zyte_api`` key in
`Request.meta <https://docs.scrapy.org/en/latest/topics/request-response.html#scrapy.http.Request.meta>`_
must be present and contain a dict with Zyte API parameters:

.. code-block:: python

    import scrapy


    class SampleQuotesSpider(scrapy.Spider):
        name = "sample_quotes"

        def start_requests(self):
            yield scrapy.Request(
                url="http://quotes.toscrape.com/",
                callback=self.parse,
                meta={
                    "zyte_api": {
                        "browserHtml": True,
                    }
                },
            )

        def parse(self, response):
            yield {"URL": response.url, "HTML": response.body}

            print(response.raw_api_response)
            # {
            #     'url': 'https://quotes.toscrape.com/',
            #     'statusCode': 200,
            #     'browserHtml': '<html> ... </html>',
            # }

You can see the full list of parameters in the `Zyte API Specification
<https://docs.zyte.com/zyte-api/openapi.html#zyte-openapi-spec>`_.
The ``url`` parameter is filled automatically from ``request.url``, other 
parameters should be set explicitly.

The raw Zyte API response can be accessed via the ``raw_api_response``
attribute of the response object.

When you use the Zyte API parameters ``browserHtml``, ``httpResponseBody``, or
``httpResponseHeaders``, the response body and headers are set accordingly.

Note that, for Zyte API requests, the spider gets responses of
``ZyteAPIResponse`` and ``ZyteAPITextResponse`` types,
which are respectively subclasses of ``scrapy.http.Response``
and ``scrapy.http.TextResponse``.

If multiple requests target the same URL with different Zyte API parameters,
pass ``dont_filter=True`` to ``Request``.

Setting default parameters
--------------------------
Often the same configuration needs to be used for all Zyte API requests.
For example, all requests may need to set the same geolocation, or
the spider only uses ``browserHtml`` requests.

To set the default parameters for Zyte API enabled requests, you can set the
following in the ``settings.py`` file or `any other settings within Scrapy
<https://docs.scrapy.org/en/latest/topics/settings.html#populating-the-settings>`_:

.. code-block:: python

    ZYTE_API_DEFAULT_PARAMS = {
        "browserHtml": True,
        "geolocation": "US",
    }


``ZYTE_API_DEFAULT_PARAMS`` works if the ``zyte_api``
key in `Request.meta <https://docs.scrapy.org/en/latest/topics/request-response.html#scrapy.http.Request.meta>`_
is set, i.e. having ``ZYTE_API_DEFAULT_PARAMS`` doesn't make all requests
to go through Zyte API. Parameters in ``ZYTE_API_DEFAULT_PARAMS`` are merged
with parameters set via the ``zyte_api`` meta key, with the values in meta
taking priority.

.. code-block:: python

    import scrapy


    class SampleQuotesSpider(scrapy.Spider):
        name = "sample_quotes"

        custom_settings = {
            "ZYTE_API_DEFAULT_PARAMS": {
                "geolocation": "US",  # You can set any Geolocation region you want.
            }
        }

        def start_requests(self):
            yield scrapy.Request(
                url="http://quotes.toscrape.com/",
                callback=self.parse,
                meta={
                    "zyte_api": {
                        "browserHtml": True,
                        "javascript": True,
                        "echoData": {"some_value_I_could_track": 123},
                    }
                },
            )

        def parse(self, response):
            yield {"URL": response.url, "HTML": response.body}

            print(response.raw_api_response)
            # {
            #     'url': 'https://quotes.toscrape.com/',
            #     'statusCode': 200,
            #     'browserHtml': '<html> ... </html>',
            #     'echoData': {'some_value_I_could_track': 123},
            # }

            print(response.request.meta)
            # {
            #     'zyte_api': {
            #         'browserHtml': True,
            #         'geolocation': 'US',
            #         'javascript': True,
            #         'echoData': {'some_value_I_could_track': 123}
            #     },
            #     'download_timeout': 180.0,
            #     'download_slot': 'quotes.toscrape.com'
            # }

There is a shortcut, in case a request uses the same parameters as
defined in the ``ZYTE_API_DEFAULT_PARAMS`` setting, without any further
customization - the ``zyte_api`` meta key can be set to ``True`` or ``{}``:

.. code-block:: python

    import scrapy


    class SampleQuotesSpider(scrapy.Spider):
        name = "sample_quotes"

        custom_settings = {
            "ZYTE_API_DEFAULT_PARAMS": {
                "browserHtml": True,
            }
        }

        def start_requests(self):
            yield scrapy.Request(
                url="http://quotes.toscrape.com/",
                callback=self.parse,
                meta={"zyte_api": True},
            )

        def parse(self, response):
            yield {"URL": response.url, "HTML": response.body}

            print(response.raw_api_response)
            # {
            #     'url': 'https://quotes.toscrape.com/',
            #     'statusCode': 200,
            #     'browserHtml': '<html> ... </html>',
            # }

            print(response.request.meta)
            # {
            #     'zyte_api': {
            #         'browserHtml': True,
            #     },
            #     'download_timeout': 180.0,
            #     'download_slot': 'quotes.toscrape.com'
            # }

Customizing the retry policy
----------------------------

API requests are retried automatically using the default retry policy of
`python-zyte-api`_.

API requests that exceed retries are dropped. You cannot manage API request
retries through Scrapy downloader middlewares.

Use the ``ZYTE_API_RETRY_POLICY`` setting or the ``zyte_api_retry_policy``
request meta key to override the default `python-zyte-api`_ retry policy with a
custom retry policy.

A custom retry policy must be an instance of `tenacity.AsyncRetrying`_.

Scrapy settings must be picklable, which `retry policies are not
<https://github.com/jd/tenacity/issues/147>`_, so you cannot assign retry
policy objects directly to the ``ZYTE_API_RETRY_POLICY`` setting, and must use
their import path string instead.

When setting a retry policy through request metadata, you can assign the
``zyte_api_retry_policy`` request meta key either the retry policy object
itself or its import path string. If you need your requests to be serializable,
however, you may also need to use the import path string.

For example, to also retry HTTP 521 errors the same as HTTP 520 errors, you can
subclass RetryFactory_ as follows:

.. code-block:: python

    # project/retry_policies.py
    from tenacity import retry_if_exception, RetryCallState
    from zyte_api.aio.errors import RequestError
    from zyte_api.aio.retry import RetryFactory

    def is_http_521(exc: BaseException) -> bool:
        return isinstance(exc, RequestError) and exc.status == 521

    class CustomRetryFactory(RetryFactory):

        retry_condition = (
            RetryFactory.retry_condition
            | retry_if_exception(is_http_521)
        )

        def wait(self, retry_state: RetryCallState) -> float:
            if is_http_521(retry_state.outcome.exception()):
                return self.temporary_download_error_wait(retry_state=retry_state)
            return super().wait(retry_state)

        def stop(self, retry_state: RetryCallState) -> bool:
            if is_http_521(retry_state.outcome.exception()):
                return self.temporary_download_error_stop(retry_state)
            return super().stop(retry_state)

    CUSTOM_RETRY_POLICY = CustomRetryFactory().build()

    # project/settings.py
    ZYTE_API_RETRY_POLICY = "project.retry_policies.CUSTOM_RETRY_POLICY"

.. _python-zyte-api: https://github.com/zytedata/python-zyte-api
.. _RetryFactory: https://github.com/zytedata/python-zyte-api/blob/main/zyte_api/aio/retry.py
.. _tenacity.AsyncRetrying: https://tenacity.readthedocs.io/en/latest/api.html#tenacity.AsyncRetrying


Stats
-----

Stats from python-zyte-api_ are exposed as Scrapy stats with the
``scrapy-zyte-api`` prefix.
