Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disable Referer by default for Zyte API requests #239

Merged
merged 3 commits into from
Jan 13, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@ either :ref:`globally <transparent>` or :ref:`per request <automap>`, or
usage/stats
usage/fingerprint
usage/proxy
usage/referer

.. toctree::
:caption: Reference
Expand Down
16 changes: 16 additions & 0 deletions docs/reference/settings.rst
Original file line number Diff line number Diff line change
Expand Up @@ -286,6 +286,22 @@ For example:
}


.. setting:: ZYTE_API_REFERRER_POLICY

ZYTE_API_REFERRER_POLICY
========================

Default: ``"no-referrer"``

:setting:`REFERRER_POLICY` to apply to Zyte API requests when using
:ref:`transparent mode <transparent>` or :ref:`automatic request parameters
<automap>`.

The :reqmeta:`referrer_policy` request metadata key takes precedence.

See :ref:`referer`.


.. setting:: ZYTE_API_RETRY_POLICY

ZYTE_API_RETRY_POLICY
Expand Down
1 change: 1 addition & 0 deletions docs/setup.rst
Original file line number Diff line number Diff line change
Expand Up @@ -108,6 +108,7 @@ scrapy-zyte-api integration as follows:
}
SPIDER_MIDDLEWARES = {
"scrapy_zyte_api.ScrapyZyteAPISpiderMiddleware": 100,
"scrapy_zyte_api.ScrapyZyteAPIRefererSpiderMiddleware": 1000,
}
REQUEST_FINGERPRINTER_CLASS = "scrapy_zyte_api.ScrapyZyteAPIRequestFingerprinter"
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
Expand Down
52 changes: 52 additions & 0 deletions docs/usage/referer.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
.. _referer:

==================
The Referer header
==================

By default, Scrapy automatically sets a `Referer header`_ on every request
yielded from a callback (see the
:class:`~scrapy.spidermiddlewares.referer.RefererMiddleware`).

However, when using :ref:`transparent mode <transparent>` or :ref:`automatic
request parameters <automap>`, this behavior is disabled by default for Zyte
API requests, and when using :ref:`manual request parameters <manual>`, all
request headers are always ignored for Zyte API requests.

.. _Referer header: https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Referer

Why is it disabled by default?
==============================

A misuse of the ``Referer`` header can increase the risk of :ref:`bans <bans>`.

By *not* setting the header, your Zyte API requests let Zyte API choose which
value to use, if any, to minimize bans.

If you *do* set the header, while Zyte API might still ignore your value to
avoid bans, it may also keep your value regardless of its impact on bans.

How to override?
================

To set the header anyway when using :ref:`transparent mode <transparent>` or
:ref:`automatic request parameters <automap>`, do any of the following:

- Set the :setting:`ZYTE_API_REFERRER_POLICY` setting or the
:reqmeta:`referrer_policy` request metadata key to ``"scrapy-default"`` or
to some other value supported by the :setting:`REFERRER_POLICY` setting.

- Set the header through the :setting:`DEFAULT_REQUEST_HEADERS` setting or
the :attr:`Request.headers <scrapy.http.Request.headers>` attribute.

- Set the header through the :http:`request:customHttpRequestHeaders` field
(for :ref:`HTTP requests <zapi-http>`) or the :http:`request:requestHeaders`
field (for :ref:`browser requests <zapi-browser>`) through the
:setting:`ZYTE_API_AUTOMAP_PARAMS` setting or the
:reqmeta:`zyte_api_automap` request metadata key.

When using :ref:`manual request parameters <manual>`, you always need to set
the header through the :http:`request:customHttpRequestHeaders` or
:http:`request:requestHeaders` field through the
:setting:`ZYTE_API_DEFAULT_PARAMS` setting or the :reqmeta:`zyte_api` request
metadata key.
1 change: 1 addition & 0 deletions scrapy_zyte_api/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
from ._annotations import ExtractFrom, actions, custom_attrs
from ._middlewares import (
ScrapyZyteAPIDownloaderMiddleware,
ScrapyZyteAPIRefererSpiderMiddleware,
ScrapyZyteAPISpiderMiddleware,
)
from ._page_inputs import Actions, Geolocation, Screenshot
Expand Down
35 changes: 35 additions & 0 deletions scrapy_zyte_api/_middlewares.py
Original file line number Diff line number Diff line change
Expand Up @@ -201,3 +201,38 @@
async for item_or_request in result:
self._process_output_item_or_request(item_or_request, spider)
yield item_or_request


class ScrapyZyteAPIRefererSpiderMiddleware:

@classmethod
def from_crawler(cls, crawler):
return cls(crawler)

def __init__(self, crawler):
self._default_policy = crawler.settings.get(
"ZYTE_API_REFERRER_POLICY", "no-referrer"
)
self._param_parser = _ParamParser(crawler, cookies_enabled=False)

def process_spider_output(self, response, result, spider):
for item_or_request in result:
self._process_output_item_or_request(item_or_request, spider)
yield item_or_request

async def process_spider_output_async(self, response, result, spider):
async for item_or_request in result:
self._process_output_item_or_request(item_or_request, spider)
yield item_or_request

Check warning on line 226 in scrapy_zyte_api/_middlewares.py

View check run for this annotation

Codecov / codecov/patch

scrapy_zyte_api/_middlewares.py#L225-L226

Added lines #L225 - L226 were not covered by tests

def _process_output_item_or_request(self, item_or_request, spider):
if not isinstance(item_or_request, Request):
return
self._process_output_request(item_or_request, spider)

def _process_output_request(self, request, spider):
if self._is_zyte_api_request(request):
request.meta.setdefault("referrer_policy", self._default_policy)

def _is_zyte_api_request(self, request):
return self._param_parser.parse(request) is not None
4 changes: 4 additions & 0 deletions scrapy_zyte_api/addon.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@

from scrapy_zyte_api import (
ScrapyZyteAPIDownloaderMiddleware,
ScrapyZyteAPIRefererSpiderMiddleware,
ScrapyZyteAPISessionDownloaderMiddleware,
ScrapyZyteAPISpiderMiddleware,
)
Expand Down Expand Up @@ -101,6 +102,9 @@ def update_settings(self, settings: BaseSettings) -> None:
667,
)
_setdefault(settings, "SPIDER_MIDDLEWARES", ScrapyZyteAPISpiderMiddleware, 100)
_setdefault(
settings, "SPIDER_MIDDLEWARES", ScrapyZyteAPIRefererSpiderMiddleware, 1000
)
settings.set(
"TWISTED_REACTOR",
"twisted.internet.asyncioreactor.AsyncioSelectorReactor",
Expand Down
1 change: 1 addition & 0 deletions tests/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@
"REQUEST_FINGERPRINTER_CLASS": "scrapy_zyte_api.ScrapyZyteAPIRequestFingerprinter",
"SPIDER_MIDDLEWARES": {
"scrapy_zyte_api.ScrapyZyteAPISpiderMiddleware": 100,
"scrapy_zyte_api.ScrapyZyteAPIRefererSpiderMiddleware": 1000,
},
"ZYTE_API_KEY": _API_KEY,
"TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
Expand Down
22 changes: 22 additions & 0 deletions tests/mockserver.py
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,12 @@ def getChild(self, path, request):
return RequestCountResource()
return self

def render_GET(self, request):
referer = request.getHeader(b"Referer")
if referer:
request.responseHeaders.setRawHeaders(b"Referer", [referer])
return b""

def render_POST(self, request):
DefaultResource.request_count += 1
request_data = json.loads(request.content.read())
Expand Down Expand Up @@ -184,6 +190,22 @@ def render_POST(self, request):
response_data["httpResponseHeaders"] = [
{"name": "test_header", "value": "test_value"}
]
headers = request_data.get("customHttpRequestHeaders", [])
for header in headers:
if header["name"].strip().lower() == "referer":
referer = header["value"]
break
else:
headers = request_data.get("requestHeaders", {})
if "referer" in headers:
referer = headers["referer"]
else:
referer = None
if referer is not None:
assert isinstance(response_data["httpResponseHeaders"], list)
response_data["httpResponseHeaders"].append(
{"name": "Referer", "value": referer}
)

actions = request_data.get("actions")
if actions:
Expand Down
2 changes: 2 additions & 0 deletions tests/test_addon.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@

from scrapy_zyte_api import (
ScrapyZyteAPIDownloaderMiddleware,
ScrapyZyteAPIRefererSpiderMiddleware,
ScrapyZyteAPISessionDownloaderMiddleware,
ScrapyZyteAPISpiderMiddleware,
)
Expand Down Expand Up @@ -148,6 +149,7 @@ def _test_setting_changes(initial_settings, expected_settings):
"REQUEST_FINGERPRINTER_CLASS": "scrapy_zyte_api.ScrapyZyteAPIRequestFingerprinter",
"SPIDER_MIDDLEWARES": {
ScrapyZyteAPISpiderMiddleware: 100,
ScrapyZyteAPIRefererSpiderMiddleware: 1000,
},
"TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
"ZYTE_API_FALLBACK_HTTPS_HANDLER": "scrapy.core.downloader.handlers.http.HTTPDownloadHandler",
Expand Down
24 changes: 2 additions & 22 deletions tests/test_api_requests.py
Original file line number Diff line number Diff line change
Expand Up @@ -3291,22 +3291,6 @@ async def test_middleware_headers_start_requests():
assert "customHttpRequestHeaders" not in api_params


@ensureDeferred
async def test_middleware_headers_cb_requests():
"""Callback requests will include the Referer parameter if the Referer
middleware is not disabled."""
crawler = await get_crawler({"ZYTE_API_TRANSPARENT_MODE": True})
request = Request(url="https://example.com")
await _process_request(crawler, request)

handler = get_download_handler(crawler, "https")
param_parser = handler._param_parser
api_params = param_parser.parse(request)
assert api_params["customHttpRequestHeaders"] == [
{"name": "Referer", "value": request.url},
]


@ensureDeferred
async def test_middleware_headers_cb_requests_disable():
"""Callback requests will not include the Referer parameter if the Referer
Expand Down Expand Up @@ -3370,7 +3354,6 @@ async def test_middleware_headers_default():
param_parser = handler._param_parser
api_params = param_parser.parse(request)
assert api_params["customHttpRequestHeaders"] == [
{"name": "Referer", "value": request.url},
{
"name": "Accept",
"value": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
Expand Down Expand Up @@ -3482,7 +3465,6 @@ async def test_middleware_headers_request_headers():
"value": DEFAULT_ACCEPT_ENCODING,
},
{"name": "User-Agent", "value": DEFAULT_USER_AGENT},
{"name": "Referer", "value": request.url},
]


Expand Down Expand Up @@ -3581,9 +3563,7 @@ async def test_middleware_headers_custom_middleware_before():
handler = get_download_handler(crawler, "https")
param_parser = handler._param_parser
api_params = param_parser.parse(request)
assert api_params["customHttpRequestHeaders"] == [
{"name": "Referer", "value": request.url},
]
assert "customHttpRequestHeaders" not in api_params


class CustomValuesDownloaderMiddleware:
Expand Down Expand Up @@ -3620,7 +3600,6 @@ async def test_middleware_headers_custom_middleware_before_custom():
param_parser = handler._param_parser
api_params = param_parser.parse(request)
assert api_params["customHttpRequestHeaders"] == [
{"name": "Referer", "value": "https://referrer.example"},
{
"name": "Accept",
"value": "text/html",
Expand All @@ -3631,6 +3610,7 @@ async def test_middleware_headers_custom_middleware_before_custom():
"name": "Accept-Encoding",
"value": "br",
},
{"name": "Referer", "value": "https://referrer.example"},
]


Expand Down
Loading
Loading