Fingerprint for initial request is not saved on redirects #9

mrueegg · 2016-10-16T07:49:44Z

Hi,

I have a spider that makes usage of FormRequest, item loaders and Request.

Here's an example for a FormRequest:

yield FormRequest(url, callback, formdata)

Here for an item loader:

il = ItemLoader(item=MyResult())
il.add_value('date', response.meta['date'])
yield il.load_item()

And here for a request:

page_request = Request(url, callback=self.parse_run_page)
yield page_request

Deltafetch is enabled, creates a .db file, but with every spider run, Scrapy does all page requests again, so no delta processing is achieved.

Any ideas? Thanks.

The text was updated successfully, but these errors were encountered:

mrueegg · 2016-11-23T15:20:37Z

The reason for this issue was that the URL I yielded a FormRequest started with http:// while the server redirected me to the https:// version of the website (same URL, just with HTTPS) and deltafetch considered these two pages as equivalent and therefore decided to process it again in the next run.

Maybe this should be documented in the Wiki and/or http/https of the same page being ignored with an option.

redapple · 2016-11-23T15:30:02Z

I don't understand the issue/the behavior you want to be documented.
Can you explain with a timeline what's happening?

mrueegg · 2016-11-23T15:33:33Z

I think this could be added to a FAQ or a Wiki to help users prevent tedious debugging sessions. When the URL scraped from a page is different just because the server redirects to the HTTPS version of the page, then deltafetch will process it again which is not obvious.

Maybe the reason why a page is not cached could also be logged in debug mode. What do you think?

redapple · 2016-12-09T14:10:10Z

Hello @mrueegg ,
sorry it took so long but I had a look at this again this morning, and I think I understand the issue now. I'm a bit slow sometimes ;-)

You are right that when requests are redirected, the deltafetch middleware stores the fingerprint of the redirected/final request made, and not the starting request.
Here's an example spider showing that:

# -*- coding: utf-8 -*-
import scrapy
from scrapy.utils.request import request_fingerprint


class HttpbinSpider(scrapy.Spider):
    name = "httpbin"
    start_urls = ['http://httpbin.org/']

    custom_settings = {
        'ROBOTSTXT_OBEY': False,
    }
    def parse(self, response):
        r = scrapy.Request('http://docs.scrapy.org',
                           callback=self.parse_page)
        self.logger.info("requesting %r (fingerprint: %r)" % (r, request_fingerprint(r)))
        yield r

    def parse_page(self, response):
        self.logger.info("parse_page(%r); request %r (fingerprint: %r)" % (
            response, response.request, request_fingerprint(response.request)))
        yield {'url': response.url}

And the logs showing that the saved fingerprint is the one for the last hop of redirects:

$ scrapy crawl httpbin
2016-12-09 14:49:49 [scrapy] INFO: Scrapy 1.2.2 started (bot: deltafetchredirect)
(...)
2016-12-09 14:49:49 [scrapy] DEBUG: Crawled (200) <GET http://httpbin.org/> (referer: None)
2016-12-09 14:49:49 [httpbin] INFO: requesting <GET http://docs.scrapy.org> (fingerprint: 'c96b7ce72fabf56ccbee0cc80e8eaba2f38e5051')
2016-12-09 14:49:49 [scrapy] DEBUG: Redirecting (301) to <GET https://docs.scrapy.org/> from <GET http://docs.scrapy.org>
2016-12-09 14:49:50 [scrapy] DEBUG: Redirecting (302) to <GET https://docs.scrapy.org/en/latest/> from <GET https://docs.scrapy.org/>
2016-12-09 14:49:50 [scrapy] DEBUG: Crawled (200) <GET https://docs.scrapy.org/en/latest/> (referer: http://httpbin.org/)
2016-12-09 14:49:50 [httpbin] INFO: parse_page(<200 https://docs.scrapy.org/en/latest/>); request <GET https://docs.scrapy.org/en/latest/> (fingerprint: '04eee400963f6f786a539be3e465ad0f8054e4e7')
2016-12-09 14:49:50 [scrapy] DEBUG: Scraped from <200 https://docs.scrapy.org/en/latest/>
{'url': 'https://docs.scrapy.org/en/latest/'}
2016-12-09 14:49:50 [scrapy] INFO: Closing spider (finished)
2016-12-09 14:49:50 [scrapy] INFO: Dumping Scrapy stats:
{'deltafetch/stored': 1,
 'downloader/request_bytes': 948,
 'downloader/request_count': 4,
 'downloader/request_method_count/GET': 4,
 'downloader/response_bytes': 37817,
 'downloader/response_count': 4,
 'downloader/response_status_count/200': 2,
 'downloader/response_status_count/301': 1,
 'downloader/response_status_count/302': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2016, 12, 9, 13, 49, 50, 781514),
 'item_scraped_count': 1,
 'log_count/DEBUG': 6,
 'log_count/INFO': 9,
 'request_depth_max': 1,
 'response_received_count': 2,
 'scheduler/dequeued': 4,
 'scheduler/dequeued/memory': 4,
 'scheduler/enqueued': 4,
 'scheduler/enqueued/memory': 4,
 'start_time': datetime.datetime(2016, 12, 9, 13, 49, 49, 215001)}
2016-12-09 14:49:50 [scrapy] INFO: Spider closed (finished)

$ cd .scrapy/deltafetch/
$ ls
httpbin.db
$ python
Python 2.7.12 (default, Nov 19 2016, 06:48:10) 
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import bsddb3
>>> db = bsddb3.db.DB()
>>> db.open('httpbin.db')
>>> db
<DB object at 0x7f18f068d880>
>>> db.keys()
['04eee400963f6f786a539be3e465ad0f8054e4e7']
>>>

The original fingerprint for http://docs.scrapy.org, c96b7ce72fabf56ccbee0cc80e8eaba2f38e5051, does not get saved. Instead the one for https://docs.scrapy.org/en/latest/, 04eee400963f6f786a539be3e465ad0f8054e4e7, is saved. On a subsequent crawl, the spider will still not issue a request to https://docs.scrapy.org/en/latest/ directly, so deltafetch will not see this as duplicate.

So the issue is confirmed.
The thing is I don't know how to (easily) solve it at the minute.

KrumBoychev · 2017-10-24T09:11:54Z

The case can be handled with custom 'deltafetch_key':

import hashlib

request = scrapy.Request(original_url, callback=self.parse_item, meta={'deltafetch_key': 
                                                hashlib.sha1(original_url).hexdigest()})

mrueegg closed this as completed Nov 23, 2016

redapple reopened this Dec 9, 2016

redapple changed the title ~~Support for FormRequest, Request and page loaders~~ Fingerprint for initial request is not saved on redirects Dec 9, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fingerprint for initial request is not saved on redirects #9

Fingerprint for initial request is not saved on redirects #9

mrueegg commented Oct 16, 2016

mrueegg commented Nov 23, 2016

redapple commented Nov 23, 2016

mrueegg commented Nov 23, 2016

redapple commented Dec 9, 2016 •

edited

Loading

KrumBoychev commented Oct 24, 2017 •

edited

Loading

Fingerprint for initial request is not saved on redirects #9

Fingerprint for initial request is not saved on redirects #9

Comments

mrueegg commented Oct 16, 2016

mrueegg commented Nov 23, 2016

redapple commented Nov 23, 2016

mrueegg commented Nov 23, 2016

redapple commented Dec 9, 2016 • edited Loading

KrumBoychev commented Oct 24, 2017 • edited Loading

redapple commented Dec 9, 2016 •

edited

Loading

KrumBoychev commented Oct 24, 2017 •

edited

Loading