Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fingerprint for initial request is not saved on redirects #9

Open
mrueegg opened this issue Oct 16, 2016 · 5 comments
Open

Fingerprint for initial request is not saved on redirects #9

mrueegg opened this issue Oct 16, 2016 · 5 comments

Comments

@mrueegg
Copy link

mrueegg commented Oct 16, 2016

Hi,

I have a spider that makes usage of FormRequest, item loaders and Request.

Here's an example for a FormRequest:

yield FormRequest(url, callback, formdata)

Here for an item loader:

il = ItemLoader(item=MyResult())
il.add_value('date', response.meta['date'])
yield il.load_item()

And here for a request:

page_request = Request(url, callback=self.parse_run_page)
yield page_request

Deltafetch is enabled, creates a .db file, but with every spider run, Scrapy does all page requests again, so no delta processing is achieved.

Any ideas? Thanks.

@mrueegg
Copy link
Author

mrueegg commented Nov 23, 2016

The reason for this issue was that the URL I yielded a FormRequest started with http:// while the server redirected me to the https:// version of the website (same URL, just with HTTPS) and deltafetch considered these two pages as equivalent and therefore decided to process it again in the next run.

Maybe this should be documented in the Wiki and/or http/https of the same page being ignored with an option.

@mrueegg mrueegg closed this as completed Nov 23, 2016
@redapple
Copy link
Contributor

I don't understand the issue/the behavior you want to be documented.
Can you explain with a timeline what's happening?

@mrueegg
Copy link
Author

mrueegg commented Nov 23, 2016

I think this could be added to a FAQ or a Wiki to help users prevent tedious debugging sessions. When the URL scraped from a page is different just because the server redirects to the HTTPS version of the page, then deltafetch will process it again which is not obvious.

Maybe the reason why a page is not cached could also be logged in debug mode. What do you think?

@redapple
Copy link
Contributor

redapple commented Dec 9, 2016

Hello @mrueegg ,
sorry it took so long but I had a look at this again this morning, and I think I understand the issue now. I'm a bit slow sometimes ;-)

You are right that when requests are redirected, the deltafetch middleware stores the fingerprint of the redirected/final request made, and not the starting request.
Here's an example spider showing that:

# -*- coding: utf-8 -*-
import scrapy
from scrapy.utils.request import request_fingerprint


class HttpbinSpider(scrapy.Spider):
    name = "httpbin"
    start_urls = ['http://httpbin.org/']

    custom_settings = {
        'ROBOTSTXT_OBEY': False,
    }
    def parse(self, response):
        r = scrapy.Request('http://docs.scrapy.org',
                           callback=self.parse_page)
        self.logger.info("requesting %r (fingerprint: %r)" % (r, request_fingerprint(r)))
        yield r

    def parse_page(self, response):
        self.logger.info("parse_page(%r); request %r (fingerprint: %r)" % (
            response, response.request, request_fingerprint(response.request)))
        yield {'url': response.url}

And the logs showing that the saved fingerprint is the one for the last hop of redirects:

$ scrapy crawl httpbin
2016-12-09 14:49:49 [scrapy] INFO: Scrapy 1.2.2 started (bot: deltafetchredirect)
(...)
2016-12-09 14:49:49 [scrapy] DEBUG: Crawled (200) <GET http://httpbin.org/> (referer: None)
2016-12-09 14:49:49 [httpbin] INFO: requesting <GET http://docs.scrapy.org> (fingerprint: 'c96b7ce72fabf56ccbee0cc80e8eaba2f38e5051')
2016-12-09 14:49:49 [scrapy] DEBUG: Redirecting (301) to <GET https://docs.scrapy.org/> from <GET http://docs.scrapy.org>
2016-12-09 14:49:50 [scrapy] DEBUG: Redirecting (302) to <GET https://docs.scrapy.org/en/latest/> from <GET https://docs.scrapy.org/>
2016-12-09 14:49:50 [scrapy] DEBUG: Crawled (200) <GET https://docs.scrapy.org/en/latest/> (referer: http://httpbin.org/)
2016-12-09 14:49:50 [httpbin] INFO: parse_page(<200 https://docs.scrapy.org/en/latest/>); request <GET https://docs.scrapy.org/en/latest/> (fingerprint: '04eee400963f6f786a539be3e465ad0f8054e4e7')
2016-12-09 14:49:50 [scrapy] DEBUG: Scraped from <200 https://docs.scrapy.org/en/latest/>
{'url': 'https://docs.scrapy.org/en/latest/'}
2016-12-09 14:49:50 [scrapy] INFO: Closing spider (finished)
2016-12-09 14:49:50 [scrapy] INFO: Dumping Scrapy stats:
{'deltafetch/stored': 1,
 'downloader/request_bytes': 948,
 'downloader/request_count': 4,
 'downloader/request_method_count/GET': 4,
 'downloader/response_bytes': 37817,
 'downloader/response_count': 4,
 'downloader/response_status_count/200': 2,
 'downloader/response_status_count/301': 1,
 'downloader/response_status_count/302': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2016, 12, 9, 13, 49, 50, 781514),
 'item_scraped_count': 1,
 'log_count/DEBUG': 6,
 'log_count/INFO': 9,
 'request_depth_max': 1,
 'response_received_count': 2,
 'scheduler/dequeued': 4,
 'scheduler/dequeued/memory': 4,
 'scheduler/enqueued': 4,
 'scheduler/enqueued/memory': 4,
 'start_time': datetime.datetime(2016, 12, 9, 13, 49, 49, 215001)}
2016-12-09 14:49:50 [scrapy] INFO: Spider closed (finished)

$ cd .scrapy/deltafetch/
$ ls
httpbin.db
$ python
Python 2.7.12 (default, Nov 19 2016, 06:48:10) 
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import bsddb3
>>> db = bsddb3.db.DB()
>>> db.open('httpbin.db')
>>> db
<DB object at 0x7f18f068d880>
>>> db.keys()
['04eee400963f6f786a539be3e465ad0f8054e4e7']
>>> 

The original fingerprint for http://docs.scrapy.org, c96b7ce72fabf56ccbee0cc80e8eaba2f38e5051, does not get saved. Instead the one for https://docs.scrapy.org/en/latest/, 04eee400963f6f786a539be3e465ad0f8054e4e7, is saved. On a subsequent crawl, the spider will still not issue a request to https://docs.scrapy.org/en/latest/ directly, so deltafetch will not see this as duplicate.

So the issue is confirmed.
The thing is I don't know how to (easily) solve it at the minute.

@redapple redapple reopened this Dec 9, 2016
@redapple redapple changed the title Support for FormRequest, Request and page loaders Fingerprint for initial request is not saved on redirects Dec 9, 2016
@KrumBoychev
Copy link

KrumBoychev commented Oct 24, 2017

The case can be handled with custom 'deltafetch_key':

import hashlib

request = scrapy.Request(original_url, callback=self.parse_item, meta={'deltafetch_key': 
                                                hashlib.sha1(original_url).hexdigest()})

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants