3D grphique: Scrapy follow 302 redirect immediately

vendredi 20 février 2015

Scrapy follow 302 redirect immediately

Vote count:

0

I'm using scrapy on an ASP site where all of the links are similar:


javascript:__doPostBack('gridID','Select$0')
javascript:__doPostBack('gridID','Select$1')
....

I am able to use a FormRequest to follow the link to the detail page for any record:


    # Let's first grab all of the Details links -- we can get everything from them that we want
    for sel in response.xpath("//table[@id='gridID']/tr[td]")[0:20]:
        thisTarget  = sel.xpath("td")[0].xpath("a/@href").extract()[0].split("'")[1]
        thisArg     = sel.xpath("td")[0].xpath("a/@href").extract()[0].split("'")[3]
        yield scrapy.FormRequest.from_response( 
                response,
                formdata={'__EVENTTARGET'   : thisTarget, 
                          '__EVENTARGUMENT' : thisArg,
                          '__EVENTVALIDATION': response.xpath("//input[@id='__EVENTVALIDATION']/@value").extract()[0],
                          '__VIEWSTATE': response.xpath("//input[@id='__VIEWSTATE']/@value").extract()[0]
                         },
                dont_click=True, 
                callback=self.parseDetail,
                dont_filter=True
            )

But when scrapy processes multiple items at a time, it makes the requests in batches. Five rows at once would result in:


2015-02-20 22:41:19-0500 [spider] DEBUG: Redirecting (302) to <GET http://ift.tt/1AX0QD4; from <POST http://ift.tt/1AX0QTi;
2015-02-20 22:41:20-0500 [spider] DEBUG: Redirecting (302) to <GET http://ift.tt/1AX0QD4; from <POST http://ift.tt/1AX0QTi;
2015-02-20 22:41:20-0500 [spider] DEBUG: Redirecting (302) to <GET http://ift.tt/1AX0QD4; from <POST http://ift.tt/1AX0QTi;
2015-02-20 22:41:21-0500 [spider] DEBUG: Redirecting (302) to <GET http://ift.tt/1AX0QD4; from <POST http://ift.tt/1AX0QTi;
2015-02-20 22:41:22-0500 [spider] DEBUG: Redirecting (302) to <GET http://ift.tt/1AX0QD4; from <POST http://ift.tt/1AX0QTi;
2015-02-20 22:41:22-0500 [spider] DEBUG: Crawled (200) <GET http://ift.tt/1AX0QD4; (referer: http://ift.tt/17COi67)
### Callback executed
2015-02-20 22:41:23-0500 [spider] DEBUG: Crawled (200) <GET http://ift.tt/1AX0QD4; (referer: http://ift.tt/17COi67)
### Callback executed
2015-02-20 22:41:23-0500 [spider] DEBUG: Crawled (200) <GET http://ift.tt/1AX0QD4; (referer: http://ift.tt/17COi67)
### Callback executed
2015-02-20 22:41:24-0500 [spider] DEBUG: Crawled (200) <GET http://ift.tt/1AX0QD4; (referer: http://ift.tt/17COi67)
### Callback executed
2015-02-20 22:41:24-0500 [spider] DEBUG: Crawled (200) <GET http://ift.tt/1AX0QD4; (referer: http://ift.tt/17COi67)
### Callback executed

This seems to lead to all 5 of the responses being identical, a result of some of the ASP magic, I suppose.

I have tried to set REDIRECT_PRIORITY_ADJUST = 100 to give the redirects more priority, with limited success. The best this has done is to stop after 16 initial requests, and do 16 of the redirects, then another batch of initial requests, and so on....

When I do things manually in scrapy shell, by fetching each of the FormRequests, the redirect is processed immediately and I get the expected response, even when fetching multiple requests in a row.

Thus, my question:

Is there any way to get scrapy to process a request all the way to a HTTP 200 response, executing any redirects along the way immediately?

3D grphique

vendredi 20 février 2015

Scrapy follow 302 redirect immediately

Vote count:

0

Aucun commentaire:

Enregistrer un commentaire

vendredi 20 février 2015

Scrapy follow 302 redirect immediately

Vote count: 0

Aucun commentaire:

Enregistrer un commentaire

Vote count:

0