Vote count:
0
I'm using scrapy
on an ASP site where all of the links are similar:
javascript:__doPostBack('gridID','Select$0')
javascript:__doPostBack('gridID','Select$1')
....
I am able to use a FormRequest
to follow the link to the detail page for any record:
# Let's first grab all of the Details links -- we can get everything from them that we want
for sel in response.xpath("//table[@id='gridID']/tr[td]")[0:20]:
thisTarget = sel.xpath("td")[0].xpath("a/@href").extract()[0].split("'")[1]
thisArg = sel.xpath("td")[0].xpath("a/@href").extract()[0].split("'")[3]
yield scrapy.FormRequest.from_response(
response,
formdata={'__EVENTTARGET' : thisTarget,
'__EVENTARGUMENT' : thisArg,
'__EVENTVALIDATION': response.xpath("//input[@id='__EVENTVALIDATION']/@value").extract()[0],
'__VIEWSTATE': response.xpath("//input[@id='__VIEWSTATE']/@value").extract()[0]
},
dont_click=True,
callback=self.parseDetail,
dont_filter=True
)
But when scrapy
processes multiple items at a time, it makes the requests in batches. Five rows at once would result in:
2015-02-20 22:41:19-0500 [spider] DEBUG: Redirecting (302) to <GET http://ift.tt/1AX0QD4; from <POST http://ift.tt/1AX0QTi;
2015-02-20 22:41:20-0500 [spider] DEBUG: Redirecting (302) to <GET http://ift.tt/1AX0QD4; from <POST http://ift.tt/1AX0QTi;
2015-02-20 22:41:20-0500 [spider] DEBUG: Redirecting (302) to <GET http://ift.tt/1AX0QD4; from <POST http://ift.tt/1AX0QTi;
2015-02-20 22:41:21-0500 [spider] DEBUG: Redirecting (302) to <GET http://ift.tt/1AX0QD4; from <POST http://ift.tt/1AX0QTi;
2015-02-20 22:41:22-0500 [spider] DEBUG: Redirecting (302) to <GET http://ift.tt/1AX0QD4; from <POST http://ift.tt/1AX0QTi;
2015-02-20 22:41:22-0500 [spider] DEBUG: Crawled (200) <GET http://ift.tt/1AX0QD4; (referer: http://ift.tt/17COi67)
### Callback executed
2015-02-20 22:41:23-0500 [spider] DEBUG: Crawled (200) <GET http://ift.tt/1AX0QD4; (referer: http://ift.tt/17COi67)
### Callback executed
2015-02-20 22:41:23-0500 [spider] DEBUG: Crawled (200) <GET http://ift.tt/1AX0QD4; (referer: http://ift.tt/17COi67)
### Callback executed
2015-02-20 22:41:24-0500 [spider] DEBUG: Crawled (200) <GET http://ift.tt/1AX0QD4; (referer: http://ift.tt/17COi67)
### Callback executed
2015-02-20 22:41:24-0500 [spider] DEBUG: Crawled (200) <GET http://ift.tt/1AX0QD4; (referer: http://ift.tt/17COi67)
### Callback executed
This seems to lead to all 5 of the responses being identical, a result of some of the ASP magic, I suppose.
I have tried to set REDIRECT_PRIORITY_ADJUST = 100
to give the redirects more priority, with limited success. The best this has done is to stop after 16 initial requests, and do 16 of the redirects, then another batch of initial requests, and so on....
When I do things manually in scrapy shell
, by fetch
ing each of the FormRequest
s, the redirect is processed immediately and I get the expected response, even when fetching multiple requests in a row.
Thus, my question:
Is there any way to get scrapy
to process a request all the way to a HTTP 200 response, executing any redirects along the way immediately?
Or ... any other solution to my problem that may not be obvious?
Scrapy follow 302 redirect immediately
Aucun commentaire:
Enregistrer un commentaire