Vote count:
0
I'm using CrawlSpider to crawl a website. I have multiple start urls, and in each url, there is a "next link" linking to another similar page. I use rules to deal with the next page.
rules = (
Rule(SgmlLinkExtractor(allow = ('/',),
restrict_xpaths=('//span[@class="next"]')),
callback='parse_item',
follow=True),
)
When there is only a url in start_urls, everything is ok. However, when there are many urls in start_urls, I got "Ignoring response <404 a url> : HTTP status code is not handled or not allowed".
How can I start with the first url in start_urls, after dealing with all the "next link", and then start with the second url in start_urls?
Here is my code
class DoubanSpider(CrawlSpider):
name = "doubanBook"
allowed_domains = ["book.douban.com"]
category = codecs.open("category.txt","r",encoding="utf-8")
start_urls = []
for line in category:
line = line.strip().rstrip()
start_urls.append(line)
rules = (
Rule(SgmlLinkExtractor(allow = ('/',),
restrict_xpaths=('//span[@class="next"]')),
callback='parse_item',
follow=True),
)
def parse_item(self, response):
sel = Selector(response)
out = open("alllink.txt","a")
sites = sel.xpath('//ul/li/div[@class="info"]/h2')
for site in sites:
href = site.xpath('a/@href').extract()[0]
title = site.xpath('a/@title').extract()[0]
out.write("***")
out.close()
asked 1 min ago
Using multiple start_urls in CrawlSpider
Aucun commentaire:
Enregistrer un commentaire