vendredi 7 mars 2014

Python Recursive Scraping with Scrapy


Vote count:

0




I'm trying to make a scraper that will pull links, titles, prices and the body of posts on craigslist. I have been able to get the prices, but it returns the price for every listing on the page, not just for the specific row. I am also unable to get it to go to the next page and continue scraping.


This is the tutorial I am using - http://ift.tt/NGsrkC


I've tried suggestions from this thread, but still can't make it work - Scrapy Python Craigslist Scraper


The page I'm trying to scrape is - http://ift.tt/1hHcpSC


In the link price variable, if I remove the // before span[@class="l2"] it returns no prices, but if I leave it there it includes every price on the page.


For the rules, I've tried playing with the class tags but it seems to hang on the first page. I'm thinking I might need separate spider classes?


Here is my code:



#-------------------------------------------------------------------------------
# Name: module1
# Purpose:
#
# Author: CD
#
# Created: 02/03/2014
# Copyright: (c) CD 2014
# Licence: <your licence>
#-------------------------------------------------------------------------------
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from craigslist_sample.items import CraigslistSampleItem
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.http import Request
from scrapy.selector import *
import sys

class PageSpider(BaseSpider):
name = "cto"
allowed_domains = ["medford.craigslist.org"]
start_urls = ["http://ift.tt/1hHcpSC"]

rules = (Rule(SgmlLinkExtractor(allow=("index\d00\.html", ), restrict_xpaths=('//span[@class="button next"]' ,))
, callback="parse", follow=True), )

def parse(self, response):
hxs = HtmlXPathSelector(response)
titles = hxs.select('//span[@class="pl"] | //span[@class="l2"]')

for title in titles:
item = CraigslistSampleItem()
item['title'] = title.select("a/text()").extract()
item['link'] = title.select("a/@href").extract()
item['price'] = title.select('//span[@class="l2"]//span[@class="price"]/text()').extract()

url = 'http://ift.tt/1oxOQfe{}'.format(''.join(item['link']))
yield Request(url=url, meta={'item': item}, callback=self.parse_item_page)


def parse_item_page(self, response):
hxs = HtmlXPathSelector(response)

item = response.meta['item']
item['description'] = hxs.select('//section[@id="postingbody"]/text()').extract()
return item


asked 1 min ago






Aucun commentaire:

Enregistrer un commentaire