3D grphique: Python Recursive Scraping with Scrapy

vendredi 7 mars 2014

Python Recursive Scraping with Scrapy

Vote count:

0

I'm trying to make a scraper that will pull links, titles, prices and the body of posts on craigslist. I have been able to get the prices, but it returns the price for every listing on the page, not just for the specific row. I am also unable to get it to go to the next page and continue scraping.

This is the tutorial I am using - http://ift.tt/NGsrkC

I've tried suggestions from this thread, but still can't make it work - Scrapy Python Craigslist Scraper

The page I'm trying to scrape is - http://ift.tt/1hHcpSC

In the link price variable, if I remove the // before span[@class="l2"] it returns no prices, but if I leave it there it includes every price on the page.

For the rules, I've tried playing with the class tags but it seems to hang on the first page. I'm thinking I might need separate spider classes?

Here is my code:


#-------------------------------------------------------------------------------
# Name:        module1
# Purpose:
#
# Author:      CD
#
# Created:     02/03/2014
# Copyright:   (c) CD 2014
# Licence:     <your licence>
#-------------------------------------------------------------------------------
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from craigslist_sample.items import CraigslistSampleItem
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.http import Request
from scrapy.selector import *
import sys

class PageSpider(BaseSpider):
    name = "cto"
    allowed_domains = ["medford.craigslist.org"]
    start_urls = ["http://ift.tt/1hHcpSC"]

    rules = (Rule(SgmlLinkExtractor(allow=("index\d00\.html", ), restrict_xpaths=('//span[@class="button next"]' ,))
        , callback="parse", follow=True), )

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        titles = hxs.select('//span[@class="pl"] | //span[@class="l2"]')

        for title in titles:
            item = CraigslistSampleItem()
            item['title'] = title.select("a/text()").extract()
            item['link'] = title.select("a/@href").extract()
            item['price'] = title.select('//span[@class="l2"]//span[@class="price"]/text()').extract()

            url = 'http://ift.tt/1oxOQfe{}'.format(''.join(item['link']))
            yield Request(url=url, meta={'item': item}, callback=self.parse_item_page)


    def parse_item_page(self, response):
        hxs = HtmlXPathSelector(response)

        item = response.meta['item']
        item['description'] = hxs.select('//section[@id="postingbody"]/text()').extract()
        return item

asked 1 min ago

ISuckAtLife

23

3D grphique

vendredi 7 mars 2014

Python Recursive Scraping with Scrapy

Vote count:

0

Aucun commentaire:

Enregistrer un commentaire

vendredi 7 mars 2014

Python Recursive Scraping with Scrapy

Vote count: 0

Aucun commentaire:

Enregistrer un commentaire

Vote count:

0