The Python Oracle

Scrapy CrawlSpider for AJAX content

Become part of the top 3% of the developers by applying to Toptal https://topt.al/25cXVn

--

Music by Eric Matyas
https://www.soundimage.org
Track title: Popsicle Puzzles

--

Chapters
00:00 Question
02:54 Accepted answer (Score 13)
04:29 Thank you

--

Full question
https://stackoverflow.com/questions/2370...

Question links:
http://example.com/symbol/TSLA
[http://example.com/account/ajax_headline...]: http://example.com/account/ajax_headline...
[Scrapy Crawl URLs in Order]: https://stackoverflow.com/questions/6566...

--

Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...

--

Tags
#python #webscraping #scrapy

#avk47



ACCEPTED ANSWER

Score 13


Crawl spider may be too limited for your purposes here. If you need a lot of logic you are usually better off inheriting from Spider.

Scrapy provides CloseSpider exception that can be raised when you need to stop parsing under certain conditions. The page you are crawling returns a message "There are no Focus articles on your stocks", when you exceed maximum page, you can check for this message and stop iteration when this message occurs.

In your case you can go with something like this:

from scrapy.spider import Spider
from scrapy.http import Request
from scrapy.exceptions import CloseSpider

class ExampleSite(Spider):
    name = "so"
    download_delay = 0.1

    more_pages = True
    next_page = 1

    start_urls = ['http://example.com/account/ajax_headlines_content?type=in_focus_articles&page=0'+
                      '&slugs=tsla&is_symbol_page=true']

    allowed_domains = ['example.com']

    def create_ajax_request(self, page_number):
        """
        Helper function to create ajax request for next page.
        """
        ajax_template = 'http://example.com/account/ajax_headlines_content?type=in_focus_articles&page={pagenum}&slugs=tsla&is_symbol_page=true'

        url = ajax_template.format(pagenum=page_number)
        return Request(url, callback=self.parse)

    def parse(self, response):
        """
        Parsing of each page.
        """
        if "There are no Focus articles on your stocks." in response.body:
            self.log("About to close spider", log.WARNING)
            raise CloseSpider(reason="no more pages to parse")


        # there is some content extract links to articles
        sel = Selector(response)
        links_xpath = "//div[@class='symbol_article']/a/@href"
        links = sel.xpath(links_xpath).extract()
        for link in links:
            url = urljoin(response.url, link)
            # follow link to article
            # commented out to see how pagination works
            #yield Request(url, callback=self.parse_item)

        # generate request for next page
        self.next_page += 1
        yield self.create_ajax_request(self.next_page)

    def parse_item(self, response):
        """
        Parsing of each article page.
        """
        self.log("Scraping: %s" % response.url, level=log.INFO)

        hxs = Selector(response)

        item = NewsItem()

        item['url'] = response.url
        item['source'] = 'example'
        item['title'] = hxs.xpath('//title/text()')
        item['date'] = hxs.xpath('//div[@class="article_info_pos"]/span/text()')

        yield item