Scrapy CrawlSpider for AJAX content
--------------------------------------------------
Rise to the top 3% as a developer or hire one of them at Toptal: https://topt.al/25cXVn
--------------------------------------------------
Music by Eric Matyas
https://www.soundimage.org
Track title: Puzzle Game 5
--
Chapters
00:00 Scrapy Crawlspider For Ajax Content
02:17 Accepted Answer Score 13
03:34 Thank you
--
Full question
https://stackoverflow.com/questions/2370...
--
Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...
--
Tags
#python #webscraping #scrapy
#avk47
Rise to the top 3% as a developer or hire one of them at Toptal: https://topt.al/25cXVn
--------------------------------------------------
Music by Eric Matyas
https://www.soundimage.org
Track title: Puzzle Game 5
--
Chapters
00:00 Scrapy Crawlspider For Ajax Content
02:17 Accepted Answer Score 13
03:34 Thank you
--
Full question
https://stackoverflow.com/questions/2370...
--
Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...
--
Tags
#python #webscraping #scrapy
#avk47
ACCEPTED ANSWER
Score 13
Crawl spider may be too limited for your purposes here. If you need a lot of logic you are usually better off inheriting from Spider.
Scrapy provides CloseSpider exception that can be raised when you need to stop parsing under certain conditions. The page you are crawling returns a message "There are no Focus articles on your stocks", when you exceed maximum page, you can check for this message and stop iteration when this message occurs.
In your case you can go with something like this:
from scrapy.spider import Spider
from scrapy.http import Request
from scrapy.exceptions import CloseSpider
class ExampleSite(Spider):
name = "so"
download_delay = 0.1
more_pages = True
next_page = 1
start_urls = ['http://example.com/account/ajax_headlines_content?type=in_focus_articles&page=0'+
'&slugs=tsla&is_symbol_page=true']
allowed_domains = ['example.com']
def create_ajax_request(self, page_number):
"""
Helper function to create ajax request for next page.
"""
ajax_template = 'http://example.com/account/ajax_headlines_content?type=in_focus_articles&page={pagenum}&slugs=tsla&is_symbol_page=true'
url = ajax_template.format(pagenum=page_number)
return Request(url, callback=self.parse)
def parse(self, response):
"""
Parsing of each page.
"""
if "There are no Focus articles on your stocks." in response.body:
self.log("About to close spider", log.WARNING)
raise CloseSpider(reason="no more pages to parse")
# there is some content extract links to articles
sel = Selector(response)
links_xpath = "//div[@class='symbol_article']/a/@href"
links = sel.xpath(links_xpath).extract()
for link in links:
url = urljoin(response.url, link)
# follow link to article
# commented out to see how pagination works
#yield Request(url, callback=self.parse_item)
# generate request for next page
self.next_page += 1
yield self.create_ajax_request(self.next_page)
def parse_item(self, response):
"""
Parsing of each article page.
"""
self.log("Scraping: %s" % response.url, level=log.INFO)
hxs = Selector(response)
item = NewsItem()
item['url'] = response.url
item['source'] = 'example'
item['title'] = hxs.xpath('//title/text()')
item['date'] = hxs.xpath('//div[@class="article_info_pos"]/span/text()')
yield item