Scrapy CrawlSpider for AJAX content
Become part of the top 3% of the developers by applying to Toptal https://topt.al/25cXVn
--
Music by Eric Matyas
https://www.soundimage.org
Track title: Popsicle Puzzles
--
Chapters
00:00 Question
02:54 Accepted answer (Score 13)
04:29 Thank you
--
Full question
https://stackoverflow.com/questions/2370...
Question links:
http://example.com/symbol/TSLA
[http://example.com/account/ajax_headline...]: http://example.com/account/ajax_headline...
[Scrapy Crawl URLs in Order]: https://stackoverflow.com/questions/6566...
--
Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...
--
Tags
#python #webscraping #scrapy
#avk47
--
Music by Eric Matyas
https://www.soundimage.org
Track title: Popsicle Puzzles
--
Chapters
00:00 Question
02:54 Accepted answer (Score 13)
04:29 Thank you
--
Full question
https://stackoverflow.com/questions/2370...
Question links:
http://example.com/symbol/TSLA
[http://example.com/account/ajax_headline...]: http://example.com/account/ajax_headline...
[Scrapy Crawl URLs in Order]: https://stackoverflow.com/questions/6566...
--
Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...
--
Tags
#python #webscraping #scrapy
#avk47
ACCEPTED ANSWER
Score 13
Crawl spider may be too limited for your purposes here. If you need a lot of logic you are usually better off inheriting from Spider.
Scrapy provides CloseSpider exception that can be raised when you need to stop parsing under certain conditions. The page you are crawling returns a message "There are no Focus articles on your stocks", when you exceed maximum page, you can check for this message and stop iteration when this message occurs.
In your case you can go with something like this:
from scrapy.spider import Spider
from scrapy.http import Request
from scrapy.exceptions import CloseSpider
class ExampleSite(Spider):
name = "so"
download_delay = 0.1
more_pages = True
next_page = 1
start_urls = ['http://example.com/account/ajax_headlines_content?type=in_focus_articles&page=0'+
'&slugs=tsla&is_symbol_page=true']
allowed_domains = ['example.com']
def create_ajax_request(self, page_number):
"""
Helper function to create ajax request for next page.
"""
ajax_template = 'http://example.com/account/ajax_headlines_content?type=in_focus_articles&page={pagenum}&slugs=tsla&is_symbol_page=true'
url = ajax_template.format(pagenum=page_number)
return Request(url, callback=self.parse)
def parse(self, response):
"""
Parsing of each page.
"""
if "There are no Focus articles on your stocks." in response.body:
self.log("About to close spider", log.WARNING)
raise CloseSpider(reason="no more pages to parse")
# there is some content extract links to articles
sel = Selector(response)
links_xpath = "//div[@class='symbol_article']/a/@href"
links = sel.xpath(links_xpath).extract()
for link in links:
url = urljoin(response.url, link)
# follow link to article
# commented out to see how pagination works
#yield Request(url, callback=self.parse_item)
# generate request for next page
self.next_page += 1
yield self.create_ajax_request(self.next_page)
def parse_item(self, response):
"""
Parsing of each article page.
"""
self.log("Scraping: %s" % response.url, level=log.INFO)
hxs = Selector(response)
item = NewsItem()
item['url'] = response.url
item['source'] = 'example'
item['title'] = hxs.xpath('//title/text()')
item['date'] = hxs.xpath('//div[@class="article_info_pos"]/span/text()')
yield item