Python - Scrapy get date picker values [Selenium or Scrapy-Splash] -
disclaimer: have searched , tried work examples found on so, have been unable achieve result seek.
i trying scrape values newspaperarchive.com, among these values dates(yeah, month & day) paper published. newspaperarchive uses date picker ui , loads content through javascript/ajax calls(not entirely sure).
i trying dates, newspaperarchive provides date picker , loads , marks date paper published.
what want find out , possibly understand is:
- if can achieved scrapy-splash.
- how can achieve selenium if scrapy-splash wouldn't work use case.
- a sample code can learn future cases more helpful.
here example page on newspaperarchive.com http://newspaperarchive.com/us/hawaii/honolulu/hawaiian-gazette/
values are: year = 1895 month = february days = 1, 5, 8, 12, 15, 19, 22, 26 , continue loop through dates year , other years available in date picker news paper.
class newspaperarchivespider(crawlspider): name = "newspaperarchive" allowed_domains = ["newspaperarchive.com"] paper_link = [ "http://newspaperarchive.com/us/alabama/rainsville/" ] start_urls = [paper paper in paper_link] rules = ( # parse page grab data rule(linkextractor(restrict_xpaths=( '//li[@class="blurlink"]/a[@href]')), callback='parse_page', follow=true), ) def parse_page(self, response): self.log('parsing data page %s' % (response.url) , log.info) item = newspaperarchiveitem() item['paper_name'] = response.xpath( '//div[@class="newbrc"]//li[6]/text()').extract() item['paper_state'] = response.xpath( '//div[@class="newbrc"]//li[4]/a/text()').extract() item['paper_city'] = response.xpath( '//div[@class="newbrc"]//li[5]/a/text()').extract() item['paper_dates'] = ' '.join(response.xpath( '//div[@class="span7 banner-img-txt"]//h1/text()' ).extract()).strip() return item
thanks taking time read. appreciated. note: open other methods can use achieve task.
Comments
Post a Comment