Scrapy is skipping first page in pagination - pagination

The following Scrapy CrawlSpider class code is for scraping links via following pagination from the data.ok.gov page.
class OklahomaFinanceSpider(CrawlSpider):
name = "OklahomaFinanceSpider"
allowed_domains = ["data.ok.gov"]
start_urls = [
"http://data.ok.gov/browse?f[0]=bundle_name%3ADataset&f[1]=im_field_categories%3A4191"
]
rules = (
Rule(SgmlLinkExtractor(allow=(), restrict_xpaths=('//li[#class="pager-next"]',)), callback="parse_page", follow= True),
)
def parse_page(self, response):
for href in response.xpath('//*[contains(concat(" ", normalize-space(#class), " "),"search-results apachesolr_search-results")]/h3/a/#href'):
url = response.urljoin(href.extract())
yield scrapy.Request(url, callback=self.parse_dir_contents)
However, the first page is not being scraped. What mistake am I making with the Rules?

Posting #paul trmbrth comment as an answer here.
For parsing the pages fetched at the start_urls, set parse_start_url = parse_page after after the parse_page(self, response) definition.

Related

Scrapy request callback doesn't enter next function

Here is my code:
import scrapy
class BookingSpider(scrapy.Spider):
name = 'booking-spider'
allowed_domains = ['booking.com']
start_urls = [
'https://www.booking.com/country.de.html?aid=356980;label=gog235jc-1DCAIoLDgcSAdYA2gsiAEBmAEHuAEHyAEP2AED6AEB'
'-AECiAIBqAIDuAK7q7DyBcACAQ;sid=8de61678ac61d10a89c13a3941fd3dcd'
]
# get country page
def parse(self, response):
for countryurl in response.xpath('//a[contains(text(),"Schweiz")]/#href'):
url = response.urljoin(countryurl.extract())
print("COUNTRYURL", url)
yield scrapy.Request(url, callback=self.parse_country)
# get page of all hotels in a country
def parse_country(self, response):
for hotelsurl in response.xpath('//a[#class="bui-button bui-button--secondary"]/#href'):
url = response.urljoin(hotelsurl.extract())
print("HOTELURL", url)
yield scrapy.Request(url, callback=self.parse_hotel)
def parse_hotel(self, response):
print("entering parse_hotel")
hotelurl = response.xpath('//*[(# id = "hp_hotel_name")]')
print("URL", hotelurl)
It doesn't go in the parse_hotel function. I can't understand why?
Where is my mistake? Thank you in advance for your suggestions!
Problem is on this line
response.xpath('//a[#class="bui-button bui-button--secondary"]/#href')
Here your XPATH extracts such urls:
https://www.booking.com/searchresults.de.html?dest_id=204;dest_type=country&
But they should be something like this:
https://www.booking.com/searchresults.de.html?label=gen173nr-1DCAIoLDgcSAdYBGhSiAEBmAEHuAEHyAEM2AED6AEB-AECiAIBqAIDuAKz_uDyBcACAQ;sid=a3807e20e99c61282850cfdf02041c07;dest_id=204;dest_type=country&
Because of this, your spider tries to open same webpage and it gets blocked by Scrapy Dupefilter. That is reason why callback is not called.
I think, missing part in url is generated by JavaScript.

Scrape Multiple articles from one page with each article with separate href

I am new to scrapy and writing my first spider make a scrapy spider for website similar to https://blogs.webmd.com/diabetes/default.htm
I want to scrape Headlines and then navigate to each article scrape the text content for each article.
I have tried by using rules and linkextractor but it's not able to navigate to next page and extract. i get the ERROR: Spider error processing https://blogs.webmd.com/diabetes/default.htm> (referer: None)
Below is my code
import scrapy
from scrapy.spiders import Rule
from scrapy.linkextractors import LinkExtractor
class MedicalSpider(scrapy.Spider):
name = 'medical'
allowed_domains = ['https://blogs.webmd.com/diabetes/default.htm']
start_urls = ['https://blogs.webmd.com/diabetes/default.htm']
Rules = (Rule(LinkExtractor(allow=(), restrict_css=('.posts-list-post-content a ::attr(href)')), callback="parse", follow=True),)
def parse(self, response):
headline = response.css('.posts-list-post-content::text').extract()
body = response.css('.posts-list-post-desc::text').extract()
print("%s : %s" % (headline, body))
next_page = response.css('.posts-list-post-content a ::attr(href)').extract()
if next_page:
next_href = next_page[0]
next_page_url = next_href
request = scrapy.Request(url=next_page_url)
yield request
Please guide a newbie in scrapy to get this spider right for multiple articles on each page.
Usually when using scrapy each response is parsed by parse callback. The main parse method is the callback for the initial response obtained for each of the start_urls.
The goal for that parse function should then be to "Identify article links", and issue requests for each of them. Those responses would then be parsed by another callback, say parse_article that would then extract all the contents from that particular article.
You don't even need that LinkExtractor. Consider:
import scrapy
class MedicalSpider(scrapy.Spider):
name = 'medical'
allowed_domains = ['blogs.webmd.com'] # Only the domain, not the URL
start_urls = ['https://blogs.webmd.com/diabetes/default.htm']
def parse(self, response):
article_links = response.css('.posts-list-post-content a ::attr(href)')
for link in article_links:
url = link.get()
if url:
yield response.follow(url=url, callback=self.parse_article)
def parse_article(self, response):
headline = 'some-css-selector-to-get-the-headline-from-the-aticle-page'
# The body is trickier, since it's spread through several tags on this particular site
body = 'loop-over-some-selector-to-get-the-article-text'
yield {
'headline': headline,
'body': body
}
I've not pasted the full code because I believe you still want some excitement learning how to do this, but you can find what I came up with on this gist
Note that the parse_article method is returning dictionaries. These are using Scrapy's items pipelines. You can get a neat json output by running your code using: scrapy runspider headlines/spiders/medical.py -o out.json

Where are my mistakes in my scrapy codes?

I want to crawl a website via scrapy but my codes come up with an error.
I have tried to use xpath but it seems I can not define the div class in the web site.
The following code raises an error on ("h2 ::text").extract().
import scrapy
from scrapy.selector import Selector
from artistlist.items import ArtistlistItem
class MySpider(scrapy.Spider):
name = "artistlist"
allowed_domains = ["baltictriennial13.org"]
start_urls = ["https://www.baltictriennial13.org/artist/caroline-achaintre/"]
def parse(self, response):
hxs = Selector(response)
titles = hxs.xpath("//div[#class='artist']")
items = []
for titles in titles:
item = ArtistlistItem()
item["artist"] = titles.select("h2 ::text").extract()
item["biograpy"] = titles.select("p::text").extract()
items.append(item)
return items
I want to crawl the web site and store the data in a .csv file.
The main issue with your code is using of .select instead of .css. Here is what do you need but I'm not sure about titles part (may be you need it on other pages):
def parse(self, response):
titles = response.xpath("//div[#class='artist']")
# items = []
for title in titles:
item = ArtistlistItem()
item["artist"] = title.css("h2::text").get()
item["biograpy"] = title.css("p::text").get()
# items.append(item)
yield item
try to remove the space in h2 ::text --> h2::text. If that doesn't work try h2/text()

why the scrapy yield.Request() did not Recursive?

here is my code?
class QuotesSpider(scrapy.Spider):
name = 'quotes'
allowed_domains = ['quotes.toscrape.com/']
start_urls = ['http://quotes.toscrape.com//']
def parse(self, response):
quotes = response.css('.quote')
for quote in quotes:
item = QuoteItem()
text = quote.css('.text::text').extract_first()
author = quote.css('.author::text').extract_first()
tags = quote.css('.tags .tag::text').extract()
item['text'] = text
item['author'] = author
item['tags'] = tags
yield item
next = response.css('.pager .next a::attr(href)').extract_first()
url = response.urljoin(next)
yield scrapy.Request(url=url, callback=self.parse)
I am new to scrapy. I think this will always Recursive.But actually it didn't.That's why?
the problem here is that scrapy uses allowed_domains as a regex for determining if the links passing through belong to the specified domain.
just change the string quotes.toscrape.com/ to quotes.toscrape.com if you only want to allow requests from that specific subdomain.
You can also remove that entire variable if you want to allow requests from every domain.

yield scrapy.Request does not return the title

I am new to Scrapy and try to use it to practice crawling the website. However, even I followed the codes provided by the tutorial, it does not return the results. It looks like yield scrapy.Request does not work. My codes are as follow:
Import scrapy
from bs4 import BeautifulSoup
from apple.items import AppleItem
class Apple1Spider(scrapy.Spider):
name = 'apple'
allowed_domains = ['appledaily.com']
start_urls =['http://www.appledaily.com.tw/realtimenews/section/new/']
def parse(self, response):
domain = "http://www.appledaily.com.tw"
res = BeautifulSoup(response.body)
for news in res.select('.rtddt'):
yield scrapy.Request(domain + news.select('a')[0]['href'], callback=self.parse_detail)
def parse_detail(self, response):
res = BeautifulSoup(response.body)
appleitem = AppleItem()
appleitem['title'] = res.select('h1')[0].text
appleitem['content'] = res.select('.trans')[0].text
appleitem['time'] = res.select('.gggs time')[0].text
return appleitem
It shows that spider was opened and closed but it returns nothing. The version of Python is 3.6. Can anyone please help? Thanks.
EDIT I
The crawl log can be reached here.
EDIT II
Maybe if I change the codes as below will make the issue clearer:
Import scrapy
from bs4 import BeautifulSoup
class Apple1Spider(scrapy.Spider):
name = 'apple'
allowed_domains = ['appledaily.com']
start_urls = ['http://www.appledaily.com.tw/realtimenews/section/new/']
def parse(self, response):
domain = "http://www.appledaily.com.tw"
res = BeautifulSoup(response.body)
for news in res.select('.rtddt'):
yield scrapy.Request(domain + news.select('a')[0]['href'], callback=self.parse_detail)
def parse_detail(self, response):
res = BeautifulSoup(response.body)
print(res.select('#h1')[0].text)
The codes should print out the url and the title separately but it does not return anything.
Your log states:
2017-07-10 19:12:47 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to
'www.appledaily.com.tw': http://www.appledaily.com.tw/realtimenews/article/life/201
70710/1158177/oBike%E7%A6%81%E5%81%9C%E6%A9%9F%E8%BB%8A%E6%A0%BC%E3%80%80%E6%96%B0%E5%8C%
97%E7%81%AB%E9%80%9F%E5%86%8D%E5%85%AC%E5%91%8A6%E5%8D%80%E7%A6%81%E5%81%9C>
Your spider is set to:
allowed_domains = ['appledaily.com']
So it should probably be:
allowed_domains = ['appledaily.com.tw']
It seems like the content you are interested in your parse method (i.e. list items with class rtddt) is generated dynamically -- it can be inspected for example using Chrome, but is not present in HTML source (what Scrapy obtains as a response).
You will have to use something to render the page for Scrapy first. I would recommend Splash together with scrapy-splash package.

Resources