I have written a scraper using Scrapy and I have a weird yet simple problem.
I login using u/p to scrape the data but sometimes the site redirects me to /login.asp page with redir= query parameter containing the url I was about to scrape. so I added re_login_if_needed() function and I call it as the first statement of the parse() callback. The idea is to check if response.url has a redirect URL so I can re-login to the site and continue scrapping with parse() as before.
The problem is that somehow re_login_if_needed() function is never executed. Any DEBUG PRINT statement I put in there, is never printed out.
How could that be?
In by class I have:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
def start_requests(self):
urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
self.re_login_if_needed(response)
page = response.url.split("/")[-2]
filename = 'quotes-%s.html' % page
with open(filename, 'wb') as f:
f.write(response.body)
self.log('Saved file %s' % filename)
def re_login_if_needed(self, response):
# check if response.url contains redirect code, i.e: "/login.asp?redir="
# and relogin ...
Related
Here is my code:
import scrapy
class BookingSpider(scrapy.Spider):
name = 'booking-spider'
allowed_domains = ['booking.com']
start_urls = [
'https://www.booking.com/country.de.html?aid=356980;label=gog235jc-1DCAIoLDgcSAdYA2gsiAEBmAEHuAEHyAEP2AED6AEB'
'-AECiAIBqAIDuAK7q7DyBcACAQ;sid=8de61678ac61d10a89c13a3941fd3dcd'
]
# get country page
def parse(self, response):
for countryurl in response.xpath('//a[contains(text(),"Schweiz")]/#href'):
url = response.urljoin(countryurl.extract())
print("COUNTRYURL", url)
yield scrapy.Request(url, callback=self.parse_country)
# get page of all hotels in a country
def parse_country(self, response):
for hotelsurl in response.xpath('//a[#class="bui-button bui-button--secondary"]/#href'):
url = response.urljoin(hotelsurl.extract())
print("HOTELURL", url)
yield scrapy.Request(url, callback=self.parse_hotel)
def parse_hotel(self, response):
print("entering parse_hotel")
hotelurl = response.xpath('//*[(# id = "hp_hotel_name")]')
print("URL", hotelurl)
It doesn't go in the parse_hotel function. I can't understand why?
Where is my mistake? Thank you in advance for your suggestions!
Problem is on this line
response.xpath('//a[#class="bui-button bui-button--secondary"]/#href')
Here your XPATH extracts such urls:
https://www.booking.com/searchresults.de.html?dest_id=204;dest_type=country&
But they should be something like this:
https://www.booking.com/searchresults.de.html?label=gen173nr-1DCAIoLDgcSAdYBGhSiAEBmAEHuAEHyAEM2AED6AEB-AECiAIBqAIDuAKz_uDyBcACAQ;sid=a3807e20e99c61282850cfdf02041c07;dest_id=204;dest_type=country&
Because of this, your spider tries to open same webpage and it gets blocked by Scrapy Dupefilter. That is reason why callback is not called.
I think, missing part in url is generated by JavaScript.
I just need to make Scrapy request to request last page of the website.
I cant create a scrapy request to go to the last page. I have tried the code below.
last_page = response.css('li.next a::attr(href)').get()
if next_page is None:
yield scrapy.Request(last_page, callback=self.parse)
It is expected that the crawler goes straight to the last page, then from there I would do some manipulations
I believe the way to go would be to inspect the source code to find the "Next" page link and use this function in parse:
current_page = #current_page_link
next_page = #scraping the link using a css selector
if next_page is None:
yield response.follow(current_page, callback = self.manipulation)
def manipulation(self, response):
#your code here
I am new to scrapy and writing my first spider make a scrapy spider for website similar to https://blogs.webmd.com/diabetes/default.htm
I want to scrape Headlines and then navigate to each article scrape the text content for each article.
I have tried by using rules and linkextractor but it's not able to navigate to next page and extract. i get the ERROR: Spider error processing https://blogs.webmd.com/diabetes/default.htm> (referer: None)
Below is my code
import scrapy
from scrapy.spiders import Rule
from scrapy.linkextractors import LinkExtractor
class MedicalSpider(scrapy.Spider):
name = 'medical'
allowed_domains = ['https://blogs.webmd.com/diabetes/default.htm']
start_urls = ['https://blogs.webmd.com/diabetes/default.htm']
Rules = (Rule(LinkExtractor(allow=(), restrict_css=('.posts-list-post-content a ::attr(href)')), callback="parse", follow=True),)
def parse(self, response):
headline = response.css('.posts-list-post-content::text').extract()
body = response.css('.posts-list-post-desc::text').extract()
print("%s : %s" % (headline, body))
next_page = response.css('.posts-list-post-content a ::attr(href)').extract()
if next_page:
next_href = next_page[0]
next_page_url = next_href
request = scrapy.Request(url=next_page_url)
yield request
Please guide a newbie in scrapy to get this spider right for multiple articles on each page.
Usually when using scrapy each response is parsed by parse callback. The main parse method is the callback for the initial response obtained for each of the start_urls.
The goal for that parse function should then be to "Identify article links", and issue requests for each of them. Those responses would then be parsed by another callback, say parse_article that would then extract all the contents from that particular article.
You don't even need that LinkExtractor. Consider:
import scrapy
class MedicalSpider(scrapy.Spider):
name = 'medical'
allowed_domains = ['blogs.webmd.com'] # Only the domain, not the URL
start_urls = ['https://blogs.webmd.com/diabetes/default.htm']
def parse(self, response):
article_links = response.css('.posts-list-post-content a ::attr(href)')
for link in article_links:
url = link.get()
if url:
yield response.follow(url=url, callback=self.parse_article)
def parse_article(self, response):
headline = 'some-css-selector-to-get-the-headline-from-the-aticle-page'
# The body is trickier, since it's spread through several tags on this particular site
body = 'loop-over-some-selector-to-get-the-article-text'
yield {
'headline': headline,
'body': body
}
I've not pasted the full code because I believe you still want some excitement learning how to do this, but you can find what I came up with on this gist
Note that the parse_article method is returning dictionaries. These are using Scrapy's items pipelines. You can get a neat json output by running your code using: scrapy runspider headlines/spiders/medical.py -o out.json
I am writing a spider with scrapy in python3, and l just started scrapy not a long time. I was catching the data of a web-site and after some minutes, web site maybe get me the 302 status and redirect to another url to verify me. So l want to save the url to the file.
for example, https://www.test.com/article?id=123 is what I want to request, and then it response me 302 an redirect to https://www.test.com/vrcode
I want to save https://www.test.com/article?id=123 to file, how should I do?
class CatchData(scrapy.Spider):
name = 'test'
allowed_domains = ['test.com']
start_urls = ['test.com/article?id=1',
'test.com/article?id=2',
# ...
]
def parse(self, response):
item = LocationItem()
item['article'] = response.xpath('...')
yield item
I found a answer from How to get the scrapy failure URLs?
but It is an answer at six years ago, I want to know is there more simple way to do this
with open(file_name, 'w', encoding="utf-8") as f:
f.write(str(item))
I am crawling a website with property listings and the "Buy/Rent" is only found in the listing page.I have extracted other data from the detail page by parsing each urls to the parse_property method from parse method, however i am not able to get the offering type from the main listing page.
I have tried to do it the same way i parsed individual urls.(The commented code)
def parse(self, response):
properties = response.xpath('//div[#class="property-information-address"]/a')
for property in properties:
url= property.xpath('./#href').extract_first()
yield Request(url, callback=self.parse_property, meta={'URL':url})
# TODO: offering
# offering=response.xpath('//div[#class="property-status"]')
# for of in offerings:
# offering=of.xpath('./a/text()').extract_first()
# yield Request(offering, callback=self.parse_property, meta={'Offering':offering})
next_page=response.xpath('//div[#class="pagination"]/a/#href')[-2].extract()
yield Request(next_page, callback=self.parse)
def parse_property(self, response):
l = ItemLoader(item=NPMItem(), response=response)
url=response.meta.get('URL')
#offer=response.meta.get('Offering')
l.add_value('URL', response.url)
#l.add_value('Offering', response.offer)
You can try to rely on element, which is higher in DOM-tree, and scrape both property type and link from there. Check this code example, it works:
def parse(self, response):
properties = response.xpath('//div[#class="property-listing"]')
for property in properties:
url = property.xpath('.//div[#class="property-information-address"]/a/#href').get()
ptype = property.xpath('.//div[#class="property-status"]/a/text()').get()
yield response.follow(url, self.parse_property, meta={'ptype': ptype})
next_page = response.xpath('//link[#rel="next"]/#href').get()
if next_page:
yield response.follow(next_page, callback=self.parse)
def parse_property(self, response):
print '======'
print response.meta['ptype']
print '======'
# build your item here, printing is only to show content of `ptype`