How to handle DNSLookupError in Scrapy? - python-3.x

I am checking a bunch of website response statuses and exporting them to a CSV file. There are a couple of websites having DNSLookupError or NO WEBSITE FOUND and not storing anything in the CSV file. How can I also store the DNSLookupError message to the CSV along with the URL?
def parse(self, response):
yield {
'URL': response.url,
'Status': response.status
}

You can use the errback function to catch DNS errors or any other types of errors. See below sample usage.
import scrapy
from twisted.internet.error import DNSLookupError
class TestSpider(scrapy.Spider):
name = 'test'
allowed_domains = ['example.com']
def start_requests(self):
yield scrapy.Request(url="http://example.com/error", errback=self.parse_error)
def parse_error(self, failure):
if failure.check(DNSLookupError):
# this is the original request
request = failure.request
yield {
'URL': request.url,
'Status': failure.value
}
def parse(self, response):
yield {
'URL': response.url,
'Status': response.status
}

Related

Is it correct in Scrapy to have multiple parse methods in one spider?

Is it correct in Scrapy to have multiple parse methods in one spider ?
Something looking like:
import scrapy
class FooSpider(scrapy.Spider):
name = 'foo'
start_urls = ['https://example.com']
def parse(self, response):
...
yield {'foo': foo}
def parse(self, response):
...
yield {'bar': bar}
No, but you can create different methods and call them from start_requests for example.
import scrapy
class FooSpider(scrapy.Spider):
name = 'POC'
start_urls = ['https://scrapingclub.com/exercise/detail_basic/']
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url=url, callback=self.get_title, dont_filter=True)
yield scrapy.Request(url=url, callback=self.get_price, dont_filter=True)
def get_title(self, response):
yield {'title': response.xpath('//h3/text()').get()}
def get_price(self, response):
yield {'price': response.xpath('//div[#class="card-body"]/h4/text()').get()}

Scrapy: How to use init_request and start_requests together?

I need to make an initial call to a service before I start my scraper (the initial call, gives me some cookies and headers), I decided to use InitSpider and override the init_request method to achieve this. however I also need to use start_requests to build my links and add some meta values like proxies and whatnot to that specific spider, but I'm facing a problem. whenever I override start_requests, my crawler doesn't call init_request anymore and I can not do the initialization and in order to get init_request working is to not override the start_requests method which is impossible in my case. any suggestions or possible solutions to my code:
class SomethingSpider(InitSpider):
name = 'something'
allowed_domains = ['something.something']
aod_url = "https://something?="
start_urls = ["id1","id2","id3"]
custom_settings = {
'DOWNLOAD_FAIL_ON_DATALOSS' : False,
'CONCURRENT_ITEMS': 20,
'DOWNLOAD_TIMEOUT': 10,
'CONCURRENT_REQUESTS': 3,
'COOKIES_ENABLED': True,
'CONCURRENT_REQUESTS_PER_DOMAIN': 20
}
def init_request(self):
yield scrapy.Request(url="https://something",callback=self.check_temp_cookie, meta={'proxy': 'someproxy:1111'})
def check_temp_cookie(self, response):
"""Check the response returned by a login request to see if we are
successfully logged in.
"""
if response.status == 200:
print("H2")
# Now the crawling can begin..
return self.initialized()
else:
print("H3")
# Something went wrong, we couldn't log in, so nothing happens.
def start_requests(self):
print("H4")
proxies = ["xyz:0000","abc:1111"]
for url in self.start_urls:
yield scrapy.Request(url=self.aod_url+url, callback=self.parse, meta={'proxy': random.choice(proxies)})
def parse(self, response):
#some processing happens
yield {
#some data
}
except Exception as err:
print("Connecting to...")
Spiders page (generic spiders section) on official scrapy docs doesn't have any mention of InitSpider You are trying to use.
InitSpider class from https://github.com/scrapy/scrapy/blob/2.5.0/scrapy/spiders/init.py written ~10 years ago (at that... ancient versions of scrapy start_requests method worked completely differently).
From this perspective I recommend You to not use undocumented and probably outdated InitSpider.
On current versions of scrapy required functionality can be implemented using regular Spider class:
import scrapy
class SomethingSpider(scrapy.Spider):
...
def start_requests(self):
yield scrapy.Request(url="https://something",callback=self.check_temp_cookie, meta={'proxy': 'someproxy:1111'})
def check_temp_cookie(self, response):
"""Check the response returned by a login request to see if we are
successfully logged in.
"""
if response.status == 200:
print("H2")
# Now the crawling can begin..
...
#Schedule next requests here:
for url in self.start_urls:
yield scrapy.Request(url=self.aod_url+url, callback=self.parse, ....})
else:
print("H3")
# Something went wrong, we couldn't log in, so nothing happens.
def parse(self, response):
...
If you are looking speicfically at incorporating logging in then I would reccomend you look at Using FormRequest.from_response() to simulate a user login in the scrapy docs.
Here is the spider example they give:
import scrapy
def authentication_failed(response):
# TODO: Check the contents of the response and return True if it failed
# or False if it succeeded.
pass
class LoginSpider(scrapy.Spider):
name = 'example.com'
start_urls = ['http://www.example.com/users/login.php']
def parse(self, response):
return scrapy.FormRequest.from_response(
response,
formdata={'username': 'john', 'password': 'secret'},
callback=self.after_login
)
def after_login(self, response):
if authentication_failed(response):
self.logger.error("Login failed")
return
# continue scraping with authenticated session...
finally, you can have a look at how too add proxies to your scrapy middleware as per this example (zyte are the guys who wrote scrapy) "How to set up a custom proxy in Scrapy?"

Scrapy Crawler Python3

I'm working on a crawler and I have to save the output in a csv file.
Here is my code:
import scrapy
class ArticleSpider(scrapy.Spider):
name = "article"
def start_requests(self):
urls = [
'https://www.topart-online.com/de/Ahorn-japan.%2C-70cm%2C--36-Blaetter----Herbst/c-KAT282/a-150001HE'
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
page = response.url.split("/")[-1]
filename = 'article-%s.html' % page
with open(filename, 'wb') as f:
f.write(response.body)
self.log('Saved file %s' % filename)
def parse(self, response):
yield{
'title': response.xpath('//h1[#class="text-center text-md-left mt-0"]/text()').get(),
'quantity': response.xpath('//div[#class="col-6"]/text()')[0].get().strip(),
'delivery_status': response.xpath('//div[#class="availabilitydeliverytime"]/text()').get().replace('/','').strip(),
'itemattr': response.xpath('//div[#class="productcustomattrdesc word-break col-6"]/text()').getall(),
'itemvalues': response.xpath('//div[#class="col-6"]/text()').getall()
}
My question is:
How can I output itemattr and itemvalues in the correct order? So I can see for example: Umkarton(itemattr) 20/20/20(dimension of a Umkarton)

how to pass scrapy.Response to the dictionary?

I understand how to do this using the requests library
import requests
def start_requests(self):
token = requests.get('https://support.hpe.com/hpesc/public/km/api/coveo/search/token').text
headers = {
...
'Authorization': f'Bearer {json.loads(token)["persistentSearchToken"]}',
...
}
Tell me how to do this using scrapy?
At first I thought of doing this:
def start_requests(self):
token = scrapy.Request(
url='https://support.hpe.com/hpesc/public/km/api/coveo/search/token',
callback=self.get_token
)
headers = {
...
'Authorization': f'Bearer {json.loads(token)["persistentSearchToken"]}',
...
}
def get_token(self, response):
return response.text
But as it turned out, the "token" variable is not an object of the "Response"class. It is an object of the "Request" class.
Try this
def start_requests(self):
yield scrapy.Request(
url='https://support.hpe.com/hpesc/public/km/api/coveo/search/token',
callback=self.get_token,
headers = {
...
'Authorization': f'Bearer {json.loads(token)["persistentSearchToken"]}',
...
})
def get_token(self, response):
return response.text

how to get status code other than 200 from scrapy-splash

I am trying to get request status code with scrapy and scrapy-splash,below is spider code.
class Exp10itSpider(scrapy.Spider):
name = "exp10it"
def start_requests(self):
urls = [
'http://192.168.8.240:8000/xxxx'
]
for url in urls:
#yield SplashRequest(url, self.parse, args={'wait': 0.5, 'dont_redirect': True},meta={'handle_httpstatus_all': True})
#yield scrapy.Request(url, self.parse, meta={'handle_httpstatus_all': True})
yield scrapy.Request(url, self.parse, meta={'handle_httpstatus_all': True,'splash': {
'args': {
'html': 1,
'png': 1,
}
}
}
)
def parse(self, response):
input("start .........")
print("status code is:\n")
input(response.status)
My start url http://192.168.8.240:8000/xxxx is a 404 status code url,there are threee kinds of request way upon:
the first is:
yield SplashRequest(url, self.parse, args={'wait': 0.5, 'dont_redirect': True},meta={'handle_httpstatus_all': True})
the second is:
yield scrapy.Request(url, self.parse, meta={'handle_httpstatus_all': True})
the third is:
yield scrapy.Request(url, self.parse, meta={'handle_httpstatus_all': True,'splash': {
'args': {
'html': 1,
'png': 1,
}
}
}
)
Only the second request way yield scrapy.Request(url, self.parse, meta={'handle_httpstatus_all': True}) can get the right status code 404,the first and the third both get status code 200,that's to say,after I try to use scrapy-splash,I can not get the right status code 404,can you help me?
As the documentation to scrapy-splash suggests, you have to pass magic_response=True to SplashRequest to achieve this:
meta['splash']['http_status_from_error_code'] - set response.status to HTTP error code when assert(splash:go(..)) fails; it requires meta['splash']['magic_response']=True. http_status_from_error_code option is False by default if you use raw meta API; SplashRequest sets it to True by default.
EDIT:
I was able to get it to work only with execute endpoint, though. Here is sample spider that tests HTTP status code using httpbin.org:
# -*- coding: utf-8 -*-
import scrapy
import scrapy_splash
class HttpStatusSpider(scrapy.Spider):
name = 'httpstatus'
lua_script = """
function main(splash, args)
assert(splash:go(args.url))
assert(splash:wait(0.5))
return {
html = splash:html(),
png = splash:png(),
}
end
"""
def start_requests(self):
yield scrapy_splash.SplashRequest(
'https://httpbin.org/status/402', self.parse,
endpoint='execute',
magic_response=True,
meta={'handle_httpstatus_all': True},
args={'lua_source': self.lua_script})
def parse(self, response):
pass
It passes the HTTP 402 status code to Scrapy, as can be seen from the output:
...
2017-10-23 08:41:31 [scrapy.core.engine] DEBUG: Crawled (402) <GET https://httpbin.org/status/402 via http://localhost:8050/execute> (referer: None)
...
You can experiment with other HTTP status codes as well.

Resources