I am trying to get request status code with scrapy and scrapy-splash,below is spider code.
class Exp10itSpider(scrapy.Spider):
name = "exp10it"
def start_requests(self):
urls = [
'http://192.168.8.240:8000/xxxx'
]
for url in urls:
#yield SplashRequest(url, self.parse, args={'wait': 0.5, 'dont_redirect': True},meta={'handle_httpstatus_all': True})
#yield scrapy.Request(url, self.parse, meta={'handle_httpstatus_all': True})
yield scrapy.Request(url, self.parse, meta={'handle_httpstatus_all': True,'splash': {
'args': {
'html': 1,
'png': 1,
}
}
}
)
def parse(self, response):
input("start .........")
print("status code is:\n")
input(response.status)
My start url http://192.168.8.240:8000/xxxx is a 404 status code url,there are threee kinds of request way upon:
the first is:
yield SplashRequest(url, self.parse, args={'wait': 0.5, 'dont_redirect': True},meta={'handle_httpstatus_all': True})
the second is:
yield scrapy.Request(url, self.parse, meta={'handle_httpstatus_all': True})
the third is:
yield scrapy.Request(url, self.parse, meta={'handle_httpstatus_all': True,'splash': {
'args': {
'html': 1,
'png': 1,
}
}
}
)
Only the second request way yield scrapy.Request(url, self.parse, meta={'handle_httpstatus_all': True}) can get the right status code 404,the first and the third both get status code 200,that's to say,after I try to use scrapy-splash,I can not get the right status code 404,can you help me?
As the documentation to scrapy-splash suggests, you have to pass magic_response=True to SplashRequest to achieve this:
meta['splash']['http_status_from_error_code'] - set response.status to HTTP error code when assert(splash:go(..)) fails; it requires meta['splash']['magic_response']=True. http_status_from_error_code option is False by default if you use raw meta API; SplashRequest sets it to True by default.
EDIT:
I was able to get it to work only with execute endpoint, though. Here is sample spider that tests HTTP status code using httpbin.org:
# -*- coding: utf-8 -*-
import scrapy
import scrapy_splash
class HttpStatusSpider(scrapy.Spider):
name = 'httpstatus'
lua_script = """
function main(splash, args)
assert(splash:go(args.url))
assert(splash:wait(0.5))
return {
html = splash:html(),
png = splash:png(),
}
end
"""
def start_requests(self):
yield scrapy_splash.SplashRequest(
'https://httpbin.org/status/402', self.parse,
endpoint='execute',
magic_response=True,
meta={'handle_httpstatus_all': True},
args={'lua_source': self.lua_script})
def parse(self, response):
pass
It passes the HTTP 402 status code to Scrapy, as can be seen from the output:
...
2017-10-23 08:41:31 [scrapy.core.engine] DEBUG: Crawled (402) <GET https://httpbin.org/status/402 via http://localhost:8050/execute> (referer: None)
...
You can experiment with other HTTP status codes as well.
Related
I am checking a bunch of website response statuses and exporting them to a CSV file. There are a couple of websites having DNSLookupError or NO WEBSITE FOUND and not storing anything in the CSV file. How can I also store the DNSLookupError message to the CSV along with the URL?
def parse(self, response):
yield {
'URL': response.url,
'Status': response.status
}
You can use the errback function to catch DNS errors or any other types of errors. See below sample usage.
import scrapy
from twisted.internet.error import DNSLookupError
class TestSpider(scrapy.Spider):
name = 'test'
allowed_domains = ['example.com']
def start_requests(self):
yield scrapy.Request(url="http://example.com/error", errback=self.parse_error)
def parse_error(self, failure):
if failure.check(DNSLookupError):
# this is the original request
request = failure.request
yield {
'URL': request.url,
'Status': failure.value
}
def parse(self, response):
yield {
'URL': response.url,
'Status': response.status
}
I need to make an initial call to a service before I start my scraper (the initial call, gives me some cookies and headers), I decided to use InitSpider and override the init_request method to achieve this. however I also need to use start_requests to build my links and add some meta values like proxies and whatnot to that specific spider, but I'm facing a problem. whenever I override start_requests, my crawler doesn't call init_request anymore and I can not do the initialization and in order to get init_request working is to not override the start_requests method which is impossible in my case. any suggestions or possible solutions to my code:
class SomethingSpider(InitSpider):
name = 'something'
allowed_domains = ['something.something']
aod_url = "https://something?="
start_urls = ["id1","id2","id3"]
custom_settings = {
'DOWNLOAD_FAIL_ON_DATALOSS' : False,
'CONCURRENT_ITEMS': 20,
'DOWNLOAD_TIMEOUT': 10,
'CONCURRENT_REQUESTS': 3,
'COOKIES_ENABLED': True,
'CONCURRENT_REQUESTS_PER_DOMAIN': 20
}
def init_request(self):
yield scrapy.Request(url="https://something",callback=self.check_temp_cookie, meta={'proxy': 'someproxy:1111'})
def check_temp_cookie(self, response):
"""Check the response returned by a login request to see if we are
successfully logged in.
"""
if response.status == 200:
print("H2")
# Now the crawling can begin..
return self.initialized()
else:
print("H3")
# Something went wrong, we couldn't log in, so nothing happens.
def start_requests(self):
print("H4")
proxies = ["xyz:0000","abc:1111"]
for url in self.start_urls:
yield scrapy.Request(url=self.aod_url+url, callback=self.parse, meta={'proxy': random.choice(proxies)})
def parse(self, response):
#some processing happens
yield {
#some data
}
except Exception as err:
print("Connecting to...")
Spiders page (generic spiders section) on official scrapy docs doesn't have any mention of InitSpider You are trying to use.
InitSpider class from https://github.com/scrapy/scrapy/blob/2.5.0/scrapy/spiders/init.py written ~10 years ago (at that... ancient versions of scrapy start_requests method worked completely differently).
From this perspective I recommend You to not use undocumented and probably outdated InitSpider.
On current versions of scrapy required functionality can be implemented using regular Spider class:
import scrapy
class SomethingSpider(scrapy.Spider):
...
def start_requests(self):
yield scrapy.Request(url="https://something",callback=self.check_temp_cookie, meta={'proxy': 'someproxy:1111'})
def check_temp_cookie(self, response):
"""Check the response returned by a login request to see if we are
successfully logged in.
"""
if response.status == 200:
print("H2")
# Now the crawling can begin..
...
#Schedule next requests here:
for url in self.start_urls:
yield scrapy.Request(url=self.aod_url+url, callback=self.parse, ....})
else:
print("H3")
# Something went wrong, we couldn't log in, so nothing happens.
def parse(self, response):
...
If you are looking speicfically at incorporating logging in then I would reccomend you look at Using FormRequest.from_response() to simulate a user login in the scrapy docs.
Here is the spider example they give:
import scrapy
def authentication_failed(response):
# TODO: Check the contents of the response and return True if it failed
# or False if it succeeded.
pass
class LoginSpider(scrapy.Spider):
name = 'example.com'
start_urls = ['http://www.example.com/users/login.php']
def parse(self, response):
return scrapy.FormRequest.from_response(
response,
formdata={'username': 'john', 'password': 'secret'},
callback=self.after_login
)
def after_login(self, response):
if authentication_failed(response):
self.logger.error("Login failed")
return
# continue scraping with authenticated session...
finally, you can have a look at how too add proxies to your scrapy middleware as per this example (zyte are the guys who wrote scrapy) "How to set up a custom proxy in Scrapy?"
I am using python3.8.5, scrapy2.4.0 I am also using scrapy-proxy-pool and scrapy-user-agents I am getting "AttributeError: Response content isn't text". I am running this code on python3-venv. Would you like to help me explaining and solving the problem ?
Here is my code:
import scrapy
import json
class BasisMembersSpider(scrapy.Spider):
name = 'basis'
allowed_domains = ['www.basis.org.bd']
def start_requests(self):
start_url = 'https://basis.org.bd/get-member-list?page=1&team='
yield scrapy.Request(url=start_url, callback=self.get_membership_no)
def get_membership_no(self, response):
data_array = json.loads(response.body)['data']
next_page = json.loads(response.body)['links']['next']
for data in data_array:
next_url = 'https://basis.org.bd/get-company-profile/{0}'.format(data['membership_no'])
yield scrapy.Request(url=next_url, callback=self.parse)
if next_page:
yield scrapy.Request(url=next_page, callback=self.get_membership_no)
def parse(self, response):
print("Printing informations....................................................")
Here is my settings.py file:
BOT_NAME = 'web_scraping'
SPIDER_MODULES = ['web_scraping.spiders']
NEWSPIDER_MODULE = 'web_scraping.spiders'
AUTOTHROTTLE_ENABLED = True
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'web_scraping (+http://www.yourdomain.com)'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
PROXY_POOL_ENABLED = True
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
'scrapy_user_agents.middlewares.RandomUserAgentMiddleware': 800,
'scrapy_proxy_pool.middlewares.ProxyPoolMiddleware': 610,
'scrapy_proxy_pool.middlewares.BanDetectionMiddleware': 620,
}
And are error messages from console output:
Thank you...
I want to crawl a web page which shows the results of a search in google's webstore and the link is static for that particular keyword.
I want to find the ranking of an extension periodically.
Here is the URL
Problem is that I can't render the dynamic data generated by Javascript code in response from server.
I tried using Scrapy and Scrapy-Splash to render the desired page but I was still getting the same response. I used Docker to run an instance of scrapinghub/splash container on port 8050. I even visited the webpage http://localhost:8050 and entered my URL manually but it couldn't render the data although the message showed success.
Here's the code I wrote for the crawler. It actually does nothing and its only job is to fetch the HTML contents of the desired page.
import scrapy
from scrapy_splash import SplashRequest
class WebstoreSpider(scrapy.Spider):
name = 'webstore'
def start_requests(self):
yield SplashRequest(
url='https://chrome.google.com/webstore/search/netflix%20vpn?utm_source=chrome-ntp-icon&_category=extensions',
callback=self.parse,
args={
"wait": 3,
},
)
def parse(self, response):
print(response.text)
and the contents of the settings.py of my Scrapy project:
BOT_NAME = 'webstore_cralwer'
SPIDER_MODULES = ['webstore_cralwer.spiders']
NEWSPIDER_MODULE = 'webstore_cralwer.spiders'
ROBOTSTXT_OBEY = False
SPLASH_URL = 'http://localhost:8050'
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
And for the result I always get nothing.
Any help is appreciated.
Works for me with a small custom lua script:
lua_source = """
function main(splash, args)
assert(splash:go(args.url))
assert(splash:wait(5.0))
return {
html = splash:html(),
}
end
"""
You can then change your start_requests as follows:
def start_requests(self):
yield SplashRequest(
url='https://chrome.google.com/webstore/search/netflix%20vpn?utm_source=chrome-ntp-icon&_category=extensions',
callback=self.parse,
args={'lua_source': self.lua_source},
)
I am trying to get the content from iframe for this reason I changed my splash request endpoint from execute to render.json. Howerver, splash.wait doesn't work at all. Here's the spider code.
import scrapy
from scrapy_splash import SplashRequest
from scrapy.http import HtmlResponse
src="""
function main(splash, args)
assert(splash:go(args.url))
assert(splash:wait(10))
return {
html = splash:html()
}
end
"""
class Lafarge (scrapy.Spider):
name = "lafargespider"
def __init__(self, *args, **kwargs):
self.root_url = "https://cacareers-lafarge-na.icims.com/jobs/search?pr=0&searchRelation=keyword_all&schemaId=&o="
def start_requests(self):
yield SplashRequest(self.root_url, self.parse_detail,
endpoint='render.json',
args={
'iframes': 1,
'html' : 1,
'lua_source': src,
'timeout': 90
}
)
def parse_detail(self, response):
#response decoded
rs = response.data['childFrames'][0]['html']
response = HtmlResponse(url="my HTML string", body=rs, encoding='utf-8')
print("next page ===>",response.xpath('//a[#class="glyph "]/#href').extract_first())
passing wait time in the Splash.request arguments solved the issue for me.
def start_requests(self):
yield SplashRequest(self.root_url, self.parse_detail,
endpoint='render.json',
args={
'wait': 5,
'iframes': 1,
'html' : 1,
'lua_source': src,
}
)
def parse_detail(self, response):
rs = response.data['childFrames'][0]['html']
Pass the wait param in args. It should be -
args ={
'wait': 5,
'iframes': 1,
'html' : 1,
'lua_source': src,
'timeout': 90
}
lua_source isn't supported with endpoint of type "render.json" but supported with type "execute" so no need for lua_source in your code.
What solved the problem is using wait see explanation for wait usage here page 11:
https://media.readthedocs.org/pdf/splash/latest/splash.pdf