Scrapy Crawl ValueError - python-3.x

I am new to python and to scrapy. I followed a tutorial to have scrapy crawl quotes.toscrape.com.
I entered in the code exactly how it is in the tutorial, but I keep getting a ValueError: invalid hostname: when I run scrapy crawl quotes. I am doing this in Pycharm on a Mac computer.
I tried doing single and double quotes around the URL in start_urls = []section but that did not fix the error.
This is what the code looks like:
import scrapy
class QuoteSpider(scrapy.Spider):
name = 'quotes'
start_urls = [
'http: // quotes.toscrape.com /'
]
def parse(self, response):
title = response.css('title').extract()
yield {'titletext':title}
It is supposed to be scraping the site for the title.
This is what the error looks like:
2019-11-08 12:52:42 [scrapy.core.engine] INFO: Spider opened
2019-11-08 12:52:42 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-11-08 12:52:42 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2019-11-08 12:52:42 [scrapy.downloadermiddlewares.robotstxt] ERROR: Error downloading <GET http:///robots.txt>: invalid hostname:
Traceback (most recent call last):
File "/Users/newuser/PycharmProjects/ScrapyTutorial/venv/lib/python2.7/site-packages/scrapy/core/downloader/middleware.py", line 44, in process_request
defer.returnValue((yield download_func(request=request, spider=spider)))
ValueError: invalid hostname:
2019-11-08 12:52:42 [scrapy.core.scraper] ERROR: Error downloading <GET http:///%20//%20quotes.toscrape.com%20/>
Traceback (most recent call last):
File "/Users/newuser/PycharmProjects/ScrapyTutorial/venv/lib/python2.7/site-packages/scrapy/core/downloader/middleware.py", line 44, in process_request
defer.returnValue((yield download_func(request=request, spider=spider)))
ValueError: invalid hostname:
2019-11-08 12:52:42 [scrapy.core.engine] INFO: Closing spider (finished)

Don't use spaces for URLs!
start_urls = [
'http://quotes.toscrape.com/'
]

Related

Scrapy 1.6 : DNS lookup failed

I am new to Scrapy and im trying to crawl this website https://www.timeanddate.com/weather/india and its throwing DNS lookup error. The code i wrote for scraping works perfectly in shell so my guess is DNS error happens before scraping takes place.
This is what i get :
2019-05-02 11:59:03 [scrapy.utils.log] INFO: Scrapy 1.6.0 started (bot: IndiaWeather)
2019-05-02 11:59:03 [scrapy.utils.log] INFO: Versions: lxml 4.3.2.0, libxml2 2.9.9, cssselect 1.0.3, parsel 1.5.1, w3lib 1.20.0, Twisted 19.2.0, Python 3.7.3 (default, Mar 27 2019, 17:13:21) [MSC v.1915 64 bit (AMD64)], pyOpenSSL 19.0.0 (OpenSSL 1.1.1b 26 Feb 2019), cryptography 2.6.1, Platform Windows-10-10.0.17134-SP0
2019-05-02 11:59:03 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'IndiaWeather', 'NEWSPIDER_MODULE': 'IndiaWeather.spiders', 'SPIDER_MODULES': ['IndiaWeather.spiders']}
2019-05-02 11:59:03 [scrapy.extensions.telnet] INFO: Telnet Password: 688b4fe759cb3ed5
2019-05-02 11:59:03 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2019-05-02 11:59:03 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-05-02 11:59:03 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-05-02 11:59:03 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-05-02 11:59:03 [scrapy.core.engine] INFO: Spider opened
2019-05-02 11:59:03 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-05-02 11:59:03 [py.warnings] WARNING: C:\Users\Abrar\Anaconda3\lib\site-packages\scrapy\spidermiddlewares\offsite.py:61: URLWarning: allowed_domains accepts only domains, not URLs. Ignoring URL entry https://www.timeanddate.com/weather/india in allowed_domains.
warnings.warn(message, URLWarning)
2019-05-02 11:59:03 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2019-05-02 11:59:05 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://https//www.timeanddate.com/weather/india/> (failed 1 times): DNS lookup failed: no results for hostname lookup: https.
2019-05-02 11:59:08 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://https//www.timeanddate.com/weather/india/> (failed 2 times): DNS lookup failed: no results for hostname lookup: https.
2019-05-02 11:59:10 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET http://https//www.timeanddate.com/weather/india/> (failed 3 times): DNS lookup failed: no results for hostname lookup: https.
2019-05-02 11:59:10 [scrapy.core.scraper] ERROR: Error downloading <GET http://https//www.timeanddate.com/weather/india/>
Traceback (most recent call last):
File "C:\Users\Abrar\Anaconda3\lib\site-packages\twisted\internet\defer.py", line 1416, in _inlineCallbacks
result = result.throwExceptionIntoGenerator(g)
File "C:\Users\Abrar\Anaconda3\lib\site-packages\twisted\python\failure.py", line 512, in throwExceptionIntoGenerator
return g.throw(self.type, self.value, self.tb)
File "C:\Users\Abrar\Anaconda3\lib\site-packages\scrapy\core\downloader\middleware.py", line 43, in process_request
defer.returnValue((yield download_func(request=request,spider=spider)))
File "C:\Users\Abrar\Anaconda3\lib\site-packages\twisted\internet\defer.py", line 654, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "C:\Users\Abrar\Anaconda3\lib\site-packages\twisted\internet\endpoints.py", line 975, in startConnectionAttempts
"no results for hostname lookup: {}".format(self._hostStr)
twisted.internet.error.DNSLookupError: DNS lookup failed: no results for hostname lookup: https.
2019-05-02 11:59:10 [scrapy.core.engine] INFO: Closing spider (finished)
2019-05-02 11:59:10 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 3,
'downloader/exception_type_count/twisted.internet.error.DNSLookupError': 3,
'downloader/request_bytes': 717,
'downloader/request_count': 3,
'downloader/request_method_count/GET': 3,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2019, 5, 2, 6, 29, 10, 505262),
'log_count/DEBUG': 3,
'log_count/ERROR': 1,
'log_count/INFO': 9,
'log_count/WARNING': 1,
'retry/count': 2,
'retry/max_reached': 1,
'retry/reason_count/twisted.internet.error.DNSLookupError': 2,
'scheduler/dequeued': 3,
'scheduler/dequeued/memory': 3,
'scheduler/enqueued': 3,
'scheduler/enqueued/memory': 3,
'start_time': datetime.datetime(2019, 5, 2, 6, 29, 3, 412894)}
2019-05-02 11:59:10 [scrapy.core.engine] INFO: Spider closed (finished)
I have posted the everything above.
Look at this error message you get:
2019-05-02 11:59:10 [scrapy.core.scraper] ERROR: Error downloading <GET http://https//www.timeanddate.com/weather/india/>
You have doubled the schema in URL (both http and https) and it's also invalid (no : after the second https). The first usually happend if you use scrapy genspider command-line command and specify the domain already with schema.
So, remove one of the schemas from the start_urls URLs.
please check value of
allowed_domains = ['abc.xyz.domain_name/']
start_urls = ['http://abc.xyz.domain_name//']
in your program.
or may be your writte http two times in your program.
or may be your added http in allowed_domain in your program.

How to fix 'PROXIES is empty' error for scrapy spider

I am trying to run a scrapy spider through the use of a proxy and am getting errors whenever I run the code.
This is for Mac OSX, python 3.7, scrapy 1.5.1.
I have tried playing around with the settings and middlewares but to no effect.
class superSpider(scrapy.Spider):
name = "myspider"
def start_requests(self):
print('request')
urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
print('parse')
The errors I get are:
2019-02-15 08:32:27 [scrapy.utils.log] INFO: Scrapy 1.5.1 started
(bot: superScraper)
2019-02-15 08:32:27 [scrapy.utils.log] INFO: Versions: lxml
4.2.5.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.5.1, w3lib 1.19.0,
Twisted 18.9.0, Python 3.7.1 (v3.7.1:260ec2c36a, Oct 20 2018,
03:13:28) - [Clang 6.0 (clang-600.0.57)], pyOpenSSL 18.0.0 (OpenSSL
1.1.0j 20 Nov 2018), cryptography 2.4.2, Platform Darwin-17.7.0-
x86_64-i386-64bit
2019-02-15 08:32:27 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'superScraper', 'CONCURRENT_REQUESTS': 25,
'NEWSPIDER_MODULE': 'superScraper.spiders', 'RETRY_HTTP_CODES':
[500, 503, 504, 400, 403, 404, 408], 'RETRY_TIMES': 10,
'SPIDER_MODULES': ['superScraper.spiders'], 'USER_AGENT':
'Mozilla/5.0 (compatible; bingbot/2.0;
+http://www.bing.com/bingbot.htm)'}
2019-02-15 08:32:27 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats']
Unhandled error in Deferred:
2019-02-15 08:32:27 [twisted] CRITICAL: Unhandled error in Deferred:
Traceback (most recent call last):
File
"/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scrapy/crawler.py", line 171, in crawl
return self._crawl(crawler, *args, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scrapy/crawler.py", line 175, in _crawl
d = crawler.crawl(*args, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/twisted/internet/defer.py", line 1613, in unwindGenerator
return _cancellableInlineCallbacks(gen)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/twisted/internet/defer.py", line 1529, in _cancellableInlineCallbacks
_inlineCallbacks(None, g, status)
--- <exception caught here> ---
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
result = g.send(result)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scrapy/crawler.py", line 80, in crawl
self.engine = self._create_engine()
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scrapy/crawler.py", line 105, in _create_engine
return ExecutionEngine(self, lambda _: self.stop())
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scrapy/core/engine.py", line 69, in __init__
self.downloader = downloader_cls(crawler)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scrapy/core/downloader/__init__.py", line 88, in __init__
self.middleware = DownloaderMiddlewareManager.from_crawler(crawler)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scrapy/middleware.py", line 58, in from_crawler
return cls.from_settings(crawler.settings, crawler)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scrapy/middleware.py", line 36, in from_settings
mw = mwcls.from_crawler(crawler)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scrapy_proxies/randomproxy.py", line 99, in from_crawler
return cls(crawler.settings)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scrapy_proxies/randomproxy.py", line 74, in __init__
raise KeyError('PROXIES is empty')
builtins.KeyError: 'PROXIES is empty'
These websites are from the documentation for scrapy and it works without using a proxy.
For anyone else having a similar problem, this was an issue with my actual scrapy_proxies.RandomProxy code
Using the code here made it work:
https://github.com/aivarsk/scrapy-proxies
Go into the scrapy_proxies folder and replace the RandomProxy.py code with the one found on github
Mine was found here:
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scrapy_proxies/randomproxy.py

ValueError: Missing scheme in request url Scrapy

i am trying to scrape https://www.skynewsarabia.com/ using Scrapy and i having this error ValueError: Missing scheme in request url:
i tried every single solution i have found on stackoverflow and none worked for me.
here is my spider:
name = 'skynews'
allowed_domains = ['www.skynewsarabia.com']
start_urls = ['https://www.skynewsarabia.com/sport/latest-news-%D8%A2%D8%AE%D8%B1-%D8%A7%D9%84%D8%A3%D8%AE%D8%A8%D8%A7%D8%B1']
}
def parse(self, response):
link = "https://www.skynewsarabia.com"
# get the urls of each article
urls = response.css("a.item-wrapper::attr(href)").extract()
# for each article make a request to get the text of that article
for url in urls:
# get the info of that article using the parse_details function
yield scrapy.Request(url=link +url, callback=self.parse_details)
# go and get the link for the next article
next_article = response.css("a.item-wrapper::attr(href)").extract_first()
if next_article:
# keep repeating the process until the bot visits all the links in the website!
yield scrapy.Request(url=next_article, callback=self.parse) # keep calling yourself!
here is the whole error:
2019-01-30 11:49:34 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-01-30 11:49:34 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2019-01-30 11:49:35 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.skynewsarabia.com/robots.txt> (referer: None)
2019-01-30 11:49:35 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.skynewsarabia.com/sport/latest-news-%D8%A2%D8%AE%D8%B1-%D8%A7%D9%84%D8%A3%D8%AE%D8%A8%D8%A7%D8%B1> (referer: None)
2019-01-30 11:49:35 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.skynewsarabia.com/sport/latest-news-%D8%A2%D8%AE%D8%B1-%D8%A7%D9%84%D8%A3%D8%AE%D8%A8%D8%A7%D8%B1> (referer: None)
Traceback (most recent call last):
File "c:\users\hozrifai\desktop\scraping\venv\lib\site-
packages\scrapy\utils\defer.py", line 102, in iter_errback
yield next(it)
File "c:\users\hozrifai\desktop\scraping\venv\lib\site-
packages\scrapy\spidermiddlewares\offsite.py", line 30, in
process_spider_output
for x in result:
File "c:\users\hozrifai\desktop\scraping\venv\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 339, in <genexpr>
return (_set_referer(r) for r in result or ())
File "c:\users\hozrifai\desktop\scraping\venv\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr>
return (r for r in result or () if _filter(r))
File "c:\users\hozrifai\desktop\scraping\venv\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr>
return (r for r in result or () if _filter(r))
File "C:\Users\HozRifai\Desktop\scraping\articles\articles\spiders\skynews.py", line 28, in parse
yield scrapy.Request(url=next_article, callback=self.parse) # keep calling yourself!
File "c:\users\hozrifai\desktop\scraping\venv\lib\site-packages\scrapy\http\request\__init__.py", line 25, in __init__
self._set_url(url)
File "c:\users\hozrifai\desktop\scraping\venv\lib\site-packages\scrapy\http\request\__init__.py", line 62, in _set_url
raise ValueError('Missing scheme in request url: %s' % self._url)
ValueError: Missing scheme in request url: /sport/1222754-%D8%A8%D9%8A%D8%B1%D9%86%D9%84%D9%8A-%D9%8A%D8%B6%D8%B9-%D8%AD%D8%AF%D8%A7-%D9%84%D8%B3%D9%84%D8%B3%D9%84%D8%A9-%D8%A7%D9%86%D8%AA%D8%B5%D8%A7%D8%B1%D8%A7%D8%AA-%D8%B3%D9%88%D9%84%D8%B4%D8%A7%D8%B1
2019-01-30 11:49:36 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.skynewsarabia.com/sport/1222754-%D8%A8%D9%8A%D8%B1%D9%86%D9%84%D9%8A-%D9%8A%D8%B6%D8%B9-%D8%AD%D8%AF%D8%A7-%D9%84%D8%B3%D9%84%D8%B3%D9%84%D8%A9-%D8%A7%D9%86%D8%AA%D8%B5%D8%A7%D8%B1%D8%A7%D8%AA-%D8%B3%D9%88%D9%84%D8%B4%D8%A7%D8%B1> (referer: https://www.skynewsarabia.com/sport/latest-news-%D8%A2%D8%AE%D8%B1-%D8%A7%D9%84%D8%A3%D8%AE%D8%A8%D8%A7%D8%B1)
thanks in advance
You have next_article url without scheme. Try:
next_article = response.css("a.item-wrapper::attr(href)").get()
if next_article:
yield scrapy.Request(response.urljoin(next_article))
In Your Next article retrieval:
next_article = response.css("a.item-wrapper::attr(href)").extract_first()
Are you sure you are getting full link starting from http/https?
For better approach if we are not sure about the url we are receiving always use urljoin as:
url = response.urljoin(next_article) # you can also use this in your above logic.

error following first steps of scrapy tutorial

I am following this tutorial. After writing the first spider it directs me to use the command scrapy crawl quotes, but I seem to obtain an error.
Here is my code:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
def start_requests(self):
urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
page = response.url.split("/")[-2]
filename = 'quotes-%s.html' % page
with open(filename, 'wb') as f:
f.write(response.body)
self.log('Saved file %s' % filename)
Here is the error that I encounter:
PS C:\Users\BB\desktop\scrapy\tutorial\spiders> scrapy crawl quotes
2018-09-12 13:55:06 [scrapy.utils.log] INFO: Scrapy 1.5.0 started
(bot: tutorial)
2018-09-12 13:55:06 [scrapy.utils.log] INFO: Versions: lxml 4.2.1.0,
libxml2 2.9.8, cssselect 1.0.3, parsel 1.4.0, w3lib 1.19.0, Twisted
17.5.0, Python 3.6.5 |Anaconda, Inc.| (default, Mar 29 2018, 13:32:41) [MSC v.1900 64 bit (AMD64)], pyOpenSSL 18.0.0 (OpenSSL 1.0.2o 27 Mar
2018), cryptography 2.2.2, Platform Windows-10-10.0.17134-SP0
Traceback (most recent call last):
File "C:\Users\BB\Anaconda3\lib\site-packages\scrapy\spiderloader.py",
line 69, in load
return self._spiders[spider_name]
KeyError: 'quotes'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\BB\Anaconda3\Scripts\scrapy-script.py", line 5, in
<module>
sys.exit(scrapy.cmdline.execute())
File "C:\Users\BB\Anaconda3\lib\site-packages\scrapy\cmdline.py", line
150, in execute
_run_print_help(parser, _run_command, cmd, args, opts)
File "C:\Users\BB\Anaconda3\lib\site-packages\scrapy\cmdline.py", line
90, in _run_print_help
func(*a, **kw)
File "C:\Users\BB\Anaconda3\lib\site-packages\scrapy\cmdline.py", line
157, in _run_command
cmd.run(args, opts)
File
"C:\Users\BB\Anaconda3\lib\site-packages\scrapy\commands\crawl.py",
line 57, in run
self.crawler_process.crawl(spname, **opts.spargs)
File "C:\Users\BB\Anaconda3\lib\site-packages\scrapy\crawler.py", line
170, in crawl
crawler = self.create_crawler(crawler_or_spidercls)
File "C:\Users\BB\Anaconda3\lib\site-packages\scrapy\crawler.py", line
198, in create_crawler
return self._create_crawler(crawler_or_spidercls)
File "C:\Users\BB\Anaconda3\lib\site-packages\scrapy\crawler.py", line
202, in _create_crawler
spidercls = self.spider_loader.load(spidercls)
File "C:\Users\BB\Anaconda3\lib\site-packages\scrapy\spiderloader.py",
line 71, in load
raise KeyError("Spider not found: {}".format(spider_name))
KeyError: 'Spider not found: quotes'
OK, I had created a folder called spiders, but the tutorial already did this for me and had a _pycache? _init__? files that were required for the command 'scrapy crawl quotes' to work. In short, I was runnning it from the wrong folder.

Scrapy Crawler gets terminated at random pages

I'm new to Scrapy. I'm crawling the r/india subreddit using a recursive parser to store the title, upvotes and URLs of each thread. It works all fine but the Scraper ends unexpectedly with a weird error that shows:
2018-04-29 00:01:12 [scrapy.core.scraper] ERROR: Spider error processing
<GET https://www.reddit.com/r/india/?count=50&after=t3_8fh5nv> (referer:
https://www.reddit.com/r/india/?count=25&after=t3_8fiqd5)
Traceback (most recent call last):
File "Z:\Anaconda\lib\site-packages\scrapy\utils\defer.py", line 102, in
iter_errback
yield next(it)
File "Z:\Anaconda\lib\site-packages\scrapy\spidermiddlewares\offsite.py",
line 30, in process_spider_output
for x in result:
File "Z:\Anaconda\lib\site-packages\scrapy\spidermiddlewares\referer.py",
line 339, in <genexpr>
return (_set_referer(r) for r in result or ())
File "Z:\Anaconda\lib\site-packages\scrapy\spidermiddlewares\urllength.py",
line 37, in <genexpr>
return (r for r in result or () if _filter(r))
File "Z:\Anaconda\lib\site-packages\scrapy\spidermiddlewares\depth.py", line
58, in <genexpr>
return (r for r in result or () if _filter(r))
File
"C:\Users\jayes\myredditscraper\myredditscraper\spiders\scrapereddit.py",
line 28, in parse
yield Request(url=(next_page),callback=self.parse)
File "Z:\Anaconda\lib\site-packages\scrapy\http\request\__init__.py", line
25, in __init__
self._set_url(url)
File "Z:\Anaconda\lib\site-packages\scrapy\http\request\__init__.py", line
62, in _set_url
raise ValueError('Missing scheme in request url: %s' % self._url)
ValueError: Missing scheme in request url:
2018-04-29 00:01:12 [scrapy.core.engine] INFO: Closing spider (finished)
And the error comes at random pages each time the spider is run, making it impossible for me to detect what's causing the problem. Here's my redditscraper.py file which contains the code(I've also used Pipeline and Items.py but that doesn't contain any problems I feel)
import scrapy
import time
from scrapy.http.request import Request
from myredditscraper.items import MyredditscraperItem
class ScraperedditSpider(scrapy.Spider):
name = 'scrapereddit'
allowed_domains = ['www.reddit.com']
start_urls = ['http://www.reddit.com/r/india/']
def parse(self,response):
next_page=''
titles=response.css("a.title::text").extract()
links=response.css("a.title::attr(href)").extract()
votes=response.css("div.score.unvoted::attr(title)").extract()
for item in zip(titles,links,votes):
new_item = MyredditscraperItem()
new_item['title']=item[0]
new_item['link']=item[1]
new_item['vote']=item[2]
yield new_item
next_page = response.css("span.next-
button").css('a::attr(href)').extract()[0]
if next_page is not None:
yield Request(url=(next_page),callback=self.parse)
As your exception says
ValueError: Missing scheme in request url:
thats means you try to scrap an invalid url - missing http:// or https:// in the url.
I guess the problem is not in the start_urls because otherwise the parse function won't be called. The problem is in the parse function.
When yield Request is called you need to check if next_page contains schema. it seems like the urls you parse are relative links, so you have two options to continue scrap those links without facing this exception:
Passing absolute url.
Skipping relative urls.

Resources