I am following this tutorial. After writing the first spider it directs me to use the command scrapy crawl quotes, but I seem to obtain an error.
Here is my code:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
def start_requests(self):
urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
page = response.url.split("/")[-2]
filename = 'quotes-%s.html' % page
with open(filename, 'wb') as f:
f.write(response.body)
self.log('Saved file %s' % filename)
Here is the error that I encounter:
PS C:\Users\BB\desktop\scrapy\tutorial\spiders> scrapy crawl quotes
2018-09-12 13:55:06 [scrapy.utils.log] INFO: Scrapy 1.5.0 started
(bot: tutorial)
2018-09-12 13:55:06 [scrapy.utils.log] INFO: Versions: lxml 4.2.1.0,
libxml2 2.9.8, cssselect 1.0.3, parsel 1.4.0, w3lib 1.19.0, Twisted
17.5.0, Python 3.6.5 |Anaconda, Inc.| (default, Mar 29 2018, 13:32:41) [MSC v.1900 64 bit (AMD64)], pyOpenSSL 18.0.0 (OpenSSL 1.0.2o 27 Mar
2018), cryptography 2.2.2, Platform Windows-10-10.0.17134-SP0
Traceback (most recent call last):
File "C:\Users\BB\Anaconda3\lib\site-packages\scrapy\spiderloader.py",
line 69, in load
return self._spiders[spider_name]
KeyError: 'quotes'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\BB\Anaconda3\Scripts\scrapy-script.py", line 5, in
<module>
sys.exit(scrapy.cmdline.execute())
File "C:\Users\BB\Anaconda3\lib\site-packages\scrapy\cmdline.py", line
150, in execute
_run_print_help(parser, _run_command, cmd, args, opts)
File "C:\Users\BB\Anaconda3\lib\site-packages\scrapy\cmdline.py", line
90, in _run_print_help
func(*a, **kw)
File "C:\Users\BB\Anaconda3\lib\site-packages\scrapy\cmdline.py", line
157, in _run_command
cmd.run(args, opts)
File
"C:\Users\BB\Anaconda3\lib\site-packages\scrapy\commands\crawl.py",
line 57, in run
self.crawler_process.crawl(spname, **opts.spargs)
File "C:\Users\BB\Anaconda3\lib\site-packages\scrapy\crawler.py", line
170, in crawl
crawler = self.create_crawler(crawler_or_spidercls)
File "C:\Users\BB\Anaconda3\lib\site-packages\scrapy\crawler.py", line
198, in create_crawler
return self._create_crawler(crawler_or_spidercls)
File "C:\Users\BB\Anaconda3\lib\site-packages\scrapy\crawler.py", line
202, in _create_crawler
spidercls = self.spider_loader.load(spidercls)
File "C:\Users\BB\Anaconda3\lib\site-packages\scrapy\spiderloader.py",
line 71, in load
raise KeyError("Spider not found: {}".format(spider_name))
KeyError: 'Spider not found: quotes'
OK, I had created a folder called spiders, but the tutorial already did this for me and had a _pycache? _init__? files that were required for the command 'scrapy crawl quotes' to work. In short, I was runnning it from the wrong folder.
Related
I was trying to run clustalw from Biopython library of python3 on Google Cloud Platform, then generate a phylogenetic tree from the .dnd file using the Phylo library.
The code was running perfectly with no error on my local system. However, when it runs on the Google Cloud platform it has the following error:
python3 clustal.py
Traceback (most recent call last):
File "clustal.py", line 9, in <module>
align = AlignIO.read("opuntia.aln", "clustal")
File "/home/lhcy3w/.local/lib/python3.5/site-packages/Bio/AlignIO/__init__.py", line 435, in read
first = next(iterator)
File "/home/lhcy3w/.local/lib/python3.5/site-packages/Bio/AlignIO/__init__.py", line 357, in parse
with as_handle(handle, 'rU') as fp:
File "/usr/lib/python3.5/contextlib.py", line 59, in __enter__
return next(self.gen)
File "/home/lhcy3w/.local/lib/python3.5/site-packages/Bio/File.py", line 113, in as_handle
with open(handleish, mode, **kwargs) as fp:
FileNotFoundError: [Errno 2] No such file or directory: 'opuntia.aln'
If I run sudo python3 clustal.py, the error would be
File "clustal.py", line 1, in <module>
from Bio import AlignIO
ImportError: No module named 'Bio'
If I run it as in the interactive form of python, the following happened
Python 3.5.3 (default, Sep 27 2018, 17:25:39)
[GCC 6.3.0 20170516] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from Bio.Align.Applications import ClustalwCommandline
>>> in_file = "opuntia.fasta"
>>> clustalw_cline = ClustalwCommandline("/usr/bin/clustalw", infile=in_file)
>>> clustalw_cline()
('\n\n\n CLUSTAL 2.1 Multiple Sequence Alignments\n\n\nSequence format is Pearson\nSequence 1: gi|6273291|gb|AF191665.1|AF191665 902 bp\nSequence 2: gi|6273290|gb
|AF191664.1|AF191664 899 bp\nSequence 3: gi|6273289|gb|AF191663.1|AF191663 899 bp\nSequence 4: gi|6273287|gb|AF191661.1|AF191661 895 bp\nSequence 5: gi|627328
6|gb|AF191660.1|AF191660 893 bp\nSequence 6: gi|6273285|gb|AF191659.1|AF191659 894 bp\nSequence 7: gi|6273284|gb|AF191658.1|AF191658 896 bp\n\n', '\n\nERROR:
Cannot open output file [opuntia.aln]\n\n\n')
Here is my clustal.py file:
from Bio import AlignIO
from Bio import Phylo
import matplotlib
from Bio.Align.Applications import ClustalwCommandline
in_file = "opuntia.fasta"
clustalw_cline = ClustalwCommandline("/usr/bin/clustalw", infile=in_file)
clustalw_cline()
tree = Phylo.read("opuntia.dnd", "newick")
tree = tree.as_phyloxml()
Phylo.draw(tree)
I just want to know how to create an .aln and a .dnd file on the Google Cloud platform as I can get on my local environment. I guess it is probably because I don't have the permission to create a new file on the server with python. I have tried f = open('test.txt', 'w') on Google Cloud but it couldn't work until I add sudo before the terminal command like sudo python3 text.py. However, as you can see, for clustalw, adding sudo only makes the whole biopython library missing.
I am trying to run a scrapy spider through the use of a proxy and am getting errors whenever I run the code.
This is for Mac OSX, python 3.7, scrapy 1.5.1.
I have tried playing around with the settings and middlewares but to no effect.
class superSpider(scrapy.Spider):
name = "myspider"
def start_requests(self):
print('request')
urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
print('parse')
The errors I get are:
2019-02-15 08:32:27 [scrapy.utils.log] INFO: Scrapy 1.5.1 started
(bot: superScraper)
2019-02-15 08:32:27 [scrapy.utils.log] INFO: Versions: lxml
4.2.5.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.5.1, w3lib 1.19.0,
Twisted 18.9.0, Python 3.7.1 (v3.7.1:260ec2c36a, Oct 20 2018,
03:13:28) - [Clang 6.0 (clang-600.0.57)], pyOpenSSL 18.0.0 (OpenSSL
1.1.0j 20 Nov 2018), cryptography 2.4.2, Platform Darwin-17.7.0-
x86_64-i386-64bit
2019-02-15 08:32:27 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'superScraper', 'CONCURRENT_REQUESTS': 25,
'NEWSPIDER_MODULE': 'superScraper.spiders', 'RETRY_HTTP_CODES':
[500, 503, 504, 400, 403, 404, 408], 'RETRY_TIMES': 10,
'SPIDER_MODULES': ['superScraper.spiders'], 'USER_AGENT':
'Mozilla/5.0 (compatible; bingbot/2.0;
+http://www.bing.com/bingbot.htm)'}
2019-02-15 08:32:27 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats']
Unhandled error in Deferred:
2019-02-15 08:32:27 [twisted] CRITICAL: Unhandled error in Deferred:
Traceback (most recent call last):
File
"/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scrapy/crawler.py", line 171, in crawl
return self._crawl(crawler, *args, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scrapy/crawler.py", line 175, in _crawl
d = crawler.crawl(*args, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/twisted/internet/defer.py", line 1613, in unwindGenerator
return _cancellableInlineCallbacks(gen)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/twisted/internet/defer.py", line 1529, in _cancellableInlineCallbacks
_inlineCallbacks(None, g, status)
--- <exception caught here> ---
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
result = g.send(result)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scrapy/crawler.py", line 80, in crawl
self.engine = self._create_engine()
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scrapy/crawler.py", line 105, in _create_engine
return ExecutionEngine(self, lambda _: self.stop())
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scrapy/core/engine.py", line 69, in __init__
self.downloader = downloader_cls(crawler)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scrapy/core/downloader/__init__.py", line 88, in __init__
self.middleware = DownloaderMiddlewareManager.from_crawler(crawler)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scrapy/middleware.py", line 58, in from_crawler
return cls.from_settings(crawler.settings, crawler)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scrapy/middleware.py", line 36, in from_settings
mw = mwcls.from_crawler(crawler)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scrapy_proxies/randomproxy.py", line 99, in from_crawler
return cls(crawler.settings)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scrapy_proxies/randomproxy.py", line 74, in __init__
raise KeyError('PROXIES is empty')
builtins.KeyError: 'PROXIES is empty'
These websites are from the documentation for scrapy and it works without using a proxy.
For anyone else having a similar problem, this was an issue with my actual scrapy_proxies.RandomProxy code
Using the code here made it work:
https://github.com/aivarsk/scrapy-proxies
Go into the scrapy_proxies folder and replace the RandomProxy.py code with the one found on github
Mine was found here:
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scrapy_proxies/randomproxy.py
I was trying to scrape the titles from this website(https://minerals.usgs.gov/science/mineral-deposit-database/#products). I am using a crawl spider beacuse I intend to get more information from every url in the page later on!
but got a TypeError: 'Rule' object is not iterable!
This is the code that I used:
import scrapy
import datetime
import socket
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from usgs.items import MineralItem
from scrapy.loader import ItemLoader
class MineralSpider(CrawlSpider):
name = 'mineral'
allowed_domains = ['web']
start_urls = 'https://minerals.usgs.gov/science/mineral-deposit-
database/#products'
rules = (
Rule(LinkExtractor(
restrict_xpaths='//*[#id="products"][1]/p/a'),
callback='parse')
)
def parse(self, response):
it = ItemLoader(item=MineralItem(), response=response)
it.add_xpath('name', '//*[#class="container"]/header/h1/text()')
it.add_value('url', response.url)
it.add_value('project', self.settings.get('BOT_NAME'))
it.add_value('spider', self.name)
it.add_value('server', socket.gethostname())
it.add_value('date', datetime.datetime.now())
return it.load_item()
LOG MESSAGE:
(base) C:\Users\User\Documents\Python WebCrawling Learing
Projects\usgs\usgs\spiders>scrapy crawl mineral
2018-11-16 17:43:03 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot:
usgs)
2018-11-16 17:43:03 [scrapy.utils.log] INFO: Versions: lxml 4.2.5.0, libxml2
2.9.8, cssselect 1.0.3, parsel 1.4.0, w3lib 1.19.0, Twisted 18.7.0, Python
3.7.0 (default, Jun 28 2018, 08:04:48) [MSC v.1912 64 bit (AMD64)],
pyOpenSSL 18.0.0 (OpenSSL 1.0.2p 14 Aug 2018), cryptography 2.3.1, Platform
Windows-10-10.0.17134-SP0
2018-11-16 17:43:03 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME':
'usgs', 'NEWSPIDER_MODULE': 'usgs.spiders', 'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['usgs.spiders']}
2018-11-16 17:43:03 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
Unhandled error in Deferred:
2018-11-16 17:43:03 [twisted] CRITICAL: Unhandled error in Deferred:
2018-11-16 17:43:03 [twisted] CRITICAL:
Traceback (most recent call last):
File "C:\Users\User\Anaconda3\lib\site-
packages\twisted\internet\defer.py", line 1418, in _inlineCallbacks
result = g.send(result)
File "C:\Users\User\Anaconda3\lib\site-packages\scrapy\crawler.py", line
79, in crawl
self.spider = self._create_spider(*args, **kwargs)
File "C:\Users\User\Anaconda3\lib\site-packages\scrapy\crawler.py", line
102, in _create_spider
return self.spidercls.from_crawler(self, *args, **kwargs)
File "C:\Users\User\Anaconda3\lib\site-packages\scrapy\spiders\crawl.py",
line 100, in from_crawler
spider = super(CrawlSpider, cls).from_crawler(crawler, *args, **kwargs)
File "C:\Users\User\Anaconda3\lib\site-
packages\scrapy\spiders\__init__.py", line 51, in from_crawler
spider = cls(*args, **kwargs)
File "C:\Users\User\Anaconda3\lib\site-packages\scrapy\spiders\crawl.py",
line 40, in __init__
self._compile_rules()
File "C:\Users\User\Anaconda3\lib\site-packages\scrapy\spiders\crawl.py",
line 92, in _compile_rules
self._rules = [copy.copy(r) for r in self.rules]
TypeError: 'Rule' object is not iterable
Any ideas?
Add a comma after your Rule object, so that it considers that it's a valid tuple:
rules = (
Rule(LinkExtractor(
restrict_xpaths='//*[#id="products"][1]/p/a'),
callback='parse'),
)
You may want to take a look at this answer as well: Why does adding a trailing comma after a variable name make it a tuple?
Trying to initialize a client with local cluster in Jupyter lab but hangs. This behaviour happens for python 3.5 and jupyter lab 0.35.
import dask.dataframe as dd
from dask import delayed
from distributed import Client
from distributed import LocalCluster
import pandas as pd
import numpy as np
import json
cluster = LocalCluster()
client = Cluster(cluster)
client
versions of the tools :
Python 3.5.2 (default, Nov 23 2017, 16:37:01)
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>
And the jupyter lab version is
pip list |grep jupyter
jupyter 1.0.0
jupyter-client 5.2.3
jupyter-console 6.0.0
jupyter-core 4.4.0
jupyterlab 0.35.3
jupyterlab-server 0.2.0
Dask version
pip list |grep dask
dask 0.20.0
What could be wrong?
EDIT: No exception on the terminal while trying to initialize client
Interrupting the Kernel gives the following exceptions:
Inside jupyter:
distributed.nanny - ERROR - Failed to restart worker after its process exited
Traceback (most recent call last):
File "/home/avlach/virtualenvs/staistics3/lib/python3.5/site-packages/distributed/nanny.py", line 291, in _on_exit
yield self.instantiate()
File "/home/avlach/virtualenvs/staistics3/lib/python3.5/site-packages/tornado/gen.py", line 1133, in run
value = future.result()
File "/usr/lib/python3.5/asyncio/futures.py", line 274, in result
raise self._exception
File "/home/avlach/virtualenvs/staistics3/lib/python3.5/site-packages/tornado/gen.py", line 1141, in run
yielded = self.gen.throw(*exc_info)
File "/home/avlach/virtualenvs/staistics3/lib/python3.5/site-packages/distributed/nanny.py", line 226, in instantiate
self.process.start()
File "/home/avlach/virtualenvs/staistics3/lib/python3.5/site-packages/tornado/gen.py", line 1133, in run
value = future.result()
File "/usr/lib/python3.5/asyncio/futures.py", line 274, in result
raise self._exception
File "/home/avlach/virtualenvs/staistics3/lib/python3.5/site-packages/tornado/gen.py", line 1141, in run
yielded = self.gen.throw(*exc_info)
File "/home/avlach/virtualenvs/staistics3/lib/python3.5/site-packages/distributed/nanny.py", line 370, in start
yield self.process.start()
File "/home/avlach/virtualenvs/staistics3/lib/python3.5/site-packages/tornado/gen.py", line 1133, in run
value = future.result()
File "/usr/lib/python3.5/asyncio/futures.py", line 274, in result
raise self._exception
File "/home/avlach/virtualenvs/staistics3/lib/python3.5/site-packages/distributed/process.py", line 35, in _call_and_set_future
res = func(*args, **kwargs)
File "/home/avlach/virtualenvs/staistics3/lib/python3.5/site-packages/distributed/process.py", line 184, in _start
process.start()
File "/usr/lib/python3.5/multiprocessing/process.py", line 105, in start
self._popen = self._Popen(self)
File "/usr/lib/python3.5/multiprocessing/context.py", line 281, in _Popen
return Popen(process_obj)
File "/usr/lib/python3.5/multiprocessing/popen_forkserver.py", line 36, in __init__
super().__init__(process_obj)
File "/usr/lib/python3.5/multiprocessing/popen_fork.py", line 20, in __init__
self._launch(process_obj)
File "/usr/lib/python3.5/multiprocessing/popen_forkserver.py", line 52, in _launch
self.sentinel, w = forkserver.connect_to_new_process(self._fds)
File "/usr/lib/python3.5/multiprocessing/forkserver.py", line 66, in connect_to_new_process
client.connect(self._forkserver_address)
ConnectionRefusedError: [Errno 111] Connection refused
and on the terminal
Kernel interrupted: 1e7da6ed-2137-41ce-818e-3484c0a659cc
fd_event_list = self._epoll.poll(timeout, max_ev)
KeyboardInterrupt
Python 3.6.0 (v3.6.0:41df79263a11, Dec 23 2016, 07:18:10) [MSC v.1900 32 bit (Intel)] on win32
Type "copyright", "credits" or "license()" for more information.
from selenium import webdriver
browser = webdriver.Chrome(executable_path = "\\Users\\WorkStation\\AppData\\Local\\Programs\\Python\\Python36-32\\selenium\\webdriver\\chrome\\chromedriver.exe")
Traceback (most recent call last):
File "C:\Users\WorkStation\AppData\Local\Programs\Python\Python36-32\selenium\webdriver\common\service.py", line 74, in start
stdout=self.log_file, stderr=self.log_file)
File "C:\Users\WorkStation\AppData\Local\Programs\Python\Python36-32\lib\subprocess.py", line 665, in __init__
errread, errwrite) = self._get_handles(stdin, stdout, stderr)
File "C:\Users\WorkStation\AppData\Local\Programs\Python\Python36-32\lib\subprocess.py", line 910, in _get_handles
c2pwrite = msvcrt.get_osfhandle(self._get_devnull())
File "C:\Users\WorkStation\AppData\Local\Programs\Python\Python36-32\lib\subprocess.py", line 770, in _get_devnull
self._devnull = os.open(os.devnull, os.O_RDWR)
FileNotFoundError: [Errno 2] No such file or directory: 'nul'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<pyshell#1>", line 1, in <module>
browser = webdriver.Chrome(executable_path = "\\Users\\WorkStation\\AppData\\Local\\Programs\\Python\\Python36-32\\selenium\\webdriver\\chrome\\chromedriver.exe")
File "C:\Users\WorkStation\AppData\Local\Programs\Python\Python36-32\selenium\webdriver\chrome\webdriver.py",
line 62, in __init__self.service.start()
File "C:\Users\WorkStation\AppData\Local\Programs\Python\Python36-32\selenium\webdriver\common\service.py",
line 81, in start os.path.basename(self.path), self.start_error_message)
selenium.common.exceptions.WebDriverException: Message: 'chromedriver.exe' executable needs to be in PATH.
Please see https://sites.google.com/a/chromium.org/chromedriver/home
I use this syntax. Try it out:
browser = webdriver.Chrome("C:\Program Files (x86)\Google\Chrome\Application\chromedriver.exe")
I tried this. Hope this works for you.
driver = webdriver.Chrome('C:\\Users\\Admin\\AppData\\Local\\Google\\Chrome\\Application\\chromedriver.exe')