I'm having some trouble running multiple spiders in a row and I couldn't find an answer that fixed my issue.
In my project I have multiple spiders, one of them can work on his own but, the following ones depend on the first one to have finished for the program to work correctly.
How can I make one spider run after the other? I tried doing something like this:
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
import excelMerger
process = CrawlerProcess(get_project_settings())
process.crawl('urlClothes_spider')
process.start()
process.crawl('clothes_spider')
process.start()
process.crawl('finalClothes_spider')
process.start()
But after the first one finish I get a reactor not startable error.
I have also tried just putting the .crawl one after the other but it seems like that way the order is not followed, so the program does not work, something like this
process.crawl('urlClothes_spider')
process.crawl('clothes_spider')
process.crawl('finalClothes_spider')
Any ideas on how to fix the issue?
You need to follow the sequential execution example in the documentation:
from twisted.internet import reactor, defer
from scrapy.crawler import CrawlerRunner
from scrapy.utils.project import get_project_settings
runner = CrawlerRunner(get_project_settings())
#defer.inlineCallbacks
def crawl():
yield runner.crawl('urlClothes_spider')
yield runner.crawl('clothes_spider')
yield runner.crawl('finalClothes_spider')
reactor.stop()
crawl()
reactor.run()
Related
I have a script that I need to run after my spider closes. I see that Scrapy has a handler called spider_closed() but what I dont understand is how to incorporate this into my script. What I am looking to do is once the scraper is done crawling I want to combine all my csv files them load them to sheets. If anyone has any examples of this can be done that would be great.
As per the example in the documentation, you add the following to your Spider:
# This function remains as-is.
#classmethod
def from_crawler(cls, crawler, *args, **kwargs):
spider = super().from_crawler(crawler, *args, **kwargs)
crawler.signals.connect(spider.spider_closed, signal=signals.spider_closed)
return spider
# This is where you do your CSV combination.
def spider_closed(self, spider):
# Whatever is here will run when the spider is done.
combine_csv_to_sheet()
As per the comments on my other answer about a signal-based solution, here is a way to run some code after multiple spiders are done. This does not involve using the spider_closed signal.
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
process = CrawlerProcess(get_project_settings())
process.crawl('spider1')
process.crawl('spider2')
process.crawl('spider3')
process.crawl('spider4')
process.start()
# CSV combination code goes here. It will only run when all the spiders are done.
# ...
Strange title I know, but it is exactly what I see. I am trying to run a requests (2.13.0) command from within a forked process (Mac OSX) using the multiprocessing module. I also happen to use numpy in my code (1.15.1) running on python 3.7. Here are my observations (see code below):
1) Without importing numpy: All works fine
2) Once I import numpy: Code crashes on starting of the forked process. Message given is:
objc[45539]: +[__NSPlaceholderDate initialize] may have been in progress in another thread when fork() was called.
objc[45539]: +[__NSPlaceholderDate initialize] may have been in progress in another thread when fork() was called. We cannot safely call it or ignore it in the fork() child process. Crashing instead. Set a breakpoint on objc_initializeAfterForkError to debug.
3) I could make it work again by calling a requests call from within the main process once before starting the new process (see commented section in code
4) On python 2.7, all seems to work fine in all cases above.
Sample minimal code to reproduce:
from multiprocessing import Process
import requests
import numpy # remove this import and it works fine on 3.7
def _worker():
full_url = "http://www.google.com"
result = requests.get(full_url)
print(result.text)
return 0
def run():
p=Process(target=_worker)
p.start()
p.join()
# Add these lines and the code works in 3.7 even with numpy imported
#try:
# requests.get('http://www.google.com')
#except:
# pass
run()
print('I am done')
I am currently running a unittest script which successfully passes the various specified test with a nagging ImportWarning message in the console:
...../lib/python3.6/importlib/_bootstrap.py:219: ImportWarning: can't resolve package from __spec__ or __package__, falling back on __name__ and __path__
return f(*args, **kwds)
....
----------------------------------------------------------------------
Ran 7 tests in 1.950s
OK
The script is run with this main function:
if __name__ == '__main__':
unittest.main()
I have read that warnings can be surpressed when the script is called like this:
python -W ignore:ImportWarning -m unittest testscript.py
However, is there a way of specifying this ignore warning in the script itself so that I don't have to call -W ignore:ImportWarning every time that the testscript is run?
Thanks in advance.
To programmatically prevent such warnings from showing up, adjust your code so that:
import warnings
if __name__ == '__main__':
with warnings.catch_warnings():
warnings.simplefilter('ignore', category=ImportWarning)
unittest.main()
Source: https://stackoverflow.com/a/40994600/328469
Update:
#billjoie is certainly correct. If the OP chooses to make answer 52463661 the accepted answer, I am OK with that. I can confirm that the following is effective at suppressing such warning messages at run-time using python versions 2.7.11, 3.4.3, 3.5.4, 3.6.5, and 3.7.1:
#! /usr/bin/env python
# -*- coding: utf-8 -*-
import unittest
import warnings
class TestPandasImport(unittest.TestCase):
def setUp(self):
warnings.simplefilter('ignore', category=ImportWarning)
def test_01(self):
import pandas # noqa: E402
self.assertTrue(True)
def test_02(self):
import pandas # noqa: E402
self.assertFalse(False)
if __name__ == '__main__':
unittest.main()
However, I think that the OP should consider doing some deeper investigation into the application code targets of the unit tests, and try to identify the specific package import or operation which is causing the actual warning, and then suppress the warning as closely as possible to the location in code where the violation takes place. This will obviate the suppression of warnings throughout the entirety of one's unit test class, which may be inadvertently obscuring warnings from other parts of the program.
Outside the unit test, somewhere in the application code:
with warnings.catch_warnings():
warnings.simplefilter('ignore', category=ImportWarning)
# import pandas
# or_ideally_the_application_code_unit_that_imports_pandas()
It could take a bit of work to isolate the specific spot in the code that is either causing the warning or leveraging third-party software which causes the warning, but the developer will obtain a clearer understanding of the reason for the warning, and this will only improve the overall maintainability of the program.
I had the same problem, and starting my unittest script with a warnings.simplefilter() statement, as described by Nels, dit not work for me. According to this source, this is because:
[...] as of Python 3.2, the unittest module was updated to use the warnings module default filter when running tests, and [...] resets to the default filter before each test, meaning that any change you may think you are making scriptwide by using warnings.simplefilter(“ignore”) at the beginning of your script gets overridden in between every test.
This same source recommends to renew the filter inside of each test function, either directly or with an elegant decorator. A simpler solution is to define the warnings filter inside unittest's setUp() method, which is run right before each test.
import unittest
class TestSomething(unittest.TestCase):
def setUp(self):
warnings.simplefilter('ignore', category=ImportWarning)
# Other initialization stuff here
def test_a(self):
# Test assertion here.
if __name__ == '__main__':
unittest.main()
I had the same warning in Pycharm for one test when using unittest. This warning disappeared when I stopped trying to import a library during the test (I moved the import to the top where it's supposed to be). I know the request was for suppression, but this would also make it disappear if it's only happening in a select number of tests.
Solutions with def setUp suppress warnings for all methods within class. If you don't want to suppress it for all of them, you can use decorator.
From Neural Dump:
def ignore_warnings(test_func):
def do_test(self, *args, **kwargs):
with warnings.catch_warnings():
warnings.simplefilter("ignore")
test_func(self, *args, **kwargs)
return do_test
Then you can use it to decorate single test method in your test class:
class TestClass(unittest.TestCase):
#ignore_warnings
def test_do_something_without_warning()
self.assertEqual(whatever)
def test_something_else_with_warning()
self.assertEqual(whatever)
I want to import asyncore from a different directory, because I need to make some changes to how asyncore works, and don't want to modify the base file.
I could include it in the folder with my script, but after putting all the modules I need there it ends up getting rather cluttered.
I'm well aware of making a sub directory and putting a blank __init__.py file in it. This doesn't work. I'm not exactly sure what happens, but when I import asyncore from a sub directory, asyncore just plain stops working. Specifically; the connect method doesn't get run at all, even though I'm calling it. Moving asyncore to the main directory and importing it normally removes this problem.
I skimmed down my code significantly, but this still has the same problem:
from Modules import asyncore
from Modules import asynchat
from Modules import socket
class runBot(asynchat.async_chat, object):
def __init__(self):
asynchat.async_chat.__init__(self)
self.connect_to_twitch()
def connect_to_twitch(self):
self.create_socket(socket.AF_INET, socket.SOCK_STREAM)
self.connect(('irc.chat.twitch.tv',6667))
self.set_terminator('\n')
self.buffer=[]
def collect_incoming_data(self, data):
self.buffer.append(data)
def found_terminator(self):
msg = ''.join(self.buffer)
print(msg)
if __name__ == '__main__':
# Assign bots to channels
bot = runBot()
# Start bots
asyncore.loop(0.001)
I'm sure this is something really simple I'm overlooking, but I'm just not able to figure this out.
Use sys.path.append -- see https://docs.python.org/3/tutorial/modules.html for the details.
Update: Try to put a debug print to the beginning and end of sources of your modules to see whether they are imported as expected. You can also print __file__ attribute for the module/object that you want to use to see, whether you imported what you expected -- like:
import re
#...
print(re.__file__)
I have a (python3) package that has completely different behaviour depending on how it's init()ed (perhaps not the best design, but rewriting is not an option). The module can only be init()ed once, a second time gives an error. I want to test this package (both behaviours) using py.test.
Note: the nature of the package makes the two behaviours mutually exclusive, there is no possible reason to ever want both in a singular program.
I have serveral test_xxx.py modules in my test directory. Each module will init the package in the way in needs (using fixtures). Since py.test starts the python interpreter once, running all test-modules in one py.test run fails.
Monkey-patching the package to allow a second init() is not something I want to do, since there is internal caching etc that might result in unexplained behaviour.
Is it possible to tell py.test to run each test module in a separate python process (thereby not being influenced by inits in another test-module)
Is there a way to reliably reload a package (including all sub-dependencies, etc)?
Is there another solution (I'm thinking of importing and then unimporting the package in a fixture, but this seems excessive)?
To reload a module, try using the reload() from library importlib
Example:
from importlib import reload
import some_lib
#do something
reload(some_lib)
Also, launching each test in a new process is viable, but multiprocessed code is kind of painful to debug.
Example
import some_test
from multiprocessing import Manager, Process
#create new return value holder, in this case a list
manager = Manager()
return_value = manager.list()
#create new process
process = Process(target=some_test.some_function, args=(arg, return_value))
#execute process
process.start()
#finish and return process
process.join()
#you can now use your return value as if it were a normal list,
#as long as it was assigned in your subprocess
Delete all your module imports and also your tests import that also import your modules:
import sys
for key in list(sys.modules.keys()):
if key.startswith("your_package_name") or key.startswith("test"):
del sys.modules[key]
you can use this as a fixture by configuring on your conftest.py file a fixture using the #pytest.fixture decorator.
Once I had similar problem, quite bad design though..
#pytest.fixture()
def module_type1():
mod = importlib.import_module('example')
mod._init(10)
yield mod
del sys.modules['example']
#pytest.fixture()
def module_type2():
mod = importlib.import_module('example')
mod._init(20)
yield mod
del sys.modules['example']
def test1(module_type1)
pass
def test2(module_type2)
pass
The example/init.py had something like this
def _init(val):
if 'sample' in globals():
logger.info(f'example already imported, val{sample}' )
else:
globals()['sample'] = val
logger.info(f'importing example with val : {val}')
output:
importing example with val : 10
importing example with val : 20
No clue as to how complex your package is, but if its just global variables, then this probably helps.
I have the same problem, and found three solutions:
reload(some_lib)
patch SUT, as the imported method is a key and value in SUT, you can patch the
SUT. Example, if you use f2 of m2 in m1, you can patch m1.f2 instead of m2.f2
import module, and use module.function.