Scrapy Spider Close - python-3.x

I have a script that I need to run after my spider closes. I see that Scrapy has a handler called spider_closed() but what I dont understand is how to incorporate this into my script. What I am looking to do is once the scraper is done crawling I want to combine all my csv files them load them to sheets. If anyone has any examples of this can be done that would be great.

As per the example in the documentation, you add the following to your Spider:
# This function remains as-is.
#classmethod
def from_crawler(cls, crawler, *args, **kwargs):
spider = super().from_crawler(crawler, *args, **kwargs)
crawler.signals.connect(spider.spider_closed, signal=signals.spider_closed)
return spider
# This is where you do your CSV combination.
def spider_closed(self, spider):
# Whatever is here will run when the spider is done.
combine_csv_to_sheet()

As per the comments on my other answer about a signal-based solution, here is a way to run some code after multiple spiders are done. This does not involve using the spider_closed signal.
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
process = CrawlerProcess(get_project_settings())
process.crawl('spider1')
process.crawl('spider2')
process.crawl('spider3')
process.crawl('spider4')
process.start()
# CSV combination code goes here. It will only run when all the spiders are done.
# ...

Related

How to combine multiple custom management commands in Django?

I wrote a set of commands for my Django project in the usual way. Is it possible to combine multiple commands into one command?
class Command(BaseCommand):
""" import all files in media in dir and add to db
after call process to generate thumbnails """
def handle(self, *args, **options):
...
To make easy steps I have commands like: import files, read metadata from files, create thumbnails etc. The goal now is to create a "do all" command that somehow imports these commands and executes them one after another.
How to do that?
You can define a DoAll command and use django django.core.management.call_command to run all your subcommands. Try something like below:
class FoobarCommand(BaseCommand):
""" call foo and bar commands """
def handle(self, *args, **options):
call_command("foo_command", *args, **options)
call_command("bar_command", *args, **options)

Python unittest framework, unfortunately no build-in timeout possibility

I'm using the python unittest framework to perform unit tests. The python version i'm using is 3.6. I'm using the windows OS.
My production code is currently a little bit unstable (as i added some new functionality) and tends to hangup in some internal asynchronous loops. I'm working on fixing those hangups, currently.
However, the tracking of those bugs is hindered by the corresponding testcases hanging up, too.
I would merely like that the corresponding testcases stop after e.g. 500ms if they not run through and be marked as FAILED in order to let all the other testcases beeing continued.
Unfortunately, the unittest framework does not support timeouts (If i had known in advance ...). Therefore i'm searching for a workaround of this. If some package would add that missing functionality to the unittest framework i would be willing to try. What i don't want is that my production code relies on too much non standard packages, for the unittests this would be OK.
I'm a little lost on howto adding such functionality to the unittests.
Therefore i just tried out some code from here: How to limit execution time of a function call?. As they said somewhere that threading should not be used to implement timeouts i tried using multiprocessing, too.
Please note that the solutions proposed here How to specify test timeout for python unittest? do not work, too. They are designed for linux (using SIGALRM).
import multiprocessing
# import threading
import time
import unittest
class InterruptableProcess(multiprocessing.Process):
# class InterruptableThread(threading.Thread):
def __init__(self, func, *args, **kwargs):
super().__init__()
self._func = func
self._args = args
self._kwargs = kwargs
self._result = None
def run(self):
self._result = self._func(*self._args, **self._kwargs)
#property
def result(self):
return self._result
class timeout:
def __init__(self, sec):
self._sec = sec
def __call__(self, f):
def wrapped_f(*args, **kwargs):
it = InterruptableProcess(f, *args, **kwargs)
# it = InterruptableThread(f, *args, **kwargs)
it.start()
it.join(self._sec)
if not it.is_alive():
return it.result
#it.terminate()
raise TimeoutError('execution expired')
return wrapped_f
class MyTests(unittest.TestCase):
def setUp(self):
# some initialization
pass
def tearDown(self):
# some cleanup
pass
#timeout(0.5)
def test_XYZ(self):
# looong running
self.assertEqual(...)
The code behaves very differently using threads vs. processes.
In the first case it runs through but continues execution of the function despite timeout (which is unwanted). In the second case it complains about unpickable objects.
In both cases i would like to know howto do proper cleanup, e.g. call the unittest.TestCase.tearDown method on timeout from within the decorator class.

Scrapy runs all spiders at once. I want to only run one spider at a time. Scrapy crawl <spider>

I am new to Scrapy and am trying to play around with the framework. What is really frustrating is that when I run "scrapy crawl (name of spider)" it runs every single spider in my "spiders" folder. So I either have to wait out all of the spiders running or comment out all the spiders except for the one I am working with. It is very annoying. How can I make it so that scrapy only runs one spider at a time?
You can run scrapy from your script (https://scrapy.readthedocs.io/en/latest/topics/practices.html#run-from-script),
for example:
import scrapy
from scrapy.crawler import CrawlerProcess
class YourSpider(scrapy.Spider):
# Your spider definition
process = CrawlerProcess()
process.crawl(YourSpider)
process.start()
It shouldn't be running the entire spider, though it does compile and run through some stuff, as that's how it pulls the spider names (I assume there are other reasons, otherwise it seems like an odd way to set things up). If you post your spider we can see what might be running vs. not.
I had the same issue, as my spiders modified csv files, including renaming/deleting them, which was screwing things up when I only wanted to run a specific spider. My solution was to have the spiders do certain tasks only when they were actually run or closed. Documentation here: https://docs.scrapy.org/en/latest/topics/signals.html though I found it lacking.
Here is the code I used. from_crawler section can be left alone aside from changing the spider name. Put whatever you'd like in the spider_closed portion
#classmethod
def from_crawler(cls, crawler, *args, **kwargs):
spider = super(SixPMSpider, cls).from_crawler(crawler, *args, **kwargs)
crawler.signals.connect(spider.spider_closed, signal=signals.spider_closed)
return spider
def spider_closed(self, spider):
os.remove(self.name+'_price_list.csv')
os.rename(self.name+'_price_list2.csv', self.name+'_price_list.csv')

Reactor not restartable while running multiple spiders

I'm having some trouble running multiple spiders in a row and I couldn't find an answer that fixed my issue.
In my project I have multiple spiders, one of them can work on his own but, the following ones depend on the first one to have finished for the program to work correctly.
How can I make one spider run after the other? I tried doing something like this:
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
import excelMerger
process = CrawlerProcess(get_project_settings())
process.crawl('urlClothes_spider')
process.start()
process.crawl('clothes_spider')
process.start()
process.crawl('finalClothes_spider')
process.start()
But after the first one finish I get a reactor not startable error.
I have also tried just putting the .crawl one after the other but it seems like that way the order is not followed, so the program does not work, something like this
process.crawl('urlClothes_spider')
process.crawl('clothes_spider')
process.crawl('finalClothes_spider')
Any ideas on how to fix the issue?
You need to follow the sequential execution example in the documentation:
from twisted.internet import reactor, defer
from scrapy.crawler import CrawlerRunner
from scrapy.utils.project import get_project_settings
runner = CrawlerRunner(get_project_settings())
#defer.inlineCallbacks
def crawl():
yield runner.crawl('urlClothes_spider')
yield runner.crawl('clothes_spider')
yield runner.crawl('finalClothes_spider')
reactor.stop()
crawl()
reactor.run()

How to suppress ImportWarning in a python unittest script

I am currently running a unittest script which successfully passes the various specified test with a nagging ImportWarning message in the console:
...../lib/python3.6/importlib/_bootstrap.py:219: ImportWarning: can't resolve package from __spec__ or __package__, falling back on __name__ and __path__
return f(*args, **kwds)
....
----------------------------------------------------------------------
Ran 7 tests in 1.950s
OK
The script is run with this main function:
if __name__ == '__main__':
unittest.main()
I have read that warnings can be surpressed when the script is called like this:
python -W ignore:ImportWarning -m unittest testscript.py
However, is there a way of specifying this ignore warning in the script itself so that I don't have to call -W ignore:ImportWarning every time that the testscript is run?
Thanks in advance.
To programmatically prevent such warnings from showing up, adjust your code so that:
import warnings
if __name__ == '__main__':
with warnings.catch_warnings():
warnings.simplefilter('ignore', category=ImportWarning)
unittest.main()
Source: https://stackoverflow.com/a/40994600/328469
Update:
#billjoie is certainly correct. If the OP chooses to make answer 52463661 the accepted answer, I am OK with that. I can confirm that the following is effective at suppressing such warning messages at run-time using python versions 2.7.11, 3.4.3, 3.5.4, 3.6.5, and 3.7.1:
#! /usr/bin/env python
# -*- coding: utf-8 -*-
import unittest
import warnings
class TestPandasImport(unittest.TestCase):
def setUp(self):
warnings.simplefilter('ignore', category=ImportWarning)
def test_01(self):
import pandas # noqa: E402
self.assertTrue(True)
def test_02(self):
import pandas # noqa: E402
self.assertFalse(False)
if __name__ == '__main__':
unittest.main()
However, I think that the OP should consider doing some deeper investigation into the application code targets of the unit tests, and try to identify the specific package import or operation which is causing the actual warning, and then suppress the warning as closely as possible to the location in code where the violation takes place. This will obviate the suppression of warnings throughout the entirety of one's unit test class, which may be inadvertently obscuring warnings from other parts of the program.
Outside the unit test, somewhere in the application code:
with warnings.catch_warnings():
warnings.simplefilter('ignore', category=ImportWarning)
# import pandas
# or_ideally_the_application_code_unit_that_imports_pandas()
It could take a bit of work to isolate the specific spot in the code that is either causing the warning or leveraging third-party software which causes the warning, but the developer will obtain a clearer understanding of the reason for the warning, and this will only improve the overall maintainability of the program.
I had the same problem, and starting my unittest script with a warnings.simplefilter() statement, as described by Nels, dit not work for me. According to this source, this is because:
[...] as of Python 3.2, the unittest module was updated to use the warnings module default filter when running tests, and [...] resets to the default filter before each test, meaning that any change you may think you are making scriptwide by using warnings.simplefilter(“ignore”) at the beginning of your script gets overridden in between every test.
This same source recommends to renew the filter inside of each test function, either directly or with an elegant decorator. A simpler solution is to define the warnings filter inside unittest's setUp() method, which is run right before each test.
import unittest
class TestSomething(unittest.TestCase):
def setUp(self):
warnings.simplefilter('ignore', category=ImportWarning)
# Other initialization stuff here
def test_a(self):
# Test assertion here.
if __name__ == '__main__':
unittest.main()
I had the same warning in Pycharm for one test when using unittest. This warning disappeared when I stopped trying to import a library during the test (I moved the import to the top where it's supposed to be). I know the request was for suppression, but this would also make it disappear if it's only happening in a select number of tests.
Solutions with def setUp suppress warnings for all methods within class. If you don't want to suppress it for all of them, you can use decorator.
From Neural Dump:
def ignore_warnings(test_func):
def do_test(self, *args, **kwargs):
with warnings.catch_warnings():
warnings.simplefilter("ignore")
test_func(self, *args, **kwargs)
return do_test
Then you can use it to decorate single test method in your test class:
class TestClass(unittest.TestCase):
#ignore_warnings
def test_do_something_without_warning()
self.assertEqual(whatever)
def test_something_else_with_warning()
self.assertEqual(whatever)

Resources