Running a scrapy program from another python script - python-3.x

This question has been kind of answered before but answers are years old.
In my "project" I have 4 spiders and each one of them deals with different kinds of products I encounter (scraping amazon ATM). Each product has a category, for example, if I want to scrape "laptops" I use one scraper but if the objective is to scrape clothes, I have another one.
So, is it there a way to run a python script that, depending on the product I have to scrape (products are read from a txt file) a different spider is called?
Code would look like this
#Imports
def scrapyProject():
#Get the products I want to scrape
if productIsClothes:
runClothesSpider
else productIsGeneric:
runGenericSpider
I know the previous code is rough, It's kind of a sketch for the final code.
It would also help knowing which imports I need for the program to work

You could just set spider class with an if statement:
import sys
import scrapy
from scrapy.crawler import CrawlerProcess
from project.spiders import Spider1, Spider2
def main():
process = CrawlerProcess({})
if sys.argv[1] == '1':
spider_cls = Spider1
elif sys.argv[1] == '2':
spider_cls = Spider2
else:
print('1st argument must be either 1 or 2')
return
process.crawl(spider_cls)
process.start() # the script will block here until the crawling is finished
if __name__ == '__main__':
main()

Related

Executing python script with various modes

I am developing a data pipeline in python with around 7 or 8 modes. Basically the data pipeline will be calling many class. For simplicity, I have created a simple test script ((pseudo code))as below. But every function is getting imported from a class.
Some modes/tasks are independent steps and few can be combined and make as a datapipeline.
For example test_flow is an independant workflow. create_flow and monitor_flow can be called as independant tasks or sometimes can be called together also.
Is there a better way to design the pipeline as there are about 8 modes and I feel the design(calling --modes as below) is bit clumsy. Please let me know if there are any other elegant ways. Thanks.
def test_flow:
print(test_flow)
def create_flow:
print(create_flow)
def monitor_flow:
print(monitor_flow)
if __name__ == "__main__":
if args.mode == "test_flow":
test_flow
if args.mode == "create_flow":
create_flow
if args.mode == "monitor_flow":
monitor_flow
Your example code is full of syntax errors!
I would suggest something like this, but you probably would want to ensure further that only certain functions are reachable via the command line:
import sys
def test_flow():
print("called test_flow")
def create_flow():
print("called create_flow")
def monitor_flow():
print("called monitor_flow")
def main(argv):
if len(argv)>1:
specifiedCall = globals()[argv[1]]
if specifiedCall:
specifiedCall()
pass
if __name__ == "__main__":
main(sys.argv)

Extract item for each spider in scrapy project

I have over a dozen spiders in a scrapy project with variety of items being extracted from different sources, including others elements mostly i have to copy same regex code over and over again in each spider for example
item['element'] = re.findall('my_regex', response.text)
I use this regex to get same element which is defined in scrapy items, is there a way to avoid copying? where do i put this in project so that i don't have to copy this in each spider and only add those that are different.
my project structure is default
any help is appreciated thanks in advance
So if I understand your question correctly, you want use the same regular expression across multiple spiders.
You can do this:
create a python module called something like regex_to_use
inside that module place your regular expression.
example:
# regex_to_use.py
regex_one = 'test'
You can access this express this one in your spiders.
# spider.py
import regex_to_use
import re as regex
find_string = regex.search(regex_to_use.regex_one, ' this is a test')
print(find_string)
# output
<re.Match object; span=(11, 15), match='test'>
You could also do something like this in your regex_to_use module
# regex_to_use.py
import re as regex
class CustomRegularExpressions(object):
def __init__(self, text):
"""
:param text: string containing the variable to search for
"""
self._text = text
def search_text(self):
find_xyx = regex.search('test', self._text)
return find_xyx
and you would call it this way in your spiders:
# spider.py
from regex_to_use import CustomRegularExpressions
find_word = CustomRegularExpressions('this is a test').search_text()
print(find_word)
# output
<re.Match object; span=(10, 14), match='test'>
If you have multiple regular expressions you could do something like this:
# regex_to_use.py
import re as regex
class CustomRegularExpressions(object):
def __init__(self, text):
"""
:param text: string containing the variable to search for
"""
self._text = text
def search_text(self, regex_to_use):
regular_expressions = {"regex_one": 'test_1', "regex_two": 'test_2'}
expression = ''.join([v for k, v in regular_expressions.items() if k == regex_to_use])
find_xyx = regex.search(expression, self._text)
return find_xyx
# spider.py
from regex_to_use import CustomRegularExpressions
find_word = CustomRegularExpressions('this is a test').search_text('regex_one')
print(find_word)
# output
<re.Match object; span=(10, 14), match='test'>
You can also use a staticmethod in the class CustomRegularExpressions
# regex_to_use.py
import re as regex
class CustomRegularExpressions:
#staticmethod
def search_text(regex_to_use, text_to_search):
regular_expressions = {"regex_one": 'test_1', "regex_two": 'test_2'}
expression = ''.join([v for k, v in regular_expressions.items() if k == regex_to_use])
find_xyx = regex.search(expression, text_to_search)
return find_xyx
# spider.py
from regex_to_use import CustomRegularExpressions
# find_word would be replaced with item['element']
# this is a test would be replaced with response.text
find_word = CustomRegularExpressions.search_text('regex_one', 'this is a test')
print(find_word)
# output
<re.Match object; span=(10, 14), match='test'>
If you use docstrings in the function search_text() you can see the regular expressions in the Python dictionary.
Showing how all this works...
This is a python project that I wrote and published. Take a look at the folder utilities. In this folder I have functions that I can use throughout my code without having to copy and paste the same code over and over.
There is a lot of common data that is usual to use across multiple spiders, like regex or even XPath.
It's a good idea to isolate them.
You can use something like this:
/project
/site_data
handle_responses.py
...
/spiders
your_spider.py
...
Isolate functionalities with a common purpose.
# handle_responses.py
# imports ...
from re import search
def get_specific_commom_data(text: str):
# probably is a good idea handle predictable errors here (`try except`)
return search('your_regex', text)
And just use where is needed that functionality.
# your_spider.py
# imports ...
import scrapy
from site_data.handle_responses import get_specific_commom_data
class YourSpider(scrapy.Spider):
# ... previous code
def your_method(self, response):
# ... previous code
item['element'] = get_specific_commom_data(response.text)
Try to keep it simple and do what you need to solve your problem.
I can copy regex in multiple spiders instead of importing object from other .py files, i understand they have the use case but here i don't want to add anything to any of the spiders but still want the element in result
There are some good answers to this but don't really solve the problem so after searching for days i have come to this solution i hope its useful for others looking for similar answer.
#middlewares.py
import yourproject.items import youritem()
#find the function and add your element
def process_spider_output(self, response, result, spider):
item = YourItem()
item['element'] = re.findall('my_regex', response.text)
now uncomment middleware from
#settings.py
SPIDER_MIDDLEWARES = {
'yourproject.middlewares.YoursprojectMiddleware': 543,
}
For each spider you will get element in result data, i am still searching for better solution and i will update the answer because it slows the spider,

Calling a function from another file within an if-Python 3.x

I've found resources on here but they're pertaining to locally embedded functions. I have one file called "test" and another called "main". I want test to contain all of my logic while main will contain a complete list of functions which each correlate with a health insurance policy. There are hundreds of policies so it would become quite tedious to write an if statement in "test" for each one each time. I'd like to write as few lines as possible to call a function based off of what a value states. Something like:
insurance = input()
The end result would not be an input but for testing/learning purposes it is. The input would always correlate with an insurance policy exactly if it exists. So on "tests" I currently have:
from inspolicy.main import bcbs, uhc, medicare
print('What is the insurance?(bcbs, uhc, medicare)')
insurance = input()
if insurance.lower() == 'bcbs':
bcbs()
elif insurance.lower() == 'uhc':
uhc()
elif insurance.lower() == 'medicare':
medicare()
else:
print('This policy can not be found in the database. Please set aside.')
With "main" including:
def bcbs():
print('This is BCBS')
def uhc():
print('This is UHC')
def medicare():
print('This is Medicare')
So is there a way to have the input (i.e. insurance) be what is referenced against to call the function from "main"?
Thank you in advance for your time!
The best approach to this is to use a dictionary to map between the name of your insurance policies and the function that deals with them. This could be a hand-built dict in one of your modules, or you could simply use the namespace of the main module (which is implemented using a dictionary):
# in test
import types
import inspolicy.main # import the module itself, rather than just the functions
insurance = input('What is the insurance?(bcbs, uhc, medicare)')
func = getattr(inspolicy.main, insurance, None)
if isinstance(func, types.FunctionType):
func()
else:
print('This policy can not be found in the database. Please set aside.')
Let's consider this is your main.py
def uhc():
print("This is UHC")
It is possible to do something like that in test.py:
import main
def unknown_function():
print('This policy can not be found in the database. Please set aside.')
insurance = input()
try:
insurance_function = getattr(main, insurance.lower())
except AttributeError:
insurance_function = unknown_function
insurance_function()
Then, if you type "uhc" as your input, you will get the uhc function from main.py and call it.

How to get cProfile results from a callback function in Python

I am trying to determine which parts of my python code are running the slowest, so that I have a better understanding on what I need to fix. I recently discovered cProfile and gprof2dot which have been extremely helpful. My problem is that I'm not seeing any information about functions that I'm using as callbacks, which I believe might be running very slowly. From what I understand from this answer is that cProfile only works by default in the main thread, and I'm guessing that callbacks use a separate thread. That answer showed a way to get things working if you are using the threading library, but I couldn't get it to work for my case.
Here is roughly what my code looks like:
import rospy
import cv
import cProfile
from numpy import *
from collections import deque
class Bla():
def __init__( self ):
self.image_data = deque()
self.do_some_stuff()
def vis_callback( self, data ):
cv_im = self.bridge.imgmsg_to_cv( data, "mono8" )
im = asarray( cv_im )
self.play_with_data( im )
self.image_data.append( im )
def run( self ):
rospy.init_node( 'bla', anonymous=True )
sub_vis = rospy.Subscriber('navbot/camera/image',Image,self.vis_callback)
while not rospy.is_shutdown():
if len( self.image_data ) > 0:
im = self.image_data.popleft()
self.do_some_things( im )
def main():
bla = Bla()
bla.run()
if __name__ == "__main__":
cProfile.run( 'main()', 'pstats.out' ) # this could go here, or just run python with -m cProfile
#main()
Any ideas on how to get cProfile info on the vis_callback function? Either by modifying the script itself, or even better using python -m cProfile or some other command line method?
I'm guessing it is either the reading of images or appending them to the queue which slow. My gut feeling is that storing them on a queue is a horrible thing to do, but I need to display them with matplotlib, which refuses to work if it isn't in the main thread, so I want to see exactly how bad the damage is with this workaround

How to run web.py server inside of another application

I have a little desktop game written in Python and would like to be able to access internal of it while the game is running. I was thinking of doing this by having a web.py running on a separate thread and serving pages. So when I access http://localhost:8080/map it would display map of the current level for debugging purposes.
I got web.py installed and running, but I don't really know where to go from here. I tried starting web.application in a separate thread, but for some reason I can not share data between threads (I think).
Below is simple example, that I was using testing this idea. I thought that http://localhost:8080/ would return different number every time, but it keeps showing the same one (5). If I print common_value inside of the while loop, it is being incremented, but it starts from 5.
What am I missing here and is the approach anywhere close to sensible? I really would like to avoid using database if possible.
import web
import thread
urls = (
'/(.*)', 'hello'
)
app = web.application(urls, globals())
common_value = 5
class hello:
def GET(self):
return str(common_value)
if __name__ == "__main__":
thread.start_new_thread(app.run, ())
while 1:
common_value = common_value + 1
After searching around a bit, I found a solution that works:
If common_value is defined at a separate module and imported from there, the above code works. So in essence (excuse the naming):
thingy.py
common_value = 0
server.py
import web
import thread
import thingy
import sys; sys.path.insert(0, ".")
urls = (
'/(.*)', 'hello'
)
app = web.application(urls, globals())
thingy.common_value = 5
class hello:
def GET(self):
return str(thingy.common_value)
if __name__ == "__main__":
thread.start_new_thread(app.run, ())
while 1:
thingy.common_value = thingy.common_value + 1
I found erros with arguments, but
change :
def GET(self):
with:
def GET(self, *args):
and works now.

Resources