I have two python crawlers who can run independently.
crawler1.py
crawler2.py
They are part of an analysis that I want to run and I would like to import all to a commong script.
from crawler1.py import *
from crawler2.py import *
a bit lower in my script I have something like this
if <condition1>:
// running crawler1
runCrawler('crawlerName', '/dir1/dir2/')
if <condition2>:
// running crawler2
runCrawler('crawlerName', '/dir1/dir2/')
runCrawler is :
def runCrawler(crawlerName, crawlerFileName):
print('Running crawler for ' + crawlerName)
process = CP(
settings={
'FEED_URI' : crawlerFileName,
'FEED_FORMAT': 'csv'
}
)
process.crawl(globals()[crawlerName])
process.start()
I get the following error:
Exception has occurred: ReactorAlreadyInstalledError
reactor already installed
The first crawler runs ok. The second one has problems.
Any ideas?
I run the above through a visual studio debugger.
the best way to do it is this way
your code should be
from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
# your code
settings={
'FEED_FORMAT': 'csv'
}
process = CrawlerRunner(Settings)
if condition1:
process.crawl(spider1,crawlerFileName=crawlerFileName)
if condition2:
process.crawl(spider2,crawlerFileName=crawlerFileName)
d = process.join()
d.addBoth(lambda _: reactor.stop())
reactor.run() # it will run both crawlers and code inside the function
your spiders should be like
class spider1(scrapy.Spider):
name = "spider1"
custom_settings = {'FEED_URI' : spider1.crawlerFileName}
def start_requests(self):
yield scrapy.Request('https://scrapy.org/')
def parse(self, response):
pass
Related
So I'm making a script to test my spiders but I don't know how to capture the returned data in the script that it's running the spider.
I have this return [self.p_name, self.price, self.currency] to return at the end of the spider.
In the spider tester I have this script:
#!/usr/bin/python3
#-*- coding: utf-8 -*-
# Import external libraries
import scrapy
from scrapy.crawler import CrawlerProcess
from Ruby.spiders.furla import Furla
# Import internal libraries
# Variables
t = CrawlerProcess()
def test_furla():
x = t.crawl(Furla, url='https://www.furla.com/pt/pt/eshop/furla-sleek-BAHMW64BW000ZN98.html?dwvar_BAHMW64BW000ZN98_color=N98&cgid=SS20-Main-Collection')
return x
test_furla()
t.start()
It's running properly the only problem is that I don't know how to catch that return at the tester side. The output from the spider is ['FURLA SLEEK', '250.00', '€'].
If you need to access items yielded from the spider, I would probably use signals for the job, specifically item_scraped signal. Adapting your code it would like something like this:
from scrapy import signals
# other imports and stuff
t = CrawlerProcess()
def item_scraped(item, response, spider):
# do something with the item
def test_furla():
# we need Crawler instance to access signals
crawler = t.create_crawler(Furla)
crawler.signals.connect(item_scraped, signal=signals.item_scraped)
x = t.crawl(crawler, url='https://www.furla.com/pt/pt/eshop/furla-sleek-BAHMW64BW000ZN98.html?dwvar_BAHMW64BW000ZN98_color=N98&cgid=SS20-Main-Collection')
return x
test_furla()
t.start()
Additional info can be found in the CrawlerProcess documentation. If you on the other hand would need to work with the whole output from the spider, all the items, you would need to accumulate items using the above mechanism and work with them once crawl finishes.
#I am trying to run a script following these requirements:
After running the demo10.py script, The AmazonfeedSpider will crawl the product information using the generated urls saved in Purl and save the output into the dataset2.json file
After successfully crawling and saving data into dataset2.json file , The ProductfeedSpider will run and grab the 5 urls returned by the Final_Product() method of CompareString Class..
Finally after grabing the final product_url list from Comparestring4 Class, The ProductfeedSpider will scrape data from the returned url list and save the result into Fproduct.json file.
#Here is the demo10.py file:
import scrapy
from scrapy.crawler import CrawlerProcess
from AmazonScrap.spiders.Amazonfeed2 import AmazonfeedSpider
from scrapy.utils.project import get_project_settings
from AmazonScrap.spiders.Productfeed import ProductfeedSpider
import time
# from multiprocessing import Process
# def CrawlAmazon():
def main():
process1 = CrawlerProcess(settings=get_project_settings())
process1.crawl(AmazonfeedSpider)
process1.start()
process1.join()
# time.sleep(20)
process2 = CrawlerProcess(settings=get_project_settings())
process2.crawl(ProductfeedSpider)
process2.start()
process2.join()
if __name__ == "__main__":
main()
#After running the file it causes exception in the compiletime and says that dataset.json file doesn't exist. Do I need to use multiprocessing in order to create delay between the spiders? then how can I implement it?
#I am looking forward to hearing from experts
I am new to Python and I think I broke my python :(
I was trying Sentdex's PyQt4 YouTube tutorial right here.
I made the changes from PyQt4 to PyQt5. This is the code I was playing around. So I think, I messed up by printing the whole page on the console.
Now the output is:
Load finished
Look at you shinin!
Press any key to continue . . .
This is being shown for any code executed. That is python shows this code even if I try print("hello") in Visual code. I even tried to restart. Now like a virus, it is not clearing.
import bs4 as bs
import sys
import urllib.request
from PyQt5.QtWebEngineWidgets import QWebEnginePage
from PyQt5.QtWidgets import QApplication
from PyQt5.QtCore import QUrl
class Page(QWebEnginePage):
def __init__(self, url):
self.app = QApplication(sys.argv)
QWebEnginePage.__init__(self)
self.html = ''
self.loadFinished.connect(self._on_load_finished)
self.load(QUrl(url))
self.app.exec_()
def _on_load_finished(self):
self.html = self.toHtml(self.Callable)
print('Load finished')
def Callable(self, html_str):
self.html = html_str
self.app.quit()
def main():
page = Page('https://pythonprogramming.net/parsememcparseface/')
soup = bs.BeautifulSoup(page.html, 'html.parser')
js_test = soup.find('p', class_='jstest')
print js_test.text
print (soup)
#js_test = soup.find('div', class_='aqi-meter-panel')
#display.popen.terminate()
if __name__ == '__main__': main()
OK, so finally got the problem fixed..went manually inside the temp files in C:\Users\xxx\AppData\Local and started on a deletion rampage...removed many files and folder remotely related to python,vscode and conda...this gave an error warning first time I executed my program again...then on subsequent run...no issue...python back to its normal self...surprised that I was not able to find any solution on the net for this.
i have a spider for crawling a site and i want to run it every 10 minutes. put it in python schedule and run it. after first run i got
ReactorNotRestartable
i try this sulotion and got
AttributeError: Can't pickle local object 'run_spider..f'
error.
edit:
try how-to-schedule-scrapy-crawl-execution-programmatically python program run without error and crawl function run every 30 seconds but spider doesn't run and i don't get data.
def run_spider():
def f(q):
try:
runner = crawler.CrawlerRunner()
deferred = runner.crawl(DivarSpider)
#deferred.addBoth(lambda _: reactor.stop())
#reactor.run()
q.put(None)
except Exception as e:
q.put(e)
runner = crawler.CrawlerRunner()
deferred = runner.crawl(DivarSpider)
q = Queue()
p = Process(target=f, args=(q,))
p.start()
result = q.get()
p.join()
if result is not None:
raise result
The multiprocessing solution is a gross hack to work-around lack of understanding of how Scrapy and reactor management work. You can get rid of it and everything is much simpler.
from twisted.internet.task import LoopingCall
from twisted.internet import reactor
from scrapy.crawler import CrawlRunner
from scrapy.utils.log import configure_logging
from yourlib import YourSpider
configure_logging()
runner = CrawlRunner()
task = LoopingCall(lambda: runner.crawl(YourSpider()))
task.start(60 * 10)
reactor.run()
Easiest way I know to do it is using a separate script to call the script containing your twisted reactor, like this:
cmd = ['python3', 'auto_crawl.py']
subprocess.Popen(cmd).wait()
To run your CrawlerRunner every 10 minutes, you could use a loop or crontab on this script.
I have generated a new project and have a single Python file containing my spider.
The layout is:
import scrapy
from scrapy.http import *
import json
from scrapy.selector import HtmlXPathSelector
from scrapy.selector import Selector
import unicodedata
from scrapy import signals
from pydispatch import dispatcher
from scrapy.crawler import CrawlerProcess
from scrapy.item import Item, Field
class TrainerItem(Item):
name = Field()
brand = Field()
link = Field()
type = Field()
price = Field()
previous_price = Field()
stock_img = Field()
alt_img = Field()
size = Field()
class SchuhSpider(scrapy.Spider):
name = "SchuhSpider"
payload = {"hash": "g=3|Mens,&c2=340|Mens Trainers&page=1&imp=1&o=new&",
"url": "/imperfects/", "type": "pageLoad", "NonSecureUrl": "http://www.schuh.co.uk"}
url = "http://schuhservice.schuh.co.uk/SearchService/GetResults"
headers = {'Content-Type': 'application/json; charset=UTF-8'}
finalLinks = []
def start_requests(self):
dispatcher.connect(self.quit, signals.spider_closed)
yield scrapy.Request(url=self.url, callback=self.parse, method="POST", body=json.dumps(self.payload), headers=self.headers)
def parse(self, response):
... do stuff ..
def quit(self, spider):
print(spider.name + " is about to die, here are your trainers..")
process = CrawlerProcess()
process.crawl(SchuhSpider)
process.start()
print("We Are Done.")
I run this spider using:
scrapy crawl SchuhSpider
The problem is I'm getting:
twisted.internet.error.ReactorNotRestartable
This is because the spider is actually running twice. Once at the start (I'm getting all my POST requests) then it says "SchuhSpider is about to die, here are you trainers..".
Then it opens the spider a second time, presumably when it does the process stuff.
My question is: How do I get the spider to stop automatically running when the script runs?
Even when I run:
scrapy list
It runs the entire spider (all my POST requests come through). I fear I'm missing something obvious but I can't see what.
You mix two ways how to run a spider. One way is as you do it now, i.e. using
scrapy crawl SchuhSpider
command. This way, don't (or better you don't have to) include the code
process = CrawlerProcess()
process.crawl(SchuhSpider)
process.start()
print("We Are Done.")
as it's inteded only if you want to run spider from script (see the documentation).
If you want to retain the possibility to run it either way, just wrap the above code like this
if __name__ == '__main__':
process = CrawlerProcess()
process.crawl(SchuhSpider)
process.start()
print("We Are Done.")
so that it doesn't run when the module is just loaded (the case when you run it using scrapy crawl).