Scrapy - run at time interval

Scrapy - run at time interval - python-3.x

i have a spider for crawling a site and i want to run it every 10 minutes. put it in python schedule and run it. after first run i got
ReactorNotRestartable
i try this sulotion and got
AttributeError: Can't pickle local object 'run_spider..f'
error.
edit:
try how-to-schedule-scrapy-crawl-execution-programmatically python program run without error and crawl function run every 30 seconds but spider doesn't run and i don't get data.
def run_spider():
def f(q):
try:
runner = crawler.CrawlerRunner()
deferred = runner.crawl(DivarSpider)
#deferred.addBoth(lambda _: reactor.stop())
#reactor.run()
q.put(None)
except Exception as e:
q.put(e)
runner = crawler.CrawlerRunner()
deferred = runner.crawl(DivarSpider)
q = Queue()
p = Process(target=f, args=(q,))
p.start()
result = q.get()
p.join()
if result is not None:
raise result

The multiprocessing solution is a gross hack to work-around lack of understanding of how Scrapy and reactor management work. You can get rid of it and everything is much simpler.
from twisted.internet.task import LoopingCall
from twisted.internet import reactor
from scrapy.crawler import CrawlRunner
from scrapy.utils.log import configure_logging
from yourlib import YourSpider
configure_logging()
runner = CrawlRunner()
task = LoopingCall(lambda: runner.crawl(YourSpider()))
task.start(60 * 10)
reactor.run()

Easiest way I know to do it is using a separate script to call the script containing your twisted reactor, like this:
cmd = ['python3', 'auto_crawl.py']
subprocess.Popen(cmd).wait()
To run your CrawlerRunner every 10 minutes, you could use a loop or crontab on this script.

Related

how to run 2 crawlers from the same python script

I have two python crawlers who can run independently.
crawler1.py
crawler2.py
They are part of an analysis that I want to run and I would like to import all to a commong script.
from crawler1.py import *
from crawler2.py import *
a bit lower in my script I have something like this
if <condition1>:
// running crawler1
runCrawler('crawlerName', '/dir1/dir2/')
if <condition2>:
// running crawler2
runCrawler('crawlerName', '/dir1/dir2/')
runCrawler is :
def runCrawler(crawlerName, crawlerFileName):
print('Running crawler for ' + crawlerName)
process = CP(
settings={
'FEED_URI' : crawlerFileName,
'FEED_FORMAT': 'csv'
}
)
process.crawl(globals()[crawlerName])
process.start()
I get the following error:
Exception has occurred: ReactorAlreadyInstalledError
reactor already installed
The first crawler runs ok. The second one has problems.
Any ideas?
I run the above through a visual studio debugger.

the best way to do it is this way
your code should be
from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
# your code
settings={
'FEED_FORMAT': 'csv'
}
process = CrawlerRunner(Settings)
if condition1:
process.crawl(spider1,crawlerFileName=crawlerFileName)
if condition2:
process.crawl(spider2,crawlerFileName=crawlerFileName)
d = process.join()
d.addBoth(lambda _: reactor.stop())
reactor.run() # it will run both crawlers and code inside the function
your spiders should be like
class spider1(scrapy.Spider):
name = "spider1"
custom_settings = {'FEED_URI' : spider1.crawlerFileName}
def start_requests(self):
yield scrapy.Request('https://scrapy.org/')
def parse(self, response):
pass

tqdm skips line when one bar finishes with multithreading

When using tqdm with multithreading, tqdm seems to jump down a line and overwrite what was there when one thread finishes. It seems to snap back once all threads have finished, but I have some long running threads and the progress bars look pretty bad as it is.
I created an example program to be able to replicate the issue. I basically just stripped out all of the business logic and replaced it with sleeps.
from concurrent.futures import ThreadPoolExecutor
from tqdm.auto import tqdm
from time import sleep
from random import randrange
def myf(instance: int, name: str):
rand_size = randrange(75, 150)
total_lines = 0
# Simulate getting file size
# Yes there's probably a better way to get the line count, but this
# was quick and dirty and works well enough. The sleep is just there
# to slow it down for the example
for _ in tqdm(
iterable=range(rand_size),
position=instance,
desc=f'GETTING LINE COUNT: {name}',
leave=False
):
sleep(0.1)
total_lines += 1
# Simulate the processing
for record in tqdm(
iterable=range(rand_size),
total=total_lines,
position=instance,
desc=name
):
sleep(0.2)
def main():
myf_args = []
for i in range(10):
myf_args.append({
'instance': i,
'name': f'Thread-{i}'
})
with ThreadPoolExecutor(max_workers=len(myf_args)) as executor:
executor.map(lambda f: myf(**f), myf_args)
if __name__ == "__main__":
main()
I'm looking for a way to keep the progress bars in place and looking neat as it's running so I can get a good idea of the progress of each thread. When googling the issue, all I can find are people having an issue where it prints a new line every iteration, which isn't really applicable here.

How to transfer data between two separate scripts in Multiprocessing?

I am using multiprocessing to run two python scripts in parallel. p1.y continually updates a certain variable and the latest value of the variable will be displayed by p2.py after every 2seconds. The code for multiprocessing of the two scripts are given below:
import os
from multiprocessing import Process
def script1():
os.system("p1.py")
def script2():
os.system("p2.py")
if __name__ == '__main__':
p = Process(target=script1)
q = Process(target=script2)
p.start()
q.start()
p.join()
q.join()
I am unable to transfer the value of the variable being updated by p1.py to p2.py. How should I approach the problem in a very simple way?

ipyparallel parallel function calls example in Jupyter Lab

I'm finding it difficult to figure out how to use ipyparallel from jupyter lab to execute two functions in parallel. Could someone please give me an example of how this should be done? For example, running these two functions at the same time:
import time
def foo():
print('foo')
time.sleep(5)
def bar():
print('bar')
time.sleep(10)

So first you will need to ensure that ipyparallel is installed and an ipycluster is running - instructions here.
Once you have done that, here is some adapted code that will run your two functions in parallel:
from ipyparallel import Client
rc = Client()
def foo():
import time
time.sleep(5)
return 'foo'
def bar():
import time
time.sleep(10)
return 'bar'
res1 = rc[0].apply(foo)
res2 = rc[1].apply(bar)
results = [res1, res2]
while not all(map(lambda ar: ar.ready(), results)):
pass
print(res1.get(), res2.get())
N.B. I removed the print statements as you can't call back from the child process into the parent Jupyter session in order to print, but we can of course return a result - I block here until both results are completed, but you could instead print the results as they became available

Python multiprocessing script partial output

I am following the principles laid down in this post to safely output the results which will eventually be written to a file. Unfortunately, the code only print 1 and 2, and not 3 to 6.
import os
import argparse
import pandas as pd
import multiprocessing
from multiprocessing import Process, Queue
from time import sleep
def feed(queue, parlist):
for par in parlist:
queue.put(par)
print("Queue size", queue.qsize())
def calc(queueIn, queueOut):
while True:
try:
par=queueIn.get(block=False)
res=doCalculation(par)
queueOut.put((res))
queueIn.task_done()
except:
break
def doCalculation(par):
return par
def write(queue):
while True:
try:
par=queue.get(block=False)
print("response:",par)
except:
break
if __name__ == "__main__":
nthreads = 2
workerQueue = Queue()
writerQueue = Queue()
considerperiod=[1,2,3,4,5,6]
feedProc = Process(target=feed, args=(workerQueue, considerperiod))
calcProc = [Process(target=calc, args=(workerQueue, writerQueue)) for i in range(nthreads)]
writProc = Process(target=write, args=(writerQueue,))
feedProc.start()
feedProc.join()
for p in calcProc:
p.start()
for p in calcProc:
p.join()
writProc.start()
writProc.join()
On running the code it prints,
$ python3 tst.py
Queue size 6
response: 1
response: 2
Also, is it possible to ensure that the write function always outputs 1,2,3,4,5,6 i.e. in the same order in which the data is fed into the feed queue?

The error is somehow with the task_done() call. If you remove that one, then it works, don't ask me why (IMO that's a bug). But the way it works then is that the queueIn.get(block=False) call throws an exception because the queue is empty. This might be just enough for your use case, a better way though would be to use sentinels (as suggested in the multiprocessing docs, see last example). Here's a little rewrite so your program uses sentinels:
import os
import argparse
import multiprocessing
from multiprocessing import Process, Queue
from time import sleep
def feed(queue, parlist, nthreads):
for par in parlist:
queue.put(par)
for i in range(nthreads):
queue.put(None)
print("Queue size", queue.qsize())
def calc(queueIn, queueOut):
while True:
par=queueIn.get()
if par is None:
break
res=doCalculation(par)
queueOut.put((res))
def doCalculation(par):
return par
def write(queue):
while not queue.empty():
par=queue.get()
print("response:",par)
if __name__ == "__main__":
nthreads = 2
workerQueue = Queue()
writerQueue = Queue()
considerperiod=[1,2,3,4,5,6]
feedProc = Process(target=feed, args=(workerQueue, considerperiod, nthreads))
calcProc = [Process(target=calc, args=(workerQueue, writerQueue)) for i in range(nthreads)]
writProc = Process(target=write, args=(writerQueue,))
feedProc.start()
feedProc.join()
for p in calcProc:
p.start()
for p in calcProc:
p.join()
writProc.start()
writProc.join()
A few things to note:
the sentinel is putting a None into the queue. Note that you need one sentinel for every worker process.
for the write function you don't need to do the sentinel handling as there's only one process and you don't need to handle concurrency (if you would do the empty() and then get() thingie in your calc function you would run into a problem if e.g. there's only one item left in the queue and both workers check empty() at the same time and then both want to do get() and then one of them is locked forever)
you don't need to put feed and write into processes, just put them into your main function as you don't want to run it in parallel anyway.
how can I have the same order in output as in input? [...] I guess multiprocessing.map can do this
Yes map keeps the order. Rewriting your program into something simpler (as you don't need the workerQueue and writerQueue and adding random sleeps to prove that the output is still in order:
from multiprocessing import Pool
import time
import random
def calc(val):
time.sleep(random.random())
return val
if __name__ == "__main__":
considerperiod=[1,2,3,4,5,6]
with Pool(processes=2) as pool:
print(pool.map(calc, considerperiod))

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Scrapy - run at time interval - python-3.x

Easiest way I know to do it is using a separate script to call the script containing your twisted reactor, like this: cmd = ['python3', 'auto_crawl.py'] subprocess.Popen(cmd).wait() To run your CrawlerRunner every 10 minutes, you could use a loop or crontab on this script.

Related

how to run 2 crawlers from the same python script

tqdm skips line when one bar finishes with multithreading

How to transfer data between two separate scripts in Multiprocessing?

ipyparallel parallel function calls example in Jupyter Lab

Python multiprocessing script partial output

Categories

Resources