I've been programming for a while in python but this is my first in multiprocessing.
I made a program that scrapes a local weather station for the ambient temperature using beautifulsoup4 every minute. The program also reads temperatures from several sensors and uploads everything to a Mysql database. This all works fine but on occasion (once every day) getting the data from the local weather station fails in retrieving the webpage. This causes beautifulsoup to start an infinite loop which effectively stops all functionality of the program. To combat this I tried to try my hand on multiprocessing.
I've coded a check that kills the extra thread if that is still running after 10 seconds. Here is where things go wrong, normally the beautifulsoup thread closes after 2-4 seconds when its finished. However in the case where the beautifulsoup gets stuck in its loop not only the thread is terminated but the entire program stops doing stuff altogether.
I've copied the relevant snippets of code. Please note that some vars are declared outside of the snippets, the code works with exception of the problem described above. Btw I am very much aware that there is a plethora of ways to make my code more efficient. Refining the code is something that I'll do when its working stable :) Thanks in advance for your help!
Imports:
...
from multiprocessing import Process, Queue
import multiprocessing
from bs4 import BeautifulSoup #sudo apt-get install python3-bs4
Beutifulsoup section:
def get_ZWS_temp_out(temp):
try:
if 1==1:
response = requests.get(url)
responsestr = str(response)
if "200" in responsestr:
soup = BeautifulSoup(response.content, 'html.parser')
tb = soup.findAll("div", {"class": "elementor-element elementor-element-8245410 elementor-widget__width-inherit elementor-widget elementor-widget-wp-widget-live_weather_station_widget_outdoor"})
tb2 = tb[0].findAll("div", {"class": "lws-widget-big-value"})
string = str(tb2[0])[-10:][:4]
stringt = string[:1]
if stringt.isdigit() == True:
#print("getal ok")
string = string
elif stringt == '-':
#print("minteken")
string = string
elif stringt == '>':
#print("temp < 10")
string = string[-3:]
temp = float(string)
except Exception as error:
print(error)
Q.put(temp)
return(temp)
Main program:
Q = Queue()
while 1 == 1:
strings = time.strftime("%Y,%m,%d,%H,%M,%S")
t = strings.split(',')
time_numbers = [ int(x) for x in t ]
if last_min != time_numbers[4]:
targettemp = get_temp_target(targettemp)
p = Process(target=get_ZWS_temp_out, name="get_ZWS_temp_out", args=(ZWS_temp_out,))
p.start()
i = 0
join = True
while i < 10:
i = i + 1
time.sleep(1)
if p.is_alive() and i == 10: #checks to quit early otherwise another iteration
print(datetime.datetime.fromtimestamp(time.time()).strftime("%Y-%m-%d %H:%M:%S"),": ZWS getter is running for too long... let's kill it...")
# Terminate ZWS query
p.terminate()
i = 10
join = False
if join == True:
p.join()
Thanks in advance for your time :)
I have to manually stop the program which gives the following output:
pi#Jacuzzi-pi:~ $ python3 /home/pi/Jacuzzi/thermometer.py
temperature sensors observer and saving program, updates every 3,5 seconds
2019-10-28 03:50:11 : ZWS getter is running for too long... let's kill it...
^CTraceback (most recent call last):
File "/home/pi/Jacuzzi/thermometer.py", line 283, in <module>
ZWS_temp_out = Q.get()
File "/usr/lib/python3.5/multiprocessing/queues.py", line 94, in get
res = self._recv_bytes()
File "/usr/lib/python3.5/multiprocessing/connection.py", line 216, in recv_bytes
buf = self._recv_bytes(maxlength)
File "/usr/lib/python3.5/multiprocessing/connection.py", line 407, in _recv_bytes
buf = self._recv(4)
File "/usr/lib/python3.5/multiprocessing/connection.py", line 379, in _recv
chunk = read(handle, remaining)
KeyboardInterrupt
I believe your program is waiting infinitely to pull items from the queue you've created. I can't see the line in the code you've posted, but it appears in the error message:
ZWS_temp_out = Q.get()
Since the get_ZWS_temp_out process is the one that adds items to the queue, you need to make sure that the process is running before you call Q.get(). I suspect this line of code gets executed between the act of terminating the timed-out process and restarting a new process, where instead it should be called after the new process is created.
Based on what Rob found out this is the updated (working) code for the main program, the others are unchanged:
Q = Queue()
while 1 == 1:
strings = time.strftime("%Y,%m,%d,%H,%M,%S")
t = strings.split(',')
time_numbers = [ int(x) for x in t ]
if last_min != time_numbers[4]:
targettemp = get_temp_target(targettemp)
p = Process(target=get_ZWS_temp_out, name="get_ZWS_temp_out", args=(ZWS_temp_out,))
p.start()
i = 0
completion = True
while i < 10:
i = i + 1
time.sleep(1)
if p.is_alive() and i == 10: #checks to quit early otherwise another iteration
print(datetime.datetime.fromtimestamp(time.time()).strftime("%Y-%m-%d %H:%M:%S"),": ZWS getter is running for too long... let's kill it...")
# Terminate ZWS query
p.terminate()
i = 10
completion = False
if completion == True:
p.join()
ZWS_temp_out = Q.get()
Related
So I have a little app that searches for all xml files on my pc, copying the files that have 44 digits as the filename to the "output" folder.
The problem is that the final user needs an indication of the progress and remaining time of the task.
This is the module to copy files:
xml_search.py
import os
import re
from threading import Thread
from datetime import datetime
import time
import shutil
import winsound
os.system('cls')
def get_drives():
response = os.popen("wmic logicaldisk get caption")
list1 = []
t1 = datetime.now()
for line in response.readlines():
line = line.strip("\n")
line = line.strip("\r")
line = line.strip(" ")
if (line == "Caption" or line == ""):
continue
list1.append(line + '\\')
return list1
def search1(drive):
for root, dir, files in os.walk(drive):
for file in files:
if re.match("\d{44}.xml", file):
filename = os.path.join(root, file)
try:
shutil.copy(filename, os.path.join('output', file))
except Exception as e:
pass
def exec_(callback):
t1 = datetime.now()
list2 = [] # empty list is created
list1 = get_drives()
for each in list1:
process1 = Thread(target=search1, args=(each,))
process1.start()
list2.append(process1)
for t in list2:
t.join() # Terminate the threads
t2 = datetime.now()
total = str(t2-t1)
print(total, file=open('times.txt', 'a'), end="\n")
for x in range(3):
winsound.Beep(2000,100)
time.sleep(.1)
callback()
if __name__ == "__main__":
exec_()
The below code uses progressbar library and it shows
indication of the progress and remaining time of the task
import progressbar
from time import sleep
bar = progressbar.ProgressBar(maxval=1120, \
widgets=[progressbar.Bar('=', '[', ']'), ' ', progressbar.ETA()])
bar.start()
for i in range(1120):
bar.update(i+1)
sleep(0.1)
bar.finish()
You would need to add the above modified code to your code.
So in your case, you would need to count the number of files and provide it as input to ProgressBar constructor's maxval argument and remove sleep call.
The suggested solution with progress bar should work with one thread. You would need to figure out how to initiate the progress bar and where to put the updates if you insist to work with multiple threads.
Try to implement a timer decorator like the following:
import time
def mytimer(func):
def wrapper():
t1 = time.time()
result = func()
t2 = time.time()
print(f"The function {func.__name__} was run {t2 - t1} seconds")
return result
return wrapper
#mytimer
def TimeConsumingFunction():
time.sleep(3)
print("Hello timers")
TimeConsumingFunction()
Output:
/usr/bin/python3.7 /home/user/Documents/python-workspace/timers/example.py
Hello timers
The function TimeConsumingFunction was run 3.002610206604004 seconds
Process finished with exit code 0
I have a python 3 script that needs to make thousands of requests to multiple different websites and check if their source code pass in some pre defined rules.
I am using selenium to make the requests because I need to get the source code after JS finishes it excecution, but due to the high number of urls I need to check, I am trying to make it run multiple threads simultaneously. Each thread creates and maintain an instance of webdriver to make the requests. The problem is after a while all threads go silent and simply stop executing, leaving just a single thread doing all the work. Here is the relevant part of my code:
def get_browser(use_firefox = True):
if use_firefox:
options = FirefoxOptions()
options.headless = True
browser = webdriver.Firefox(options = options)
browser.implicitly_wait(4)
return browser
else:
chrome_options = ChromeOptions()
chrome_options.add_argument("--headless")
browser = webdriver.Chrome(chrome_options=chrome_options)
browser.implicitly_wait(4)
return browser
def start_validations(urls, rules, results, thread_id):
try:
log("thread %s started" % thread_id, thread_id)
browser = get_browser(thread_id % 2 == 1)
while not urls.empty():
url = "http://%s" % urls.get()
try:
log("starting %s" % url, thread_id)
browser.get(url)
time.sleep(0.5)
WebDriverWait(browser, 6).until(selenium_wait_reload(4))
html = browser.page_source
result = check_url(html, rules)
original_domain = url.split("://")[1].split("/")[0].replace("www.","")
tested_domain = browser.current_url.split("://")[1].split("/")[0].replace("www.","")
redirected_url = "" if tested_domain == original_domain else browser.current_url
results.append({"Category":result, "URL":url, "Redirected":redirected_url})
log("finished %s" % url, thread_id)
except Exception as e:
log("couldn't test url %s" % url, thread_id )
log(str(e), thread_id)
results.append({"Category":"Connection Error", "URL":url, "Redirected":""})
browser.quit()
time.sleep(2)
browser = get_browser(thread_id % 2 == 1)
except Exception as e:
log(str(e), thread_id)
finally:
log("closing thread", thread_id)
browser.quit()
def calculate_progress(urls):
progress_folder ="%sprogress/" % WEBROOT
if not os.path.exists(progress_folder):
os.makedirs(progress_folder)
initial_size = urls.qsize()
while not urls.empty():
current_size = urls.qsize()
on_queue = initial_size - current_size
progress = '{0:.0f}'.format((on_queue / initial_size * 100))
for progress_file in os.listdir(progress_folder):
file_path = os.path.join(progress_folder, progress_file)
if os.path.isfile(file_path) and not file_path.endswith(".csv"):
os.unlink(file_path)
os.mknod("%s%s" % (progress_folder, progress))
time.sleep(1)
if __name__ == '__main__':
while True:
try:
log("scraper started")
if os.path.isfile(OUTPUT_FILE):
os.unlink(OUTPUT_FILE)
manager = Manager()
rules = fetch_rules()
urls = manager.Queue()
fetch_urls()
results = manager.list()
jobs = []
p = Process(target=calculate_progress, args=(urls,))
jobs.append(p)
p.start()
for i in range(THREAD_POOL_SIZE):
log("spawning thread with id %s" % i)
p = Process(target=start_validations, args=(urls, rules, results, i))
jobs.append(p)
p.start()
time.sleep(2)
for j in jobs:
j.join()
save_results(results, OUTPUT_FILE)
log("scraper finished")
except Exception as e:
log(str(e))
As you can see, first I thought I could only have one instance of the browser, so I tried to run at least firefox and chrome in paralel, but this still leaves only a thread to do all the work.
Some times the driver crahsed and the thread stopped working even though it is inside a try/catch block, so I started to get a new instance of the browser everytime this happens,but it still didn't work. I also tried waiting a few seconds between creating each instance of the driver still with no results
here is a pastebin of one of the log files:
https://pastebin.com/TsjZdRYf
A strange thing that I noticed is that almost everytime the only thread that keeps running is the last one spawned (with id 3).
Thanks for your time and you help!
EDIT:
[1] Here is the full code: https://pastebin.com/fvVPwPVb
[2] custom selenium wait condition: https://pastebin.com/Zi7nbNFk
Am I allowed to curse on SO? I solved the problem, and I don't think this answer should exist on SO because nobody else will benefit from it. The problem was a custom wait condition that I had created. This class is in the pastebin that was added in edit 2, but I'll also add it here for convenince:
import time
class selenium_wait_reload:
def __init__(self, desired_repeating_sources):
self.desired_repeating_sources = desired_repeating_sources
self.repeated_pages = 0
self.previous_source = None
def __call__(self, driver):
while True:
current_source = driver.page_source
if current_source == self.previous_source:
self.repeated_pages = self.repeated_pages +1
if self.repeated_pages >= self.desired_repeating_sources:
return True
else:
self.previous_source = current_source
self.repeated_pages = 0
time.sleep(0.3)
The goal of this class was to make selenium wait because the JS could be loading additional DOM.
So, this class makes selenium wait a short time and check the code, wait a little again and check the code again. The class repeats this until the source code is the same 3 times in a row.
The problem is that there are some pages that have a js carrousel, so the source code is never the same. I thought that in cases like this the WebDriverWait second parameter would make it crash with a timeoutexception. I was wrong.
I use feedparser to get rss feeds from some sites, my core code is like this:
def parseworker(procnum, result_queue, return_dict, source_link):
try:
data = feedparser.parse(source_link)
return_dict[procnum] = data
except Exception as e:
print(str(e))
result_queue.put(source_link + 'grabbed')
def infoworker(procnum, timeout, result_queue, source_name, source_link):
text = 'recheck ' + source_name + ': ' + '...'
progress = ''
for x in range(timeout):
progress += '.'
sys.stdout.write('\r' + text + progress)
sys.stdout.flush()
time.sleep(1)
result_queue.put('time out')
def parsecaller(link, timeout, timestocheck):
return_dict = multiprocessing.Manager().dict()
result_queue = multiprocessing.Queue()
counter = 1
jobs = []
result = []
while not (counter > timestocheck):
p1 = multiprocessing.Process(target=infoworker, args=(11, timeout, result_queue, source_name, link))
p2 = multiprocessing.Process(target=parseworker, args=(22, result_queue, return_dict, link))
jobs.append(p1)
jobs.append(p2)
p1.start()
p2.start()
result_queue.get()
p1.terminate()
p2.terminate()
p1.join()
p2.join()
result = return_dict.values()
if not result or result[0].bozo:
print(' bad - no data', flush=True)
result = -1
else:
print(' ok ', flush=True)
result = result[0]
break
counter += 1
if result == -1:
raise bot_exceptions.ParserExceptionData()
elif result == -2:
raise bot_exceptions.ParserExceptionConnection()
else:
return result
if __name__ == '__main__':
multiprocessing.freeze_support()
multiprocessing.set_start_method('spawn')
try:
data = parsecaller(source_link, timeout=wait_time, timestocheck=check_times)
except Exception as e:
print(str(e))
continue
It works good, but after some random time goes into suspended state and does nothing - like infinite bootloop. It may suspend after 4 hours or 3 days, that's random.
I try to solve that problem by multiprocessing: use main process with timer like infoworker. When infoworker stops, it will put "result" to queue and by that will call result_queue.get() in parsecaller which after continues it and terminates both processes. But it does not work. Today, after 11 hours I got my code in suspended state in multiprocessing managers.py:
def serve_forever(self):
'''
Run the server forever
'''
self.stop_event = threading.Event()
process.current_process()._manager_server = self
try:
accepter = threading.Thread(target=self.accepter)
accepter.daemon = True
accepter.start()
try:
while not self.stop_event.is_set():
self.stop_event.wait(1)
except (KeyboardInterrupt, SystemExit):
pass
finally:
if sys.stdout != sys.__stdout__: # what about stderr?
util.debug('resetting stdout, stderr')
sys.stdout = sys.__stdout__
sys.stderr = sys.__stderr__
sys.exit(0)
for all time it was in:
while not self.stop_event.is_set():
self.stop_event.wait(1)
I thing that somewhere or GIL does not allow any other threads to work in processes or feedparser goes into loop. And of course it gets suspended with any random RSS sources.
My 'environment':
Mac OS 10.12.6 (also was that situation on win7 and win 10)
Python 3.7.0 (also wat that situation on 3.6.2, 3.6.5)
Pycharm 2017.2.2
My questions:
How to understand why it gets suspended (what to do, any recipe)?
How to bypass that state (what to do, any recipe)?
I need to read a large (10GB+) file line by line and process each line. The processing is fairly simple, so it seemed that multiprocessing was the way to go. However, when I set it up, it is much, much slower than running things linearly. My CPU usage never goes above 50%, so it's not a processing power issue.
I'm running Python 3.6 in Jupyter Notebook on a Mac.
This is what I have, working from the accepted answer posted here:
def do_work(in_queue, out_list):
while True:
line = in_queue.get()
# exit signal
if line == None:
return
#fake work for testing
elements = line.split("\t")
out_list.append(elements)
if __name__ == "__main__":
num_workers = 4
manager = Manager()
results = manager.list()
work = manager.Queue(num_workers)
# start for workers
pool = []
for i in range(num_workers):
p = Process(target=do_work, args=(work, results))
p.start()
pool.append(p)
# produce data
with open(file_on_my_machine, 'rt',newline="\n") as f:
for line in f:
work.put(line)
for p in pool:
p.join()
# get the results
print(sorted(results))
I have a script that downloads images from urls, but I would like to parallelise it otherwise it will take hours. With this code:
import requests
from math import floor, log10
import urllib
import time
import multiprocessing
with open('images.csv', 'r') as f:
images = f.readlines()
num_position = floor(log10(len(images)) + 1)
a = time.time()
for i, image in enumerate(images[1:10]):
if (i+1) % 1000 == 0:
print('Downloading {} image'.format(i+1) )
# a = time.time()
with open(str(i).zfill(num_position)+'a.jpg', 'wb') as file:
try:
writing = file.write(requests.get(image.split(',')[2]).content)
p = multiprocessing.Process(target=writing, args=(image,))
p.start()
p.join()
except:
print('Skipping an image!')
pass
b = time.time()
print('multiple process -- {}'.format(b-a))
I get an error :
Process Process-9:
Traceback (most recent call last):
File "/usr/lib/python3.4/multiprocessing/process.py", line 254, in _bootstrap
self.run()
File "/usr/lib/python3.4/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
TypeError: 'int' object is not callable
Why am I getting an error but the task is still completed and the code doesn't break? (and by that I mean the piece in try: )
What would be the easiest way to include some kind of paralleling here?
You get the error because AFAIK this line
writing = file.write(requests.get(image.split(',')[2]).content)
has the output of integer type. write returns the number of written characters which is equal to the length of the string-representation of your image. Now you assign that to the variable writing -> writing becomes a number.
p = multiprocessing.Process(target=writing, args=(image,))
calls writing as target function, which raises the error since your are not calling a function but integer-type writing (not callable). The code works since your workers do not have anything to do and close immediatly and the file is already written.
To get things working, your would have to define a function that takes your image as argument and maybe the file name. This function you later call in the setup of your workers. Something like that:
def write_file(image, filename):
filestream = open(filename, mode="w")
filestream.write(requests.get(image.split(',')[2]).content)
filestream.close()
And in your application
p = multiprocessing.Process(target=write_file, args=(image, filename,))
However, that is just the writing part. If you want to do the downloads in separate task too then you have to put the code for that into your separate function.
def download_write(urls):
for url in iter(urls.get, 'STOP'):
#download code here#
filestream = open(filename, mode="w")
filestream.write(requests.get(image.split(',')[2]).content)
filestream.close()
And your main application:
list_urls = [] # your list of urls to download
urls = Queue()
for element in list_urls:
urls.put(element)
p = multiprocessing.Process(target=download_write, args=(urls,))
urls.put("STOP") #signals end of tasks for your workers
p.start() #start worker
p.join() #wait for worker to finish