I have a python 3 script that needs to make thousands of requests to multiple different websites and check if their source code pass in some pre defined rules.
I am using selenium to make the requests because I need to get the source code after JS finishes it excecution, but due to the high number of urls I need to check, I am trying to make it run multiple threads simultaneously. Each thread creates and maintain an instance of webdriver to make the requests. The problem is after a while all threads go silent and simply stop executing, leaving just a single thread doing all the work. Here is the relevant part of my code:
def get_browser(use_firefox = True):
if use_firefox:
options = FirefoxOptions()
options.headless = True
browser = webdriver.Firefox(options = options)
browser.implicitly_wait(4)
return browser
else:
chrome_options = ChromeOptions()
chrome_options.add_argument("--headless")
browser = webdriver.Chrome(chrome_options=chrome_options)
browser.implicitly_wait(4)
return browser
def start_validations(urls, rules, results, thread_id):
try:
log("thread %s started" % thread_id, thread_id)
browser = get_browser(thread_id % 2 == 1)
while not urls.empty():
url = "http://%s" % urls.get()
try:
log("starting %s" % url, thread_id)
browser.get(url)
time.sleep(0.5)
WebDriverWait(browser, 6).until(selenium_wait_reload(4))
html = browser.page_source
result = check_url(html, rules)
original_domain = url.split("://")[1].split("/")[0].replace("www.","")
tested_domain = browser.current_url.split("://")[1].split("/")[0].replace("www.","")
redirected_url = "" if tested_domain == original_domain else browser.current_url
results.append({"Category":result, "URL":url, "Redirected":redirected_url})
log("finished %s" % url, thread_id)
except Exception as e:
log("couldn't test url %s" % url, thread_id )
log(str(e), thread_id)
results.append({"Category":"Connection Error", "URL":url, "Redirected":""})
browser.quit()
time.sleep(2)
browser = get_browser(thread_id % 2 == 1)
except Exception as e:
log(str(e), thread_id)
finally:
log("closing thread", thread_id)
browser.quit()
def calculate_progress(urls):
progress_folder ="%sprogress/" % WEBROOT
if not os.path.exists(progress_folder):
os.makedirs(progress_folder)
initial_size = urls.qsize()
while not urls.empty():
current_size = urls.qsize()
on_queue = initial_size - current_size
progress = '{0:.0f}'.format((on_queue / initial_size * 100))
for progress_file in os.listdir(progress_folder):
file_path = os.path.join(progress_folder, progress_file)
if os.path.isfile(file_path) and not file_path.endswith(".csv"):
os.unlink(file_path)
os.mknod("%s%s" % (progress_folder, progress))
time.sleep(1)
if __name__ == '__main__':
while True:
try:
log("scraper started")
if os.path.isfile(OUTPUT_FILE):
os.unlink(OUTPUT_FILE)
manager = Manager()
rules = fetch_rules()
urls = manager.Queue()
fetch_urls()
results = manager.list()
jobs = []
p = Process(target=calculate_progress, args=(urls,))
jobs.append(p)
p.start()
for i in range(THREAD_POOL_SIZE):
log("spawning thread with id %s" % i)
p = Process(target=start_validations, args=(urls, rules, results, i))
jobs.append(p)
p.start()
time.sleep(2)
for j in jobs:
j.join()
save_results(results, OUTPUT_FILE)
log("scraper finished")
except Exception as e:
log(str(e))
As you can see, first I thought I could only have one instance of the browser, so I tried to run at least firefox and chrome in paralel, but this still leaves only a thread to do all the work.
Some times the driver crahsed and the thread stopped working even though it is inside a try/catch block, so I started to get a new instance of the browser everytime this happens,but it still didn't work. I also tried waiting a few seconds between creating each instance of the driver still with no results
here is a pastebin of one of the log files:
https://pastebin.com/TsjZdRYf
A strange thing that I noticed is that almost everytime the only thread that keeps running is the last one spawned (with id 3).
Thanks for your time and you help!
EDIT:
[1] Here is the full code: https://pastebin.com/fvVPwPVb
[2] custom selenium wait condition: https://pastebin.com/Zi7nbNFk
Am I allowed to curse on SO? I solved the problem, and I don't think this answer should exist on SO because nobody else will benefit from it. The problem was a custom wait condition that I had created. This class is in the pastebin that was added in edit 2, but I'll also add it here for convenince:
import time
class selenium_wait_reload:
def __init__(self, desired_repeating_sources):
self.desired_repeating_sources = desired_repeating_sources
self.repeated_pages = 0
self.previous_source = None
def __call__(self, driver):
while True:
current_source = driver.page_source
if current_source == self.previous_source:
self.repeated_pages = self.repeated_pages +1
if self.repeated_pages >= self.desired_repeating_sources:
return True
else:
self.previous_source = current_source
self.repeated_pages = 0
time.sleep(0.3)
The goal of this class was to make selenium wait because the JS could be loading additional DOM.
So, this class makes selenium wait a short time and check the code, wait a little again and check the code again. The class repeats this until the source code is the same 3 times in a row.
The problem is that there are some pages that have a js carrousel, so the source code is never the same. I thought that in cases like this the WebDriverWait second parameter would make it crash with a timeoutexception. I was wrong.
Related
I am currently working on a server / client chat program which works fine on terminal but in trying to migrate it into a windowed program I've come across a few errors which I have managed to solve except for this one, for whatever reason, in the GUI class running on the main thread seems to lag considerably whenever the user focuses their attention on it or whenever they try to interact with it, I've tried to use threading but I'm not sure what the problem is.
...
class GUI():
def __init__(self):
self.txtval = []
self.nochngtxtval = []
self.root = tk.Tk()
self.root.title("Chatroom")
self.root.geometry("400x200")
self.labl = tk.Label(self.root, text="", justify=tk.LEFT, anchor='nw', width=45, relief=tk.RIDGE, height=8)
self.labl.pack()
count = 1
while events["connected"] == False:
if events["FError"] == True:
quit()
self.labl["text"] = "Loading"+"."*count
count += 1
if count > 3:
count = 1
self.labl["text"] = ""
self.inputtxt = tk.Text(self.root,height=1,width=40,undo=True)
self.inputtxt.bind("<Return>", self.sendinput)
self.sendButton = tk.Button(self.root, text="Send")
self.sendButton.bind("<Button-1>", self.sendinput)
self.sendButton.pack(padx=5,pady=5,side=tk.BOTTOM)
self.inputtxt.pack(padx=5,pady=5,side=tk.BOTTOM)
uc = Thread(target=self.updatechat,daemon=True)
uc.start()
self.root.protocol("WM_DELETE_WINDOW", cleanAndClose)
self.root.mainloop()
def updatechat(self):
global events
while True:
if events["recv"] != None:
try:
current = events["recv"]
events["recv"] = None
... # shorten to reasonable length (45 characters)
self.labl['text'] = ''
self.txtval = []
self.nochngtxtval.append(new+"\n")
for i in range(len(nochngtxtval)):
self.txtval.append(nochngtxtval[i])
self.txtval.pop(len(self.txtval)-1) # probably not necessary
self.txtval.append(new+"\n") # probably not necessary
self.txtval[len(self.txtval)-1] = self.txtval[len(self.txtval)-1][:-1]
for i in range(len(self.txtval)):
self.labl['text'] = self.labl['text']+self.txtval[i]
except Exception as e:
events["Error"] = str(e)
pass
def sendinput(self, event):
global events
inp = self.inputtxt.get("1.0", "end").strip()
events["send"] = inp
self.inputtxt.delete('1.0', "end")
... # start the logger and server threads
GUI()
I'm using my custom encryption which is linked on github https://github.com/Nidhogg-Wyrmborn/BBR
my server is named server.py in the same git repo as my encryption.
EDIT:
I added a test file which is called WindowClasses.py to my git repo which has no socket interactions and works fine, however the problem seems to be introduced the moment i use socket as displayed in the Client GUI. I hope this can help to ascertain the issue.
I've been programming for a while in python but this is my first in multiprocessing.
I made a program that scrapes a local weather station for the ambient temperature using beautifulsoup4 every minute. The program also reads temperatures from several sensors and uploads everything to a Mysql database. This all works fine but on occasion (once every day) getting the data from the local weather station fails in retrieving the webpage. This causes beautifulsoup to start an infinite loop which effectively stops all functionality of the program. To combat this I tried to try my hand on multiprocessing.
I've coded a check that kills the extra thread if that is still running after 10 seconds. Here is where things go wrong, normally the beautifulsoup thread closes after 2-4 seconds when its finished. However in the case where the beautifulsoup gets stuck in its loop not only the thread is terminated but the entire program stops doing stuff altogether.
I've copied the relevant snippets of code. Please note that some vars are declared outside of the snippets, the code works with exception of the problem described above. Btw I am very much aware that there is a plethora of ways to make my code more efficient. Refining the code is something that I'll do when its working stable :) Thanks in advance for your help!
Imports:
...
from multiprocessing import Process, Queue
import multiprocessing
from bs4 import BeautifulSoup #sudo apt-get install python3-bs4
Beutifulsoup section:
def get_ZWS_temp_out(temp):
try:
if 1==1:
response = requests.get(url)
responsestr = str(response)
if "200" in responsestr:
soup = BeautifulSoup(response.content, 'html.parser')
tb = soup.findAll("div", {"class": "elementor-element elementor-element-8245410 elementor-widget__width-inherit elementor-widget elementor-widget-wp-widget-live_weather_station_widget_outdoor"})
tb2 = tb[0].findAll("div", {"class": "lws-widget-big-value"})
string = str(tb2[0])[-10:][:4]
stringt = string[:1]
if stringt.isdigit() == True:
#print("getal ok")
string = string
elif stringt == '-':
#print("minteken")
string = string
elif stringt == '>':
#print("temp < 10")
string = string[-3:]
temp = float(string)
except Exception as error:
print(error)
Q.put(temp)
return(temp)
Main program:
Q = Queue()
while 1 == 1:
strings = time.strftime("%Y,%m,%d,%H,%M,%S")
t = strings.split(',')
time_numbers = [ int(x) for x in t ]
if last_min != time_numbers[4]:
targettemp = get_temp_target(targettemp)
p = Process(target=get_ZWS_temp_out, name="get_ZWS_temp_out", args=(ZWS_temp_out,))
p.start()
i = 0
join = True
while i < 10:
i = i + 1
time.sleep(1)
if p.is_alive() and i == 10: #checks to quit early otherwise another iteration
print(datetime.datetime.fromtimestamp(time.time()).strftime("%Y-%m-%d %H:%M:%S"),": ZWS getter is running for too long... let's kill it...")
# Terminate ZWS query
p.terminate()
i = 10
join = False
if join == True:
p.join()
Thanks in advance for your time :)
I have to manually stop the program which gives the following output:
pi#Jacuzzi-pi:~ $ python3 /home/pi/Jacuzzi/thermometer.py
temperature sensors observer and saving program, updates every 3,5 seconds
2019-10-28 03:50:11 : ZWS getter is running for too long... let's kill it...
^CTraceback (most recent call last):
File "/home/pi/Jacuzzi/thermometer.py", line 283, in <module>
ZWS_temp_out = Q.get()
File "/usr/lib/python3.5/multiprocessing/queues.py", line 94, in get
res = self._recv_bytes()
File "/usr/lib/python3.5/multiprocessing/connection.py", line 216, in recv_bytes
buf = self._recv_bytes(maxlength)
File "/usr/lib/python3.5/multiprocessing/connection.py", line 407, in _recv_bytes
buf = self._recv(4)
File "/usr/lib/python3.5/multiprocessing/connection.py", line 379, in _recv
chunk = read(handle, remaining)
KeyboardInterrupt
I believe your program is waiting infinitely to pull items from the queue you've created. I can't see the line in the code you've posted, but it appears in the error message:
ZWS_temp_out = Q.get()
Since the get_ZWS_temp_out process is the one that adds items to the queue, you need to make sure that the process is running before you call Q.get(). I suspect this line of code gets executed between the act of terminating the timed-out process and restarting a new process, where instead it should be called after the new process is created.
Based on what Rob found out this is the updated (working) code for the main program, the others are unchanged:
Q = Queue()
while 1 == 1:
strings = time.strftime("%Y,%m,%d,%H,%M,%S")
t = strings.split(',')
time_numbers = [ int(x) for x in t ]
if last_min != time_numbers[4]:
targettemp = get_temp_target(targettemp)
p = Process(target=get_ZWS_temp_out, name="get_ZWS_temp_out", args=(ZWS_temp_out,))
p.start()
i = 0
completion = True
while i < 10:
i = i + 1
time.sleep(1)
if p.is_alive() and i == 10: #checks to quit early otherwise another iteration
print(datetime.datetime.fromtimestamp(time.time()).strftime("%Y-%m-%d %H:%M:%S"),": ZWS getter is running for too long... let's kill it...")
# Terminate ZWS query
p.terminate()
i = 10
completion = False
if completion == True:
p.join()
ZWS_temp_out = Q.get()
I use feedparser to get rss feeds from some sites, my core code is like this:
def parseworker(procnum, result_queue, return_dict, source_link):
try:
data = feedparser.parse(source_link)
return_dict[procnum] = data
except Exception as e:
print(str(e))
result_queue.put(source_link + 'grabbed')
def infoworker(procnum, timeout, result_queue, source_name, source_link):
text = 'recheck ' + source_name + ': ' + '...'
progress = ''
for x in range(timeout):
progress += '.'
sys.stdout.write('\r' + text + progress)
sys.stdout.flush()
time.sleep(1)
result_queue.put('time out')
def parsecaller(link, timeout, timestocheck):
return_dict = multiprocessing.Manager().dict()
result_queue = multiprocessing.Queue()
counter = 1
jobs = []
result = []
while not (counter > timestocheck):
p1 = multiprocessing.Process(target=infoworker, args=(11, timeout, result_queue, source_name, link))
p2 = multiprocessing.Process(target=parseworker, args=(22, result_queue, return_dict, link))
jobs.append(p1)
jobs.append(p2)
p1.start()
p2.start()
result_queue.get()
p1.terminate()
p2.terminate()
p1.join()
p2.join()
result = return_dict.values()
if not result or result[0].bozo:
print(' bad - no data', flush=True)
result = -1
else:
print(' ok ', flush=True)
result = result[0]
break
counter += 1
if result == -1:
raise bot_exceptions.ParserExceptionData()
elif result == -2:
raise bot_exceptions.ParserExceptionConnection()
else:
return result
if __name__ == '__main__':
multiprocessing.freeze_support()
multiprocessing.set_start_method('spawn')
try:
data = parsecaller(source_link, timeout=wait_time, timestocheck=check_times)
except Exception as e:
print(str(e))
continue
It works good, but after some random time goes into suspended state and does nothing - like infinite bootloop. It may suspend after 4 hours or 3 days, that's random.
I try to solve that problem by multiprocessing: use main process with timer like infoworker. When infoworker stops, it will put "result" to queue and by that will call result_queue.get() in parsecaller which after continues it and terminates both processes. But it does not work. Today, after 11 hours I got my code in suspended state in multiprocessing managers.py:
def serve_forever(self):
'''
Run the server forever
'''
self.stop_event = threading.Event()
process.current_process()._manager_server = self
try:
accepter = threading.Thread(target=self.accepter)
accepter.daemon = True
accepter.start()
try:
while not self.stop_event.is_set():
self.stop_event.wait(1)
except (KeyboardInterrupt, SystemExit):
pass
finally:
if sys.stdout != sys.__stdout__: # what about stderr?
util.debug('resetting stdout, stderr')
sys.stdout = sys.__stdout__
sys.stderr = sys.__stderr__
sys.exit(0)
for all time it was in:
while not self.stop_event.is_set():
self.stop_event.wait(1)
I thing that somewhere or GIL does not allow any other threads to work in processes or feedparser goes into loop. And of course it gets suspended with any random RSS sources.
My 'environment':
Mac OS 10.12.6 (also was that situation on win7 and win 10)
Python 3.7.0 (also wat that situation on 3.6.2, 3.6.5)
Pycharm 2017.2.2
My questions:
How to understand why it gets suspended (what to do, any recipe)?
How to bypass that state (what to do, any recipe)?
in Python 2.7 I am successful in using the following code to listen to a direct message stream on an account:
from tweepy import Stream
from tweepy import OAuthHandler
from tweepy import API
from tweepy.streaming import StreamListener
# These values are appropriately filled in the code
consumer_key = '######'
consumer_secret = '######'
access_token = '######'
access_token_secret = '######'
class StdOutListener( StreamListener ):
def __init__( self ):
self.tweetCount = 0
def on_connect( self ):
print("Connection established!!")
def on_disconnect( self, notice ):
print("Connection lost!! : ", notice)
def on_data( self, status ):
print("Entered on_data()")
print(status, flush = True)
return True
# I can add code here to execute when a message is received, such as slicing the message and activating something else
def on_direct_message( self, status ):
print("Entered on_direct_message()")
try:
print(status, flush = True)
return True
except BaseException as e:
print("Failed on_direct_message()", str(e))
def on_error( self, status ):
print(status)
def main():
try:
auth = OAuthHandler(consumer_key, consumer_secret)
auth.secure = True
auth.set_access_token(access_token, access_token_secret)
api = API(auth)
# If the authentication was successful, you should
# see the name of the account print out
print(api.me().name)
stream = Stream(auth, StdOutListener())
stream.userstream()
except BaseException as e:
print("Error in main()", e)
if __name__ == '__main__':
main()
This is great, and I can also execute code when I receive a message, but the jobs I'm adding to a work queue need to be able to stop after a certain amount of time. I'm using a popular start = time.time() and subtracting current time to determine elapsed time, but this streaming code does not loop to check the time. I just waits for a new message, so the clock is never checked so to speak.
My question is this: How can I get streaming to occur and still track time elapsed? Do I need to use multithreading as described in this article? http://www.tutorialspoint.com/python/python_multithreading.htm
I am new to Python and having fun playing around with hardware attached to a Raspberry Pi. I have learned so much from Stackoverflow, thank you all :)
I'm not sure exactly how you want to decide when to stop, but you can pass a timeout argument to the stream to give up after a certain delay.
stream = Stream(auth, StdOutListener(), timeout=30)
That will call your listener's on_timeout() method. If you return true, it will continue streaming. Otherwise, it will stop.
Between the stream's timeout argument and your listener's on_timeout(), you should be able to decide when to stop streaming.
I found I was able to get some multithreading code the way I wanted to. Unlike this tutorial from Tutorialspoint which gives an example of launching multiple instances of the same code with varying timing parameters, I was able to get two different blocks of code to run in their own instances
One block of code constantly adds 10 to a global variable (var).
Another block checks when 5 seconds elapses then prints var's value.
This demonstrates 2 different tasks executing and sharing data using Python multithreading.
See code below
import threading
import time
exitFlag = 0
var = 10
class myThread1 (threading.Thread):
def __init__(self, threadID, name, counter):
threading.Thread.__init__(self)
self.threadID = threadID
self.name = name
self.counter = counter
def run(self):
#var counting block begins here
print "addemup starting"
global var
while (var < 100000):
if var > 90000:
var = 0
var = var + 10
class myThread2 (threading.Thread):
def __init__(self, threadID, name, counter):
threading.Thread.__init__(self)
self.threadID = threadID
self.name = name
self.counter = counter
def run(self):
#time checking block begins here and prints var every 5 secs
print "checkem starting"
global var
start = time.time()
elapsed = time.time() - start
while (elapsed < 10):
elapsed = time.time() - start
if elapsed > 5:
print "var = ", var
start = time.time()
elapsed = time.time() - start
# Create new threads
thread1 = myThread1(1, "Thread-1", 1)
thread2 = myThread2(2, "Thread-2", 2)
# Start new Threads
thread1.start()
thread2.start()
print "Exiting Main Thread"
My next task will be breaking up my twitter streaming in to its own thread, and passing direct messages received as variables to a task queueing program, while hopefully the first thread continues to listen for more direct messages.
useI am working on a python script to check if the url is working. The script will write the url and response code to a log file.
To speed up the check, I am using threading and queue.
The script works well if the number of url's to check is small but when increasing the number of url's to hundreds, some url's just will miss from the log file.
Is there anything I need to fix?
My script is
#!/usr/bin/env python
import Queue
import threading
import urllib2,urllib,sys,cx_Oracle,os
import time
from urllib2 import HTTPError, URLError
queue = Queue.Queue()
##print_queue = Queue.Queue()
class NoRedirectHandler(urllib2.HTTPRedirectHandler):
def http_error_302(self, req, fp, code, msg, headers):
infourl = urllib.addinfourl(fp, headers, req.get_full_url())
infourl.status = code
infourl.code = code
return infourl
http_error_300 = http_error_302
http_error_301 = http_error_302
http_error_303 = http_error_302
http_error_307 = http_error_302
class ThreadUrl(threading.Thread):
#Threaded Url Grab
## def __init__(self, queue, print_queue):
def __init__(self, queue,error_log):
threading.Thread.__init__(self)
self.queue = queue
## self.print_queue = print_queue
self.error_log = error_log
def do_something_with_exception(self,idx,url,error_log):
exc_type, exc_value = sys.exc_info()[:2]
## self.print_queue.put([idx,url,exc_type.__name__])
with open( error_log, 'a') as err_log_f:
err_log_f.write("{0},{1},{2}\n".format(idx,url,exc_type.__name__))
def openUrl(self,pair):
try:
idx = pair[1]
url = 'http://'+pair[2]
opener = urllib2.build_opener(NoRedirectHandler())
urllib2.install_opener(opener)
request = urllib2.Request(url)
request.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 5.1; rv:13.0) Gecko/20100101 Firefox/13.0.1')
#open urls of hosts
resp = urllib2.urlopen(request, timeout=10)
## self.print_queue.put([idx,url,resp.code])
with open( self.error_log, 'a') as err_log_f:
err_log_f.write("{0},{1},{2}\n".format(idx,url,resp.code))
except:
self.do_something_with_exception(idx,url,self.error_log)
def run(self):
while True:
#grabs host from queue
pair = self.queue.get()
self.openUrl(pair)
#signals to queue job is done
self.queue.task_done()
def readUrlFromDB(queue,connect_string,column_name,table_name):
try:
connection = cx_Oracle.Connection(connect_string)
cursor = cx_Oracle.Cursor(connection)
query = 'select ' + column_name + ' from ' + table_name
cursor.execute(query)
#Count lines in the file
rows = cursor.fetchall()
total = cursor.rowcount
#Loop through returned urls
for row in rows:
#print row[1],row[2]
## url = 'http://'+row[2]
queue.put(row)
cursor.close()
connection.close()
return total
except cx_Oracle.DatabaseError, e:
print e[0].context
raise
def main():
start = time.time()
error_log = "D:\\chkWebsite_Error_Log.txt"
#Check if error_log file exists
#If exists then deletes it
if os.path.isfile(error_log):
os.remove(error_log)
#spawn a pool of threads, and pass them queue instance
for i in range(10):
t = ThreadUrl(queue,error_log)
t.setDaemon(True)
t.start()
connect_string,column_name,table_name = "user/pass#db","*","T_URL_TEST"
tn = readUrlFromDB(queue,connect_string,column_name,table_name)
#wait on the queue until everything has been processed
queue.join()
## print_queue.join()
print "Total retrived: {0}".format(tn)
print "Elapsed Time: %s" % (time.time() - start)
main()
Python's threading module isn't really multithreaded because of the global interpreter lock, http://wiki.python.org/moin/GlobalInterpreterLock as such you should really use multiprocessing http://docs.python.org/library/multiprocessing.html if you really want to take advantage of multiple cores.
Also you seem to be accessing a file simultatnously
with open( self.error_log, 'a') as err_log_f:
err_log_f.write("{0},{1},{2}\n".format(idx,url,resp.code))
This is really bad AFAIK, if two threads are trying to write to the same file at the same time or almost at the same time, keep in mind, their not really multithreaded, the behavior tends to be undefined, imagine one thread writing while another just closed it...
Anyway you would need a third queue to handle writing to the file.
At first glance this looks like a race condition, since many threads are trying to write to the log file at the same time. See this question for some pointers on how to lock a file for writing (so only one thread can access it at a time).