Python pool.apply_async() doesn't call target function? - python-3.x

I'm writing an optimization routine to brute force search a solution space for optimal hyper parameters; and apply_async does not appear to be doing anything at all. Ubuntu Server 16.04, Python 3.5, PyCharm CE 2018. Also, I'm doing this on an Azure virtual machine. My code looks like this:
class optimizer(object):
def __init__(self,n_proc,frame):
# Set Class Variables
def prep(self):
# Get Data and prepare for optimization
def ret_func(self,retval):
self.results = self.results.append(retval)
print('Something')
def search(self):
p = multiprocessing.Pool(processes=self.n_proc)
for x, y in zip(repeat(self.data),self.grid):
job = p.apply_async(self.bot.backtest,(x,y),callback=self.ret_func)
p.close()
p.join()
self.results.to_csv('OptimizationResults.csv')
print('***************************')
print('Exiting, Optimization Complete')
if __name__ == '__main__':
multiprocessing.freeze_support()
opt = optimizer(n_proc=4,frame='ytd')
opt.prep()
print('Data Prepped, beginning search')
opt.search()
I was running this exact setup on a Windows Server VM, and I switched over due to issues with multiprocessing not utilizing all cores. Today, I configured my machine and was able to run the optimization one time only. After that, it mysteriously stopped working with no change from me. Also, I should mention that it spits out output every 1 in 10 times I run it. Very odd behavior. I expect to see:
Something
Something
Something
.....
Which would typically be the best "to-date" results of the optimization (omitted for clarity). Instead I get:
Data Prepped, beginning search
***************************
Exiting, Optimization Complete
If I call get() on the async object, the results are printed as expected, but only one core is utilized because the results are being gathered in the for loop. Why isn't apply_async doing anything at all? I should mention that I use the "stop" button on Pycharm to terminate the process, not sure if this has something to do with it?
Let me know if you need more details about prep(), or bot.backtest()

I found the error! Basically I was converting a dict() to a list() and passing the values from the list into my function! The list parameter order was different every time I ran the function, and one of the parameters needed to be an integer, not a float.
For some reason, on windows, the order of the dict was preserved when converting to a list; not the case with Ubuntu! Very interesting.

Related

Multiprocessing with Multiple Functions: Need to add a function to the pool from within another function

I am measuring the metrics of an encryption algorithm that I designed. I have declared 2 functions and a brief sample is as follows:
import sys, random, timeit, psutil, os, time
from multiprocessing import Process
from subprocess import check_output
pid=0
def cpuUsage():
global running
while pid == 0:
time.sleep(1)
running=true
p = psutil.Process(pid)
while running:
print(f'PID: {pid}\t|\tCPU Usage: {p.memory_info().rss/(1024*1024)} MB')
time.sleep(1)
def Encryption()
global pid, running
pid = os.getpid()
myList=[]
for i in range(1000):
myList.append(random.randint(-sys.maxsize,sys.maxsize)+random.random())
print('Now running timeit function for speed metrics.')
p1 = Process(target=metric_collector())
p1.start()
p1.join()
number=1000
unit='msec'
setup = '''
import homomorphic,random,sys,time,os,timeit
myList={myList}
'''
enc_code='''
for x in range(len(myList)):
myList[x] = encryptMethod(a, b, myList[x], d)
'''
dec_code='''
\nfor x in range(len(myList)):
myList[x] = decryptMethod(myList[x])
'''
time=timeit.timeit(setup=setup,
stmt=(enc_code+dec_code),
number=number)
running=False
print(f'''Average Time:\t\t\t {time/number*.0001} seconds
Total time for {number} Iters:\t\t\t {time} {unit}s
Total Encrypted/Decrypted Values:\t {number*len(myList)}''')
sys.exit()
if __name__ == '__main__':
print('Beginning Metric Evaluation\n...\n')
p2 = Process(target=Encryption())
p2.start()
p2.join()
I am sure there's an implementation error in my code, I'm just having trouble grabbing the PID for the encryption method and I am trying to make the overhead from other calls as minimal as possible so I can get an accurate reading of just the functionality of the methods being called by timeit. If you know a simpler implementation, please let me know. Trying to figure out how to measure all of the metrics has been killing me softly.
I've tried acquiring the pid a few different ways, but I only want to measure performance when timeit is run. Good chance I'll have to break this out separately and run it that way (instead of multiprocessing) to evaluate the function properly, I'm guessing.
There are at least three major problems with your code. The net result is that you are not actually doing any multiprocessing.
The first problem is here, and in a couple of other similar places:
p2 = Process(target=Encryption())
What this code passes to Process is not the function Encryption but the returned value from Encryption(). It is exactly the same as if you had written:
x = Encryption()
p2 = Process(target=x)
What you want is this:
p2 = Process(target=Encryption)
This code tells Python to create a new Process and execute the function Encryption() in that Process.
The second problem has to do with the way Python handles memory for Processes. Each Process lives in its own memory space. Each Process has its own local copy of global variables, so you cannot set a global variable in one Process and have another Process be aware of this change. There are mechanisms to handle this important situation, documented in the multiprocessing module. See the section titled "Sharing state between processes." The bottom line here is that you cannot simply set a global variable inside a Process and expect other Processes to see the change, as you are trying to do with pid. You have to use one of the approaches described in the documentation.
The third problem is this code pattern, which occurs for both p1 and p2.
p2 = Process(target=Encryption)
p2.start()
p2.join()
This tells Python to create a Process and to start it. Then you immediately wait for it to finish, which means that your current Process must stop at that point until the new Process is finished. You never allow two Processes to run at once, so there is no performance benefit. The only reason to use multiprocessing is to run two things at the same time, which you never do. You might as well not bother with multiprocessing at all since it is only making your life more difficult.
Finally I am not sure why you have decided to try to use multiprocessing in the first place. The functions that measure memory usage and execution time are almost certainly very fast, and I would expect them to be much faster than any method of synchronizing one Process to another. If you're worried about errors due to the time used by the diagnostic functions themselves, I doubt that you can make things better by multiprocessing. Why not just start with a simple program and see what results you get?

How to run a threaded function that returns a variable?

Working with Python 3.6, what I’m looking to accomplish is to create a function that continuously scrapes dynamic/changing data from a webpage, while the rest of the script executes, and is able to reference the data returned from the continuous function.
I know this is likely a threading task, however I’m not super knowledgeable in it yet. Pseudo-code I might think looks something like this
def continuous_scraper():
# Pull data from webpage
scraped_table = pd.read_html(url)
return scraped_table
# start the continuous scraper function here, to run either indefinitely, or preferably stop after a predefined amount of time
scraped_table = thread(continuous_scraper)
# the rest of the script is run here, making use of the updating “scraped_table”
while True:
print(scraped_table[“Col_1”].iloc[0]
Here is a fairly simple example using some stock market page that seems to update every couple of seconds.
import threading, time
import pandas as pd
# A lock is used to ensure only one thread reads or writes the variable at any one time
scraped_table_lock = threading.Lock()
# Initially set to None so we know when its value has changed
scraped_table = None
# This bad-boy will be called only once in a separate thread
def continuous_scraper():
# Tell Python this is a global variable, so it rebinds scraped_table
# instead of creating a local variable that is also named scraped_table
global scraped_table
url = r"https://tradingeconomics.com/australia/stock-market"
while True:
# Pull data from webpage
result = pd.read_html(url, match="Dow Jones")[0]
# Acquire the lock to ensure thread-safety, then assign the new result
# This is done after read_html returns so it doesn't hold the lock for so long
with scraped_table_lock:
scraped_table = result
# You don't wanna flog the server, so wait 2 seconds after each
# response before sending another request
time.sleep(2)
# Make the thread daemonic, so the thread doesn't continue to run once the
# main script and any other non-daemonic threads have ended
scraper_thread = threading.Thread(target=continuous_scraper, daemon=True)
# start the continuous scraper function here, to run either indefinitely, or
# preferably stop after a predefined amount of time
scraper_thread.start()
# the rest of the script is run here, making use of the updating “scraped_table”
for _ in range(100):
print("Time:", time.time())
# Acquire the lock to ensure thread-safety
with scraped_table_lock:
# Check if it has been changed from the default value of None
if scraped_table is not None:
print(" ", scraped_table)
else:
print("scraped_table is None")
# You probably don't wanna flog your stdout, either, dawg!
time.sleep(0.5)
Be sure to read about multithreaded programming and thread safety. It's easy to make mistakes. If there is a bug, it often only manifests in rare and seemingly random occasions, making it difficult to debug.
I recommend looking into multiprocessing library and Pool class.
The docs have multiple examples of how to use it.
Question itself is too general to make a simple answer.

Threads will not close off after program completion

I have a script that receives temperature data via using requests. Since I had to make multiple requests (around 13000) I decided to explore the use of multi-threading which I am new at.
The programs work by grabbing longitude/latitude data from a csv file and then makes a request to retrieve the temperature data.
The problem that I am facing is that the script does not finish fully when the last temperature value is retrieved.
Here is the code. I have shortened so it is easy to see what I am doing:
num_threads = 16
q = Queue(maxsize=0)
def get_temp(q):
while not q.empty():
work = q.get()
if work is None:
break
## rest of my code here
q.task_done()
At main:
def main():
for o in range(num_threads):
logging.debug('Starting Thread %s', o)
worker = threading.Thread(target=get_temp, args=(q,))
worker.setDaemon(True)
worker.start()
logging.info("Main Thread Waiting")
q.join()
logging.info("Job complete!")
I do not see any errors on the console and temperature is being successfully being written to another file. I have a tried running a test csv file with only a few longitude/latitude references and the script seems to finish executing fine.
So is there a way of shedding light as to what might be happening in the background? I am using Python 3.7.3 on PyCharm 2019.1 on Linux Mint 19.1.
the .join() function waits for all threads to join before continuing to the next line

Why does some widgets don't update on Qt5?

I am trying to create a PyQt5 application, where I have used certain labels for displaying status variables. To update them, I have implemented custom pyqtSignal manually. However, on debugging I find that the value of GUI QLabel have changed but the values don't get reflected on the main window.
Some answers suggested calling QApplication().processEvents() occasionally. However, this instantaneously crashes the application and also freezes the application.
Here's a sample code (all required libraries are imported, it's just the part creating problem, the actual code is huge):
from multiprocessing import Process
def sub(signal):
i = 0
while (True):
if (i % 5 == 0):
signal.update(i)
class CustomSignal(QObject):
signal = pyqtSignal(int)
def update(value):
self.signal.emit(value)
class MainApp(QWidget):
def __init__(self):
super().__init__()
self.label = QLabel("0");
self.customSignal = CustomSignal()
self.subp = Process(target=sub, args=(customSignal,))
self.subp.start()
self.customSignal.signal.connect(self.updateValue)
def updateValue(self, value):
print("old value", self.label.text())
self.label.setText(str(value))
print("new value", self.label.text())
The output of the print statements is as expected. However, the text in label does not change.
The update function in CustomSignal is called by some thread.
I've applied the same method to update progress bar which works fine.
Is there any other fix for this, other than processEvents()?
The OS is Ubuntu 16.04.
The key problem lies in the very concept behind the code.
Processes have their own address space, and don't share data with another processes, unless some inter-process communication algorithm is used. Perhaps, multithreading module was used instead of threading module to bring concurrency to avoid Python's GIL and speedup the program. However, subprocess has cannot access the data of parent process.
I have tested two solutions to this case, and they seem to work.
threading module: No matter threading in Python is inefficient due to GIL, but it's still sufficient to some extent for basic concurrency demands. Note the difference between concurrency and speedup.
QThread: Since you are using PyQt, there's isn't any issue in using QThread, which is a better option because it takes concurrency to multiple cores taking advantage of operating system's system call, rather than Python in the middle.
Try adding
self.label.repaint()
immediately after updating the text, like this:
self.label.setText(str(value))
self.label.repaint()

Should I use coroutines or another scheduling object here?

I currently have code in the form of a generator which calls an IO-bound task. The generator actually calls sub-generators as well, so a more general solution would be appreciated.
Something like the following:
def processed_values(list_of_io_tasks):
for task in list_of_io_tasks:
value = slow_io_call(task)
yield postprocess(value) # in real version, would iterate over
# processed_values2(value) here
I have complete control over slow_io_call, and I don't care in which order I get the items from processed_values. Is there something like coroutines I can use to get the yielded results in the fastest order by turning slow_io_call into an asynchronous function and using whichever call returns fastest? I expect list_of_io_tasks to be at least thousands of entries long. I've never done any parallel work other than with explicit threading, and in particular I've never used the various forms of lightweight threading which are available.
I need to use the standard CPython implementation, and I'm running on Linux.
Sounds like you are in search of multiprocessing.Pool(), specifically the Pool.imap_unordered() method.
Here is a port of your function to use imap_unordered() to parallelize calls to slow_io_call().
def processed_values(list_of_io_tasks):
pool = multiprocessing.Pool(4) # num workers
results = pool.imap_unordered(slow_io_call, list_of_io_tasks)
while True:
yield results.next(9999999) # large time-out
Note that you could also iterate over results directly (i.e. for item in results: yield item) without a while True loop, however calling results.next() with a time-out value works around this multiprocessing keyboard interrupt bug and allows you to kill the main process and all subprocesses with Ctrl-C. Also note that the StopIteration exceptions are not caught in this function but one will be raised when results.next() has no more items return. This is legal from generator functions, such as this one, which are expected to either raise StopIteration errors when there are no more values to yield or just stop yielding and a StopIteration exception will be raised on it's behalf.
To use threads in place of processes, replace
import multiprocessing
with
import multiprocessing.dummy as multiprocessing

Resources