Python multiprocessing taking the brakes off OSX - python-3.x

I have a program that randomly selects 13 cards from a full pack and analyses the hands for shape, point count and some other features important to the game of bridge. The program will select and analyse 10**7 hands in about 5 minutes. Checking the Activity Monitor shows that during execution the CPU (which s a 6 Core processor) is devoting about 9% of its time to the program and ~90% of its time it is idle. So it looks like a prime candidate for multiprocessing and I created a multiprocessing version using a Queue to pass information from each process back to the main program. Having navigated the problems of IDLE not working will multiprocessing (I now run it using PyCharm) and that doing a join on a process before it has finished freezes the program, I got it to work.
However, it doesn’t matter how many processes I use 5,10, 25 or 50 the result is always the same. The CPU devotes about 18% of its time to the program and has ~75% of its time idle and the execution time is slightly more than double at a bit over 10 minutes.
Can anyone explain how I can get the processes to take up more of the CPU time and how I can get the execution time to reflect this? Below are the relevant sections fo the program:
import random
import collections
import datetime
import time
from math import log10
from multiprocessing import Process, Queue
NUM_OF_HANDS = 10**6
NUM_OF_PROCESSES = 25
def analyse_hands(numofhands, q):
#code remove as not relevant to the problem
q.put((distribution, points, notrumps))
if __name__ == '__main__':
processlist = []
q = Queue()
handsperprocess = NUM_OF_HANDS // NUM_OF_PROCESSES
print(handsperprocess)
# Set up the processes and get them to do their stuff
start_time = time.time()
for _ in range(NUM_OF_PROCESSES):
p = Process(target=analyse_hands, args=((handsperprocess, q)))
processlist.append(p)
p.start()
# Allow q to get a few items
time.sleep(.05)
while not q.empty():
while not q.empty():
#code remove as not relevant to the problem
# Allow q to be refreshed so allowing all processes to finish before
# doing a join. It seems that doing a join before a process is
# finished will cause the program to lock
time.sleep(.05)
counter['empty'] += 1
for p in processlist:
p.join()
while not q.empty():
# This is never executed as all the processes have finished and q
# emptied before the join command above.
#code remove as not relevant to the problem
finish_time = time.time()

I have no answer to the reason why IDLE will not run a multiprocessor start instruction correctly but I believe the answer to the doubling of the execution times lies in the type of problem I am dealing with. Perhaps others can comment but it seems to me that the overhead involved with adding and removing items to and from the Queue is quite high so that performance improvements will be best achieved when the amount of data being passed via the Queue is small compared with the amount of processing required to obtain that data.
In my program I am creating and passing 10**7 items of data and I suppose it is the overhead of passing this number of items via the Queue that kills any performance improvement from getting the data via separate Processes. By using a map it seems all 10^7 items of data will need to be stored in the map before any further processing can be done. This might improve performance depending on the overhead of using the map and dealing with that amount of data but for the time being I will stick with my original vanilla, single processed code.

Related

Multiprocessing with Multiple Functions: Need to add a function to the pool from within another function

I am measuring the metrics of an encryption algorithm that I designed. I have declared 2 functions and a brief sample is as follows:
import sys, random, timeit, psutil, os, time
from multiprocessing import Process
from subprocess import check_output
pid=0
def cpuUsage():
global running
while pid == 0:
time.sleep(1)
running=true
p = psutil.Process(pid)
while running:
print(f'PID: {pid}\t|\tCPU Usage: {p.memory_info().rss/(1024*1024)} MB')
time.sleep(1)
def Encryption()
global pid, running
pid = os.getpid()
myList=[]
for i in range(1000):
myList.append(random.randint(-sys.maxsize,sys.maxsize)+random.random())
print('Now running timeit function for speed metrics.')
p1 = Process(target=metric_collector())
p1.start()
p1.join()
number=1000
unit='msec'
setup = '''
import homomorphic,random,sys,time,os,timeit
myList={myList}
'''
enc_code='''
for x in range(len(myList)):
myList[x] = encryptMethod(a, b, myList[x], d)
'''
dec_code='''
\nfor x in range(len(myList)):
myList[x] = decryptMethod(myList[x])
'''
time=timeit.timeit(setup=setup,
stmt=(enc_code+dec_code),
number=number)
running=False
print(f'''Average Time:\t\t\t {time/number*.0001} seconds
Total time for {number} Iters:\t\t\t {time} {unit}s
Total Encrypted/Decrypted Values:\t {number*len(myList)}''')
sys.exit()
if __name__ == '__main__':
print('Beginning Metric Evaluation\n...\n')
p2 = Process(target=Encryption())
p2.start()
p2.join()
I am sure there's an implementation error in my code, I'm just having trouble grabbing the PID for the encryption method and I am trying to make the overhead from other calls as minimal as possible so I can get an accurate reading of just the functionality of the methods being called by timeit. If you know a simpler implementation, please let me know. Trying to figure out how to measure all of the metrics has been killing me softly.
I've tried acquiring the pid a few different ways, but I only want to measure performance when timeit is run. Good chance I'll have to break this out separately and run it that way (instead of multiprocessing) to evaluate the function properly, I'm guessing.
There are at least three major problems with your code. The net result is that you are not actually doing any multiprocessing.
The first problem is here, and in a couple of other similar places:
p2 = Process(target=Encryption())
What this code passes to Process is not the function Encryption but the returned value from Encryption(). It is exactly the same as if you had written:
x = Encryption()
p2 = Process(target=x)
What you want is this:
p2 = Process(target=Encryption)
This code tells Python to create a new Process and execute the function Encryption() in that Process.
The second problem has to do with the way Python handles memory for Processes. Each Process lives in its own memory space. Each Process has its own local copy of global variables, so you cannot set a global variable in one Process and have another Process be aware of this change. There are mechanisms to handle this important situation, documented in the multiprocessing module. See the section titled "Sharing state between processes." The bottom line here is that you cannot simply set a global variable inside a Process and expect other Processes to see the change, as you are trying to do with pid. You have to use one of the approaches described in the documentation.
The third problem is this code pattern, which occurs for both p1 and p2.
p2 = Process(target=Encryption)
p2.start()
p2.join()
This tells Python to create a Process and to start it. Then you immediately wait for it to finish, which means that your current Process must stop at that point until the new Process is finished. You never allow two Processes to run at once, so there is no performance benefit. The only reason to use multiprocessing is to run two things at the same time, which you never do. You might as well not bother with multiprocessing at all since it is only making your life more difficult.
Finally I am not sure why you have decided to try to use multiprocessing in the first place. The functions that measure memory usage and execution time are almost certainly very fast, and I would expect them to be much faster than any method of synchronizing one Process to another. If you're worried about errors due to the time used by the diagnostic functions themselves, I doubt that you can make things better by multiprocessing. Why not just start with a simple program and see what results you get?

Why does implementing multiprocessing makes my program slower

I'm trying to implement multiprocessing in my code to make it faster.
To make it easier to understand I will just say the program fits an observed curve using a linear combination of a library of curves and from that measures properties of the observed curve.
I have to do this for over 400 curves and in order to estimate the errors of these properties I perform a Monte Carlo simulation, which means I have to iterate a number of times each calculation.
This takes a lot of time and work, and granted I believe it is a CPU-bound task I figured I'd use multiprocessing in the error estimation step. Here's a simplification of my code:
Without multiprocessing
import numpy as np
import fitting_package
import multiprocessing
def estimate_errors(best_fit_curve, signal_to_noise, fit_kwargs, iterations=100)
results = defaultdict(list)
def fit(best_fit_curve, signal_to_noise, fit_kwargs, results):
# Here noise is added to simulate a new curve (Monte Carlo simulation)
noise = best_fit/signal_to_noise
simulated_curve = np.random.normal(best_fit_curve, noise)
# The arguments from the original fit (outside the error estimation) are passed to the fitting
fit_kwargs.update({'curve' : simulated_curve})
# The fit is performed and it returns the properties packed together
solutions = fitting_package(**fit_kwargs)
# There are more properties so this is a simplification
property_1, property_2 = solutions
aux_dict = {'property_1' : property_1, 'property_2' : property_2}
for key, value in aux_dict.items():
results[key].append(values)
for _ in range(iterations):
fit(best_fit_curve, signal_to_noise, fit_kwargs, results)
return results
With multiprocessing
def estimate_errors(best_fit_curve, signal_to_noise, fit_kwargs, iterations=100)
def fit(best_fit_curve, signal_to_noise, fit_kwargs, queue):
results = queue.get()
noise = best_fit/signal_to_noise
simulated_curve = np.random.normal(best_fit_curve, noise)
fit_kwargs.update({'curve' : simulated_curve})
solutions = fitting_package(**fit_kwargs)
property_1, property_2 = solutions
aux_dict = {'property_1' : property_1, 'property_2' : property_2}
for key, value in aux_dict.items():
results[key].append(values)
queue.put(results)
process_list = []
queue = multiprocessing.Queue()
queue.put(defaultdict(list))
for _ in range(iterations):
process = multiprocessing.Process(target=fit, args=(best_fit_curve, signal_to_noise, fit_kwargs, queue))
process.start()
process_list.append(process)
for p in process_list:
p.join()
results = queue.get()
return results
I thought using multiprocessing would save time, but it actually takes more than double than the other way to do it. Why is this? Is there anyway I can make it faster with multiprocessing?
I thought using multiprocessing would save time, but it actually takes more than double than the other way to do it. Why is this?
Starting a process takes a long time (at least in computer terms). It also uses a lot of memory.
In your code, you are starting 100 separate Python interpreters in 100 separate OS processes. That takes a really long time, so unless each process runs a very long time, the time it takes to start the process is going to dominate the time it actually does useful work.
In addition to that, unless you actually have 100 un-used CPU cores, those 100 processes will just spend most of their time waiting for each other to finish. Even worse, since they all have the same priority, the OS will try to give each of them a fair amount of time, so it will run them for a bit of time, then suspend them, run others for a bit of time, suspend them, etc. All this scheduling also takes time.
Having more parallel workloads than parallel resources cannot speed up your program, since they have to wait to be executed one-after-another anyway.
Parallelism will only speed up your program if the time for the parallel tasks is not dominated by the time of creating, managing, scheduling, and re-joining the parallel tasks.

When running Numba code on a CUDA GPU, I notice one of my CPU cores stays at 100%. Is this limiting performance?

I have a test code that is computationally intense and I run that on the GPU using Numba. I noticed that while that is running, one of my CPU cores goes to 100% and stays there the whole time. The GPU seems to be at 100% as well. You can see both in the screenshot below.
My benchmark code is as follows:
from numba import *
import numpy as np
from numba import cuda
import time
def benchmark():
input_list = np.random.randint(10, size=3200000).astype(np.intp)
output_list = np.zeros(input_list.shape).astype(np.intp)
d_input_array = cuda.to_device(input_list)
d_output_array = cuda.to_device(output_list)
run_test[32, 512](d_input_array, d_output_array)
out = d_output_array.copy_to_host()
print('Result: ' + str(out))
#cuda.jit("void(intp[::1], intp[::1])", fastmath=True)
def run_test(d_input_array, d_output_array):
array_slice_len = len(d_input_array) / (cuda.blockDim.x * cuda.gridDim.x)
thread_coverage = cuda.threadIdx.x * array_slice_len
slice_start = thread_coverage + (cuda.blockDim.x * cuda.blockIdx.x * array_slice_len)
for step in range(slice_start, slice_start + array_slice_len, 1):
if step > len(d_input_array) - 1:
return
count = 0
for item2 in d_input_array:
if d_input_array[step] == item2:
count = count + 1
d_output_array[step] = count
if __name__ == '__main__':
import timeit
# make_multithread(benchmark, 64)
print(timeit.timeit("benchmark()", setup="from __main__ import benchmark", number=1))
You can run the code above to repro if you got python 3.7, Numba and codatoolkit installed. I'm on Linux Mint 20.
I got 32 cores - doesn't seem right to have one 100% while everyone else seats idle.
I'm wondering why that is, if there is a way to have other cores help with whatever is being done to increase performance?
How can I investigate what is taking 100% of a single core and know what is going on?
CUDA kernel launches (and some other operations) are asynchronous from the point of view of the host thread. And as you say, you're running the computationally intense portion of the work on the GPU.
So the host thread has nothing to do, other than launch some work and wait for it to be finished. The waiting process here is a spin-wait which means the CPU thread is in a tight loop, waiting for a status condition to change.
The CPU thread will hit that spin-wait here:
out = d_output_array.copy_to_host()
which is the line of code after your kernel launch, and it expects to copy (valid) results back from the GPU to the CPU. In order for this to work, the CPU thread must wait there until the results are ready. Numba implements this with a blocking sync operation, between GPU and CPU activity. Therefore, for most of the duration of your program, the CPU thread is actually waiting at that line of code.
This waiting takes up 100% of that thread's activity, and thus one core is seen as fully utilized.
There wouldn't be any sense or reason to try to "distribute" this "work" to multiple threads/cores, so this is not a "performance" issue in the way you are suggesting.
Any CPU profiler that shows hotspots or uses PC sampling should be able to give you a picture of this. That line of code should show up near the top of the list of lines of code most heavily visited by your CPU/core/thread.

Multiprocessing code does not work when trying to initialize dataframe columns

I am trying to use multiprocessing module to initialize each column of a dataframe using a separate CPU core in Python 3.6 but my code doesn't work. Does anybody know the issue with this code? I appreciate your help.
My laptop has Windows 10 and its CPU is Core i7 8th Gen:
import time
import pandas as pd
import numpy as np
import multiprocessing
df=pd.DataFrame(index=range(10),columns=["A","B","C","D"])
def multiprocessing_func(col):
for i in range(0,df.shape[0]):
df.iloc[i,col]=np.random(4)
print("column "+str(col)+ " is completed" )
if __name__ == '__main__':
starttime = time.time()
processes = []
for i in range(0,df.shape[1]):
p = multiprocessing.Process(target=multiprocessing_func, args=(i,))
processes.append(p)
p.start()
for process in processes:
process.join()
print('That took {} seconds'.format(time.time() - starttime))
When you start a Process, it is basically a copy of the parent process. (I'm skipping over some details here, but they shouldn't matter for the explanation).
Unlike threads, processes don't share data. (Processes can use shared memory, but this is not automatic. To the best of my knowledge, the mechanisms in multiprocessing for sharing data cannot handle a dataframe.)
So what happens is that each of the worker processes is modifying its own copy of the dataframe, not the dataframe in the parent process.
For this to work, you'd have to send the new data back to the parent process. You could do that by e.g. return-ing it from the worker function, and then putting the returned data into the original dataframe.
It only makes sense to use multiprocessing like this if the work of generating the data takes significantly longer then launching a new worker process, sending the data back to the parent process and putting it into the dataframe. Since you are basically filling the columns with random data, I don't think that is the case here.
So I don't see why you would use multiprocessing here.
Edit: Based on your comment that it takes days to calculate each column, I would propose the following.
Use Proces like you have been doing, but have each of the worker processes save the numbers they produce in a file where the filename includes the value of i. Have the workers return a status code so you can determine that thay have succeeded or failed. In case of failure, also return some kind of index of the amount of data successfully completed, so you don't have to re-calculate that again.
The file format should be simple and preferable readable. E.g. one number per line.
Wait for all processes to finish, read the files and fill the dataframe.

Python 3: create new process when another one finishes

I have an array of data to handle and handler that executing long (1-2 minutes) and takes a lot of memory for its calculations.
raw = ['a', 'b', 'c']
def handler():
# do something long
Since handler requires a lot of memory, I want to execute it in separate subprocess and kill it after execution to release memory. Something like the following snippet:
from multiprocessing import Process
for r in raw:
process = Process(target=handler, args=(r))
process.start()
The problem is that such approach leads to immediate running len(raw) processes. And it's not good.
Also, it's not needed to interchange any kind of data between subprocesses. Just run them consequently.
Therefore it would be great to run a few processes at the same time and add a new one once existing finishes.
How could it be implemented (if it's even possible)?
to run your processes sequentially, just join each process within the loop:
from multiprocessing import Process
for r in raw:
process = Process(target=handler, args=(r))
process.start()
process.join()
that way you're sure that only one process is running at the same time (no concurrency)
That's the simplest way. To run more than one process but limit the number of processes running at the same time, you can use a multiprocessing.Pool object and apply_async
I've built a simple example which computes the square of the argument, and simulates an heavy processing:
from multiprocessing import Pool
import time
def target(r):
time.sleep(5)
return(r*r)
raw = [1,2,3,4,5]
if __name__ == '__main__':
with Pool(3) as p: # 3 processes at a time
reslist = [p.apply_async(target, (r,)) for r in raw]
for result in reslist:
print(result.get())
Running this I get:
<5 seconds wait, time to compute the results>
1
4
9
<5 seconds wait, 3 processes max can run at the same time>
16
25

Resources