I'm trying to implement multiprocessing in my code to make it faster.
To make it easier to understand I will just say the program fits an observed curve using a linear combination of a library of curves and from that measures properties of the observed curve.
I have to do this for over 400 curves and in order to estimate the errors of these properties I perform a Monte Carlo simulation, which means I have to iterate a number of times each calculation.
This takes a lot of time and work, and granted I believe it is a CPU-bound task I figured I'd use multiprocessing in the error estimation step. Here's a simplification of my code:
Without multiprocessing
import numpy as np
import fitting_package
import multiprocessing
def estimate_errors(best_fit_curve, signal_to_noise, fit_kwargs, iterations=100)
results = defaultdict(list)
def fit(best_fit_curve, signal_to_noise, fit_kwargs, results):
# Here noise is added to simulate a new curve (Monte Carlo simulation)
noise = best_fit/signal_to_noise
simulated_curve = np.random.normal(best_fit_curve, noise)
# The arguments from the original fit (outside the error estimation) are passed to the fitting
fit_kwargs.update({'curve' : simulated_curve})
# The fit is performed and it returns the properties packed together
solutions = fitting_package(**fit_kwargs)
# There are more properties so this is a simplification
property_1, property_2 = solutions
aux_dict = {'property_1' : property_1, 'property_2' : property_2}
for key, value in aux_dict.items():
results[key].append(values)
for _ in range(iterations):
fit(best_fit_curve, signal_to_noise, fit_kwargs, results)
return results
With multiprocessing
def estimate_errors(best_fit_curve, signal_to_noise, fit_kwargs, iterations=100)
def fit(best_fit_curve, signal_to_noise, fit_kwargs, queue):
results = queue.get()
noise = best_fit/signal_to_noise
simulated_curve = np.random.normal(best_fit_curve, noise)
fit_kwargs.update({'curve' : simulated_curve})
solutions = fitting_package(**fit_kwargs)
property_1, property_2 = solutions
aux_dict = {'property_1' : property_1, 'property_2' : property_2}
for key, value in aux_dict.items():
results[key].append(values)
queue.put(results)
process_list = []
queue = multiprocessing.Queue()
queue.put(defaultdict(list))
for _ in range(iterations):
process = multiprocessing.Process(target=fit, args=(best_fit_curve, signal_to_noise, fit_kwargs, queue))
process.start()
process_list.append(process)
for p in process_list:
p.join()
results = queue.get()
return results
I thought using multiprocessing would save time, but it actually takes more than double than the other way to do it. Why is this? Is there anyway I can make it faster with multiprocessing?
I thought using multiprocessing would save time, but it actually takes more than double than the other way to do it. Why is this?
Starting a process takes a long time (at least in computer terms). It also uses a lot of memory.
In your code, you are starting 100 separate Python interpreters in 100 separate OS processes. That takes a really long time, so unless each process runs a very long time, the time it takes to start the process is going to dominate the time it actually does useful work.
In addition to that, unless you actually have 100 un-used CPU cores, those 100 processes will just spend most of their time waiting for each other to finish. Even worse, since they all have the same priority, the OS will try to give each of them a fair amount of time, so it will run them for a bit of time, then suspend them, run others for a bit of time, suspend them, etc. All this scheduling also takes time.
Having more parallel workloads than parallel resources cannot speed up your program, since they have to wait to be executed one-after-another anyway.
Parallelism will only speed up your program if the time for the parallel tasks is not dominated by the time of creating, managing, scheduling, and re-joining the parallel tasks.
Related
I am measuring the metrics of an encryption algorithm that I designed. I have declared 2 functions and a brief sample is as follows:
import sys, random, timeit, psutil, os, time
from multiprocessing import Process
from subprocess import check_output
pid=0
def cpuUsage():
global running
while pid == 0:
time.sleep(1)
running=true
p = psutil.Process(pid)
while running:
print(f'PID: {pid}\t|\tCPU Usage: {p.memory_info().rss/(1024*1024)} MB')
time.sleep(1)
def Encryption()
global pid, running
pid = os.getpid()
myList=[]
for i in range(1000):
myList.append(random.randint(-sys.maxsize,sys.maxsize)+random.random())
print('Now running timeit function for speed metrics.')
p1 = Process(target=metric_collector())
p1.start()
p1.join()
number=1000
unit='msec'
setup = '''
import homomorphic,random,sys,time,os,timeit
myList={myList}
'''
enc_code='''
for x in range(len(myList)):
myList[x] = encryptMethod(a, b, myList[x], d)
'''
dec_code='''
\nfor x in range(len(myList)):
myList[x] = decryptMethod(myList[x])
'''
time=timeit.timeit(setup=setup,
stmt=(enc_code+dec_code),
number=number)
running=False
print(f'''Average Time:\t\t\t {time/number*.0001} seconds
Total time for {number} Iters:\t\t\t {time} {unit}s
Total Encrypted/Decrypted Values:\t {number*len(myList)}''')
sys.exit()
if __name__ == '__main__':
print('Beginning Metric Evaluation\n...\n')
p2 = Process(target=Encryption())
p2.start()
p2.join()
I am sure there's an implementation error in my code, I'm just having trouble grabbing the PID for the encryption method and I am trying to make the overhead from other calls as minimal as possible so I can get an accurate reading of just the functionality of the methods being called by timeit. If you know a simpler implementation, please let me know. Trying to figure out how to measure all of the metrics has been killing me softly.
I've tried acquiring the pid a few different ways, but I only want to measure performance when timeit is run. Good chance I'll have to break this out separately and run it that way (instead of multiprocessing) to evaluate the function properly, I'm guessing.
There are at least three major problems with your code. The net result is that you are not actually doing any multiprocessing.
The first problem is here, and in a couple of other similar places:
p2 = Process(target=Encryption())
What this code passes to Process is not the function Encryption but the returned value from Encryption(). It is exactly the same as if you had written:
x = Encryption()
p2 = Process(target=x)
What you want is this:
p2 = Process(target=Encryption)
This code tells Python to create a new Process and execute the function Encryption() in that Process.
The second problem has to do with the way Python handles memory for Processes. Each Process lives in its own memory space. Each Process has its own local copy of global variables, so you cannot set a global variable in one Process and have another Process be aware of this change. There are mechanisms to handle this important situation, documented in the multiprocessing module. See the section titled "Sharing state between processes." The bottom line here is that you cannot simply set a global variable inside a Process and expect other Processes to see the change, as you are trying to do with pid. You have to use one of the approaches described in the documentation.
The third problem is this code pattern, which occurs for both p1 and p2.
p2 = Process(target=Encryption)
p2.start()
p2.join()
This tells Python to create a Process and to start it. Then you immediately wait for it to finish, which means that your current Process must stop at that point until the new Process is finished. You never allow two Processes to run at once, so there is no performance benefit. The only reason to use multiprocessing is to run two things at the same time, which you never do. You might as well not bother with multiprocessing at all since it is only making your life more difficult.
Finally I am not sure why you have decided to try to use multiprocessing in the first place. The functions that measure memory usage and execution time are almost certainly very fast, and I would expect them to be much faster than any method of synchronizing one Process to another. If you're worried about errors due to the time used by the diagnostic functions themselves, I doubt that you can make things better by multiprocessing. Why not just start with a simple program and see what results you get?
I have a test code that is computationally intense and I run that on the GPU using Numba. I noticed that while that is running, one of my CPU cores goes to 100% and stays there the whole time. The GPU seems to be at 100% as well. You can see both in the screenshot below.
My benchmark code is as follows:
from numba import *
import numpy as np
from numba import cuda
import time
def benchmark():
input_list = np.random.randint(10, size=3200000).astype(np.intp)
output_list = np.zeros(input_list.shape).astype(np.intp)
d_input_array = cuda.to_device(input_list)
d_output_array = cuda.to_device(output_list)
run_test[32, 512](d_input_array, d_output_array)
out = d_output_array.copy_to_host()
print('Result: ' + str(out))
#cuda.jit("void(intp[::1], intp[::1])", fastmath=True)
def run_test(d_input_array, d_output_array):
array_slice_len = len(d_input_array) / (cuda.blockDim.x * cuda.gridDim.x)
thread_coverage = cuda.threadIdx.x * array_slice_len
slice_start = thread_coverage + (cuda.blockDim.x * cuda.blockIdx.x * array_slice_len)
for step in range(slice_start, slice_start + array_slice_len, 1):
if step > len(d_input_array) - 1:
return
count = 0
for item2 in d_input_array:
if d_input_array[step] == item2:
count = count + 1
d_output_array[step] = count
if __name__ == '__main__':
import timeit
# make_multithread(benchmark, 64)
print(timeit.timeit("benchmark()", setup="from __main__ import benchmark", number=1))
You can run the code above to repro if you got python 3.7, Numba and codatoolkit installed. I'm on Linux Mint 20.
I got 32 cores - doesn't seem right to have one 100% while everyone else seats idle.
I'm wondering why that is, if there is a way to have other cores help with whatever is being done to increase performance?
How can I investigate what is taking 100% of a single core and know what is going on?
CUDA kernel launches (and some other operations) are asynchronous from the point of view of the host thread. And as you say, you're running the computationally intense portion of the work on the GPU.
So the host thread has nothing to do, other than launch some work and wait for it to be finished. The waiting process here is a spin-wait which means the CPU thread is in a tight loop, waiting for a status condition to change.
The CPU thread will hit that spin-wait here:
out = d_output_array.copy_to_host()
which is the line of code after your kernel launch, and it expects to copy (valid) results back from the GPU to the CPU. In order for this to work, the CPU thread must wait there until the results are ready. Numba implements this with a blocking sync operation, between GPU and CPU activity. Therefore, for most of the duration of your program, the CPU thread is actually waiting at that line of code.
This waiting takes up 100% of that thread's activity, and thus one core is seen as fully utilized.
There wouldn't be any sense or reason to try to "distribute" this "work" to multiple threads/cores, so this is not a "performance" issue in the way you are suggesting.
Any CPU profiler that shows hotspots or uses PC sampling should be able to give you a picture of this. That line of code should show up near the top of the list of lines of code most heavily visited by your CPU/core/thread.
Recently, I started learning about threading and i wanted to implement it in the following code.
import timeit
start = timeit.default_timer()
def func(num):
s = [(i, j, k) for i in range(num) for j in range(num) for k in range(num)]
return s
z = 150
a,b = func(z),func(z)
print(a[:5], b[:5])
stop = timeit.default_timer()
print("time: ", stop - start)
the time it took was:
time: 3.7628489000000003
So I tried to use the Threading module and modified the code as:
import timeit
from threading import Thread
start = timeit.default_timer()
def func(num):
s = [(i, j, k) for i in range(num) for j in range(num) for k in range(num)]
print(s[:5])
a = Thread(target=func, args=(150,))
b = Thread(target=func, args=(150,))
a.start()
b.start()
a.join()
b.join()
stop = timeit.default_timer()
print("time: ", stop - start)
the time it took was:
time: 4.2522736
But, its supposed to get halved instead it increases. Is there anything wrong in my implementation?
Please explain what went wrong or is there a better way to achieve this.
You have encountered what is known as Global Interpreter Lock, GIL for short.
Threads in python are not "real" threads, that is to say that they do not execute simultaneously, but atomic operations in them are computed in sequence in some order (that order is often hard to predetermine)
This means that threads of threading library are useful when you need to wait for many blocking things simultaneously. This is usually listening to a network connection when one thread sits at receive() -method until something is received.
Other threads can keep doing other things and don't have to keep constantly checking the connection.
Real performance gains however cannot be achieved with threading
There is another library, called multiprocessing which does implement real threads that actually execute simultaneously. Using multiprocessing is in many ways similar to threading library but requires a little bit more work and care. I've come to realise that this divide between threading and multiprocessing is a good and useful thing. Threads in threading all have access to the same complete namespace, and as long as race conditions are taken care of, they operate in the same universe.
Threads in multiprocessing (I should use term process here) on the other hand are separated by the chasm of different namespaces after the child process is started. One has to use specialized communication queues and shared namespace objects when transmitting information between them. This will quickly require hundreds of lines of boilerplate code.
I have a program that randomly selects 13 cards from a full pack and analyses the hands for shape, point count and some other features important to the game of bridge. The program will select and analyse 10**7 hands in about 5 minutes. Checking the Activity Monitor shows that during execution the CPU (which s a 6 Core processor) is devoting about 9% of its time to the program and ~90% of its time it is idle. So it looks like a prime candidate for multiprocessing and I created a multiprocessing version using a Queue to pass information from each process back to the main program. Having navigated the problems of IDLE not working will multiprocessing (I now run it using PyCharm) and that doing a join on a process before it has finished freezes the program, I got it to work.
However, it doesn’t matter how many processes I use 5,10, 25 or 50 the result is always the same. The CPU devotes about 18% of its time to the program and has ~75% of its time idle and the execution time is slightly more than double at a bit over 10 minutes.
Can anyone explain how I can get the processes to take up more of the CPU time and how I can get the execution time to reflect this? Below are the relevant sections fo the program:
import random
import collections
import datetime
import time
from math import log10
from multiprocessing import Process, Queue
NUM_OF_HANDS = 10**6
NUM_OF_PROCESSES = 25
def analyse_hands(numofhands, q):
#code remove as not relevant to the problem
q.put((distribution, points, notrumps))
if __name__ == '__main__':
processlist = []
q = Queue()
handsperprocess = NUM_OF_HANDS // NUM_OF_PROCESSES
print(handsperprocess)
# Set up the processes and get them to do their stuff
start_time = time.time()
for _ in range(NUM_OF_PROCESSES):
p = Process(target=analyse_hands, args=((handsperprocess, q)))
processlist.append(p)
p.start()
# Allow q to get a few items
time.sleep(.05)
while not q.empty():
while not q.empty():
#code remove as not relevant to the problem
# Allow q to be refreshed so allowing all processes to finish before
# doing a join. It seems that doing a join before a process is
# finished will cause the program to lock
time.sleep(.05)
counter['empty'] += 1
for p in processlist:
p.join()
while not q.empty():
# This is never executed as all the processes have finished and q
# emptied before the join command above.
#code remove as not relevant to the problem
finish_time = time.time()
I have no answer to the reason why IDLE will not run a multiprocessor start instruction correctly but I believe the answer to the doubling of the execution times lies in the type of problem I am dealing with. Perhaps others can comment but it seems to me that the overhead involved with adding and removing items to and from the Queue is quite high so that performance improvements will be best achieved when the amount of data being passed via the Queue is small compared with the amount of processing required to obtain that data.
In my program I am creating and passing 10**7 items of data and I suppose it is the overhead of passing this number of items via the Queue that kills any performance improvement from getting the data via separate Processes. By using a map it seems all 10^7 items of data will need to be stored in the map before any further processing can be done. This might improve performance depending on the overhead of using the map and dealing with that amount of data but for the time being I will stick with my original vanilla, single processed code.
I'm trying to understand Python's multiprocessing, and have devised the following code to test it:
import multiprocessing
def F(n):
if n == 0: return 0
elif n == 1: return 1
else: return F(n-1)+F(n-2)
def G(n):
print(f'Fibbonacci of {n}: {F(n)}')
processes = []
for i in range(25, 35):
processes.append(multiprocessing.Process(target=G, args=(i, )))
for pro in processes:
pro.start()
When I run it, I tells me that the computing time was roughly of 6.65s.
I then wrote the following code, which I thought to be functionally equivalent to the latter:
from multiprocessing.dummy import Pool as ThreadPool
def F(n):
if n == 0: return 0
elif n == 1: return 1
else: return F(n-1)+F(n-2)
def G(n):
print(f'Fibbonacci of {n}: {F(n)}')
in_data = [i for i in range(25, 35)]
pool = ThreadPool(10)
results = pool.map(G, in_data)
pool.close()
pool.join()
and its running time was almost 12s.
Why is it that the second takes almost twice as the first one? Aren't they supposed to be equivalent?
(NB. I'm running Python 3.6, but also tested a similar code on 3.52 with same results.)
The reason the second takes twice as long as the first is likely due to the CPython Global Interpreter Lock.
From http://python-notes.curiousefficiency.org/en/latest/python3/multicore_python.html:
[...] the GIL effectively restricts bytecode execution to a single core, thus rendering pure Python threads an ineffective tool for distributing CPU bound work across multiple cores.
As you know, multiprocessing.dummy is a wrapper around the threading module, so you're creating threads, not processes. The Global Interpreter Lock, with a CPU-bound task as here, is not much different than simply executing your Fibonacci calculations sequentially in a single thread (except that you've added some thread-management/context-switching overhead).
With the "true multiprocessing" version, you only have a single thread in each process, each of which is using its own GIL. Hence, you can actually make use of multiple processors to improve the speed.
For this particular processing task, there is no significant advantage to using multiple threads over multiple processes. If you only have a single processor, there is no advantage to using either multiple processes or multiple threads over a single thread/process (in fact, both merely add context-switching overhead to your task).
(FWIW: A join in the true multiprocessing version is apparently being done automatically by the python runtime so adding an explicit join doesn't seem to make any difference in my tests using time(1). And, by the way, if you did want to add join, you should add a second loop for the join processing. Adding join to the existing loop will simply serialize your processes.)