I use a multiprocessing.Pool().imap_unordered(...) to perform some tasks in parallel and measure the time it takes by calculating the difference of time.time() before and after starting the pool tasks.
However, it returns wrong results! When I watch my wall clock while the program runs, it tells me a run time of around 5 seconds. But the program itself outputs a run time of only 0.1 seconds.
I also have a variant of this code without any multiprocessing which takes double the time, but outputs the correct run times.
Here is my code:
if __name__ == "__main__":
n = int(input("How many grids to create? "))
use_multiprocessing = None
while use_multiprocessing is None:
answer = input("Use multiprocessing to speed things up? (Y/n) ").strip().lower()
if len(answer) == 1 and answer in "yn":
use_multiprocessing = True if answer == "y" else False
t0 = time.time()
if use_multiprocessing:
processes = cpu_count()
worker_pool = Pool(processes)
print("Creating {} sudokus using {} processes. Please wait...".format(n, processes))
sudokus = worker_pool.imap_unordered(create_sudoku, range(n), n // processes + 1)
progress_bar, progress_bar_length = 0, 10
sudokus = []
print("Creating {} sudokus".format(n), end="", flush=True)
for i in range(n):
p = int((i / n) * progress_bar_length)
if p > progress_bar:
print("." * (p-progress_bar), end="", flush=True)
progress_bar = p
new_sudoku = create_sudoku()
t = time.time() - t0
l = len(list(sudokus))
print("\nSuccessfully created {} grids in {:.6f}s (average {:.3f}ms per grid)!".format(
l, t, 1000*t/l
And here an example run, which took around 5-6 seconds in reality (after entering the number of grids to create and whether to use multiprocessing, of course):
How many grids to create? 100000
Use multiprocessing to speed things up? (Y/n) y
Creating 100000 sudokus using 4 processes. Please wait...
Successfully created 100000 grids in 0.122141s (average 0.001ms per grid)!
Process finished with exit code 0
Are multiprocessing and time.time() incompatible? I've heard that time.clock() can make problems under these circumstances, but I thought time.time() should be safe. Or is there any other problem?

I figured it out.
Pool.imap_unordered(...) returns a generator and no list. That means, its elements are not already created when the method finishes, but only as soon as I access them.
I did this in the line l = len(list(sudokus)), where I converted the generator into a list to get the length. And the finish time got measured one line before that, so it correctly reported the time it took to initialize the generator. This was not what I want, so swapping those two lines results in correct times.
I know I may not convert a generator into a list just to find out the length and then discard the list again. I must either rely on the saved requested length if I want a generator, or I must use instead which produces a list and blocks until it's ready.


Numpy.linalg.norm performance apparently doesn't scale with the number of dimensions

I have the following snippets of code which is a subroutine of the K-means clustering algorithm; specifically, it tries to assign each point to the closest centroid.
import numpy as np
n = 20000
D = 30
K = 250
points = np.random.rand(n, D)
centroids = np.random.rand(K, D)
membership = np.zeros(shape=n, dtype=int)
for i in range(n):
distances = np.apply_along_axis(lambda x: np.linalg.norm(x, ord=2), 1, centroids - points[i])
membership[i] = np.argmin(distances)
The running time here should be O(NKD) where D is the dimension of the data points, so naturally I expect when D increases or decreases, the running time would change proportionally as well. To my surprise, I see very little time being changed when changing D, for example when testing on my local machine:
D = 1
python3 12.10s user 0.39s system 118% cpu 10.564 total
D = 30
python3 12.17s user 0.36s system 117% cpu 10.703 total
D = 300
python3 13.30s user 0.31s system 115% cpu 11.784 total
D = 1000
python3 16.51s user 1.76s system 110% cpu 16.524 total
Is there something that I'm missing here?
Edit: per #Warren's suggestion, I modified the code to use np.linalg.norm with axis parameter directly; the performance is following:
D = 1
python3 1.45s user 0.37s system 634% cpu 0.287 total
D = 30
python3 1.67s user 0.29s system 592% cpu 0.331 total
D = 300
python3 3.03s user 0.32s system 234% cpu 1.428 total
D = 1000
python3 6.32s user 2.73s system 126% cpu 7.177 total
so the performance was better.
This is due to the overhead of Numpy functions.
Indeed, np.apply_along_axis is called 20_000 times and each call to this function internally does a loop calling the target Python function 250 times (ie. it is not vectorized), and so np.linalg.norm. In the end, np.linalg.norm is called, 20_000 * 250 = 5000000 times. The thing is each call to a Numpy function takes typically about 1 µs. On my machine, np.linalg.norm takes 4-5 µs on an array of size 1. This time is due to many internal checks (types and values), allocations, functions calls, conversion, etc.
There are two simple ways to reduce this overhead: vectorization and using a JIT compiler like Numba. The later is often more efficient as it avoid creating expensive big temporary arrays.
Here is a much faster implementation:
import numpy as np
import numba as nb
#nb.njit('(float64[:,::1], float64[:,::1], int_[::1])')
def compute(points, centroids, membership):
n, K, D = points.shape[0], centroids.shape[0], points.shape[1]
assert centroids.shape[1] == D and membership.shape[0] == n
distances = np.empty(K, np.float64)
for i in range(n):
for j in range(K):
distances[j] = np.linalg.norm(centroids[j] - points[i], ord=2)
membership[i] = np.argmin(distances)
n = 20000
D = 30
K = 250
points = np.random.rand(n, D)
centroids = np.random.rand(K, D)
membership = np.zeros(shape=n, dtype=int)
compute(points, centroids, membership)
In fact, while this code is much faster, it still have a similar issue: the cost of allocating the temporary arrays centroids[j] - points[i] is significant compared to the actual time required to compute the norm. In fact, each allocations takes only few hundred of nanoseconds, but the number of loop iteration is huge. One solution is simply to compute the norm manually:
from math import sqrt
#nb.njit('(float64[:,::1], float64[:,::1], int_[::1])', fastmath=True)
def compute_fast(points, centroids, membership):
n, K, D = points.shape[0], centroids.shape[0], points.shape[1]
assert centroids.shape[1] == D and membership.shape[0] == n
distances = np.empty(K, np.float64)
for i in range(n):
for j in range(K):
s = 0.0
for k in range(D):
tmp = centroids[j,k] - points[i,k]
s += tmp * tmp
distances[j] = sqrt(s)
membership[i] = np.argmin(distances)
Here are results on my i5-9600KF processor:
initial code: 26.56 seconds
compute: 1.44 seconds
compute_fast: 0.02 seconds (x1328)
initial code: 27.09 seconds
compute: 1.65 seconds
compute_fast: 0.13 seconds (x208)
initial code: 39.34 seconds
compute: 3.74 seconds
compute_fast: 4.57 seconds (x8.6)
The last implementation is much faster for small values of D since the Numpy overhead are the main bottleneck in this case and the implementation can almost completely remove such overheads (thanks to the JIT compilation).
It is probably O(NKD).
But the thing is you are iterating 3 loops here. One explicitly. One semi-explicitly. And the last one implicitly, inside numpy functions.
The outer one is your explicit for loop, for N.
The middle one is the np.apply_along_axis one, which applies on the K rows of centroids-points[i] (btw, there is another one here, with some broadcasting. But we don't need to count all of them for big-O consideration)
And the inner one is the one on the D columns that occur inside norm.
The inner one is obviously the most important to optimized, and that's good, because it is the only one that is vectorized here.
But that means that for small enough value of D, what we really see is more some constant overhead (times N×K, since it is inside a double for loop). Your inefficient outer for loops drive most of the cost, which, then, looks like O(NK).
Note that np.apply_along_axis is just a for loop by another name. It is not as bad. But almost so. It is still calling several times some python code. It is not vectorization.
But, well, I bet that with D big enough, you'll see that it is O(NKD)
Here is what I get when I increase D (with smaller n, so that it remains computable in realistic time)
You see that it looks really linear (affine, to be accurate, since it doesn't pass through 0, which is the reason why it doesn't look very linear to you; and which is explained by my previous comment: most of the inner cost inside the for/along_axis double loop is mainly constant overhead of those loops, when D is small. The "proportional to D" part begins to show when the overhead become negligible)

"Time Limit Exceeded" error for python file

I have a question about how to improve my simple Python file so that it does not exceed the time limit. My code should run in less than 2 seconds, but it takes a long time. I will be glad to know any advice about it. Code receives (n) as an integer from the user, then in n lines, I have to do the tasks. If the input is "Add" I have to add the given number and then arrange them from smallest to largest. If the input is "Ask", I have to return the asked index of added numbers.
This is
an example for inputs and outputs.
I guess the code works well for other examples, but the only problem is time ...
n = int(input())
def arrange(x):
for j in range(len(x)):
for i in range(len(x) - 1):
if x[i] > x[i + 1]:
x[i], x[i + 1] = x[i + 1], x[i]
for i in range(n):
ref = []
for i in range(n):
if tasks[i][0] == 'Add':
elif tasks[i][0] == 'Ask':
print(ref[int(tasks[i][1]) - 1])
For the given example, I get a "Time Limit Exceeded" Error.
First-off: Reimplementing list.sort will always be slower than just using it directly. If nothing else, getting rid of the arrange function and replacing the call to it with ref.sort() would improve performance (especially because Python's sorting algorithm is roughly O(n) when the input is largely sorted already, so you'll be reducing the work from the O(n**2) of your bubble-sorting arrange to roughly O(n), not just the O(n log n) of an optimized general purpose sort).
If that's not enough, note that list.sort is still theoretically O(n log n); if the list is getting large enough, that may cost more than it should. If so, take a look at the bisect module, to let you do the insertions with O(log n) lookup time (plus O(n) insertion time, but with very low constant factors) which might improve performance further.
Alternatively, if Ask operations are going to be infrequent, you might not sort at all when Adding, and only sort on demand when Ask occurs (possibly using a flag to indicate if it's already sorted so you don't call sort unnecessarily). That could make a meaningfully difference, especially if the inputs typically don't interleave Adds and Asks.
Lastly, in the realm of microoptimizations, you're needlessly wasting time on list copying and indexing you don't need to do, so stop doing it:
for i in range(n):
tasks.append(input().split()) # Removed list() call; str.split already returns a list
ref = []
for action, value in tasks: # Don't iterate by index, iterate the raw list and unpack to useful
# names; it's meaningfully faster
if action == 'Add':
elif action == 'Ask':
print(ref[int(value) - 1])
For me it runs in less than 0,005 seconds. Are you sure that you are measuring the right thing and you don't count in the time of giving the input for example?
Add 10
Add 2
Ask 1
Ask 2
Add 5
Ask 2
Ask 3
Elapsed time: 0.0033 seconds
My code:
import time
n = int(input('Input:\n'))
def arrange(x):
for j in range(len(x)):
for i in range(len(x) - 1):
if x[i] > x[i + 1]:
x[i], x[i + 1] = x[i + 1], x[i]
for i in range(n):
tic = time.perf_counter()
ref = []
for i in range(n):
if tasks[i][0] == 'Add':
elif tasks[i][0] == 'Ask':
print(ref[int(tasks[i][1]) - 1])
toc = time.perf_counter()
print(f"Elapsed time: {toc - tic:0.4f} seconds")

Two level multiprocessing in Python

Let's consider a function whose elaboration time depends on one of its parameter, size, and depending on its value the running time can take from hours to (few) days. In a script I launch multiple instances of this function in parallel to use all the cores and save time.
Let's say that the total elaboration time is limited by the longest running time of function instance with biggest size, and that I can also parallelize some parts of the function: this would probably not be beneficial if all the function instances are running, but it could be beneficial when only one remains (or if I have tons of cores). How would you do this in Python, that is, organize a two-level multiprocessing (one at the script level, the other inside the function)? Or would you just parallelize the function and launch multiple scripts with different configurations?
I provide a MWE, but I understand the answer on the (absolute) fastest execution is probably problem dependent. Here the function consists of nested loops, but you can adapt it as long as it verifies the preceding assumptions, or change the parameters. I am not interested in this particular MWE, but on how to set an inner parallelization.
import numpy as np
import time
import timeit
import multiprocessing as mp
import copy
def function_wrapper(matrix):
"""Puts to zero the multiples of 3."""
side = matrix.shape[0]
# consider to parallelize the following
# NOTE that I am not interested in parallelizing this particular example
# (you can change this part as long as by parallelizing it you better use the cores)
# with list comprehensions or built-in functions, but on how to setup a second level of multiprocessing
for x in range(side):
for y in range(side):
for z in range(side):
matrix[x, y, z] = timeit.timeit("numpy.linalg.eig(numpy.random.randint(0, 10, (L, L)))", setup='import numpy; L='+str(matrix[x, y, z]), number=10)
return matrix
num_cores = mp.cpu_count()
if __name__ == "__main__":
matrices_number = 20 # depending on the values of matrices_number and max_side the
max_side = 10 # parallelization setup is more or less important
matrices = [np.random.randint(1, 101, (side, side, side))
for side in np.random.randint(2, max_side, matrices_number)]
# sorts matrices by decreasing shape
args_generator = sorted([(m,) for m in matrices], key=lambda x: x[0].shape[0], reverse=True)
iterations = 10 # iterations to reduce caching effects
start = time.time()
for k in range(iterations):
results = [function_wrapper(*args) for args in copy.deepcopy(args_generator)]
elapsed = time.time() - start
print(f"loop: elapsed={elapsed} sec.")
start = time.time()
for k in range(iterations):
with mp.Pool(num_cores) as pool:
results = pool.starmap_async(function_wrapper, copy.deepcopy(args_generator)).get()
elapsed = time.time() - start
print(f"mp.Pool: elapsed={elapsed} sec.")
You're not taking advantage of the built-in functions in the numpy library. You are iterating over the array instead of broadcasting your logic across the whole matrix at once. When you take advantage of the built-in numpy functions, you take advantage of the underlying code written in C. The wrapper function should be the following.
def broadcaster(matrix):
return np.where(matrix % 3 != 0, matrix, 0)
start = time.time()
for k in range(iterations):
results = [broadcaster(*args) for args in copy.deepcopy(args_generator)]
elapsed = time.time() - start
print(f"Broadcast: elapsed={elapsed} sec.")
When I add that snippet to your code I get the following.
loop: elapsed=15.675757884979248 sec.
mp.Pool: elapsed=14.439897060394287 sec.
Broadcast: elapsed=0.6325647830963135 sec.
As you can see, in terms of performance, it is not even close.

Python - Iterate over while loop to compile an average runtime of program

So I want to preface this by saying one of the biggest problems with this I'm assuming is the return section of this code. With that being said, exactly what I'm trying to do is based off of my previous question for this code which was answered in two different ways, one said to be faster than the other. I wanted to see just how much faster myself by comparing the numbers. The problem I am now having though is that I would like to iterate over this function X amount of times, take the runtimes for each of those executions of the code, compile them, and create an average so I can then do the same with the other proposed solution, and compare the two. The main answer or help I'm currently looking for is getting this to iterate so I can have X different runtimes available to be seen. After that I will try to figure out how to compile them on my own, unless someone would be kind enough to help me through this entire process in one go.
import time
start_time = time.time()
def fibonacci():
previous_num, result = 0, 1
user = 1000
iterations = 10
while len(str(result)) < user:
while iterations != 0:
iterations -= 1
previous_num, result = result, previous_num + result
return result
print("--- %s seconds ---" % (time.time() - start_time))

How to make multiple objects work at same time?

I have been using Python to control some instruments, which I created a Class for. I have multiple instruments of the same kind, so my script has multiple instances of the same class.
Let's say the class is Arm, and it has methods move_left, move_right and reset. Right now I have script like this:
arm1 = Arm()
arm2 = Arm()
It's completely in serial. I have to wait for arm1 to finish move_left, then start arm2 to move_left. This is very inefficient. I would like arm1 and arm2 to move at the same time. They don't have to be exact same time, because arm1 and arm2 are quite independent and there's not much synchronization requirement. I just don't want to waste time in the serialization in the code.
I've done some searching and learned a little about threading, but what I found is all about putting a function in a Thread target, which doesn't really apply to my situation here.
One way to approach the problem would be to implement a state machine. That is, instead of defining the problem through commands like move_left() and move_right(), instead you can have some variables that represent the final position that you want each arm to end up at, and a second set of variables that represent the current position of the arm. Then at each time-step, you simply move the arms by a small amount towards their target-destination.
Here's a very simple toy program to demonstrate the idea. Note that it moves each "arm" by no more than 0.1 units every 100mS time-step (you can of course use any time-step and maximum-movement values you want instead):
import time
class Robot:
def __init__(self):
self._leftArmCurrentPos = 0.0
self._leftArmTargetPos = 0.0
self._rightArmCurrentPos = 0.0
self._rightArmTargetPos = 0.0
def setLeftArmTargetPos(self, newPos):
self._leftArmTargetPos = newPos
def setRightArmTargetPos(self, newPos):
self._rightArmTargetPos = newPos
# Returns the closest value to (deltaVal) in the range [-0.1, +0.1]
def clamp(self, deltaVal):
aLittleBit = 0.1 # or however much you want
if (deltaVal > aLittleBit):
return aLittleBit
elif (deltaVal < -aLittleBit):
return -aLittleBit
return deltaVal
def moveArmsTowardsTargetPositions(self):
leftArmDelta = self.clamp(self._leftArmTargetPos - self._leftArmCurrentPos)
if (leftArmDelta != 0.0):
self._leftArmCurrentPos += leftArmDelta
print("Moved left arm by %f towards %f, new left arm pos is %f" % (leftArmDelta, self._leftArmTargetPos, self._leftArmCurrentPos))
rightArmDelta = self.clamp(self._rightArmTargetPos - self._rightArmCurrentPos)
if (rightArmDelta != 0.0):
self._rightArmCurrentPos += rightArmDelta
print("Moved right arm by %f towards %f, new right arm pos is %f" % (rightArmDelta, self._rightArmTargetPos, self._rightArmCurrentPos))
if __name__ == "__main__":
r = Robot()
while True:
A nice side-effect of this approach is that you if change your mind at any time about where you want the arms to be, you can simply call setLeftArmTargetPos() or setRightArmTargetPos() to give the arms new/different destination values, and they will immediately start moving from (wherever they currently are at) towards the new target positions -- there's no need to wait for them to arrive at the old destinations first.
