I have the following snippets of code which is a subroutine of the K-means clustering algorithm; specifically, it tries to assign each point to the closest centroid.
import numpy as np
n = 20000
D = 30
K = 250
points = np.random.rand(n, D)
centroids = np.random.rand(K, D)
membership = np.zeros(shape=n, dtype=int)
for i in range(n):
distances = np.apply_along_axis(lambda x: np.linalg.norm(x, ord=2), 1, centroids - points[i])
membership[i] = np.argmin(distances)
The running time here should be O(NKD) where D is the dimension of the data points, so naturally I expect when D increases or decreases, the running time would change proportionally as well. To my surprise, I see very little time being changed when changing D, for example when testing on my local machine:
D = 1
python3 benchmark.py 12.10s user 0.39s system 118% cpu 10.564 total
D = 30
python3 benchmark.py 12.17s user 0.36s system 117% cpu 10.703 total
D = 300
python3 benchmark.py 13.30s user 0.31s system 115% cpu 11.784 total
D = 1000
python3 benchmark.py 16.51s user 1.76s system 110% cpu 16.524 total
Is there something that I'm missing here?
Edit: per #Warren's suggestion, I modified the code to use np.linalg.norm with axis parameter directly; the performance is following:
D = 1
python3 benchmark.py 1.45s user 0.37s system 634% cpu 0.287 total
D = 30
python3 benchmark.py 1.67s user 0.29s system 592% cpu 0.331 total
D = 300
python3 benchmark.py 3.03s user 0.32s system 234% cpu 1.428 total
D = 1000
python3 benchmark.py 6.32s user 2.73s system 126% cpu 7.177 total
so the performance was better.
This is due to the overhead of Numpy functions.
Indeed, np.apply_along_axis is called 20_000 times and each call to this function internally does a loop calling the target Python function 250 times (ie. it is not vectorized), and so np.linalg.norm. In the end, np.linalg.norm is called, 20_000 * 250 = 5000000 times. The thing is each call to a Numpy function takes typically about 1 µs. On my machine, np.linalg.norm takes 4-5 µs on an array of size 1. This time is due to many internal checks (types and values), allocations, functions calls, conversion, etc.
There are two simple ways to reduce this overhead: vectorization and using a JIT compiler like Numba. The later is often more efficient as it avoid creating expensive big temporary arrays.
Here is a much faster implementation:
import numpy as np
import numba as nb
#nb.njit('(float64[:,::1], float64[:,::1], int_[::1])')
def compute(points, centroids, membership):
n, K, D = points.shape[0], centroids.shape[0], points.shape[1]
assert centroids.shape[1] == D and membership.shape[0] == n
distances = np.empty(K, np.float64)
for i in range(n):
for j in range(K):
distances[j] = np.linalg.norm(centroids[j] - points[i], ord=2)
membership[i] = np.argmin(distances)
n = 20000
D = 30
K = 250
points = np.random.rand(n, D)
centroids = np.random.rand(K, D)
membership = np.zeros(shape=n, dtype=int)
compute(points, centroids, membership)
In fact, while this code is much faster, it still have a similar issue: the cost of allocating the temporary arrays centroids[j] - points[i] is significant compared to the actual time required to compute the norm. In fact, each allocations takes only few hundred of nanoseconds, but the number of loop iteration is huge. One solution is simply to compute the norm manually:
from math import sqrt
#nb.njit('(float64[:,::1], float64[:,::1], int_[::1])', fastmath=True)
def compute_fast(points, centroids, membership):
n, K, D = points.shape[0], centroids.shape[0], points.shape[1]
assert centroids.shape[1] == D and membership.shape[0] == n
distances = np.empty(K, np.float64)
for i in range(n):
for j in range(K):
s = 0.0
for k in range(D):
tmp = centroids[j,k] - points[i,k]
s += tmp * tmp
distances[j] = sqrt(s)
membership[i] = np.argmin(distances)
Here are results on my i5-9600KF processor:
D=1:
initial code: 26.56 seconds
compute: 1.44 seconds
compute_fast: 0.02 seconds (x1328)
D=30:
initial code: 27.09 seconds
compute: 1.65 seconds
compute_fast: 0.13 seconds (x208)
D=1000:
initial code: 39.34 seconds
compute: 3.74 seconds
compute_fast: 4.57 seconds (x8.6)
The last implementation is much faster for small values of D since the Numpy overhead are the main bottleneck in this case and the implementation can almost completely remove such overheads (thanks to the JIT compilation).
It is probably O(NKD).
But the thing is you are iterating 3 loops here. One explicitly. One semi-explicitly. And the last one implicitly, inside numpy functions.
The outer one is your explicit for loop, for N.
The middle one is the np.apply_along_axis one, which applies on the K rows of centroids-points[i] (btw, there is another one here, with some broadcasting. But we don't need to count all of them for big-O consideration)
And the inner one is the one on the D columns that occur inside norm.
The inner one is obviously the most important to optimized, and that's good, because it is the only one that is vectorized here.
But that means that for small enough value of D, what we really see is more some constant overhead (times N×K, since it is inside a double for loop). Your inefficient outer for loops drive most of the cost, which, then, looks like O(NK).
Note that np.apply_along_axis is just a for loop by another name. It is not as bad. But almost so. It is still calling several times some python code. It is not vectorization.
But, well, I bet that with D big enough, you'll see that it is O(NKD)
Edit:
Here is what I get when I increase D (with smaller n, so that it remains computable in realistic time)
You see that it looks really linear (affine, to be accurate, since it doesn't pass through 0, which is the reason why it doesn't look very linear to you; and which is explained by my previous comment: most of the inner cost inside the for/along_axis double loop is mainly constant overhead of those loops, when D is small. The "proportional to D" part begins to show when the overhead become negligible)
Multiplying larger and larger arrays of fp64 values takes the same time up until some point where the time increases. However, it isn't what I expect. First the results and then the code. The first number is the size of the array of numbers and the last number is the time in seconds. The time is for 1,000,000 executions. It is on a 4090. While looking consider a few questions. If there are only 16,384 CUDA cores then why does the time stay the same from before 16,384 fp64 values till well after that? Only at 262,144 multiples does it take significantly longer. Then after that the time doesn't quite double(1.8x) for reasons I don't understand. Once you've saturated the device then doubling the work should be at least 2X slower. Finally when going from 2097152 to 4194304 multiplies it takes 4.5 times as long. ???
8192: t1 * t2 took 2.081
16384: t1 * t2 took 2.095
32768: t1 * t2 took 2.066
65536: t1 * t2 took 2.057
131072: t1 * t2 took 2.209 Q1: Why still about 2 second with way over num of cuda cores?
262144: t1 * t2 took 2.991
524288: t1 * t2 took 5.989 2X slower which makes sense
1048576: t1 * t2 took 10.388 Only 1.7X slower which is a suprise given it is twice the work
2097152: t1 * t2 took 18.95 Q2: 1.8X slower but why ONLY 1.8X
4194304: t1 * t2 took 86.161 Q3: 4.5X slower for twice the work What is going on here?
import torch
import time
from datetime import datetime
from datetime import timedelta
with torch.cuda.device(0):
dim1 = 256
dim2 = 16
while dim2 <= 16384:
t1 = 1 + torch.rand((dim1,dim2), device='cuda', dtype=torch.float64)/10000
t2 = 1 + torch.rand((dim1,dim2), device='cuda', dtype=torch.float64)/10000
i = 0
tm0 = datetime.now()
while i < 1000000:
t1 = t1 * t2
#torch.cuda.synchronize() # MULT is dependent on previous result
i += 1
torch.cuda.synchronize()
print(f"{dim1*dim2}: t1 * t2 took {round(timedelta.total_seconds(datetime.now()-tm0)+.0001, 3)}")
dim2 *= 2
I tried the code above and I was expecting 32,768 multiplies would take twice the time as 16,384 given the actual number of cuda cores.
I am trying to apply a function to a large range of numbers - and the version where I use a pool from multiprocessing takes much longer to finish than what I estimate for a "single process" version -
Is this a problem with my code? Or Python? Or Linux?
The function that I am using is is_solution defined below-
as_ten_digit_string = lambda x: f"0000000000{x}"[-10:]
def sum_of_digits(nstr):
return sum([int(_) for _ in list(nstr)])
def is_solution(x):
return sum_of_digits(as_ten_digit_string(x)) == 10
When I run is_solution on a million numbers - it takes about 2 seconds
In [13]: %timeit [is_solution(x) for x in range(1_000_000)]
1.9 s ± 18.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Based on this - for ~10 Billion numbers - it should take about 20,000 seconds or around 6 hours. But the multiprocessing version doesn't end even after 9 hours.
I am using the multiprocessing module like this -
from multiprocessing import Pool
with Pool(processes=24) as p:
for solution in p.imap_unordered(is_solution, range(1_000_000_000, 9_999_999_999)):
if solution:
print(solution)
The python version I am using is 3.8 on linux.
I don't know if this is relevant - when I run the top command in linux - I see that when my main program has run for ~200 minutes - each of my worker processes has a CPU Time of about 20 minutes.
Multiprocessing is not free. If you have X cpu cores then spawning more than X processes will eventually lead to performance degradation. If your processes do I/O then spawning even 10*X process may be ok. Because they don't strain cpu. However if your processes do calculations and memory manipulation then it might be that any process above X only degrades performance. In comments you've said that you have 4 cores, so you should set Pool(processes=4). You may experiment with different values as well. Multiprocessing is hard, it may be that 5 or even 8 processes will still increase performance. But it is extremely likely that 24 processes over 4 cpu cores only hurts performance.
The other thing that you can do is to send data to subprocesses in batches. At the moment you send data one by one and since your calculation is fast (for a single datapoint) it may be that the interprocess communication dominates total execution time. This is a price that you do not pay in the single process scenario but you always pay when multiprocessing. To minimize its effect use chunksize parameter of imap_unordered.
Finally, try to reimplement your algorithm to avoid brute force. As suggested by #Alex.
def solution(n, sum):
"""Generates numbers of n digits with the given total sum"""
if n == 1 and sum < 10:
yield str(sum)
return
if n < 1 or (sum > 9 and n < 2):
return
if sum == 0:
yield "0" * n
return
for digit in range(min(sum + 1, 10)):
for s in solution(n - 1, sum - digit):
yield str(digit) + s
# Print all 4-digits numbers with total sum 10
for s in solution(4, 10):
print(s)
# Print all 4-digits numbers with total sum 10, not starting with zero
for digit in range(1, 10):
for s in solution(3, 10 - digit):
print(str(digit) + s)
I need to run a very time cost program. This program is made by multi-threads. But when I run it on my sever (windows server 2008 , cpu-E5 2680), it had a same performance with my PC, only with little speed improvement.
So I made a Fib function for testing and I want use all the cores to run this Fib function at the same time with close to 100% CPU usage.
let rec fib n =
if n> 450 then 10
else fib (n+1) + fib (n+2)
let intial (x:BackgroundWorker) =
x.DoWork.Add(fun e ->
ignore(fib 1)
)
x.RunWorkerCompleted.Add(fun e ->())
x
let arr = [|for i = 0 to 7 do yield new BackgroundWorker()|]
let _ = arr |> Array.map intial
while true do
let res =arr |> Array.map (fun e ->
if e.IsBusy then
()
else
e.RunWorkerAsync())
()
When I set only one thread , one core will reach to 100% usage.This does make sense.But when try to increase the number of thread. It show that the CPU usage decrease when the number of thread increase. For example , when I use 8 thread, there will be 8 CPU working with a about 50% usage, and also another one seems to be the thread of the 'While' part. Others seems to be not on working.
So what's wrong here? I really need a demo which can do some parallel computing with a high level CPU usage.
It is likely that your program is limited by the cost of switching threads - it is doing too little useful work. A call to just start job on a thread is 10's of thousands of cycles best case. Calculating the fibonacci of 20 is trivial by comparison.
Remember the fibonacci sequence gets very large, very quickly. You will exceed integer, float and double by the time you reach 450. (I believe the number is about 4.9E93. 64 bit integers run out around 2E19, and doubles loose integer resolution well before that.
have two algorithm implementations:
average(List) -> sum(List) / len(List).
sum([]) -> 0;
sum([Head | Tail]) -> Head + sum(Tail).
len([]) -> 0;
len([_ | Tail]) -> 1 + len(Tail).
average1(List) -> average_acc(List, 0,0).
average_acc([], Sum, Length) -> Sum / Length;
average_acc([H | T], Sum, Length) -> average_acc(T, Sum + H, Length + 1).
and output for trace GC events gc_start gc_end (gc started and stopped):
here every next value for process is a sum of preceding value and last gc time
average: 5189
average: 14480
average: 15118
average1: 594
Why so big difference?
PS. I use wall clock time.
You should not use wall clock time (timestamp flag) to measure what GC takes because even GC is not rescheduled in Erlang scheduler thread the thread self can be rescheduled by underlying OS. So you should use cpu_timestamp instead.
Your average/1 is using sum/1 and count/1 implementation where both are not tail recursive so you allocate 2*N stack frames which makes big difference in performance to average1/1 which uses tail recursive average_acc/3. So average1/1 performs exactly as loop in other languages.
Edit:
Edited according Yola stated he is using timestamp flag for trace messages.