perf experiments to understand what the hardware can do

perf experiments to understand what the hardware can do - pytorch

Multiplying larger and larger arrays of fp64 values takes the same time up until some point where the time increases. However, it isn't what I expect. First the results and then the code. The first number is the size of the array of numbers and the last number is the time in seconds. The time is for 1,000,000 executions. It is on a 4090. While looking consider a few questions. If there are only 16,384 CUDA cores then why does the time stay the same from before 16,384 fp64 values till well after that? Only at 262,144 multiples does it take significantly longer. Then after that the time doesn't quite double(1.8x) for reasons I don't understand. Once you've saturated the device then doubling the work should be at least 2X slower. Finally when going from 2097152 to 4194304 multiplies it takes 4.5 times as long. ???
8192: t1 * t2 took 2.081
16384: t1 * t2 took 2.095
32768: t1 * t2 took 2.066
65536: t1 * t2 took 2.057
131072: t1 * t2 took 2.209 Q1: Why still about 2 second with way over num of cuda cores?
262144: t1 * t2 took 2.991
524288: t1 * t2 took 5.989 2X slower which makes sense
1048576: t1 * t2 took 10.388 Only 1.7X slower which is a suprise given it is twice the work
2097152: t1 * t2 took 18.95 Q2: 1.8X slower but why ONLY 1.8X
4194304: t1 * t2 took 86.161 Q3: 4.5X slower for twice the work What is going on here?
import torch
import time
from datetime import datetime
from datetime import timedelta
with torch.cuda.device(0):
dim1 = 256
dim2 = 16
while dim2 <= 16384:
t1 = 1 + torch.rand((dim1,dim2), device='cuda', dtype=torch.float64)/10000
t2 = 1 + torch.rand((dim1,dim2), device='cuda', dtype=torch.float64)/10000
i = 0
tm0 = datetime.now()
while i < 1000000:
t1 = t1 * t2
#torch.cuda.synchronize() # MULT is dependent on previous result
i += 1
torch.cuda.synchronize()
print(f"{dim1*dim2}: t1 * t2 took {round(timedelta.total_seconds(datetime.now()-tm0)+.0001, 3)}")
dim2 *= 2
I tried the code above and I was expecting 32,768 multiplies would take twice the time as 16,384 given the actual number of cuda cores.

Related

Numpy.linalg.norm performance apparently doesn't scale with the number of dimensions

I have the following snippets of code which is a subroutine of the K-means clustering algorithm; specifically, it tries to assign each point to the closest centroid.
import numpy as np
n = 20000
D = 30
K = 250
points = np.random.rand(n, D)
centroids = np.random.rand(K, D)
membership = np.zeros(shape=n, dtype=int)
for i in range(n):
distances = np.apply_along_axis(lambda x: np.linalg.norm(x, ord=2), 1, centroids - points[i])
membership[i] = np.argmin(distances)
The running time here should be O(NKD) where D is the dimension of the data points, so naturally I expect when D increases or decreases, the running time would change proportionally as well. To my surprise, I see very little time being changed when changing D, for example when testing on my local machine:
D = 1
python3 benchmark.py 12.10s user 0.39s system 118% cpu 10.564 total
D = 30
python3 benchmark.py 12.17s user 0.36s system 117% cpu 10.703 total
D = 300
python3 benchmark.py 13.30s user 0.31s system 115% cpu 11.784 total
D = 1000
python3 benchmark.py 16.51s user 1.76s system 110% cpu 16.524 total
Is there something that I'm missing here?
Edit: per #Warren's suggestion, I modified the code to use np.linalg.norm with axis parameter directly; the performance is following:
D = 1
python3 benchmark.py 1.45s user 0.37s system 634% cpu 0.287 total
D = 30
python3 benchmark.py 1.67s user 0.29s system 592% cpu 0.331 total
D = 300
python3 benchmark.py 3.03s user 0.32s system 234% cpu 1.428 total
D = 1000
python3 benchmark.py 6.32s user 2.73s system 126% cpu 7.177 total
so the performance was better.

This is due to the overhead of Numpy functions.
Indeed, np.apply_along_axis is called 20_000 times and each call to this function internally does a loop calling the target Python function 250 times (ie. it is not vectorized), and so np.linalg.norm. In the end, np.linalg.norm is called, 20_000 * 250 = 5000000 times. The thing is each call to a Numpy function takes typically about 1 µs. On my machine, np.linalg.norm takes 4-5 µs on an array of size 1. This time is due to many internal checks (types and values), allocations, functions calls, conversion, etc.
There are two simple ways to reduce this overhead: vectorization and using a JIT compiler like Numba. The later is often more efficient as it avoid creating expensive big temporary arrays.
Here is a much faster implementation:
import numpy as np
import numba as nb
#nb.njit('(float64[:,::1], float64[:,::1], int_[::1])')
def compute(points, centroids, membership):
n, K, D = points.shape[0], centroids.shape[0], points.shape[1]
assert centroids.shape[1] == D and membership.shape[0] == n
distances = np.empty(K, np.float64)
for i in range(n):
for j in range(K):
distances[j] = np.linalg.norm(centroids[j] - points[i], ord=2)
membership[i] = np.argmin(distances)
n = 20000
D = 30
K = 250
points = np.random.rand(n, D)
centroids = np.random.rand(K, D)
membership = np.zeros(shape=n, dtype=int)
compute(points, centroids, membership)
In fact, while this code is much faster, it still have a similar issue: the cost of allocating the temporary arrays centroids[j] - points[i] is significant compared to the actual time required to compute the norm. In fact, each allocations takes only few hundred of nanoseconds, but the number of loop iteration is huge. One solution is simply to compute the norm manually:
from math import sqrt
#nb.njit('(float64[:,::1], float64[:,::1], int_[::1])', fastmath=True)
def compute_fast(points, centroids, membership):
n, K, D = points.shape[0], centroids.shape[0], points.shape[1]
assert centroids.shape[1] == D and membership.shape[0] == n
distances = np.empty(K, np.float64)
for i in range(n):
for j in range(K):
s = 0.0
for k in range(D):
tmp = centroids[j,k] - points[i,k]
s += tmp * tmp
distances[j] = sqrt(s)
membership[i] = np.argmin(distances)
Here are results on my i5-9600KF processor:
D=1:
initial code: 26.56 seconds
compute: 1.44 seconds
compute_fast: 0.02 seconds (x1328)
D=30:
initial code: 27.09 seconds
compute: 1.65 seconds
compute_fast: 0.13 seconds (x208)
D=1000:
initial code: 39.34 seconds
compute: 3.74 seconds
compute_fast: 4.57 seconds (x8.6)
The last implementation is much faster for small values of D since the Numpy overhead are the main bottleneck in this case and the implementation can almost completely remove such overheads (thanks to the JIT compilation).

It is probably O(NKD).
But the thing is you are iterating 3 loops here. One explicitly. One semi-explicitly. And the last one implicitly, inside numpy functions.
The outer one is your explicit for loop, for N.
The middle one is the np.apply_along_axis one, which applies on the K rows of centroids-points[i] (btw, there is another one here, with some broadcasting. But we don't need to count all of them for big-O consideration)
And the inner one is the one on the D columns that occur inside norm.
The inner one is obviously the most important to optimized, and that's good, because it is the only one that is vectorized here.
But that means that for small enough value of D, what we really see is more some constant overhead (times N×K, since it is inside a double for loop). Your inefficient outer for loops drive most of the cost, which, then, looks like O(NK).
Note that np.apply_along_axis is just a for loop by another name. It is not as bad. But almost so. It is still calling several times some python code. It is not vectorization.
But, well, I bet that with D big enough, you'll see that it is O(NKD)
Edit:
Here is what I get when I increase D (with smaller n, so that it remains computable in realistic time)
You see that it looks really linear (affine, to be accurate, since it doesn't pass through 0, which is the reason why it doesn't look very linear to you; and which is explained by my previous comment: most of the inner cost inside the for/along_axis double loop is mainly constant overhead of those loops, when D is small. The "proportional to D" part begins to show when the overhead become negligible)

Why does a `multiprocessing` version take longer than a single process version in python 3 on Linux?

I am trying to apply a function to a large range of numbers - and the version where I use a pool from multiprocessing takes much longer to finish than what I estimate for a "single process" version -
Is this a problem with my code? Or Python? Or Linux?
The function that I am using is is_solution defined below-
as_ten_digit_string = lambda x: f"0000000000{x}"[-10:]
def sum_of_digits(nstr):
return sum([int(_) for _ in list(nstr)])
def is_solution(x):
return sum_of_digits(as_ten_digit_string(x)) == 10
When I run is_solution on a million numbers - it takes about 2 seconds
In [13]: %timeit [is_solution(x) for x in range(1_000_000)]
1.9 s ± 18.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Based on this - for ~10 Billion numbers - it should take about 20,000 seconds or around 6 hours. But the multiprocessing version doesn't end even after 9 hours.
I am using the multiprocessing module like this -
from multiprocessing import Pool
with Pool(processes=24) as p:
for solution in p.imap_unordered(is_solution, range(1_000_000_000, 9_999_999_999)):
if solution:
print(solution)
The python version I am using is 3.8 on linux.
I don't know if this is relevant - when I run the top command in linux - I see that when my main program has run for ~200 minutes - each of my worker processes has a CPU Time of about 20 minutes.

Multiprocessing is not free. If you have X cpu cores then spawning more than X processes will eventually lead to performance degradation. If your processes do I/O then spawning even 10*X process may be ok. Because they don't strain cpu. However if your processes do calculations and memory manipulation then it might be that any process above X only degrades performance. In comments you've said that you have 4 cores, so you should set Pool(processes=4). You may experiment with different values as well. Multiprocessing is hard, it may be that 5 or even 8 processes will still increase performance. But it is extremely likely that 24 processes over 4 cpu cores only hurts performance.
The other thing that you can do is to send data to subprocesses in batches. At the moment you send data one by one and since your calculation is fast (for a single datapoint) it may be that the interprocess communication dominates total execution time. This is a price that you do not pay in the single process scenario but you always pay when multiprocessing. To minimize its effect use chunksize parameter of imap_unordered.
Finally, try to reimplement your algorithm to avoid brute force. As suggested by #Alex.

def solution(n, sum):
"""Generates numbers of n digits with the given total sum"""
if n == 1 and sum < 10:
yield str(sum)
return
if n < 1 or (sum > 9 and n < 2):
return
if sum == 0:
yield "0" * n
return
for digit in range(min(sum + 1, 10)):
for s in solution(n - 1, sum - digit):
yield str(digit) + s
# Print all 4-digits numbers with total sum 10
for s in solution(4, 10):
print(s)
# Print all 4-digits numbers with total sum 10, not starting with zero
for digit in range(1, 10):
for s in solution(3, 10 - digit):
print(str(digit) + s)

Linear programming conditional constraint

There are 22 drivers. Each driver has to work min of 7.6 hrs and can work max of 10 hrs. Each driver cost and productivity is different.
if some driver works overtime (more than 7.6 hrs), for first 2 hrs, we need to pay 1.5 times. For remaining 0.4 hrs, we need to pay 2 times.
195 hrs of work has to be completed by 22 drivers. We need to schedule in such a way that cost can be minimized.
Driver,Cost,Productivity
A,70,0.8
B,22,0.8
C,24,0.8
D,26,0.8
E,28,0.8
F,30,0.8
G,32,0.8
H,34,0.8
I,36,0.8
J,38,0.8
K,40,0.8
L,42,0.9
M,44,0.9
N,46,0.9
O,48,0.9
P,50,0.9
Q,52,0.9
R,54,0.9
S,56,0.9
T,58,0.9
U,60,0.9
V,62,0.5
Decision Variables:
X1,X2 ........X22 represents the total number of hours allocated to each driver
Objective Function:
Min Z = 20*X1 +22*X2......62*X22
Constraints:
X1>=7.6,X2>=7.6....X22>=7.6
X1<=10,X2<=10....X22<=10
X1+X2.....+X22 <= 195
I have tried following python program so far.
import pulp
import pandas as pd
def main():
model = pulp.LpProblem("Cost minimising scheduling problem", pulp.LpMinimize)
totalHours = 192
minHourEachDriver = 7.6
maxHourEachDriver = 10
# importing data from CSV
drivers = pd.DataFrame.from_csv('csv/drivers.csv', index_col=['Driver', 'Cost', 'Productivity'])
# Decision Variables
drv = pulp.LpVariable.dicts("driverName", indexs=((i) for i, j, k in drivers.index), lowBound=0,
cat='Continuous')
# Objective
model += pulp.lpSum([j * (1 / k) * drv[i] for i, j, k in drivers.index]), "Cost"
# Constraints
# total no of hours work to be done
model += pulp.lpSum([drv[i] for i, j, k in drivers.index]) == totalHours
for i, j, k in drivers.index:
# minimum hours driver has to work
model += drv[i] >= minHourEachDriver
# Maximum hour driver can work
model += drv[i] <= maxHourEachDriver
model.solve()
# model status
print(pulp.LpStatus[model.status])
# Total Cost
print(pulp.value(model.objective))
# No of hrs allocated to each driver
for i, j, k in drivers.index:
var_value = drv[i].varValue
# print(var_value)
print("The number hours for driver {0} are {1}".format(i, var_value))
if __name__ == '__main__':
main()
But, I am not able to figure out, how do we put following constraint.
if some driver work overtime (more than 7.6 hrs), for first 2 hrs, we
need to pay 1.5 times. For remaining 0.4 hrs, we need to pay 2 times.

If for each driver is mandatory to work 7.6h, there is no need to put it in the conditions. It is just static time (cost) that can be subtracted from total hours (costs) because it always happen:
195 - (NumDrivers * 7.6) = is the remaining time that need to be flexibly distributed between drivers as their overtimes to reach the 195 hours (when total hours > NumDrivers*7,6).
I would represent each driver with two variables (one for time working at 1.5 rate and second working time at double rate) and make following LP:
Xij = represents hours allocated to i-driver in j-working mode (let's say j=1 for 1,5 and j=2 for 2)
Based on the provided input file:
Min Z = 70*1,5*X11 + 70*2*X12 + 22*1,5*X21 + 22*2*X22 + ... 62*1,5*X221 + 62*2*X222
Constraints:
X11+X12+X21+X22+...X221+X222 = 27,8 (195 - (22*7,6))
X11+X12 <= 3,4
X21+X22 <= 3,4
...
X221+X222 <= 3,4
X11<=2
X21<=2
...
X221<=2
For the completeness there should be also set of conditions representing that each driver can start with j mode (2*) only after completing 2 hours at 1.5* but in this case objective function should make it automatically.

fsolve cannot find a solution even when given one

I reproduced the issue that I am experiencing with this simple code.
to validate that fsolve is working with the functions below I pre-calculated the functions values with the t values below, so that I am sure that t1 to t4 are a solution.
but even when giving the solution points fsolve is returning always the same solution
[ 1.50000000e+02 7.00000000e-01 2.00000000e+02 1.00000000e-01]
What I am doing wrong?
Is there a way to set constraints on the solution, for example all the t variables are between 0 and 1000?
t1 = 150.0
t2 = 0.7
t3 = 200.00
t4 = 0.1
def FS(z):
x1=z[0]
x2=z[1]
x3=z[2]
x4=z[3]
f = np.zeros(4)
f[0] = x1*x2 + x3*x4 - 125.0
f[1] = (x1**2/500)*x2 + (x3**2/500)*x4 - 39.5
f[2] = (x1**3/500**2)*x2 + (x3**3/500**2)*x4 - 12.649999999999999
f[3] = (x1**4/500**3)*x2 + (x3**4/500**3)*x4 - 4.115
return f
res = fsolve(FS, [t1, t2, t3, t3])
print(res)
[ 1.50000000e+02 7.00000000e-01 2.00000000e+02 1.00000000e-01]

First of all, I don't see your problem. The algorithm converges as expected on (one of the) solutions. This even happens, when you deviate from the solution a lot in your starting values, eg
t1 = 10
t2 = 10
t3 = 190
t4 = 10
Which gives rise to the solution [150, 0.7, 200, 0.1]. But part of the problem is probably that you have more than one solution. Try for instance
t1 = 190
t2 = 10
t3 = 10
t4 = 10
If [t1, t2, t3, t4] is a solution, then [t3, t4, t1, t2] is a solution as well. Which probably makes the algorithm non-convergent for certain starting values like [1, 1, 1, 1]. See some discussion about the underlying algorithm for instance here

cpu efficiency formula

I have a question like this :
Measurement of a certain system have shown that the average process runs for a time T before blocking on IO. A process switch requires a time S, which is effectively wasted ( overhead ). For round robin scheduling with quantum Q, give a formula for the CPU efficiency for each of the following
( a ) Q = INFINITY
( b ) Q > T
( c ) S < Q < T
( d ) Q = S
( e ) Q -> 0
I know how to do the a,b,d and e, but for c, the answer is T/(T + S * T/Q) = Q/(Q + S). It means the total times context switching occurs is T/Q which makes me confusing, let say T = 3, Q = 2, the process run for 2 units and switch to another process, then later it is switched back to execute and finish, then switch to another process again, so it is 2 switches which is = roof(T/Q); but based on the answer, there is only 1 switching, so there is no different between running in 1 round and 2 rounds? Could anyone explain it to me and what exactly CPU efficiency is.

Your problem doesn't say anything about the scheduler switching when blocked by IO, so I don't the answer you provided is correct. It doesn't take into account the fact that CPU is wasted when the process is blocked by IO. Let's look at an example with 2 processes:
repeat floor(T/Q) times:
Process 1 runs (Q units of time)
Context switch to process 2 (S units of time)
Process 2 runs (Q units of time)
Context switch to process 1 (S units of time)
if T mod Q > 0
Process 1 runs (T mod Q units of time) then blocks to IO
CPU is idle (Q - T mod Q units of time)
Context switch to process 2 (S units of time)
Process 2 runs (T mod Q units of time) then blocks to IO
CPU is idle (Q - T mod Q units of time)
Context switch to process 1 (S units of time)
Total time elapsed = 2(Q+S)*ceiling(T/Q)
Total time processes were running = 2T
Efficiency = T/((Q+S)*ceiling(T/Q))
If the scheduler switches once a process is blocked, then:
repeat floor(T/Q) times:
Process 1 runs (Q units of time)
Context switch to process 2 (S units of time)
Process 2 runs (Q units of time)
Context switch to process 1 (S units of time)
if T mod Q > 0
Process 1 runs (T mod Q units of time) then blocks to IO
Context switch to process 2 (S units of time)
Process 2 runs (T mod Q units of time) then blocks to IO
Context switch to process 1 (S units of time)
Total time elapsed = 2T + 2*S*ceiling(T/Q)
Total time processes were running = 2T
Efficiency = T/(T+S*ceiling(T/Q))
So if we assume that the scheduler switches when blocked, the answer you have is just missing the ceiling() part. If we assume that T is always a multiple of Q, then you don't even need it. Not sure what your problem says about that though.
On a side note, I think you were double counting context switches because you were looking at it from the perspective of a single process. The fact that there should be one context switch for every quantum that ran becomes more clear when you consider multiple processes being scheduled.

CPU efficiency is the percentage of time the CPU is doing something useful (i.e. not switching). Your formula doesn't suggest anything about how many switches are done, just what fraction of the time is being spent NOT switching.

I end up getting [(T^2)/(Q)] / [(T^2)/(Q)+(S*((T/Q))-(1/P))]. Using P as the number of processes. Which is the Total time executing divided by the total time (including switches):
Total Executing Time: P[(T/Q) * (T)] , T/Q is how many times each process must run. Then multiply by T again to get the total time processing.
Switching Time: P[((T/Q) * (S)) - S/P] , T/Q * S since we need the total switching times, but now we are counting switch after the last process has finished(an extra count) so we subtract S/P.
Total Time: Executing Time + Switching Time or P[(T^2) / (Q)] + P[((T/Q) * (S)) - (S/P)]
Efficiency: [(T^2)/(Q)] / [(T^2)/(Q)+(S*((T/Q))-(1/P))] Notice the P's drop.
Wolfram Alpha displays properly: Wolfram Eval

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string