Scheduling due to the last execution of the parallel region

Scheduling due to the last execution of the parallel region - multithreading

I needs some help with scheduling of a parallel region. My setup is as follow, I have a parallel region over a few (less than 10 say) expensive independent functions:
for (j=0; j < 1000; j++) {
//Parallel region
#pragma parallel for
for (i=0; i < number_of_functions; i++) {
fcn(j,i) //Expensive
}
//Serial region (must be so)
...
}
The time it takes to evaluate fcn is highly dependent of i and not so much on j.
Consider the example with three expensive functions and two threads, where the functions take approximately:
First iteration on j:
fcn(j=0, i=1) ~ 10s
fcn(j=0, i=2) ~ 10s
fcn(j=0, i=3) ~ 100s
Second iteration on j:
fcn(j=1, i=1) ~ 10s
fcn(j=1, i=2) ~ 10s
fcn(j=1, i=3) ~ 100s
So here I'd like to schedule i=3 first and then the rest. So what I'd like is that the scheduling is done due to the iterations that took the longest in the last jth iteration.
I know about the scheduling options for the for loop (static, dynamic) and the closest that would fit would be dynamic with a chunksize of one. Although in my example it wouldn't help as i=3 would always be evaluated last (if we have two threads). So my question is if there is an automatic way of scheduling due to the last execution of the parallel region? Or do I have to manually time the different evaluations and do the scheduling on my own?
Consider a second example
First iteration on j:
fcn(j=0, i=1) ~ 10s
fcn(j=0, i=2) ~ 10s
fcn(j=0, i=3) ~ 100s
fcn(j=0, i=4) ~ 10s
fcn(j=0, i=5) ~ 50s
Second iteration on j:
fcn(j=1, i=1) ~ 10s
fcn(j=1, i=2) ~ 10s
fcn(j=1, i=3) ~ 100s
fcn(j=1, i=4) ~ 10s
fcn(j=1, i=5) ~ 50s
Here I'd like to schedule i=3 first and i=5 second. I know that in the first iteration there is not much to do but the second iteration, there I'd like the last iteration to be taken into account for the scheduling.
I hope I made myself clear and thanks in advance!

Related

Why does a `multiprocessing` version take longer than a single process version in python 3 on Linux?

I am trying to apply a function to a large range of numbers - and the version where I use a pool from multiprocessing takes much longer to finish than what I estimate for a "single process" version -
Is this a problem with my code? Or Python? Or Linux?
The function that I am using is is_solution defined below-
as_ten_digit_string = lambda x: f"0000000000{x}"[-10:]
def sum_of_digits(nstr):
return sum([int(_) for _ in list(nstr)])
def is_solution(x):
return sum_of_digits(as_ten_digit_string(x)) == 10
When I run is_solution on a million numbers - it takes about 2 seconds
In [13]: %timeit [is_solution(x) for x in range(1_000_000)]
1.9 s ± 18.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Based on this - for ~10 Billion numbers - it should take about 20,000 seconds or around 6 hours. But the multiprocessing version doesn't end even after 9 hours.
I am using the multiprocessing module like this -
from multiprocessing import Pool
with Pool(processes=24) as p:
for solution in p.imap_unordered(is_solution, range(1_000_000_000, 9_999_999_999)):
if solution:
print(solution)
The python version I am using is 3.8 on linux.
I don't know if this is relevant - when I run the top command in linux - I see that when my main program has run for ~200 minutes - each of my worker processes has a CPU Time of about 20 minutes.

Multiprocessing is not free. If you have X cpu cores then spawning more than X processes will eventually lead to performance degradation. If your processes do I/O then spawning even 10*X process may be ok. Because they don't strain cpu. However if your processes do calculations and memory manipulation then it might be that any process above X only degrades performance. In comments you've said that you have 4 cores, so you should set Pool(processes=4). You may experiment with different values as well. Multiprocessing is hard, it may be that 5 or even 8 processes will still increase performance. But it is extremely likely that 24 processes over 4 cpu cores only hurts performance.
The other thing that you can do is to send data to subprocesses in batches. At the moment you send data one by one and since your calculation is fast (for a single datapoint) it may be that the interprocess communication dominates total execution time. This is a price that you do not pay in the single process scenario but you always pay when multiprocessing. To minimize its effect use chunksize parameter of imap_unordered.
Finally, try to reimplement your algorithm to avoid brute force. As suggested by #Alex.

def solution(n, sum):
"""Generates numbers of n digits with the given total sum"""
if n == 1 and sum < 10:
yield str(sum)
return
if n < 1 or (sum > 9 and n < 2):
return
if sum == 0:
yield "0" * n
return
for digit in range(min(sum + 1, 10)):
for s in solution(n - 1, sum - digit):
yield str(digit) + s
# Print all 4-digits numbers with total sum 10
for s in solution(4, 10):
print(s)
# Print all 4-digits numbers with total sum 10, not starting with zero
for digit in range(1, 10):
for s in solution(3, 10 - digit):
print(str(digit) + s)

Multi-threaded parallelism performance problem with Fibonacci sequence in Julia (1.3)

I'm trying the multithread function of Julia 1.3 with the following Hardware:
Model Name: MacBook Pro
Processor Name: Intel Core i7
Processor Speed: 2.8 GHz
Number of Processors: 1
Total Number of Cores: 4
L2 Cache (per Core): 256 KB
L3 Cache: 6 MB
Hyper-Threading Technology: Enabled
Memory: 16 GB
When running the following script:
function F(n)
if n < 2
return n
else
return F(n-1)+F(n-2)
end
end
#time F(43)
it gives me the following output
2.229305 seconds (2.00 k allocations: 103.924 KiB)
433494437
However when running the following code copied from the Julia page about multithreading
import Base.Threads.#spawn
function fib(n::Int)
if n < 2
return n
end
t = #spawn fib(n - 2)
return fib(n - 1) + fetch(t)
end
fib(43)
what happens is that the utilisation of RAM/CPU jumps from 3.2GB/6% to 15GB/25% without any output (for at least 1 minute, after which i decided to kill the julia session)
What am I doing wrong?

Great question.
This multithreaded implementation of the Fibonacci function is not faster than the single threaded version. That function was only shown in the blog post as a toy example of how the new threading capabilities work, highlighting that it allows for spawning many many threads in different functions and the scheduler will figure out an optimal workload.
The problem is that #spawn has a non-trivial overhead of around 1µs, so if you spawn a thread to do a task that takes less than 1µs, you've probably hurt your performance. The recursive definition of fib(n) has exponential time complexity of order 1.6180^n [1], so when you call fib(43), you spawn something of order 1.6180^43 threads. If each one takes 1µs to spawn, it'll take around 16 minutes just to spawn and schedule the needed threads, and that doesn't even account for the time it takes to do the actual computations and re-merge / sync threads which takes even more time.
Things like this where you spawn a thread for each step of a computation only make sense if each step of the computation takes a long time compared to the #spawn overhead.
Note that there is work going into lessening the overhead of #spawn, but by the very physics of multicore silicon chips I doubt it can ever be fast enough for the above fib implementation.
If you're curious about how we could modify the threaded fib function to actually be beneficial, the easiest thing to do would be to only spawn a fib thread if we think it will take significantly longer than 1µs to run. On my machine (running on 16 physical cores), I get
function F(n)
if n < 2
return n
else
return F(n-1)+F(n-2)
end
end
julia> #btime F(23);
122.920 μs (0 allocations: 0 bytes)
so thats a good two orders of magnitude over the cost of spawning a thread. That seems like a good cutoff to use:
function fib(n::Int)
if n < 2
return n
elseif n > 23
t = #spawn fib(n - 2)
return fib(n - 1) + fetch(t)
else
return fib(n-1) + fib(n-2)
end
end
now, if I follow proper benchmark methodology with BenchmarkTools.jl [2] I find
julia> using BenchmarkTools
julia> #btime fib(43)
971.842 ms (1496518 allocations: 33.64 MiB)
433494437
julia> #btime F(43)
1.866 s (0 allocations: 0 bytes)
433494437
#Anush asks in the comments: This is a factor of 2 speed up using 16 cores it seems. Is it possible to get something closer to a factor of 16 speed up?
Yes it is. The problem with the above function is that the function body is larger than that of F, with lots of conditionals, function / thread spawning and all that. I invite you to compare #code_llvm F(10) #code_llvm fib(10). This means that fib is much harder for julia to optimize. This extra overhead it makes a world of difference for the small n cases.
julia> #btime F(20);
28.844 μs (0 allocations: 0 bytes)
julia> #btime fib(20);
242.208 μs (20 allocations: 320 bytes)
Oh no! all that extra code that never gets touched for n < 23 is slowing us down by an order of magnitude! There's an easy fix though: when n < 23, don't recurse down to fib, instead call the single threaded F.
function fib(n::Int)
if n > 23
t = #spawn fib(n - 2)
return fib(n - 1) + fetch(t)
else
return F(n)
end
end
julia> #btime fib(43)
138.876 ms (185594 allocations: 13.64 MiB)
433494437
which gives a result closer to what we'd expect for so many threads.
[1] https://www.geeksforgeeks.org/time-complexity-recursive-fibonacci-program/
[2] The BenchmarkTools #btime macro from BenchmarkTools.jl will run functions multiple times, skipping the compilation time and average results.

#Anush
As an example of using memoization and multithreading manually
_fib(::Val{1}, _, _) = 1
_fib(::Val{2}, _, _) = 1
import Base.Threads.#spawn
_fib(x::Val{n}, d = zeros(Int, n), channel = Channel{Bool}(1)) where n = begin
# lock the channel
put!(channel, true)
if d[n] != 0
res = d[n]
take!(channel)
else
take!(channel) # unlock channel so I can compute stuff
#t = #spawn _fib(Val(n-2), d, channel)
t1 = _fib(Val(n-2), d, channel)
t2 = _fib(Val(n-1), d, channel)
res = fetch(t1) + fetch(t2)
put!(channel, true) # lock channel
d[n] = res
take!(channel) # unlock channel
end
return res
end
fib(n) = _fib(Val(n), zeros(Int, n), Channel{Bool}(1))
fib(1)
fib(2)
fib(3)
fib(4)
#time fib(43)
using BenchmarkTools
#benchmark fib(43)
But the speed up came from memmiozation and not so much multithreading. Lesson here is that we should think better algorithms before multithreading.

How to indicate an observation is the larger of two sampled values?

I'm writing a JAGS script (hierarchical bayesian model) where the times of events are modelled as a race between two processes.
Observations: time is the measured times of events.
Model: two processes with gaussian rates - whichever process finishes first triggers the event.
Goal: Estimate the rates of the two processes.
model{
# Priors
mu1 ~ dnorm( 0,1 ) # rate of one process
mu2 ~ dnorm( 0,1 ) # rate of other process
sigma1 <- 1 # variability in rate
sigma2 <- 0.1 # variability in rate
# Observations
for (i in 1:N)
rate1[i] ~ dnorm( mu1, sigma1 ) # Sample the two
rate2[i] ~ dnorm( mu2, sigma2 ) # racing processes.
rmax[i] <- max( rate1[i], rate2[i] ) # which was faster?
time[i] ~ 1/rmax[i] #### This is wrong!
}
}
Question: How can I indicate that the times are sampled from the larger of two rates, each of which is sampled from a different distribution?
Example histogram of simulated time data, using mu1=3, mu2=3 with different standard deviations for the two processes (fixed at 1 and 0.1)

cpu efficiency formula

I have a question like this :
Measurement of a certain system have shown that the average process runs for a time T before blocking on IO. A process switch requires a time S, which is effectively wasted ( overhead ). For round robin scheduling with quantum Q, give a formula for the CPU efficiency for each of the following
( a ) Q = INFINITY
( b ) Q > T
( c ) S < Q < T
( d ) Q = S
( e ) Q -> 0
I know how to do the a,b,d and e, but for c, the answer is T/(T + S * T/Q) = Q/(Q + S). It means the total times context switching occurs is T/Q which makes me confusing, let say T = 3, Q = 2, the process run for 2 units and switch to another process, then later it is switched back to execute and finish, then switch to another process again, so it is 2 switches which is = roof(T/Q); but based on the answer, there is only 1 switching, so there is no different between running in 1 round and 2 rounds? Could anyone explain it to me and what exactly CPU efficiency is.

Your problem doesn't say anything about the scheduler switching when blocked by IO, so I don't the answer you provided is correct. It doesn't take into account the fact that CPU is wasted when the process is blocked by IO. Let's look at an example with 2 processes:
repeat floor(T/Q) times:
Process 1 runs (Q units of time)
Context switch to process 2 (S units of time)
Process 2 runs (Q units of time)
Context switch to process 1 (S units of time)
if T mod Q > 0
Process 1 runs (T mod Q units of time) then blocks to IO
CPU is idle (Q - T mod Q units of time)
Context switch to process 2 (S units of time)
Process 2 runs (T mod Q units of time) then blocks to IO
CPU is idle (Q - T mod Q units of time)
Context switch to process 1 (S units of time)
Total time elapsed = 2(Q+S)*ceiling(T/Q)
Total time processes were running = 2T
Efficiency = T/((Q+S)*ceiling(T/Q))
If the scheduler switches once a process is blocked, then:
repeat floor(T/Q) times:
Process 1 runs (Q units of time)
Context switch to process 2 (S units of time)
Process 2 runs (Q units of time)
Context switch to process 1 (S units of time)
if T mod Q > 0
Process 1 runs (T mod Q units of time) then blocks to IO
Context switch to process 2 (S units of time)
Process 2 runs (T mod Q units of time) then blocks to IO
Context switch to process 1 (S units of time)
Total time elapsed = 2T + 2*S*ceiling(T/Q)
Total time processes were running = 2T
Efficiency = T/(T+S*ceiling(T/Q))
So if we assume that the scheduler switches when blocked, the answer you have is just missing the ceiling() part. If we assume that T is always a multiple of Q, then you don't even need it. Not sure what your problem says about that though.
On a side note, I think you were double counting context switches because you were looking at it from the perspective of a single process. The fact that there should be one context switch for every quantum that ran becomes more clear when you consider multiple processes being scheduled.

CPU efficiency is the percentage of time the CPU is doing something useful (i.e. not switching). Your formula doesn't suggest anything about how many switches are done, just what fraction of the time is being spent NOT switching.

I end up getting [(T^2)/(Q)] / [(T^2)/(Q)+(S*((T/Q))-(1/P))]. Using P as the number of processes. Which is the Total time executing divided by the total time (including switches):
Total Executing Time: P[(T/Q) * (T)] , T/Q is how many times each process must run. Then multiply by T again to get the total time processing.
Switching Time: P[((T/Q) * (S)) - S/P] , T/Q * S since we need the total switching times, but now we are counting switch after the last process has finished(an extra count) so we subtract S/P.
Total Time: Executing Time + Switching Time or P[(T^2) / (Q)] + P[((T/Q) * (S)) - (S/P)]
Efficiency: [(T^2)/(Q)] / [(T^2)/(Q)+(S*((T/Q))-(1/P))] Notice the P's drop.
Wolfram Alpha displays properly: Wolfram Eval

Why so big difference in GC time in two implementations

have two algorithm implementations:
average(List) -> sum(List) / len(List).
sum([]) -> 0;
sum([Head | Tail]) -> Head + sum(Tail).
len([]) -> 0;
len([_ | Tail]) -> 1 + len(Tail).
average1(List) -> average_acc(List, 0,0).
average_acc([], Sum, Length) -> Sum / Length;
average_acc([H | T], Sum, Length) -> average_acc(T, Sum + H, Length + 1).
and output for trace GC events gc_start gc_end (gc started and stopped):
here every next value for process is a sum of preceding value and last gc time
average: 5189
average: 14480
average: 15118
average1: 594
Why so big difference?
PS. I use wall clock time.

You should not use wall clock time (timestamp flag) to measure what GC takes because even GC is not rescheduled in Erlang scheduler thread the thread self can be rescheduled by underlying OS. So you should use cpu_timestamp instead.
Your average/1 is using sum/1 and count/1 implementation where both are not tail recursive so you allocate 2*N stack frames which makes big difference in performance to average1/1 which uses tail recursive average_acc/3. So average1/1 performs exactly as loop in other languages.
Edit:
Edited according Yola stated he is using timestamp flag for trace messages.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Scheduling due to the last execution of the parallel region - multithreading

Related

Why does a `multiprocessing` version take longer than a single process version in python 3 on Linux?

Multi-threaded parallelism performance problem with Fibonacci sequence in Julia (1.3)

How to indicate an observation is the larger of two sampled values?

cpu efficiency formula

Why so big difference in GC time in two implementations

Categories

Resources