Is there a search algorithm for minimizing number of threads? - multithreading

I am using the Intel Xeon Phi coprocessor, which has up to 240 threads, and I am working on minimizing the number of threads used for a particular application (or maximize performance) while being within a percentage of the best execution time. So for example if I have the following measurements:
Threads | Execution time
240 100 s
200 105 s
150 107 s
120 109 s
100 120 s
I would like to select a number of threads between 120 and 150, since the "performance curve" there seems to stabilize and the reduction in execution time is not that significant (in this case around 15% of the best measured time. I did this using an exhaustive search algorithm (measuring from 1 to 240 threads), but my problem is that it takes too long for smaller number of threads (obviously depending on the size of the problem).
To try to reduce the number of measurements, I developed a sort of "binary search" algorithm. Basically I have an upper and lower limit (beginning at 0 and 240 threads), I take the value in the middle and measure it and at 240. I get the percent difference between both values and if it is within 15% (this value was selected after analyzing the results for the exhaustive search) I assign a new lower or upper bound. If the difference is larger than 15% then this is a new lower bound (120-240) and if it is smaller then it is a new upper bound (0-120), and if I get a better execution time I store it as the best execution time.
The problem with this algorithm is that first of all this is not necessarily a sorted array of execution times, and for some problem sizes the exhaustive search results show two different minimum, so for example in one I get the best performance at 80 threads and at 170, and I would like to be able to return 80, and not 170 threads as a result of the search. However, for the other cases where there is only one minimum, the algorithm found a value very close to the one expected.
If anyone has a better idea or knows of an existing search algorithm or heuristic that could help me I would be really grateful.

I'm taking it that your goal is to get the best relative performance for the least amount of threads, while still maintaining some limit on performance based on a coefficient (<=1) of the best possible performance. IE: If the coefficient is 0.85 then the performance should be no less than 85% of the performance using all threads.
It seems like what you should be trying to do is simply find the minimium number of threads required to obtain the performance bound. Rather than looking at 1-240 threads, start at 240 threads and reduce the number of threads until you can place a lower bound on the performance limit. You can then work up from the lower bound in such a way that you can find the min without passing over it. If you don't have predefined performance bound, then you can calculate one on the fly based on diminishing returns.
As long as the performance limit has not been exceeded, half the number of threads (start with max number of threads). The number that exceeds the performance limit is a lower bound on the number of threads required.
Starting at the lower bound on the number of threads, Z, add m threads if can be added without getting within the performance limit. Repeatedly double the number of threads added until within the performance limit. If adding the threads get within the performance limit, subtract the last addition and reset the number of threads to be added to m. If even just adding m gets within the limit, then add the last m threads and return the number of threads.
It might be clearer to give an example of what the process looks like step by step. Where Passed means that the number of threads are outside of the performance limits, and failed means they are either on the performance limit or inside of it.
Try adding 1m (Z + 1m). Passed. Threads = Z + m.
Try adding 2m (Z + 3m). Passed. Threads = Z + 3m.
Try adding 4m (Z + 7m). Failed. Threads = Z + 3m. Reset.
Try adding 1m. Passed. Threads = Z + 4m.
Try adding 2m. Passed. Threads = Z + 6m.
Z + 7m failed earlier so reset.
Comparisons/lookups are cheap, use them to prevent duplication of work.
Try adding 1m. Failed. Threads = Z + 6m. Reset.
Cannot add less than 1m and still in outside of performance limit.
The solution is Z + 7m threads.
Since Z + 6m is m threads short of the performance limit.
It's a bit inefficient, but it does find the minimium number of threads (>= Z) required to obtain the performance bound to within an error of m-1 threads and requiring only O(log (N-Z)) tests. This should be enough in most cases, but if it isn't just skip step 1 and use Z=m. Unless increasing the number of threads rapidly decreases the run-time causing very slow run times when Z is very small. In which case, doing step 1 and using interpolation can get an idea of how quickly the run-time increases as the number of threads decrease, which is also useful for determining a good performance limit if none is given.

Related

how to measure the time of selection sort?

when i measure the time of selection sort
with random array size of 10000 in random number range of 1000
it gives me big time like 14 sec when the size is 1000000 it gives me 1 min i think it supposed to be less than 5 sec
can you help me with the algorithm to lower the time
def selection_sort(selection_random_array):
for i in range(len(selection_array) - 1):
minimum_index = i
for j in range(i + 1, len(selection_array)):
if selection_array[j] < selection_array[minimum_index]:
minimum_index = j
selection_array[i], selection_array[minimum_index] = selection_array[minimum_index], selection_array[i]
return selection_array
print("--------selection_sort----------")
start1 = time.time()
selection_sort(selection_random_array)
end1 = time.time()
print(f"random array: {end1 - start1}")
You seem to be asking two questions: how to improve selection sort and how to time it exactly.
The short answer for both is: you can't. If you modify the sorting algorithm it is no longer selection sort. If that's okay, the industry standard is quicksort, so take a look at that algorithm (it's much more complicated, but runs in O(n log n) time instead of selection sort's O(n^2) time.
As for your other question, "how do I time it exactly", you also can't. Computers don't handle only one thing anymore. Your operating system is constantly threading tasks in between each other. There is a 0% chance that your CPU is dedicated entirely to this program while it runs. What does that mean? It means that the time it takes for the program to finish will change each time you run it. Beyond that, the time it takes to call time.time() will need to be taken into account.

list.count() vs Counter() performance

While trying to find the frequency of a bunch of characters in a string, why does running string.count(character) 4 times for 4 different characters yield faster execution time (using time.time()) than using a collections.Counter(string)?
Background:
Given a sequence of moves represented by a string. Valid moves are R (right), L (left), U (up), and D (down). Return True if the sequence of moves takes me back to the origin. Otherwise, return false.
# approach - 1 : iterate 4 times (3.9*10^-6 seconds)
def foo1(moves):
return moves.count('U') == moves.count('D') and moves.count('L') == moves.count('R')
# approach - 2 iterate once (3.9*10^-5 seconds)
def foo2(moves):
from collections import Counter
d = Counter(moves)
return d['R'] == d['L'] and d['U'] == d['D']
import time
start = time.time()
moves = "LDRRLRUULRLRLRLRLRLRLRLRLRLRL"
foo1(moves)
# foo2(moves)
end = time.time()
print("--- %s seconds ---" % (end - start))
These results are the opposite of what I had expected. My reasoning is that first approach should take longer because the string is iterated over 4 times whereas in the second approach, we iterate only once. Could it be due to the library call overhead?
Counter is faster in theory, but has higher fixed overhead, especially compared to str.count, which can scan the underlying C array with direct memory comparisons, where list.count has to do rich comparisons for each element; converting moves to a list of single characters nearly triples the time for foo1 in local tests, from 448 ns to 1.3 μs (while foo2 actually gets a tiny bit faster, dropping from 5.6 μs to 5.48 μs).
Other problems:
Importing an already imported module uses the cached import, but there is a surprising amount of overhead involved in even a cached import (the loading machinery has a lot of stuff to check to make sure it's okay to do so); in local tests, moving from collections import Counter to the top level reduced the runtime of foo2 by 1.6 μs (5.6 μs with single global import, 7.2 μs with local per-call import). This will vary a lot by environment; on another machine (with less stuff installed in both user and system site-packages), the overhead was only 0.75 μs. Regardless, it's a significant, avoidable disadvantage for foo2.
Counter on modern Python uses a C accelerator to speed up counting, but the accelerator only provides a benefit when the iterable is long enough. If you use the list form of moves, but multiply it by 100 to make a longer sequence, the difference drops, relatively speaking (to 106 µs for foo1 vs. 140 µs for foo2)
You're just not counting very many things; when there are only four things you care about, paying O(n) four times can easily beat paying O(n) once if the former case has lower constant multipliers (which aren't included in big-O notation) than the latter. Counter remains O(n) for any number of unique things being counted; calling .count is O(n) per call, but if you need to know the count of every unique thing in the input, for inputs that are mostly unique, individual .count calls for each will be asymptotically O(n²).
The .count approach is short-circuiting in your specific case, so it isn't even doing O(n) work four times, just twice; the U and D counts don't match, so it never counts L and R at all. Counter doesn't get meaningfully slower if it can't short-circuit (all the cost is paid in the single counting pass), but your foo1, in the same benchmark I used from point #2 (longer input, in list form), goes from 106 µs to 185 µs if I just add a single D to the end of the (pre-multiplication) moves (making the U and D counts the same, and requiring two more count calls); foo2 only goes up to 143 µs (from 140 µs), presumably because moves actually got longer (adding the D before multiplying by 100 meant it went from 2900 elements to count to 3000).
Basically, you had some minor implementation weaknesses, but mostly, you happened to choose a use case that gave all the advantage to .count, none to Counter. If your inputs are always str, and you're only counting them a small, fixed number of times, then sure, repeated calls to count are generally going to win. But for arbitrary input types (especially iterators, where count is impossible, both because it doesn't exist, and because you can only iterate it once), especially larger ones, with more unique things to count, where consistent performance counts (so relying on short-circuiting to reduce the number of count calls isn't acceptable), Counter will win.

How to speed up parallel loading (& unloading) of matrices onto multiple GPUs in Matlab

I am trying to implement an algorithm involving large dense matrices in Matlab. I am using multi-GPU AWS instances for performance.
At each iteration, I have to work with two large m by n matrices (of doubles), A and B, where m = 1600000, and n = 500. Due to the size of the matrices and the memory capacity of each GPU (~8 GB memory each), I decompose the problem by partitioning the matrices row-wise into K chunks of smaller matrices who has the same number of n columns but fewer rows (M /K).
In theory, I can load each chunk of data onto the GPU one at a time, perform computations, and gather the data before repeating with the next chunk. However, since I have access to 4 GPUs, I would like to use all 4 GPUs in parallel to save time, and decompose the matrices into 4 chunks.
To achieve this, I tried using the parfor loop in Matlab (with the parallel computing toolbox), utilizing best practices such as slicing, loading only relevant data for each worker. For posterity, here is a complete code snippet. I have provided small, decomposed problems deeper down in this post.
M = 1600000;
K = 4;
m = M/K;
n = 500;
A = randn(K, m,n);
B = randn(K,m,n);
C = randn(n,2);
D = zeros(K,m,2);
%delete(gcp('nocreate'));
%p = parpool('local',K);
tic
toc_load = zeros(K,1);
toc_compute = zeros(K,1);
toc_unload = zeros(K,1);
parfor j = 1:K
tic
A_blk = gpuArray(reshape(A(j,:,:),[m,n]));
B_blk = gpuArray(reshape(B(j,:,:), [m,n]));
C_blk = gpuArray(C);
D_blk = gpuArray(reshape(D(j,:,:), [m,2]));
toc_load(j) = toc;
tic
B_blk = D_blk * C_blk' + A_blk + B_blk;
toc_compute(j) = toc;
tic
B(j,:,:) = gather(B_blk);
toc_unload(j) = toc;
end
toc_all = toc;
fprintf('averaged over 4 workers, loading onto GPU took %f seconds \n', mean(toc_load));
fprintf('averaged over 4 workers, computation on GPU took %f seconds \n',mean(toc_compute));
fprintf('averaged over 4 workers, unloading from GPU took %f seconds \n', mean(toc_unload));
fprintf('the entire process took %f seconds \n', toc_all);
Using the tic-toc time checker (I run the code only after starting the parpool to ensure that time-tracker is accurate), I found that each worker takes on average:
6.33 seconds to load the data onto the GPU
0.18 seconds to run the computations on the GPU
4.91 seconds to unload the data from the GPU.
However, the entire process takes 158.57 seconds. So, the communication overhead (or something else?) took up a significant chunk of the running time.
I then tried a simple for loop without parallelization, see snippet below.
%% for loop
tic
for j = 1:K
A_blk = gpuArray(reshape(A(j,:,:),[m,n]));
B_blk = gpuArray(reshape(B(j,:,:), [m,n]));
C_blk = gpuArray(C);
D_blk = gpuArray(reshape(D(j,:,:), [m,2]));
toc_load(j) = toc;
B_blk = D_blk * C_blk' + A_blk + B_blk;
toc_compute(j) = toc;
B(j,:,:) = gather(B_blk);
end
toc_all = toc;
fprintf('the entire process took %f seconds \n', toc_all);
This time, running the entire code took only 27.96 seconds. So running the code in serial significantly improved performance in this case. Nonetheless, given that I have 4 GPUs, it seems disappointing to not be able to gain a speedup by using all 4 at the same time.
From my experiments above, I have observed that the actual computational cost of the GPU working on the linear algebra tasks appears low. The key bottleneck appears to be the time taken in loading the data in parallel from CPU onto the multiple GPUs, and gathering the data from the multiple GPUs back to CPU, though it is also possible that there is some other factor in play.
In lieu of this, I have the following questions:
What exactly is underlying the slowness of parfor? Why is the communication overhead (or whatever the underlying reason) so expensive?
How can I speed up the parallel loading and unloading of data from CPU to multiple GPUs and then back in Matlab? Are there tricks involving parfor, spmd (or other things such as parfeval, which I have not tried) that I have neglected? Or have I reached some kind of fundamental speed limit in Matlab (assuming I maintain my current CPU/GPU setup) ?
If there is a fundamental limitation in how Matlab handles the data loading/unloading, would the only recourse be to rewrite this portion of the code in C++?
Thank you for any assistance!
Sending data to/from AWS instances to use with parfor is considerably slower than using workers on your local machine because (a) the machines are further away, and (b) there's additional overhead because all communication with AWS workers use secure communication.
You can use ticBytes and tocBytes to see how much data is being transferred.
To improve the performance, I would suggest doing everything possible to avoid transferring large amounts of data between your client and the workers. It can often be more efficient to build data directly on the workers, even if this means building arrays redundantly multiple times.
Precisely how you avoid data transfer is highly dependent on where your original fundamental data is coming from. If you have files on your client system... that's tough. In your example, you're using rand - which is easy to run on the cluster, but presumably not really representative.
Sometimes there's a middle ground where you have some small-ish fundamental data that can only be computed at the client, and large derived data that is needed on the workers. In that case, you might conceivably couple the computation with parallel.pool.Constant, or just do everything inside a single spmd block or something. (Your parfor loop as written could equally use spmd since you're arranging things to have one iteration per worker).

yield loss with OpenMP in Fortran

First of all sorry if I make grammatical mistakes. I'm not english.
I'm trying to improve the yield loss that is occurring when increasing the number of threads using OpenMP in Fortran.
I'm using two Intel Xeon X5650 (12 physical cores) with 96 Gb of RAM
The best results that I've obtained are the following:
1 proc -> 15.50 sec; 2 proc -> 8.10 sec; 4 proc -> 4.42 sec; 8 proc -> 2.81 sec; 12 proc -> 2.43 sec
Like you see, the improvement decreases the more threads I run.
Here's the code:
allocate(SUM(PUNTOST,PUNTOSP,NUM_DATA,2))
allocate(SUMATORIO(1))
allocate(SUMATORIO(1)%REGION(REGIONS))
DO i=1,REGIONES
allocate(SUMATORIO(1)%REGION(i)%VALOR(2,PUNTOSP,PUNTOST))
SUMATORIO(1)%REGION(i)%VALOR= cmplx(0.0,0.0)
END DO
allocate(valor_aux(2,PUNTOSP,PUNTOST))
!...
call SYSTEM_CLOCK(counti,count_rate)
!$OMP PARALLEL NUM_THREADS(THREADS) DEFAULT(PRIVATE) FIRSTPRIVATE(REGIONS) &
!$OMP SHARED(SUMATORIO,SUM,PP,VEC_1,VEC_2,IDENT,TIPO,MUESTRA,PUNTOST,PUNTOSP)
!$OMP DO SCHEDULE(DYNAMIC,8)
DO i=1,REGIONS
INDICE=VEC_1(i)
valor_aux = cmplx(0.0,0.0)
DO j=1,VEC_2(i)
ii=IDENT(INDICE+1)
INDICE=INDICE+1
IF(TIPO(ii).ne.4) THEN
j1=MUESTRA(ii)
DO I1=1,PUNTOST
DO I2=1,PUNTOSP
valor_aux(1,I2,I1)=valor_aux(1,I2,I1)+SUM(I1,I2,J1,1)*PP(ii)
valor_aux(2,I2,I1)=valor_aux(2,I2,I1)+SUM(I1,I2,J1,2)*PP(ii)
END DO
END DO
END IF
END DO
SUMATORIO(1)%REGION(i)%VALOR= valor_aux
END DO
!$OMP END DO
!$OMP END PARALLEL
call SYSTEM_CLOCK(countf)
dt=REAL(countf-counti)/REAL(count_rate)
write(*,*)'FASE_1: Time: ',dt,'seconds'
Some points to know:
All data types are COMPLEX, except for loop vectors
NUM_DATA = 14000000
REGIONS = 1000000
Values contained in VEC_2 are between 10 and 20
PUNTOST = 21
PUNTOSP = 20
All allocated memory consume about 60 Gb of RAM
I've tried to change the dimensions of the matrixes to evade excesive memory caching (SUM(2,PUNTOSP,PUNTOST,NUM_DATA) for example) but this is the way in which I've obtained the best performance (I don't know the reason because in most of documents I've read they say that you have to try to make memory access be "sequential" to make the CPU brings the least amount of memory to cachee).
Also I've changed memory alignment to 32, 64 and 128 bytes but It didn't improve nothing.
Also I've changed the SCHEDULE option to STATIC with different chunk sizes and DYNAMIC with different chunk sizes but the results are the same or worse.
Do you have some ideas that I could use to improve the performance when using 8 or more cores?
Thank you so much for your attention and help.
Dividing by 6 the cpu time using 12 processors is rather good. In my applications I get rarely more than 4 or 5 (but it always remains sequential parts which is possibly the reason of that).
You could try the option collapse allowing to merge two loops together... But I don't know whether this is possible in your case because they are conditions to fulfill (for instance no instruction between the two loops).
While working on multidimensional arrays in Fortran, the leftmost index should change the fastest. You could try to change the order of the indices of valor_aux and SUM to
valor_aux(PUNTOSP, PUNTOST, 2)
SUM(PUNTOSP, PUNTOST, NUM_DATA, 2)
Additionally, you should always mind Amdahl's law. There is always some overhead, which yields additional speedup impossible.
Also: In your two innermost loops, the factor PP(ii) doesn't change. You should try to apply it after these loops (except, you know, that you are using FMA). And these loops are only a SUM of many values. You should try the intrinsic function SUM to remove these loops. Both things could require a massive redesign of your loops.

Joblib parallel increases time by n jobs

While trying to get multiprocessing to work (and understand it) in python 3.3 I quickly reverted to joblib to make my life easier. But I experience something very strange (in my point of view). When running this code (just to test if it works):
Parallel(n_jobs=1)(delayed(sqrt)(i**2) for i in range(200000))
It takes about 9 seconds but by increasing n_jobs it actually takes longer... for n_jobs=2 it takes 25 seconds and n_jobs=4 it takes 27 seconds.
Correct me if I'm wrong... but shouldn't it instead be much faster if n_jobs increases? I have an Intel I7 3770K so I guess it's not the problem of my CPU.
Perhaps giving my original problem can increase the possibility of an answer or solution.
I have a list of 30k+ strings, data, and I need to do something with each string (independent of the other strings), it takes about 14 seconds. This is only the test case to see if my code works. In real applications it will probably be 100k+ entries so multiprocessing is needed since this is only a small part of the entire calculation.
This is what needs to be done in this part of the calculation:
data_syno = []
for entry in data:
w = wordnet.synsets(entry)
if len(w)>0: data_syno.append(w[0].lemma_names[0])
else: data_syno.append(entry)
The n_jobs parameter is counter intuitive as the max number of cores to be used is at -1. at 1 it uses only one core. At -2 it uses max-1 cores, at -3 it uses max-2 cores, etc. Thats how I read it:
from the docs:
n_jobs: int :
The number of jobs to use for the computation. If -1 all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used. Thus for n_jobs = -2, all CPUs but one are used.

Resources