MInimal time to compute the minimal value - multithreading

I was asked such question, what is the minimal time needed to compute the minimal value of an unsorted array of 32 integers, given that you have 8 cores and each comparison takes 1 minute. My solution is 6 minutes, assuming that each core operates independently. Divide the array into 8 portions, each has 4 integers, 8 cores concurrently compute the local min of each portion, takes 3 minutes, (3 comparisons in each portion). Then 4 cores to compute the local min of those 8 local mins, 1 minute. Then 2 cores to compute the 4 local mins, 1 minute, then 1 core to compute the global min among the remaining 2 mins, 1 minute. Therefore, the total amount is 6 minutes. However, it didn't seem to be the answer that the interviewer was looking for. So what do you guys think about it? Thank you

If you assume that the program is CPU-bound, which is fairly ridiculous, but seems to be where you were going with your analysis, then you need to decide how to divide the work to gain something by multithreading.
8 pieces of 4 integers each seems arbitrary. Interviewers usually like to see a thought process. Being mathematically general, let us compute total orderings over subsets of the problem. How hard is it to compute a total ordering, and what is the payoff?
Total ordering of N items, picking arbitrarily when two items are equal, requires N*(N-1)/2 comparisons and eliminates (N-1) items. Let's make a table.
N = 2: 1 comparison, 1 elimination.
N = 3: 3 comparisons, 2 eliminations.
N = 4: 6 comparisons, 3 eliminations.
Clearly it's most efficient to work with pairs (N = 2), but the other operations are useful if resources would otherwise be idle.
Minute 1-3: Eliminate 24 candidates using operations with N = 2, 8 at a time.
Minute 4: Now there are 8 candidates. Keeping N = 2 would leave 4 cores idle. Setting N = 3 uses 2 more cores per operation and yields 1 more elimination. So do two operations with N = 3 and one with N = 2, eliminating 2+2+1 = 5 candidates. Or, use 6 cores with N = 4 and two with N = 1 to eliminate 3+1+1 = 5. The result is the same.
Minute 5: Only 3 candidates remain, so set N = 3 for the last round.
If you keep the CPUs busy, it takes 5 minutes using a mix of two higher-level abstractions. More energy is spent because this isn't the most efficient way to solve the problem, but it is faster.

I'm going to assume that comparing two "integers" is a black box that takes 1 minute to complete, but we can cache those comparisons and only do any particular comparison once.
There's not much you can do until you're down to 8 candidates (3 minutes). But you don't want to leave cores sitting idle if you can help it. Let's say that the candidates are numbered 1 through 8. Then in minutes 4 you can compare:
1v2 3v4 5v6 7v8 AND 1v5 2v6 3v7 4v8
If we're lucky, this eliminates 6 candidates, and we can use minute 5 to to pick the winner.
If we're not lucky, this leaves 4 candidares (for example, 1, 3, 6, and 8), and that step didn't gain us anything over the original approach. In minute 5, we need to throw everything at it (to beat the original approach). But there are 8 cores, and C(4,2)=6 possible pairings. So we can make every possible comparison (and leave 2 cores idle), and get our winner in 5 minutes.

Those are really big integers, too big to fit into CPU cache, so multithreading doesn't really help you — this problem is I/O bound. (I suppose it depends on the specifics of the I/O bottleneck, but let's not pick nits.)
Since you need exactly N-1 comparisons, the answer is 31.

Related

how to measure the time of selection sort?

when i measure the time of selection sort
with random array size of 10000 in random number range of 1000
it gives me big time like 14 sec when the size is 1000000 it gives me 1 min i think it supposed to be less than 5 sec
can you help me with the algorithm to lower the time
def selection_sort(selection_random_array):
for i in range(len(selection_array) - 1):
minimum_index = i
for j in range(i + 1, len(selection_array)):
if selection_array[j] < selection_array[minimum_index]:
minimum_index = j
selection_array[i], selection_array[minimum_index] = selection_array[minimum_index], selection_array[i]
return selection_array
print("--------selection_sort----------")
start1 = time.time()
selection_sort(selection_random_array)
end1 = time.time()
print(f"random array: {end1 - start1}")
You seem to be asking two questions: how to improve selection sort and how to time it exactly.
The short answer for both is: you can't. If you modify the sorting algorithm it is no longer selection sort. If that's okay, the industry standard is quicksort, so take a look at that algorithm (it's much more complicated, but runs in O(n log n) time instead of selection sort's O(n^2) time.
As for your other question, "how do I time it exactly", you also can't. Computers don't handle only one thing anymore. Your operating system is constantly threading tasks in between each other. There is a 0% chance that your CPU is dedicated entirely to this program while it runs. What does that mean? It means that the time it takes for the program to finish will change each time you run it. Beyond that, the time it takes to call time.time() will need to be taken into account.

FIFO almost full and empty conditions Verilog

Suppose i am having a FIFO with depth 32 and width 8 bit.There is a valid bit A in all 32 locations.If this bit is 1 in all locations we have full condition and if 0 it will be empty condition.My Requirement is if this bit A at one location is 0 and all locations of this bit A is 1. when reaches to 30th location it should generate Almost_full condition.
Help me out please.
Thanks in Advance.
So you have a 32 bit vector and you want to check only one of the bits is 0. If speed is not much of a concern I will use a for loop to do this.
If speed is a concern I will get this done in 5 iterations. You can do this by divide and check method. Check two 16 bit words in parallel. Then divide this into two 8 bits and check them in parallel. And depending on where the zero is divide that particular 8 bit into 4 bits and check and so on.
If at any point you have zeros in both the parts, then you can exit the checking and conclude that almost_full = 0;

Is there a search algorithm for minimizing number of threads?

I am using the Intel Xeon Phi coprocessor, which has up to 240 threads, and I am working on minimizing the number of threads used for a particular application (or maximize performance) while being within a percentage of the best execution time. So for example if I have the following measurements:
Threads | Execution time
240 100 s
200 105 s
150 107 s
120 109 s
100 120 s
I would like to select a number of threads between 120 and 150, since the "performance curve" there seems to stabilize and the reduction in execution time is not that significant (in this case around 15% of the best measured time. I did this using an exhaustive search algorithm (measuring from 1 to 240 threads), but my problem is that it takes too long for smaller number of threads (obviously depending on the size of the problem).
To try to reduce the number of measurements, I developed a sort of "binary search" algorithm. Basically I have an upper and lower limit (beginning at 0 and 240 threads), I take the value in the middle and measure it and at 240. I get the percent difference between both values and if it is within 15% (this value was selected after analyzing the results for the exhaustive search) I assign a new lower or upper bound. If the difference is larger than 15% then this is a new lower bound (120-240) and if it is smaller then it is a new upper bound (0-120), and if I get a better execution time I store it as the best execution time.
The problem with this algorithm is that first of all this is not necessarily a sorted array of execution times, and for some problem sizes the exhaustive search results show two different minimum, so for example in one I get the best performance at 80 threads and at 170, and I would like to be able to return 80, and not 170 threads as a result of the search. However, for the other cases where there is only one minimum, the algorithm found a value very close to the one expected.
If anyone has a better idea or knows of an existing search algorithm or heuristic that could help me I would be really grateful.
I'm taking it that your goal is to get the best relative performance for the least amount of threads, while still maintaining some limit on performance based on a coefficient (<=1) of the best possible performance. IE: If the coefficient is 0.85 then the performance should be no less than 85% of the performance using all threads.
It seems like what you should be trying to do is simply find the minimium number of threads required to obtain the performance bound. Rather than looking at 1-240 threads, start at 240 threads and reduce the number of threads until you can place a lower bound on the performance limit. You can then work up from the lower bound in such a way that you can find the min without passing over it. If you don't have predefined performance bound, then you can calculate one on the fly based on diminishing returns.
As long as the performance limit has not been exceeded, half the number of threads (start with max number of threads). The number that exceeds the performance limit is a lower bound on the number of threads required.
Starting at the lower bound on the number of threads, Z, add m threads if can be added without getting within the performance limit. Repeatedly double the number of threads added until within the performance limit. If adding the threads get within the performance limit, subtract the last addition and reset the number of threads to be added to m. If even just adding m gets within the limit, then add the last m threads and return the number of threads.
It might be clearer to give an example of what the process looks like step by step. Where Passed means that the number of threads are outside of the performance limits, and failed means they are either on the performance limit or inside of it.
Try adding 1m (Z + 1m). Passed. Threads = Z + m.
Try adding 2m (Z + 3m). Passed. Threads = Z + 3m.
Try adding 4m (Z + 7m). Failed. Threads = Z + 3m. Reset.
Try adding 1m. Passed. Threads = Z + 4m.
Try adding 2m. Passed. Threads = Z + 6m.
Z + 7m failed earlier so reset.
Comparisons/lookups are cheap, use them to prevent duplication of work.
Try adding 1m. Failed. Threads = Z + 6m. Reset.
Cannot add less than 1m and still in outside of performance limit.
The solution is Z + 7m threads.
Since Z + 6m is m threads short of the performance limit.
It's a bit inefficient, but it does find the minimium number of threads (>= Z) required to obtain the performance bound to within an error of m-1 threads and requiring only O(log (N-Z)) tests. This should be enough in most cases, but if it isn't just skip step 1 and use Z=m. Unless increasing the number of threads rapidly decreases the run-time causing very slow run times when Z is very small. In which case, doing step 1 and using interpolation can get an idea of how quickly the run-time increases as the number of threads decrease, which is also useful for determining a good performance limit if none is given.

Overall download time calculation with max connection cap

Can somebody point to me to the algorithm for calculating download estimate with maximum connection caps. For instance I have 7 PCs with different download speed and I can have only X devices allowed to download at once.
Speed(Kbps) Size(Kb) Estimate(s)
10 1000 100
50 1000 20
100 1000 10
200 1000 5
10 1000 100
20 1000 50
40 1000 25
*Estimate = Size/Speed
What comes in mind is Sum(Estimate)/MaxConnections but it seems inaccurate.
If X=2 then result using that logic will be 310/2=155 but in real life it will be 160:
1st iteration:
1 thread: 100s
2 thread: 100s
Total Elapsed: 100s
2nd iteration:
1 thread: 50s
2 thread: 25s + 20s + 5s
Total Elapsed: 150s
3rd iteration:
1 thread: 10s
Total Elapsed: 160s
It seems to be a variation of k-partition problem, where you want to 'split' the work as evenly as possible between the X devices you have. Unfortunately, this problem is NP-Complete, and there is no known efficient solution to it.
When X=2, and estimations are relatively small integers, there is a pseudo-polynomial solution to the problem using Dynamic Programming.
However, for general X, the problem there is no known pseudo polynomial solution.
What you can do:
Using heuristics solutions such as Genetic Algorithms to split the
work in groups. These solution will usually be pretty good - but not optimal.
Use brute force approach to find optimal solution. Note it will only be feasible for very low number of items you want to download, and specifically it is going to be O(X^n), with n being the number of elements you download.

Joblib parallel increases time by n jobs

While trying to get multiprocessing to work (and understand it) in python 3.3 I quickly reverted to joblib to make my life easier. But I experience something very strange (in my point of view). When running this code (just to test if it works):
Parallel(n_jobs=1)(delayed(sqrt)(i**2) for i in range(200000))
It takes about 9 seconds but by increasing n_jobs it actually takes longer... for n_jobs=2 it takes 25 seconds and n_jobs=4 it takes 27 seconds.
Correct me if I'm wrong... but shouldn't it instead be much faster if n_jobs increases? I have an Intel I7 3770K so I guess it's not the problem of my CPU.
Perhaps giving my original problem can increase the possibility of an answer or solution.
I have a list of 30k+ strings, data, and I need to do something with each string (independent of the other strings), it takes about 14 seconds. This is only the test case to see if my code works. In real applications it will probably be 100k+ entries so multiprocessing is needed since this is only a small part of the entire calculation.
This is what needs to be done in this part of the calculation:
data_syno = []
for entry in data:
w = wordnet.synsets(entry)
if len(w)>0: data_syno.append(w[0].lemma_names[0])
else: data_syno.append(entry)
The n_jobs parameter is counter intuitive as the max number of cores to be used is at -1. at 1 it uses only one core. At -2 it uses max-1 cores, at -3 it uses max-2 cores, etc. Thats how I read it:
from the docs:
n_jobs: int :
The number of jobs to use for the computation. If -1 all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used. Thus for n_jobs = -2, all CPUs but one are used.

Resources