Download multiple files: best performance-wise solution - multithreading

I am writing an application which needs to read thousands (let's say 10000) of small (let's say 4KB) files from a remote location (e.g., an S3 bucket).
The single read from remote takes around 0.1 seconds and processing the file takes just few milliseconds (from 10 to 30). Which is the best solution in terms of performance?
The worse I can do is to download everything serially: 0.1*10000 = 1000 seconds (~16 minutes).
Compressing the files in a single big one would surely help, but it is not an option in my application.
I tried to spawn N threads (where N is the number of logical cores) that download and process 10000/N files each. This gave me 10000/N * 0.1 seconds. If N = 16, it takes indeed around 1 minute to complete
I also tried to spawn K*N threads (this helps sometimes on high latency systems, as GPUs) but I did not get much speed up for any value of K (I tried 2,3,4), maybe due to lot of context thread switching.
My general question is: how would you design such a similar system? Is 1 minute really the best I can achieve?
Thank you,
Giuseppe

Related

How to speed up parallel loading (& unloading) of matrices onto multiple GPUs in Matlab

I am trying to implement an algorithm involving large dense matrices in Matlab. I am using multi-GPU AWS instances for performance.
At each iteration, I have to work with two large m by n matrices (of doubles), A and B, where m = 1600000, and n = 500. Due to the size of the matrices and the memory capacity of each GPU (~8 GB memory each), I decompose the problem by partitioning the matrices row-wise into K chunks of smaller matrices who has the same number of n columns but fewer rows (M /K).
In theory, I can load each chunk of data onto the GPU one at a time, perform computations, and gather the data before repeating with the next chunk. However, since I have access to 4 GPUs, I would like to use all 4 GPUs in parallel to save time, and decompose the matrices into 4 chunks.
To achieve this, I tried using the parfor loop in Matlab (with the parallel computing toolbox), utilizing best practices such as slicing, loading only relevant data for each worker. For posterity, here is a complete code snippet. I have provided small, decomposed problems deeper down in this post.
M = 1600000;
K = 4;
m = M/K;
n = 500;
A = randn(K, m,n);
B = randn(K,m,n);
C = randn(n,2);
D = zeros(K,m,2);
%delete(gcp('nocreate'));
%p = parpool('local',K);
tic
toc_load = zeros(K,1);
toc_compute = zeros(K,1);
toc_unload = zeros(K,1);
parfor j = 1:K
tic
A_blk = gpuArray(reshape(A(j,:,:),[m,n]));
B_blk = gpuArray(reshape(B(j,:,:), [m,n]));
C_blk = gpuArray(C);
D_blk = gpuArray(reshape(D(j,:,:), [m,2]));
toc_load(j) = toc;
tic
B_blk = D_blk * C_blk' + A_blk + B_blk;
toc_compute(j) = toc;
tic
B(j,:,:) = gather(B_blk);
toc_unload(j) = toc;
end
toc_all = toc;
fprintf('averaged over 4 workers, loading onto GPU took %f seconds \n', mean(toc_load));
fprintf('averaged over 4 workers, computation on GPU took %f seconds \n',mean(toc_compute));
fprintf('averaged over 4 workers, unloading from GPU took %f seconds \n', mean(toc_unload));
fprintf('the entire process took %f seconds \n', toc_all);
Using the tic-toc time checker (I run the code only after starting the parpool to ensure that time-tracker is accurate), I found that each worker takes on average:
6.33 seconds to load the data onto the GPU
0.18 seconds to run the computations on the GPU
4.91 seconds to unload the data from the GPU.
However, the entire process takes 158.57 seconds. So, the communication overhead (or something else?) took up a significant chunk of the running time.
I then tried a simple for loop without parallelization, see snippet below.
%% for loop
tic
for j = 1:K
A_blk = gpuArray(reshape(A(j,:,:),[m,n]));
B_blk = gpuArray(reshape(B(j,:,:), [m,n]));
C_blk = gpuArray(C);
D_blk = gpuArray(reshape(D(j,:,:), [m,2]));
toc_load(j) = toc;
B_blk = D_blk * C_blk' + A_blk + B_blk;
toc_compute(j) = toc;
B(j,:,:) = gather(B_blk);
end
toc_all = toc;
fprintf('the entire process took %f seconds \n', toc_all);
This time, running the entire code took only 27.96 seconds. So running the code in serial significantly improved performance in this case. Nonetheless, given that I have 4 GPUs, it seems disappointing to not be able to gain a speedup by using all 4 at the same time.
From my experiments above, I have observed that the actual computational cost of the GPU working on the linear algebra tasks appears low. The key bottleneck appears to be the time taken in loading the data in parallel from CPU onto the multiple GPUs, and gathering the data from the multiple GPUs back to CPU, though it is also possible that there is some other factor in play.
In lieu of this, I have the following questions:
What exactly is underlying the slowness of parfor? Why is the communication overhead (or whatever the underlying reason) so expensive?
How can I speed up the parallel loading and unloading of data from CPU to multiple GPUs and then back in Matlab? Are there tricks involving parfor, spmd (or other things such as parfeval, which I have not tried) that I have neglected? Or have I reached some kind of fundamental speed limit in Matlab (assuming I maintain my current CPU/GPU setup) ?
If there is a fundamental limitation in how Matlab handles the data loading/unloading, would the only recourse be to rewrite this portion of the code in C++?
Thank you for any assistance!
Sending data to/from AWS instances to use with parfor is considerably slower than using workers on your local machine because (a) the machines are further away, and (b) there's additional overhead because all communication with AWS workers use secure communication.
You can use ticBytes and tocBytes to see how much data is being transferred.
To improve the performance, I would suggest doing everything possible to avoid transferring large amounts of data between your client and the workers. It can often be more efficient to build data directly on the workers, even if this means building arrays redundantly multiple times.
Precisely how you avoid data transfer is highly dependent on where your original fundamental data is coming from. If you have files on your client system... that's tough. In your example, you're using rand - which is easy to run on the cluster, but presumably not really representative.
Sometimes there's a middle ground where you have some small-ish fundamental data that can only be computed at the client, and large derived data that is needed on the workers. In that case, you might conceivably couple the computation with parallel.pool.Constant, or just do everything inside a single spmd block or something. (Your parfor loop as written could equally use spmd since you're arranging things to have one iteration per worker).

Is there a search algorithm for minimizing number of threads?

I am using the Intel Xeon Phi coprocessor, which has up to 240 threads, and I am working on minimizing the number of threads used for a particular application (or maximize performance) while being within a percentage of the best execution time. So for example if I have the following measurements:
Threads | Execution time
240 100 s
200 105 s
150 107 s
120 109 s
100 120 s
I would like to select a number of threads between 120 and 150, since the "performance curve" there seems to stabilize and the reduction in execution time is not that significant (in this case around 15% of the best measured time. I did this using an exhaustive search algorithm (measuring from 1 to 240 threads), but my problem is that it takes too long for smaller number of threads (obviously depending on the size of the problem).
To try to reduce the number of measurements, I developed a sort of "binary search" algorithm. Basically I have an upper and lower limit (beginning at 0 and 240 threads), I take the value in the middle and measure it and at 240. I get the percent difference between both values and if it is within 15% (this value was selected after analyzing the results for the exhaustive search) I assign a new lower or upper bound. If the difference is larger than 15% then this is a new lower bound (120-240) and if it is smaller then it is a new upper bound (0-120), and if I get a better execution time I store it as the best execution time.
The problem with this algorithm is that first of all this is not necessarily a sorted array of execution times, and for some problem sizes the exhaustive search results show two different minimum, so for example in one I get the best performance at 80 threads and at 170, and I would like to be able to return 80, and not 170 threads as a result of the search. However, for the other cases where there is only one minimum, the algorithm found a value very close to the one expected.
If anyone has a better idea or knows of an existing search algorithm or heuristic that could help me I would be really grateful.
I'm taking it that your goal is to get the best relative performance for the least amount of threads, while still maintaining some limit on performance based on a coefficient (<=1) of the best possible performance. IE: If the coefficient is 0.85 then the performance should be no less than 85% of the performance using all threads.
It seems like what you should be trying to do is simply find the minimium number of threads required to obtain the performance bound. Rather than looking at 1-240 threads, start at 240 threads and reduce the number of threads until you can place a lower bound on the performance limit. You can then work up from the lower bound in such a way that you can find the min without passing over it. If you don't have predefined performance bound, then you can calculate one on the fly based on diminishing returns.
As long as the performance limit has not been exceeded, half the number of threads (start with max number of threads). The number that exceeds the performance limit is a lower bound on the number of threads required.
Starting at the lower bound on the number of threads, Z, add m threads if can be added without getting within the performance limit. Repeatedly double the number of threads added until within the performance limit. If adding the threads get within the performance limit, subtract the last addition and reset the number of threads to be added to m. If even just adding m gets within the limit, then add the last m threads and return the number of threads.
It might be clearer to give an example of what the process looks like step by step. Where Passed means that the number of threads are outside of the performance limits, and failed means they are either on the performance limit or inside of it.
Try adding 1m (Z + 1m). Passed. Threads = Z + m.
Try adding 2m (Z + 3m). Passed. Threads = Z + 3m.
Try adding 4m (Z + 7m). Failed. Threads = Z + 3m. Reset.
Try adding 1m. Passed. Threads = Z + 4m.
Try adding 2m. Passed. Threads = Z + 6m.
Z + 7m failed earlier so reset.
Comparisons/lookups are cheap, use them to prevent duplication of work.
Try adding 1m. Failed. Threads = Z + 6m. Reset.
Cannot add less than 1m and still in outside of performance limit.
The solution is Z + 7m threads.
Since Z + 6m is m threads short of the performance limit.
It's a bit inefficient, but it does find the minimium number of threads (>= Z) required to obtain the performance bound to within an error of m-1 threads and requiring only O(log (N-Z)) tests. This should be enough in most cases, but if it isn't just skip step 1 and use Z=m. Unless increasing the number of threads rapidly decreases the run-time causing very slow run times when Z is very small. In which case, doing step 1 and using interpolation can get an idea of how quickly the run-time increases as the number of threads decrease, which is also useful for determining a good performance limit if none is given.

Perl - creating text files with some data in lesser time - using threading

Whats the best way to generate 1000K text files? (with Perl and Windows 7) I want to generate those text files in as possible in less time (possibly withing 5 minutes)? Right now I am using Perl threading with 50 threads. Still it is taking longer time.
What will be best solution? Do I need to increase thread count? Is there any other way to write 1000K files in under five minutes? Here is my code
$start = 0;
$end = 10000;
my $start_run = time();
my #thr = "";
for($t=0; $t < 50; $t++) {
$thr[$t] = threads->create(\&files_write, $start, $end);
#start again from 10000 to 20000 loop
.........
}
for($t=0; $t < 50; $t++) {
$thr[$t]->join();
}
my $end_run = time();
my $run_time = $end_run - $start_run;
print "Job took $run_time seconds\n";
I don't want return result of those threads. I used detach() also but it didn't worked me.
For generating 500k files (with only size of 20kb) it took 1564 seconds (26min). Can I able to achieve within 5min?
Edited: The files_write will only take values from array predefined structure and write into file. thats it.
Any other solution?
The time needed depends on lots of factors, but heavy threading is probably not the solution:
creating files in the same directory at the same time needs probably locking in the OS, so it's better done not too much in parallel
the layout how the data gets written on disk depend on the amount of data and on how many writes you do in parallel. A bad layout can impact the performance a lot, especially on HDD. But even a SDD cannot do lots of parallel writes. This all depends a lot on the disk you use, e.g. it is a desktop system which is optimized for sequential writes or is it a server system which can do more parallel writes as required by databases.
... lots of other factors, often depending on the system
I would suggest that you use a thread pool with a fixed size of threads to benchmark, what the optimal number of threads is for your specific hardware. E.g. start with a single thread and slowly increase the number. My guess is, that the optimal number might be between factor 0.5 and 4 of the number of processor cores you have, but like I said, it heavily depends on your real hardware.
The slow performance is probably due to Windows having to lock the filesystem down while creating files.
If it is only for testing - and not critical data - a RAMdisk may be ideal. Try Googling DataRam RAMdisk.

Scaling socket.io broadcast

I want to broadcast a 1Kb message with socket.io (node.js framework), every 3 seconds to a large number of users. What is the best way to scale it (1 user = 1 'listener' with socket.on('periodicMessage',callback) )?
There is no other CPU usage (one read of an external database which is filled by an other external module every 3 seconds), so i am trying to know if a simple heroku server can broadcast a message to 10 000, 100 000, 1 million or more users.
We have easily scaled to tens of thousands of 'listeners' on a single node.js process. I am not sure how many you actually can scale to, given that each socket is a file descriptor, and the plain vanilla kernel can have 65K fd's for each process, no more.
CPU would not be a problem. If at all, upload bandwidth would be (1KB * 50K users / 3 sec = 50M/3sec = 16MB/s upstream. I never measured Heroku, so don't know if they sustain this. I suppose they do, but maybe they limit you, since they are paying Amazon for this, after all).

Joblib parallel increases time by n jobs

While trying to get multiprocessing to work (and understand it) in python 3.3 I quickly reverted to joblib to make my life easier. But I experience something very strange (in my point of view). When running this code (just to test if it works):
Parallel(n_jobs=1)(delayed(sqrt)(i**2) for i in range(200000))
It takes about 9 seconds but by increasing n_jobs it actually takes longer... for n_jobs=2 it takes 25 seconds and n_jobs=4 it takes 27 seconds.
Correct me if I'm wrong... but shouldn't it instead be much faster if n_jobs increases? I have an Intel I7 3770K so I guess it's not the problem of my CPU.
Perhaps giving my original problem can increase the possibility of an answer or solution.
I have a list of 30k+ strings, data, and I need to do something with each string (independent of the other strings), it takes about 14 seconds. This is only the test case to see if my code works. In real applications it will probably be 100k+ entries so multiprocessing is needed since this is only a small part of the entire calculation.
This is what needs to be done in this part of the calculation:
data_syno = []
for entry in data:
w = wordnet.synsets(entry)
if len(w)>0: data_syno.append(w[0].lemma_names[0])
else: data_syno.append(entry)
The n_jobs parameter is counter intuitive as the max number of cores to be used is at -1. at 1 it uses only one core. At -2 it uses max-1 cores, at -3 it uses max-2 cores, etc. Thats how I read it:
from the docs:
n_jobs: int :
The number of jobs to use for the computation. If -1 all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used. Thus for n_jobs = -2, all CPUs but one are used.

Resources