Perl - creating text files with some data in lesser time - using threading - multithreading

Whats the best way to generate 1000K text files? (with Perl and Windows 7) I want to generate those text files in as possible in less time (possibly withing 5 minutes)? Right now I am using Perl threading with 50 threads. Still it is taking longer time.
What will be best solution? Do I need to increase thread count? Is there any other way to write 1000K files in under five minutes? Here is my code
$start = 0;
$end = 10000;
my $start_run = time();
my #thr = "";
for($t=0; $t < 50; $t++) {
$thr[$t] = threads->create(\&files_write, $start, $end);
#start again from 10000 to 20000 loop
.........
}
for($t=0; $t < 50; $t++) {
$thr[$t]->join();
}
my $end_run = time();
my $run_time = $end_run - $start_run;
print "Job took $run_time seconds\n";
I don't want return result of those threads. I used detach() also but it didn't worked me.
For generating 500k files (with only size of 20kb) it took 1564 seconds (26min). Can I able to achieve within 5min?
Edited: The files_write will only take values from array predefined structure and write into file. thats it.
Any other solution?

The time needed depends on lots of factors, but heavy threading is probably not the solution:
creating files in the same directory at the same time needs probably locking in the OS, so it's better done not too much in parallel
the layout how the data gets written on disk depend on the amount of data and on how many writes you do in parallel. A bad layout can impact the performance a lot, especially on HDD. But even a SDD cannot do lots of parallel writes. This all depends a lot on the disk you use, e.g. it is a desktop system which is optimized for sequential writes or is it a server system which can do more parallel writes as required by databases.
... lots of other factors, often depending on the system
I would suggest that you use a thread pool with a fixed size of threads to benchmark, what the optimal number of threads is for your specific hardware. E.g. start with a single thread and slowly increase the number. My guess is, that the optimal number might be between factor 0.5 and 4 of the number of processor cores you have, but like I said, it heavily depends on your real hardware.

The slow performance is probably due to Windows having to lock the filesystem down while creating files.
If it is only for testing - and not critical data - a RAMdisk may be ideal. Try Googling DataRam RAMdisk.

Related

How to speed up parallel loading (& unloading) of matrices onto multiple GPUs in Matlab

I am trying to implement an algorithm involving large dense matrices in Matlab. I am using multi-GPU AWS instances for performance.
At each iteration, I have to work with two large m by n matrices (of doubles), A and B, where m = 1600000, and n = 500. Due to the size of the matrices and the memory capacity of each GPU (~8 GB memory each), I decompose the problem by partitioning the matrices row-wise into K chunks of smaller matrices who has the same number of n columns but fewer rows (M /K).
In theory, I can load each chunk of data onto the GPU one at a time, perform computations, and gather the data before repeating with the next chunk. However, since I have access to 4 GPUs, I would like to use all 4 GPUs in parallel to save time, and decompose the matrices into 4 chunks.
To achieve this, I tried using the parfor loop in Matlab (with the parallel computing toolbox), utilizing best practices such as slicing, loading only relevant data for each worker. For posterity, here is a complete code snippet. I have provided small, decomposed problems deeper down in this post.
M = 1600000;
K = 4;
m = M/K;
n = 500;
A = randn(K, m,n);
B = randn(K,m,n);
C = randn(n,2);
D = zeros(K,m,2);
%delete(gcp('nocreate'));
%p = parpool('local',K);
tic
toc_load = zeros(K,1);
toc_compute = zeros(K,1);
toc_unload = zeros(K,1);
parfor j = 1:K
tic
A_blk = gpuArray(reshape(A(j,:,:),[m,n]));
B_blk = gpuArray(reshape(B(j,:,:), [m,n]));
C_blk = gpuArray(C);
D_blk = gpuArray(reshape(D(j,:,:), [m,2]));
toc_load(j) = toc;
tic
B_blk = D_blk * C_blk' + A_blk + B_blk;
toc_compute(j) = toc;
tic
B(j,:,:) = gather(B_blk);
toc_unload(j) = toc;
end
toc_all = toc;
fprintf('averaged over 4 workers, loading onto GPU took %f seconds \n', mean(toc_load));
fprintf('averaged over 4 workers, computation on GPU took %f seconds \n',mean(toc_compute));
fprintf('averaged over 4 workers, unloading from GPU took %f seconds \n', mean(toc_unload));
fprintf('the entire process took %f seconds \n', toc_all);
Using the tic-toc time checker (I run the code only after starting the parpool to ensure that time-tracker is accurate), I found that each worker takes on average:
6.33 seconds to load the data onto the GPU
0.18 seconds to run the computations on the GPU
4.91 seconds to unload the data from the GPU.
However, the entire process takes 158.57 seconds. So, the communication overhead (or something else?) took up a significant chunk of the running time.
I then tried a simple for loop without parallelization, see snippet below.
%% for loop
tic
for j = 1:K
A_blk = gpuArray(reshape(A(j,:,:),[m,n]));
B_blk = gpuArray(reshape(B(j,:,:), [m,n]));
C_blk = gpuArray(C);
D_blk = gpuArray(reshape(D(j,:,:), [m,2]));
toc_load(j) = toc;
B_blk = D_blk * C_blk' + A_blk + B_blk;
toc_compute(j) = toc;
B(j,:,:) = gather(B_blk);
end
toc_all = toc;
fprintf('the entire process took %f seconds \n', toc_all);
This time, running the entire code took only 27.96 seconds. So running the code in serial significantly improved performance in this case. Nonetheless, given that I have 4 GPUs, it seems disappointing to not be able to gain a speedup by using all 4 at the same time.
From my experiments above, I have observed that the actual computational cost of the GPU working on the linear algebra tasks appears low. The key bottleneck appears to be the time taken in loading the data in parallel from CPU onto the multiple GPUs, and gathering the data from the multiple GPUs back to CPU, though it is also possible that there is some other factor in play.
In lieu of this, I have the following questions:
What exactly is underlying the slowness of parfor? Why is the communication overhead (or whatever the underlying reason) so expensive?
How can I speed up the parallel loading and unloading of data from CPU to multiple GPUs and then back in Matlab? Are there tricks involving parfor, spmd (or other things such as parfeval, which I have not tried) that I have neglected? Or have I reached some kind of fundamental speed limit in Matlab (assuming I maintain my current CPU/GPU setup) ?
If there is a fundamental limitation in how Matlab handles the data loading/unloading, would the only recourse be to rewrite this portion of the code in C++?
Thank you for any assistance!
Sending data to/from AWS instances to use with parfor is considerably slower than using workers on your local machine because (a) the machines are further away, and (b) there's additional overhead because all communication with AWS workers use secure communication.
You can use ticBytes and tocBytes to see how much data is being transferred.
To improve the performance, I would suggest doing everything possible to avoid transferring large amounts of data between your client and the workers. It can often be more efficient to build data directly on the workers, even if this means building arrays redundantly multiple times.
Precisely how you avoid data transfer is highly dependent on where your original fundamental data is coming from. If you have files on your client system... that's tough. In your example, you're using rand - which is easy to run on the cluster, but presumably not really representative.
Sometimes there's a middle ground where you have some small-ish fundamental data that can only be computed at the client, and large derived data that is needed on the workers. In that case, you might conceivably couple the computation with parallel.pool.Constant, or just do everything inside a single spmd block or something. (Your parfor loop as written could equally use spmd since you're arranging things to have one iteration per worker).

Download multiple files: best performance-wise solution

I am writing an application which needs to read thousands (let's say 10000) of small (let's say 4KB) files from a remote location (e.g., an S3 bucket).
The single read from remote takes around 0.1 seconds and processing the file takes just few milliseconds (from 10 to 30). Which is the best solution in terms of performance?
The worse I can do is to download everything serially: 0.1*10000 = 1000 seconds (~16 minutes).
Compressing the files in a single big one would surely help, but it is not an option in my application.
I tried to spawn N threads (where N is the number of logical cores) that download and process 10000/N files each. This gave me 10000/N * 0.1 seconds. If N = 16, it takes indeed around 1 minute to complete
I also tried to spawn K*N threads (this helps sometimes on high latency systems, as GPUs) but I did not get much speed up for any value of K (I tried 2,3,4), maybe due to lot of context thread switching.
My general question is: how would you design such a similar system? Is 1 minute really the best I can achieve?
Thank you,
Giuseppe

How to extract interval/range of rows from compressed file?

How do I return interval of rows from 100mil rows *.gz file?
Let's say I need 5 mil rows starting from 15mil up to 20mil?
is this the best performing option?
zcat myfile.gz|head -20000000|tail -500
real 0m43.106s
user 0m43.154s
sys 0m9.259s
That's a perfectly reasonable option; since you don't know how long a line will be, you basically have to decompress and iterate the lines to figure out where the line separators are. All three tools are fairly heavily optimized, so I/O and decompression time will likely dominate regardless.
In theory, rolling your own solution that combines all three tools in a single executable might save a little (by reducing the costs of IPC a bit), but the savings would likely be negligible.

DMA memcpy operation in Linux

I want to dma using dma_async_memcpy_buf_to_buf function which is in dmaengine.c file (linux/drivers/dma). For this, I add a function in dmatest.c file (linux/drivers/dma) as following:
void foo ()
{
int index = 0;
dma_cookie_t cookie;
size_t len = 0x20000;
ktime_t start, end, end1, end2, end3;
s64 actual_time;
u16* dest;
u16* src;
dest = kmalloc(len, GFP_KERNEL);
src = kmalloc(len, GFP_KERNEL);
for (index = 0; index < len/2; index++)
{
dest[index] = 0xAA55;
src[index] = 0xDEAD;
}
start = ktime_get();
cookie = dma_async_memcpy_buf_to_buf(chan, dest, src, len);
while (dma_async_is_tx_complete(chan, cookie, NULL, NULL) == DMA_IN_PROGRESS)
{
dma_sync_wait(chan, cookie);
}
end = ktime_get();
actual_time = ktime_to_ns(ktime_sub(end, start));
printk("Time taken for function() execution dma: %lld\n",(long long)actual_time);
memset(dest, 0 , len);
start = ktime_get();
memcpy(dest, src, len);
end = ktime_get();
actual_time = ktime_to_ns(ktime_sub(end, start));
printk("Time taken for function() execution non-dma: %lld\n",(long long)actual_time);
}
There are some issues with DMA:
Interestingly, memcpy function execution time is less than dma_async_memcpy_buf_to_buf function. Maybe, its related with ktime_get() function problem.
My method with foo function is correct or incorrect to perform DMA operation? I'm not sure about this.
How can I measure tick counts of memcpy and dma_async_memcpy_buf_to_buf functions in terms of cpu usage
Finally, Is DMA operation possible at application level? Up to now I used in kernel level as you can see above(dmatest.c is inserted kernel module)
There are multiple issues in your question, which make it kind of hard to answer exactly what you're questioning:
Yes, your general DMA operation invocation algorithm is correct.
The fundamental difference in using plain memcpy and DMA operations for copying memory is not getting direct performance gains, but (a) performance gains due to sustaining CPU cache / prefetcher state when using DMA operation (which is likely would be garbled when using plain old memcpy, executed on CPU itself), and (b) true background operation which leaves CPU available to do other stuff.
Given (a), it's kind of pointless to use DMA operations on anything less than CPU cache size, i.e. dozens of megabytes. Typically it's done for purposes of fast off-CPU stream processing, i.e. moving data that would be anyhow produced/consumed by external devices, such as fast networking cards, video streaming / capturing / encoding hardware, etc.
Comparing async and sync operations in terms of wall clock elapsed time is wrong. There might be hundreds of threads / processes running and no one guarantees you that you'd get scheduled next tick and not several thousands ticks later.
Using ktime_get for benchmarking purposes is wrong - it's fairly imprecise, especially for given such short jobs. Profiling kernel code in fact is a pretty hard and complex task which is well beyond the scope of this question. A quick recommendation here would be to refrain at all from such micro-benchmarks and profile a much bigger and more complete job - similar to what you're ultimately trying to achieve.
Measuring "ticks" for modern CPUs is also kind of pointless, although you can use CPU vendor-specific tools, such as Intel's VTune.
Using DMA copy operations on application level is fairly pointless - at least I can't come with a single viable scenario from top of my head when it would be worth the trouble. It's not innately faster, and, what's more important, I seriously doubt that your application performance's bottleneck is memory copying. For this to be the case, you generally should be doing everything else faster than regular memory copying, and I can't really think of anything on application level that would be faster than memcpy. And if we're talking about communication with some other, off-CPU processing device, then it's automatically not application level.
Generally, memory copy performance is usually limited by memory speed, i.e. clock freq and timings. You aren't going to get any miracle boosts over regular memcpy in direct performance, just because memcpy executed on CPU is fast enough, as CPU usually works with 3x-5x-10x faster clock frequencies than memory.

Is there a way to accelerate matrix plots?

ggpairs(), like its grandparent scatterplotMatrix(), is terribly slow as the number of pairs grows. That's fair; the number of permutations of pairs grows factorially.
What isn't fair is that I have to watch the other cores on my machine sit idle while one cranks away at 100% load.
Is there a way to parallelize large matrix plots?
Here is some sample data for benchmarking.
num.vars <- 100
num.rows <- 50000
require(GGally)
require(data.table)
tmp <- data.table(replicate(num.vars, runif(num.rows)),
class = as.factor(sample(0:1,size=num.rows, replace=TRUE)))
system.time({
tmp.plot <- ggpairs(data=tmp, diag=list(continuous="density"), columns=1:num.vars,
colour="class", axisLabels="show")
print(tmp.plot)})
Interestingly enough, my initial benchmarks excluding the print() statement ran at tolerable speeds (21 minutes for the above). The print statement, when added, caused what appear to be segfaults on my machine. (Hard to say at the moment because the R session is simply killed by the OS).
Is the problem in memory, or is this something that could be parallelized? (At least the plot generation part seems amenable to parallelization.)
Drawing ggpairs plots is single threaded because the bulk of the work inside GGally:::print.ggpairs happens inside two for loops (somewhere around line 50, depending upon how you count lines):
for (rowPos in 1:numCol) {
for (columnPos in 1:numCol) {
It may be possible to replace these with calls to plyr::l_ply (or similar) which has a .parallel argument. I have no idea if the graphics devices will cope OK with several cores trying to simultaneous draw things on them though. My gut feeling is that getting parallel plotting to work robustly may be non-trivial, but it could also be a fun project.

Resources