I have the following pseudo code (a loop) that I am trying to implement it (variable step size implementation) by using Matlab Parallel computing toolbox or Matlab distributed server computing. Actually, I have a matlab code for this loop that works in ordinary matlab 2013a.
Given: u0, t_0, T (initial and ending time value), the initial step size: h0
while t_0 < T
% the fist step is to compute U1, U2 which depend on t_0 and some known parameters
U1(t_0, h0, u0, parameters)
U2(t_0, h0, u0, parameters)
% so U1 and U2 are independent, which can be computed in parallel using Matlab
% the next step is to compute U3, U4, U5, U6 which depends on t_0, U1, U2, and known parameters
U3(t_0, h0, u0, U1, U2, parameters)
U4(t_0, h0, u0, U1, U2, parameters)
U5(t_0, h0, u0, U1, U2, parameters)
U6(t_0, h0, u0, U1, U2, parameters)
% so U3, U4, U5, U6 are independent, which can be also computed in parallel using Matlab
%finally, compute U7 and U8 which depend on U1,U2,..,U6
U7(t0, u0,h0, U1,U2,U3,U4,U5,U6)
U8(t0, u0,h0,U1,U2,U3,U4,U5,U6)
% so U7 and U8 are also independent, and we can compute them in parallel as well.
%Do step size control here, then assign h0:=h_new
t0=t0+h_new
end
Could you please suggest me the best way to implement the above code using Matlab parallel?
By the best way I mean I want to get a speedup for the whole computation as fast as possible.
(I have an access to supercomputer LEO III which has 162 computer nodes (with a total of 1944 cores). So each node has 12 cores.)
My idea is to compute U1, U2 on two separate workers (cores) which have their own memory, at the same time. Using the obtained results for U1, U2, one can do the similar way for computing U3,U4,U5,U6, and finally for U7, U8. For that I think I need to use PARFOR within Matlabpool? But I do not know how many indices (corresponding to the number of cores/processors) I need for the loop.
My questions are:
I can use supercomputer as mentioned above, so I can use Matlab Distributed Computing server?
For this code, should I use Parallel Computing Toolbox or Matlab Distributed Computing server?
I mean with Parallel Computing Toolbox (local workers), I cannot specify which workers will compute U1 and U2 (also for U3, U4,...) since they share memory and run interactively, is it right?
If I would use the proposed idea, then how many workers that I will need? probably 8 cores?
Is this better to use 1 compute node and ask for 9 cores (8 for use and one for matlab session) or to use 8 computer nodes?
I am a beginner with Matlab Parallel Computing. Please give your suggestions!
Thanks!
Peter
I suggest to parallelize the while-loop, since you want to be distributing many iterations among the nodes. Parfor is the easiest way to start with parallel computing, and does a good job for straightforward problems as yours. Only go with server if there's a lot of time steps that each take some significant time, because any parallelization comes with a certain overhead.
Computing locally allows you to make use of 12 cores in recent versions of Matlab; make sure that you have enough RAM to keep 13 copies of your loop body in memory. With good processor architecture and with no other programs competing for resources, it is fine to run on all cores.
Thus:
timeSteps = t0:h:T;
parfor timeIdx = 1:length(timeSteps)
t0 = timeSteps(timeIdx);
%# calculate all your u's here
%# collect the output
result{timeIdx,1} = U7;
result{timeIdx,2} = U8;
end
I would say all computations of U1,..U8 will need to call a function for computing matrix-vector multiplications. Let say we do not care about how long do they take for the moment (not much in my case). The problem is that, for the previous methods, U1,..,U8 are not independent (they are dependent!). That means to compute U_{i+1} you need U_{i}. So you need to compute them sequentially one after other. Now I could construct such a method that allows to compute U1, U2 at the same time (independent), the same holds for U3,..,U6, and for U7, U8. So I want to save the cpu time for the whole computation. That why I think one could use matlab parallel computing.
Related
I'm trying to train multiple "documents" (here mostly log format), and the Doc2Vec is taking longer if I'm specifying more than one core (which i have).
My data looks like this:
print(len(train_corpus))
7930196
print(train_corpus[:5])
[TaggedDocument(words=['port', 'ssh'], tags=[0]),
TaggedDocument(words=['session', 'initialize', 'by', 'client'], tags=[1]),
TaggedDocument(words=['dfs', 'fsnamesystem', 'block', 'namesystem', 'addstoredblock', 'blockmap', 'update', 'be', 'to', 'blk', 'size'], tags=[2]),
TaggedDocument(words=['appl', 'selfupdate', 'component', 'amd', 'microsoft', 'windows', 'kernel', 'none', 'elevation', 'lower', 'version', 'revision', 'holder'], tags=[3]),
TaggedDocument(words=['ramfs', 'tclass', 'blk', 'file'], tags=[4])]
I have 8 cores available:
print(os.cpu_count())
8
I am using gensim 4.1.2, on Centos7. Using this approch (stackoverflow.com/a/37190672/130288), It looks like my BLAS library is OpenBlas, so I setted OPENBLAS_NUM_THREADS=1 on my bashrc (and could be visible from Jupyter, using !echo $OPENBLAS_NUM_THREADS=1 )
This is my test code :
dict_time_workers = dict()
for workers in range(1, 9):
model = Doc2Vec(vector_size=20,
min_count=1,
workers=workers,
epochs=1)
model.build_vocab(train_corpus, update = False)
t1 = time.time()
model.train(train_corpus, epochs=1, total_examples=model.corpus_count)
dict_time_workers[workers] = time.time() - t1
And the variable dict_time_workers is equal too :
{1: 224.23211407661438,
2: 273.408652305603,
3: 313.1667754650116,
4: 331.1840877532959,
5: 433.83785605430603,
6: 545.671571969986,
7: 551.6248495578766,
8: 548.430994272232}
As you can see, the time taking is increasing instead of decreasing. Results seems to be the same with a bigger epochs parameters.
Nothing is running on my Centos7 except this.
If I look at what's happening on my threads using htop, I see that the right number of thread are used for each training. But, the more threads are used, less the percentage of usage is (for example, with only one thread, 95% is used, for 2 they both used around 65% of their max power, for 6 threads are 20-25% ...). I suspected an IO issue, but iotop showed me that nothing bad is happening at the same disk.
The post seems now to be related to this post
Not efficiently to use multi-Core CPU for training Doc2vec with gensim .
When getting no benefit from extra cores like that, it's likely that the BLAS library you've got installed is already configured to try to use all cores for every bulk array operation. That means that other attempts to engage more cores, like Gensim's workers specification, just increase the overhead of contention, when each individual worker thread's individual BLAS callouts also try to use 8 threads.
Depending on the BLAS library in use, its own propensity to use more cores can typically be limited by environment variables named something like OPENBLAS_NUM_THREADS and/or MKL_NUM_THREADS.
If you set these to just 1 before your process launches, you may see different, and possibly better, multithreaded behavior.
Note, though: 1 just restores the assumption that every worker-thread only ever engages a single core. Some other mix of BLAS-cores & Gensim-worker-threads might actually achieve the best training throughput & non-contending core-utilization.
And, at least for Gensim workers, the actual thread count value achieving the best throughput will vary based on other model parameters that influence the relative amount of calculation time in highly-parallelizable code-blocks versus highly-contended blocks, especially window, vector_size, & negative. And, there's not really a shortcut to finding the best workers value except via trial-and-error: observing reported training rates in logs over a few minutes of running. (Though: any rate observed in, say, minutes 2-4 of a abbreviated trial run should be representative of the training rate through the whole corpus over multiple epochs.)
(For any system with at least 4 cores, the optimal value with a classic iterable corpus of TaggedDocuments is usually at least 3, no more than the number of cores, but also rarely more than 8-12 threads, due to other inherent sources of contention due to both Gensim's approach to fanning out the work among worker-threads, and the Python 'GIL'.)
Other thoughts:
the build_vocab() step is never multi-threaded, so benchmarking alternate workers values will give a truer readout of their effect by only timing the train() step
ensuring your iterable corpus does as little redundant work (like say IO & tokenization) on each pass can help limit any bottlenecks around the single manager thread doing each epoch's iteration & batching texts to the workers
the alternate corpus_file approach can achieve higher core utilization, up to any number of cores, by assigning each thread its own exclusive range of an input-file. But, it also means (a) your whole corpus must be in one uncompressed space-tokenized plain-text file; (b) your documents only get a single integer tag (their line-number); (c) you may be subject to some small as-yet-diagnosed-and-fixed bug(s). (See project issue #2747.)
Ok, the best way to fully use the core is to use the parameter corpus_file of doc2vec.
Doing the same bench, the result looks like :
{1: 114.58889961242676, 2: 82.8250150680542, 3: 71.52109575271606, 4: 67.1010684967041, 5: 75.96869373321533, 6: 100.68377351760864, 7: 116.7901406288147, 8: 139.53436756134033}
The thread seems to be useful, in my case 4 are the best.
Still strange that the "regular" doc2vec is not that great at parallelizing
I have a function that uses values stored in one array to operate on another array. This behaves similar to the numpy.hist function. For example:
import numpy as np
from numba import jit
#jit(nopython=True)
def array_func(x, y, output_counts, output_weights):
for row in range(x.size):
col = int(x[row] * 10)
output_counts[col] += 1
output_weights[col] += y[row]
return (output_counts, output_weights)
# in the current code these arrays exists ad pytorch tensors
# on the GPU and get converted to numpy arrays on the CPU before
# being passed to "array_func"
x = np.random.randint(0, 11, (1000)) / 10
y = np.random.randint(0, 100, (10000))
output_counts, output_weights = array_func(x, y, np.zeros(y.size), np.zeros(y.size))
While this works for arrays it does not work for torch tensors that are on the GPU. This is close to what histogram functions do, but I also need the summation of binned values (i.e., the output_weights array/tensor). The current function requires me to continually pass the data from GPU to CPU, followed by the CPU function being run in series.
Can this function be converted to run in parallel on the GPU?
##EDIT##
The challenge is caused by the following line:
output_weights[col] += y[row]
If it weren't for that line I could just use the torch.histc function.
Here's my thought: GPUs are "fast" because they have hundreds/thousands of threads available and can run parts of a big job (or many smaller jobs) on these threads. However, if I convert the function above to work on torch tensors then there is no benefit to running on the GPU (it actually kills the performance). I wonder if there is a way I can break of x so each value gets sent to different threads (similar to how apply_async does within multiprocessing)?
I'm open to other options.
In it's current form the function is fast, but the GPU-to-CPU data transfer is killing me.
Your computation is indeed a general histogram operation. There are multiple ways to compute this on a GPU regarding the number of items to scan, the size of the histogram and the distribution of the values.
For example, one solution consist in building local histograms in each separate kernel blocks and then perform a reduction. However, this solution is not well suited in your case since len(x) / len(y) is relatively small.
An alternative solution is to perform atomic updates of the histogram in parallel. This solutions only scale well if there is no atomic conflicts which is dependent of the actual input data. Indeed, if all value of x are equal, then all updates will be serialized which is slower than doing the accumulation sequentially on a CPU (due to the overhead of the atomic operations). Such a case is frequent on small histograms but assuming the distribution is close to uniform, this can be fine.
This operation can be done with Numba using CUDA (targetting Nvidia GPUs). Here is an example of kernel solving your problem:
#cuda.jit
def array_func(x, y, output_counts, output_weights):
tx = cuda.threadIdx.x # Thread id in a 1D block
ty = cuda.blockIdx.x # Block id in a 1D grid
bw = cuda.blockDim.x # Block width, i.e. number of threads per block
pos = tx + ty * bw # Compute flattened index inside the array
if pos < x.size:
col = int(x[pos] * 10)
cuda.atomic.add(output_counts, col, 1)
cuda.atomic.add(output_weights, col, y[pos])
For more information about how to run this kernel, please read the documentation. Note that the arrays output_counts and output_weights can possibly be directly created on the GPU so to avoid transfers. x and y should be on the GPU for better performance (otherwise a CPU reduction will be certainly faster). Also note that the kernel should be pretty fast so the overhead to run/wait it and allocate/free temporary array may be significant and even possibly slower than the kernel itself (but certainly faster than doing a double transfer from/to the CPU so to compute things on the CPU assuming data was on the GPU). Note also that such atomic accesses are only fast on quite recent Nvidia GPU that benefit from specific computing units for atomic operations.
I am trying to implement an algorithm involving large dense matrices in Matlab. I am using multi-GPU AWS instances for performance.
At each iteration, I have to work with two large m by n matrices (of doubles), A and B, where m = 1600000, and n = 500. Due to the size of the matrices and the memory capacity of each GPU (~8 GB memory each), I decompose the problem by partitioning the matrices row-wise into K chunks of smaller matrices who has the same number of n columns but fewer rows (M /K).
In theory, I can load each chunk of data onto the GPU one at a time, perform computations, and gather the data before repeating with the next chunk. However, since I have access to 4 GPUs, I would like to use all 4 GPUs in parallel to save time, and decompose the matrices into 4 chunks.
To achieve this, I tried using the parfor loop in Matlab (with the parallel computing toolbox), utilizing best practices such as slicing, loading only relevant data for each worker. For posterity, here is a complete code snippet. I have provided small, decomposed problems deeper down in this post.
M = 1600000;
K = 4;
m = M/K;
n = 500;
A = randn(K, m,n);
B = randn(K,m,n);
C = randn(n,2);
D = zeros(K,m,2);
%delete(gcp('nocreate'));
%p = parpool('local',K);
tic
toc_load = zeros(K,1);
toc_compute = zeros(K,1);
toc_unload = zeros(K,1);
parfor j = 1:K
tic
A_blk = gpuArray(reshape(A(j,:,:),[m,n]));
B_blk = gpuArray(reshape(B(j,:,:), [m,n]));
C_blk = gpuArray(C);
D_blk = gpuArray(reshape(D(j,:,:), [m,2]));
toc_load(j) = toc;
tic
B_blk = D_blk * C_blk' + A_blk + B_blk;
toc_compute(j) = toc;
tic
B(j,:,:) = gather(B_blk);
toc_unload(j) = toc;
end
toc_all = toc;
fprintf('averaged over 4 workers, loading onto GPU took %f seconds \n', mean(toc_load));
fprintf('averaged over 4 workers, computation on GPU took %f seconds \n',mean(toc_compute));
fprintf('averaged over 4 workers, unloading from GPU took %f seconds \n', mean(toc_unload));
fprintf('the entire process took %f seconds \n', toc_all);
Using the tic-toc time checker (I run the code only after starting the parpool to ensure that time-tracker is accurate), I found that each worker takes on average:
6.33 seconds to load the data onto the GPU
0.18 seconds to run the computations on the GPU
4.91 seconds to unload the data from the GPU.
However, the entire process takes 158.57 seconds. So, the communication overhead (or something else?) took up a significant chunk of the running time.
I then tried a simple for loop without parallelization, see snippet below.
%% for loop
tic
for j = 1:K
A_blk = gpuArray(reshape(A(j,:,:),[m,n]));
B_blk = gpuArray(reshape(B(j,:,:), [m,n]));
C_blk = gpuArray(C);
D_blk = gpuArray(reshape(D(j,:,:), [m,2]));
toc_load(j) = toc;
B_blk = D_blk * C_blk' + A_blk + B_blk;
toc_compute(j) = toc;
B(j,:,:) = gather(B_blk);
end
toc_all = toc;
fprintf('the entire process took %f seconds \n', toc_all);
This time, running the entire code took only 27.96 seconds. So running the code in serial significantly improved performance in this case. Nonetheless, given that I have 4 GPUs, it seems disappointing to not be able to gain a speedup by using all 4 at the same time.
From my experiments above, I have observed that the actual computational cost of the GPU working on the linear algebra tasks appears low. The key bottleneck appears to be the time taken in loading the data in parallel from CPU onto the multiple GPUs, and gathering the data from the multiple GPUs back to CPU, though it is also possible that there is some other factor in play.
In lieu of this, I have the following questions:
What exactly is underlying the slowness of parfor? Why is the communication overhead (or whatever the underlying reason) so expensive?
How can I speed up the parallel loading and unloading of data from CPU to multiple GPUs and then back in Matlab? Are there tricks involving parfor, spmd (or other things such as parfeval, which I have not tried) that I have neglected? Or have I reached some kind of fundamental speed limit in Matlab (assuming I maintain my current CPU/GPU setup) ?
If there is a fundamental limitation in how Matlab handles the data loading/unloading, would the only recourse be to rewrite this portion of the code in C++?
Thank you for any assistance!
Sending data to/from AWS instances to use with parfor is considerably slower than using workers on your local machine because (a) the machines are further away, and (b) there's additional overhead because all communication with AWS workers use secure communication.
You can use ticBytes and tocBytes to see how much data is being transferred.
To improve the performance, I would suggest doing everything possible to avoid transferring large amounts of data between your client and the workers. It can often be more efficient to build data directly on the workers, even if this means building arrays redundantly multiple times.
Precisely how you avoid data transfer is highly dependent on where your original fundamental data is coming from. If you have files on your client system... that's tough. In your example, you're using rand - which is easy to run on the cluster, but presumably not really representative.
Sometimes there's a middle ground where you have some small-ish fundamental data that can only be computed at the client, and large derived data that is needed on the workers. In that case, you might conceivably couple the computation with parallel.pool.Constant, or just do everything inside a single spmd block or something. (Your parfor loop as written could equally use spmd since you're arranging things to have one iteration per worker).
As we known, WaveFront (AMD OpenCL) is very similar to WARP (CUDA): http://research.cs.wisc.edu/multifacet/papers/isca14-channels.pdf
GPGPU languages, like OpenCLâ„¢ and CUDA, are called SIMT because they
map the programmer’s view of a thread to a SIMD lane. Threads
executing on the same SIMD unit in lockstep are called a wavefront
(warp in CUDA).
Also known, that AMD suggested us the (Reduce) addition of numbers using a local memory. And for accelerating of addition (Reduce) suggests using vector types: http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/01/AMD_OpenCL_Tutorial_SAAHPC2010.pdf
But are there any optimized register-to-register data-exchage instructions between items (threads) in WaveFront:
such as int __shfl_down(int var, unsigned int delta, int width=warpSize); in WARP (CUDA): https://devblogs.nvidia.com/parallelforall/faster-parallel-reductions-kepler/
or such as __m128i _mm_shuffle_epi8(__m128i a, __m128i b); SIMD-lanes on x86_64: https://software.intel.com/en-us/node/524215
This shuffle-instruction can, for example, execute Reduce (add up the numbers) of 8 elements from 8 threads/lanes, for 3 cycles without any synchronizations and without using any cache/local/shared-memory (which has ~3 cycles latency for each access).
I.e. threads sends its value directly to register of other threads: https://devblogs.nvidia.com/parallelforall/faster-parallel-reductions-kepler/
Or in OpenCL we can use only instruction gentypen shuffle( gentypem x, ugentypen mask ) which can be used only for vector-types such as float16/uint16 into each item (thread), but not between items (threads) in WaveFront: https://www.khronos.org/registry/OpenCL/sdk/1.1/docs/man/xhtml/shuffle.html
Can we use something looks like shuffle() for reg-to-reg data-exchange between items (threads) in WaveFront which more faster than data-echange via Local memory?
Are there in AMD OpenCL instructions for register-to-register data-exchange intra-WaveFront such as instructions __any(), __all(), __ballot(), __shfl() for intra-WARP(CUDA): http://on-demand.gputechconf.com/gtc/2015/presentation/S5151-Elmar-Westphal.pdf
Warp vote functions:
__any(predicate) returns non-zero if any of the predicates for the
threads in the warp returns non-zero
__all(predicate) returns non-zero if all of the predicates for the
threads in the warp returns non-zero
__ballot(predicate) returns a bit-mask with the respective bits
of threads set where predicate returns non-zero
__shfl(value, thread) returns value from the requested thread
(but only if this thread also performed a __shfl()-operation)
CONCLUSION:
As known, in OpenCL-2.0 there is Sub-groups with SIMD execution model akin to WaveFronts: Does the official OpenCL 2.2 standard support the WaveFront?
For Sub-Group there are - page-160: http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_OpenCL_Programming_User_Guide2.pdf
int sub_group_all(int predicate) the same as CUDA-__all(predicate)
int sub_group_any(int predicate); the same as CUDA-__any(predicate)
But in OpenCL there is no similar functions:
CUDA-__ballot(predicate)
CUDA-__shfl(value, thread)
There is only Intel-specified built-in shuffle functions in Version 4, August 28, 2016 Final Draft OpenCL Extension #35: intel_sub_group_shuffle, intel_sub_group_shuffle_down, intel_sub_group_shuffle_down, intel_sub_group_shuffle_up: https://www.khronos.org/registry/OpenCL/extensions/intel/cl_intel_subgroups.txt
Also in OpenCL there are functions, which usually implemented by shuffle-functions, but there are not all of functions which can be implemented by using shuffle-functions:
<gentype> sub_group_broadcast( <gentype> x, uint sub_group_local_id );
<gentype> sub_group_reduce_<op>( <gentype> x );
<gentype> sub_group_scan_exclusive_<op>( <gentype> x );
<gentype> sub_group_scan_inclusive_<op>( <gentype> x );
Summary:
shuffle-functions remain more flexible functions , and ensure the fastest possible communication between threads with direct register-to-register data-exchanging.
But functions sub_group_broadcast/_reduce/_scan doesn't guarantee direct register-to-register data-exchanging, and these sub-group-functions less flexible.
There is
gentype work_group_reduce<op> ( gentype x)
for version >=2.0
but its definition doesn't say anything about using local memory or registers. This just reduces each collaborator's x value to a single sum of all. This function must be hit by all workgroup-items so its not on a wavefront level approach. Also the order of floating-point operations is not guaranteed.
Maybe some vendors do it register way while some use local memory. Nvidia does with register I assume. But an old mainstream Amd gpu has local memory bandwidth of 3.7 TB/s which is still good amount. (edit: its not 22 TB/s) For 2k cores, this means nearly 1.5 byte per cycle per core or much faster per cache line.
For %100 register(if not spills to global memory) version, you can reduce number of threads and do vectorized reduction in threads themselves without communicating with others if number of elements are just 8 or 16. Such as
v.s0123 += v.s4567
v.s01 += v.s23
v.s0 += v.s1
which should be similar to a __m128i _mm_shuffle_epi8 and its sum version when compiled on a CPU and non-scalar implementations will use same SIMD on a GPU to do these 3 operations.
Also using these vector types tend to use efficient memory transactions even for global and local, not just registers.
A SIMD works on only a single wavefront at a time, but a wavefront may be processed by multiple SIMDs, so, this vector operation does not imply a whole wavefront is being used. Or even whole wavefront may be computing 1st elements of all vectors in a cycle. But for a CPU, most logical option is SIMD computing work items one by one(avx,sse) instead of computing them in parallel by their same indexed elements.
If main work group doesn't fit ones requirements, there are child kernels to spawn and use dynamic width kernels for this kind of operations. Child kernel works on another group called sub-group concurrently. This is done within device-side queue and needs OpenCl version to be at least 2.0.
Look for "device-side enqueue" in http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_OpenCL_Programming_User_Guide2.pdf
AMD APP SDK supports Sub-Group
There are 3 parts to my application:
A numerical simulator solving a 21 variable diff equation by runge-kutta method - direct from numerical recipes in C, step size is 0.0001 s
A C code pinging a PIC based micrprocessor every 1s and receiving data at about 3600 samples per second over the USB-COM port; It sends relevant data to the front end over TCP/IP
A JAVA front end reading the data from the numerical simulator via SWIG (for the C code) and JNI, modifying the parameters with input from the microprocessor and finally plotting it to the GUI.
I want to recode the JAVA front end in C++ now, with the option of using HTML/Javascript for plotting.
Would rewriting the front end in C++ so that the numerical simulator runs on a separate thread be a good approach?
I don't understand threading though I have used it for the listening and plotting functions in the JAVA code. It seems like having it all run on multiple threads instead of separate processes would slow down my simulations.
Can I combine 1 , 2 and 3 into a single program or should they remain separate to retain the 0.0001 ms simulation speed and the ability to handle the large amount to microprocessor data.
Please help me pick a path forward!
Thanks in Advance!
On a multicore platform, multithreading will generally improve performance. However, GPOS such as Linux and Windows are not deterministic, so there are no guarantees.
That said, the computational performance of a modern PC is such that it will hardly be stretched by this task and data rate,so it hardly matters perhaps?