Multithreading in Windows Forms - multithreading

I wrote app, Caesar Cipher in Windows Forms CLI with dynamic linking libraries(in C++ and in ASM) with my alghorithms for model(eciphering and deciphering). That part of my app is working.
Here is also a multithreading from Windows Forms. User can chose number of threads(1-64). If he chose 2, message to encipher(decipher) will be divided on two substrings which will be divided on two threads. And I want to execute these threads paraller, and finally reduce cost of execution time.
When user push encipher or decipher button there will be displayed enciphered or deciphered text and time costs for execution functions in C++ and ASM. Actualy everything is alright, but times for greater threads than 1 aren't smaller, they are bigger.
There is some code:
/*Function which concats string for substrings to threads*/
array<String^>^ ThreadEncipherFuncCpp(int nThreads, string str2){
//Tablica wątków
array<String^>^ arrayOfThreads = gcnew array <String^>(nThreads);
//Przechowuje n-tą część wiadomosci do przetworzenia
string loopSubstring;
//Długość podstringa w wiadomości
int numberOfSubstring = str2.length() / nThreads;
int isModulo = str2.length() % nThreads;
array<Thread^>^ xThread = gcnew array < Thread^ >(nThreads);
for (int i = 0; i < nThreads; i++)
{
if (i == 0 && numberOfSubstring != 0)
loopSubstring = str2.substr(0, numberOfSubstring);
else if ((i == nThreads - 1) && numberOfSubstring != 0){
if (isModulo != 0)
loopSubstring = str2.substr(numberOfSubstring*i, numberOfSubstring + isModulo);
else
loopSubstring = str2.substr(numberOfSubstring*i, numberOfSubstring);
}
else if (numberOfSubstring == 0){
loopSubstring = str2.substr(0, isModulo);
i = nThreads - 1;
}
else
loopSubstring = str2.substr(numberOfSubstring*i, numberOfSubstring);
ThreadExample::inputString = gcnew String(loopSubstring.c_str());
xThread[i] = gcnew Thread(gcnew ThreadStart(&ThreadExample::ThreadEncipher));
xThread[i]->Start();
xThread[i]->Join();
arrayOfThreads[i] = ThreadExample::outputString;
}
return arrayOfThreads;
}}
Here is a fragment which is responsible for the calculation of the time for C++:
/*****************C++***************/
auto start = chrono::high_resolution_clock::now();
array<String^>^ arrayOfThreads = ThreadEncipherFuncCpp(nThreads, str2);
auto elapsed = chrono::high_resolution_clock::now() - start;
long long milliseconds = chrono::duration_cast<std::chrono::microseconds>(elapsed).count();
double micro = milliseconds;
this->label4->Text = Convert::ToString(micro + " microseconds");
String^ str3;
String^ str4;
str4 = str3->Concat(arrayOfThreads);
this->textBox2->Text = str4;
/**********************************/
And example of working:
For input data: "Some example text. Some example text2."
Program will display: "Vrph hadpsoh whaw. Vrph hadpsoh whaw2."
Times of execution for 1 thread:
C++ time: 31231us.
Asm time: 31212us.
Times of execution for 2 threads:
C++ time: 62488us.
Asm time: 62505us.
Times of execution for 4 threads:
C++ time: 140254us.
Asm time: 124587us.
Times of execution for 32 threads:
C++ time: 1002548us.
Asm time: 1000020us.
How to solve this problem?
I need this structure of program, this is academic project.
My CPU has 4 cores.

The reason it's not going any faster is because you aren't letting your threads run in parallel.
xThread[i] = gcnew Thread(gcnew ThreadStart(&ThreadExample::ThreadEncipher));
xThread[i]->Start();
xThread[i]->Join();
These three lines create the thread, start it running, and then wait for it to finish. You're not getting any parallelism here, you're just adding the overhead of spawning & waiting for threads.
If you want to have a speedup from multithreading, the way to do it is to start all the threads at once, let them all run, and then collect up the results.
In this case, I'd make it so that ThreadEncipher (which you haven't shown us the source of, so I'm making assumptions) takes a parameter, which is used as an array index. Instead of having ThreadEncipher read from inputString and write to outputString, have it read from & write to one index of an array. That way, each thread can read & write at the same time. After you've spawned all these threads, then you can wait for all of them to finish, and you can either process the output array, or since array<String^>^ is already your return type, just return it as-is.
Other thoughts:
You've got a mix of unmanaged and managed objects here. It will be better if you pick one and stick with it. Since you're in C++/CLI, I'd recommend that you stick with the managed objects. I'd stop using std::string, and use System::String^ exclusively.
Since your CPU has 4 cores, you're not going to get any speedup by using more than 4 threads. Don't be surprised when 32 threads takes longer than 4, because you're doing 8x the string manipulation, and you've got 32 threads fighting over 4 processor cores.
Your string splitting code is more complex than it needs to be. You've got five different cases in there, I'd have to sit down and think about it for a while to be sure it's correct. Try this:
int totalLen = str2->length;
for (int i = 0; i < nThreads; i++)
{
int startIndex = totalLen * i / nThreads;
int endIndex = totalLen * (i+1) / nThreads;
int substrLen = endIndex - startIndex;
String^ substr = str2->SubString(startIndex, substrLen);
...
}

Related

Most efficient way to spawn n pthreads with the same parameters in C

I have 32 threads that I know the input parameters to ahead of time, nothing changes inside the function (other than the memory buffer that each thread interacts with).
In pseudo C code this is my design pattern:
// declare 32 pthreads as global variables
void dispatch_32_threads() {
for(int i=0; i < 32; i++) {
pthread_create( &thread_id[i], NULL, thread_function, (void*) thread_params[i] );
}
// wait until all 32 threads are finished
for(int j=0; j < 32; j++) {
pthread_join( thread_id[j], NULL);
}
}
int main (crap) {
//init 32 pthreads here
for(int n = 0; n<4000; n++) {
for(int x = 0; x<100< x++) {
for(int y = 0; y<100< y++) {
dispatch_32_threads();
//modify buffers here
}
}
}
}
I am calling dispatch_32_threads 100*100*4000= 40000000 times. thread_function and (void*) thread_params[i] do not change. I think pthread_create keeps creating and destroying threads, I have 32 cores, none of them are at 100% utilization, it hovers around 12%. Moreover, when I reduce the number of threads to 10, all 32 cores remain at 5-7% utilization, and I see no slow down in runtime. Running less than 10 slow things down.
Running 1 thread however is extremely slow, so multi threading is helping. I profiled my code, I know it's thread_func that is slow, and thread_func is parallelizable. This leads me to believe that pthread_create keeps spawning and destroying threads on different cores, and after 10 threads I lose efficiency, and it gets slower, thread_func is in essence "less complicated" than spawning more than 10 threads.
Is this assessment true? What is the best way to utilize 100% of all cores?
Thread creation is expensive. It depends on different parameters, but is rarely below 1000 cycles. And thread synchronisation and destruction is similar. If the amount of work in your thread_function is not very high it will largely dominate the computation time.
It is rarely a good idea to create threads in the inner loops. Probably, the best is to create threads to process iterations of the outer loop. Depending on your program and on what does the thread_function there may be dependencies between iterations and this may require some rewriting, but a solution could be:
int outer=4000;
int nthreads=32;
int perthread=outer/nthreads;
// add an integer with thread_id to thread_param struct
void thread_func(whatisrequired *thread_params){
// runs perthread iteration of the loop beginning at start
int start = thread_param->thread_id;
for(int n = start; n<start+perthread; n++) {
for(int x = 0; x<100< x++) {
for(int y = 0; y<100< y++) {
//do the work
}
}
}
}
int main(){
for(int i=0; i < 32; i++) {
thread_params[i]->thread_id=i;
pthread_create( &thread_id[i], NULL, thread_func,
(void*) thread_params[i]);
}
// wait until all 32 threads are finished
for(int j=0; j < 32; j++) {
pthread_join( thread_id[j], NULL);
}
}
With this kind of parallelization, you can consider using openmp. The parallel for clause will make you easily experiment with the best parallelization scheme.
If there are dependencies and such an obvious parallelization is not possible, you can create threads at program start and give them work by managing a thread pool. Managing queues is less expensive than thread creation (but atomic accesses do have a cost).
Edit: Alternatively, you can
1. put all you loops in the thread function
2. at the start (or the end) of the inner loop add a barrier to synchronize your threads. This will ensure that all threads have finished their job.
3. In the main create all the threads and wait for completion.
Barriers are less expensive than thread creation and the result will be identical.

How to count branch mispredictions?

I`ve got a task to count branch misprediction penalty (in ticks), so I wrote this code:
int main (int argc, char ** argv) {
unsigned long long start, end;
FILE *f;
f = fopen("output", "w");
long long int k = 0;
unsigned long long min;
int n = atoi(argv[1]);// n1 = atoi(argv[2]);
for (int i = 1; i <= n + 40; i++) {
min = 9999999999999;
for(int r = 0; r < 1000; r++) {
start = rdtsc();
for (long long int j = 0; j < 100000; j++) {
if (j % i == 0) {
k++;
}
}
end = rdtsc();
if (min > end - start) min = end - start;
}
fprintf (f, "%d %lld \n", i, min);
}
fclose (f);
return 0;
}
(rdtsc is a function that measures time in ticks)
The idea of this code is that it periodically (with period equal to i) goes into branch (if (j % i == 0)), so at some point it starts doing mispredictions. Other parts of the code are mostly multiple measurements, that I need to get more precise results.
Tests show that branch mispredictions start to happen around i = 47, but I do not know how to count exact number of mispredictions to count exact number of ticks. Can anyone explain to me, how to do this without using any side programs like Vtune?
It depends on the processor your using, in general cpuid can be used to obtain a lot of information about the processor and what cpuid does not provide is typically accessible via smbios or other regions of memory.
Doing this in code on a general level without the processor support functions and manual will not tell you as much as you want to a great degree of certainty but may be useful as an estimate depending on what your looking for and how you have your code compiled e.g. the flags you use during compilation etc.
In general, what is referred to as specular or speculative execution and is typically not observed by programs as their logic which transitions through the pipeline is determined to be not used is then discarded.
Depending on how you use specific instructions in your program you may be able to use such stale cache information for better or worse but the logic therein would vary greatly depending on the CPU in use.
See also Spectre and RowHammer for interesting examples of using such techniques for privileged execution.
See the comments below for links which have code related to the use of cpuid as well as rdrand, rdseed and a few others. (rdtsc)
It's not completely clear what your looking for perhaps but will surely get you started and provide some useful examples.
See also Branch mispredictions

Optimize the Buddhabrot

I am currently working on my own implementation of the Buddhabrot. So far I am using the std::thread-Class from C++11 to concurrently work through the following iteration:
void iterate(float *res){
//generate starting point
std::default_random_engine generator;
std::uniform_real_distribution<double> distribution(-1.5,1.5);
double ReC,ImC;
double ReS,ImS,ReS_;
unsigned int steps;
unsigned int visitedPos[maxCalcIter];
unsigned int succSamples(0);
//iterate over it
while(succSamples < samplesPerThread){
steps = 0;
ReC = distribution(generator)-0.4;
ImC = distribution(generator);
double p(sqrt((ReC-0.25)*(ReC-0.25) + ImC*ImC));
while (( ((ReC+1)*(ReC+1) + ImC*ImC) < 0.0625) || (ReC < p - 2*p*p + 0.25)){
ReC = distribution(generator)-0.4;
ImC = distribution(generator);
p = sqrt((ReC-0.25)*(ReC-0.25) + ImC*ImC);
}
ReS = ReC;
ImS = ImC;
for (unsigned int j = maxCalcIter; (ReS*ReS + ImS*ImS < 4)&&(j--); ){
ReS_ = ReS;
ReS *= ReS;
ReS += ReC - ImS*ImS;
ImS *= 2*ReS_;
ImS += ImC;
if ((ReS+0.5)*(ReS+0.5) + ImS*ImS < 4){
visitedPos[steps] = int((ReS+2.5)*0.25*outputSize)*outputSize + int((ImS+2)*0.25*outputSize);
}
steps++;
}
if ((steps > minCalcIter)&&(ReS*ReS + ImS*ImS > 4)){
succSamples++;
for (int j = steps; j--;){
//std::cout << visitedPos[j] << std::endl;
res[visitedPos[j]]++;
}
}
}
}
So basically I am working in every thread so long that I generated enough trajectories of sufficient length which in expectation takes the same time in every thread.
But I really have the feeling that this function might me very unoptimized since its code is so very readable. Can anybody come up with some fancy optimizations? When it comes to compiling I just use:
g++ -O4 -std=c++11 -I/usr/include/OpenEXR/ -L/usr/lib64/ -lHalf -lIlmImf -lm buddha_cpu.cpp -o buddha_cpu
So any hints on crunching some more numbers/sec would be really appreciated. Also any links to further literature are totally welcome.
Did you check that -O4 is faster than -O2? Above O2, it's not sure.
Also, if this compilation is only for you, try -march=native. This will take advantage of your specific CPU architecture, but the resulting binary might crash with SIGSEV on older/different machines.
You did not show any threads, if I see correctly. Make sure your threads do not write memory locations of the same cache line. Writing memory locations in the same cache line from different threads force the CPU cores to synchronize their cache -- it's a huge performance degradation.

OpenCL float sum reduction

I would like to apply a reduce on this piece of my kernel code (1 dimensional data):
__local float sum = 0;
int i;
for(i = 0; i < length; i++)
sum += //some operation depending on i here;
Instead of having just 1 thread that performs this operation, I would like to have n threads (with n = length) and at the end having 1 thread to make the total sum.
In pseudo code, I would like to able to write something like this:
int i = get_global_id(0);
__local float sum = 0;
sum += //some operation depending on i here;
barrier(CLK_LOCAL_MEM_FENCE);
if(i == 0)
res = sum;
Is there a way?
I have a race condition on sum.
To get you started you could do something like the example below (see Scarpino). Here we also take advantage of vector processing by using the OpenCL float4 data type.
Keep in mind that the kernel below returns a number of partial sums: one for each local work group, back to the host. This means that you will have to carry out the final sum by adding up all the partial sums, back on the host. This is because (at least with OpenCL 1.2) there is no barrier function that synchronizes work-items in different work-groups.
If summing the partial sums on the host is undesirable, you can get around this by launching multiple kernels. This introduces some kernel-call overhead, but in some applications the extra penalty is acceptable or insignificant. To do this with the example below you will need to modify your host code to call the kernel repeatedly and then include logic to stop executing the kernel after the number of output vectors falls below the local size (details left to you or check the Scarpino reference).
EDIT: Added extra kernel argument for the output. Added dot product to sum over the float 4 vectors.
__kernel void reduction_vector(__global float4* data,__local float4* partial_sums, __global float* output)
{
int lid = get_local_id(0);
int group_size = get_local_size(0);
partial_sums[lid] = data[get_global_id(0)];
barrier(CLK_LOCAL_MEM_FENCE);
for(int i = group_size/2; i>0; i >>= 1) {
if(lid < i) {
partial_sums[lid] += partial_sums[lid + i];
}
barrier(CLK_LOCAL_MEM_FENCE);
}
if(lid == 0) {
output[get_group_id(0)] = dot(partial_sums[0], (float4)(1.0f));
}
}
I know this is a very old post, but from everything I've tried, the answer from Bruce doesn't work, and the one from Adam is inefficient due to both global memory use and kernel execution overhead.
The comment by Jordan on the answer from Bruce is correct that this algorithm breaks down in each iteration where the number of elements is not even. Yet it is essentially the same code as can be found in several search results.
I scratched my head on this for several days, partially hindered by the fact that my language of choice is not C/C++ based, and also it's tricky if not impossible to debug on the GPU. Eventually though, I found an answer which worked.
This is a combination of the answer by Bruce, and that from Adam. It copies the source from global memory into local, but then reduces by folding the top half onto the bottom repeatedly, until there is no data left.
The result is a buffer containing the same number of items as there are work-groups used (so that very large reductions can be broken down), which must be summed by the CPU, or else call from another kernel and do this last step on the GPU.
This part is a little over my head, but I believe, this code also avoids bank switching issues by reading from local memory essentially sequentially. ** Would love confirmation on that from anyone that knows.
Note: The global 'AOffset' parameter can be omitted from the source if your data begins at offset zero. Simply remove it from the kernel prototype and the fourth line of code where it's used as part of an array index...
__kernel void Sum(__global float * A, __global float *output, ulong AOffset, __local float * target ) {
const size_t globalId = get_global_id(0);
const size_t localId = get_local_id(0);
target[localId] = A[globalId+AOffset];
barrier(CLK_LOCAL_MEM_FENCE);
size_t blockSize = get_local_size(0);
size_t halfBlockSize = blockSize / 2;
while (halfBlockSize>0) {
if (localId<halfBlockSize) {
target[localId] += target[localId + halfBlockSize];
if ((halfBlockSize*2)<blockSize) { // uneven block division
if (localId==0) { // when localID==0
target[localId] += target[localId + (blockSize-1)];
}
}
}
barrier(CLK_LOCAL_MEM_FENCE);
blockSize = halfBlockSize;
halfBlockSize = blockSize / 2;
}
if (localId==0) {
output[get_group_id(0)] = target[0];
}
}
https://pastebin.com/xN4yQ28N
You can use new work_group_reduce_add() function for sum reduction inside single work group if you have support for OpenCL C 2.0 features
A simple and fast way to reduce data is by repeatedly folding the top half of the data into the bottom half.
For example, please use the following ridiculously simple CL code:
__kernel void foldKernel(__global float *arVal, int offset) {
int gid = get_global_id(0);
arVal[gid] = arVal[gid]+arVal[gid+offset];
}
With the following Java/JOCL host code (or port it to C++ etc):
int t = totalDataSize;
while (t > 1) {
int m = t / 2;
int n = (t + 1) / 2;
clSetKernelArg(kernelFold, 0, Sizeof.cl_mem, Pointer.to(arVal));
clSetKernelArg(kernelFold, 1, Sizeof.cl_int, Pointer.to(new int[]{n}));
cl_event evFold = new cl_event();
clEnqueueNDRangeKernel(commandQueue, kernelFold, 1, null, new long[]{m}, null, 0, null, evFold);
clWaitForEvents(1, new cl_event[]{evFold});
t = n;
}
The host code loops log2(n) times, so it finishes quickly even with huge arrays. The fiddle with "m" and "n" is to handle non-power-of-two arrays.
Easy for OpenCL to parallelize well for any GPU platform (i.e. fast).
Low memory, because it works in place
Works efficiently with non-power-of-two data sizes
Flexible, e.g. you can change kernel to do "min" instead of "+"

how to generate a delay

i'm new to kernel programming and i'm trying to understand some basics of OS. I am trying to generate a delay using a technique which i've implemented successfully in a 20Mhz microcontroller.
I know this is a totally different environment as i'm using linux centOS in my 2 GHz Core 2 duo processor.
I've tried the following code but i'm not getting a delay.
#include<linux/kernel.h>
#include<linux/module.h>
int init_module (void)
{
unsigned long int i, j, k, l;
for (l = 0; l < 100; l ++)
{
for (i = 0; i < 10000; i ++)
{
for ( j = 0; j < 10000; j ++)
{
for ( k = 0; k < 10000; k ++);
}
}
}
printk ("\nhello\n");
return 0;
}
void cleanup_module (void)
{
printk ("bye");
}
When i dmesg after inserting the module as quickly as possile for me, the string "hello" is already there. If my calculation is right, the above code should give me atleast 10 seconds delay.
Why is it not working? Is there anything related to threading? How could a 20 Ghz processor execute the above code instantly without any noticable delay?
The compiler is optimizing your loop away since it has no side effects.
To actually get a 10 second (non-busy) delay, you can do something like this:
#include <linux/sched.h>
//...
unsigned long to = jiffies + (10 * HZ); /* current time + 10 seconds */
while (time_before(jiffies, to))
{
schedule();
}
or better yet:
#include <linux/delay.h>
//...
msleep(10 * 1000);
for short delays you may use mdelay, ndelay and udelay
I suggest you read Linux Device Drivers 3rd edition chapter 7.3, which deals with delays for more information
To answer the question directly, it's likely your compiler seeing that these loops don't do anything and "optimizing" them away.
As for this technique, what it looks like you're trying to do is use all of the processor to create a delay. While this may work, an OS should be designed to maximize processor time. This will just waste it.
I understand it's experimental, but just the heads up.

Resources