Easiest way to make basic OpenMP like library - multithreading

I would like to make a basic library with some very basic features of OpenMP. For example, just to be able to write a code like below. I wonder how I can use LLVM and pthreads to accomplish this.
I guess there are two steps:
Preprocessing to figure out the parallel task (parallel vs parallel-for)
Convert appropriate code blocks to a void* function needed for pthreads
Automation to create, run and join threads
Example code:
#Our_Own parallel
{
printf("hello");
}
#Our_Own parallel for
for (int i =0; i < 1000; i++)
{
printf(i);
}

There is no need to use pragmas to implement parallel loops (or tasks) in C++. You can see this in the implementation of Threading Building Blocks (TBB), or Kokkos, both of which provide parallelism without using pragmas. As a consequence, there is no need to make any compiler modifications to do this!
The key observation here is that C++ provides lambdas which allow you to abstract a chunk of code into an anonymous function and to bind the appropriate variables from the context so that it can later be invoked in some other thread.
Even if you do want to map to pragmas, for instance to provide your own, "improved" version of OpenMP, you can do that without using anything more than C macros, by using the _Pragma directive, which can be placed inside a macro, something like this:-
#include <stdio.h>
#include <omp.h>
#define STRINGIFY1(...) #__VA_ARGS__
#define STRINGIFY(...) STRINGIFY1(__VA_ARGS__)
#define MY_PARALLEL _Pragma("omp parallel")
#define my_threadID() omp_get_thread_num()
int main (int, char **)
{
MY_PARALLEL
{
printf ("Hello from thread %d\n", my_threadID());
}
return 0;
}
However, we're rather in the dark about what you are really trying to achieve, and in what context:
Since OpenMP implementations almost all sit on top of pthreads, why do you need something different?
Which language is this for? (C, C++, other?)
What is the reason to avoid using existing implementations (such as TBB, RAJA, Kokkos, C++ Parallel Algorithms)?
Remember, "The best code is the code I do not have to write".
(P.s. if you want to see the type of thing you are taking on, look at the Little OpenMP runtime which implements some (not all) of the CPU OpenMP requirements, and the associated book.)

DISCLAIMER: This is a very hard task, and there are many hidden tasks involved (how to access variables and set up their sharing attributes, implicit barriers, synchronization, etc.) I do not claim that you can solve all of them using the idea described below.
If you use a (python) script to preprocess your code, here is a very minimal example how to start in C language (C++ may be a bit easier because of lambda functions).
Use macro definitions (preferably put them in a header file) :
// Macro definitions (in own.h)
#define OWN_PARALLEL_BLOCK_START(f) \
void *f(void * data) \
{ \
#define OWN_PARALLEL_BLOCK_END \
return NULL; \
}
#define OWN_PARALLEL_MAIN(f) \
{ \
int THREAD_NUM=4; \
pthread_t thread_id[THREAD_NUM]; \
for(int i=0;i<THREAD_NUM;i++) { \
pthread_create(&thread_id[i], NULL, f, NULL); \
} \
for(int i=0;i<THREAD_NUM;i++) { \
pthread_join(thread_id[i], NULL); \
} \
}
Your (python) script should convert this:
int main(){
#pragma own parallel
{
printf("Hello word\n");
}
}
to the following:
OWN_PARALLEL_BLOCK_START(ThreadBlock_1)
{
printf("Hello word\n");
}
OWN_PARALLEL_BLOCK_END
int main(){
OWN_PARALLEL_MAIN(ThreadBlock_1)
}
Check it on Compiler Explorer

I would like to make a basic library with some very basic features of OpenMP.
No, you really wouldn't. At minimum, because nothing of the sort you are considering can accurately be described as "basic" or as (only) a "library".
There are multiple good alternatives already available, not least OpenMP itself. Use one of those. Or if you insist on rolling your own then do so with the understanding that you are taking on a major project.
For example, just to be able to write a code like below. I wonder how I can use LLVM and pthreads to accomplish this. I guess there are two steps:
Preprocessing to figure out the parallel task (parallel vs parallel-for)
Yes, but not with the C preprocessor. It is not nearly powerful enough. You need a source-to-source translator that has a sufficient semantic understanding of the target language (C? C++?) to recognize your annotations and the source constructs to which they apply, and to apply the needed transformations and tooling.
I think the LLVM project contains pieces that would help with this (so you probably don't need to write a C parser from scratch, for example).
Convert appropriate code blocks to a void* function needed for pthreads
Yes, that would be among the transformations required.
Automation to create, run and join threads
Yes, that would also be among the transformations required.
But those are not all. For the parallel for case, for example, you also need to account for splitting the loop iterations among multiple threads. You probably also need a solution for recognizing and managing access to shared variables.

Related

Openmp lock alternative that allows concurrent reads until a thread tries to write?

I want to replace:
omp_set_lock(&bestTimeSeenSoFar_lock);
temp_bestTimeSeenSoFar = bestTimeSeenSoFar; // this is a read
omp_unset_lock(&bestTimeSeenSoFar_lock);
...
omp_set_lock(&bestTimeSeenSoFar_lock);
// update/write bestTimeSeenSoFar
omp_unset_lock(&bestTimeSeenSoFar_lock);
with code that will allow multiple threads to be reading the variable at once UNLESS a thread is trying to write, in which case they wait until the write is done. Help?
What about using something like this?
#pragma flush( bestTimeSeenSoFar )
#pragma omp atomic read
temp_bestTimeSeenSoFar = bestTimeSeenSoFar;
...
#pragma omp atomic write
bestTimeSeenSoFar = whatever;
#pragma flush( bestTimeSeenSoFar )
My reading to the OpenMP standard chapter 2.12.6 dealing with atomic doesn't permit me to decide whether this will perform exactly what you want, but this is the best / closest I can come up with. Moreover, even if this might work in theory, it will be highly dependant on the quality of the implementation of this feature within your compiler. So it not working for you won't necessarily imply that the idea is wrong.
Anyway, I would encourage you to give it a try and, please please, to report if it works for you.

vector::empty() function doesn't work correctly in release mode

#include<iostream>
#include<vector>
#include<thread>
#include<string>
using namespace std;
vector<string> s;
void add()
{
while(true)
{
getchar();
s.push_back("added");
}
}
void show()
{
while(true)
{
//cout<<"";
while(!s.empty())
{
cout<<(*s.begin())<<endl;
s.erase(s.begin());
}
}
}
int main()
{
thread one(add);
thread two(show);
one.join();
two.join();
}
In debug mode there is no such a problem. In release mode if the comment line is uncommented it works again. But with just like this, there is a problem. What is the problem?
std::vector (as any other std:: container) is not generally thread-safe. It means that concurrent modifying access to the same vector from multiple thread is generally not supported. What that means is that while you can call non-modifying functions of the vector from many threads at the same time (for instance, you can call begin() and end() with no problems), modification functions should have exclusive access to the vector object. To achieve this exclusivity, you need to use thread-synchronization primitives to 'signal' your intention to obtain exclusive access to the vector, perform your modification and than 'signal' that exclusive access is no longer need.
Note, this is not enough to perform that sort of routine when you modify (insert) data to the vector. You will also have to do the same dance when you read data from the vector, since modifications need exclusive access, and even the read will violate this exclusivity. The non-technical term I've used here, 'signalling', has a technical counterpart - it is called critical section. Here we say that you 'enter critical section' and 'leave critical section'.
There a more than one way to enter and leave critical section. The stapples of this are so-called mutexes, and they should be enough for your learning. Just keep in mind there are other ways as well, which you'll learn in the due course.

Use of Tcl in c++ multithreaded application

I am facing crash while trying to create one tcl interpreter per thread. I am using TCL version 8.5.9 on linux rh6. It crashes in different functions each time seems some kind of memory corruption. Going through net it seems a valid approach. Has anybody faced similar issue? Does multi-threaded use of Tcl need any kind of special support?
Here is the following small program causing crash with tcl version 8.5.9.
#include <tcl.h>
#include <pthread.h>
void* run (void*)
{
Tcl_Interp *interp = Tcl_CreateInterp();
sleep(1);
Tcl_DeleteInterp(interp);
}
main ()
{
pthread_t t1, t2;
pthread_create(&t1, NULL, run, NULL);
pthread_create(&t2, NULL, run, NULL);
pthread_join (t1, NULL);
pthread_join (t2, NULL);
}
The default Tcl library isn't built thread enabled. (well, not with 8.5.9 afaik, 8.6 is).
So did you check that your tcl lib was built thread enabled?
If you have a tclsh built against the lib, you can simply run:
% parray ::tcl_platform
::tcl_platform(byteOrder) = littleEndian
::tcl_platform(machine) = intel
::tcl_platform(os) = Windows NT
::tcl_platform(osVersion) = 6.2
::tcl_platform(pathSeparator) = ;
::tcl_platform(platform) = windows
::tcl_platform(pointerSize) = 4
::tcl_platform(threaded) = 1
::tcl_platform(wordSize) = 4
If ::tcl_platform(threaded) is 0, your build isn't thread enabled. You would need to build a version with thread support by passing --enable-threads to the configure script.
Did you use the correct defines to declare you want the thread enabled Macros from tcl.h?
You should add -DTCL_THREADS to your compiler invocation, otherwise the locking macros are compiled as no-ops.
You need to use a thread-enabled build of the library.
When built without thread-enabling, Tcl internally uses quite a bit of global static data in places like memory management. It's pretty pervasive. While it might be possible to eventually make things work (provided you do all the initialisation and setup within a single thread) it's going to be rather unadvisable. That things crash in strange ways in your case isn't very surprising at all.
When you use a thread-enabled build of Tcl, all that global static data is converted to either thread-specific data or to appropriate mutex-guarded global data. That then allows Tcl to be used from many threads at once. However, a particular Tcl_Interp is bound to the thread that created it (as it uses lots of thread-specific data). In your case, that will be no problem; your interpreters are happily per-thread entities.
(Well, provided you also add a call to initialise the Tcl library itself, which only needs to be done once. Put Tcl_FindExecutable(NULL); inside main() before you create any of those threads.)
Tcl 8.5 defaulted to not being thread-enabled on Unix for backward-compatibility reasons — on Windows and Mac OS X it was thread-enabled due to the different ways they handle low-level events — but this was changed in 8.6. I don't know how to get a thread-enabled build on RH6 (other than building it yourself from source, which should be straight-forward).

what does sched_feat macro in scheduler mean

The following macro is defined in ./kernel/sched/sched.h
#define sched_feat(x) (static_branch_##x(&sched_feat_keys[__SCHED_FEAT_##x]))
#else /* !(SCHED_DEBUG && HAVE_JUMP_LABEL) */
#define sched_feat(x) (sysctl_sched_features & (1UL << __SCHED_FEAT_##x))
#endif /* SCHED_DEBUG && HAVE_JUMP_LABEL */
I am unable to understand what role does it play.
The sched_feat() macro is used in scheduler code to test if a certain scheduler feature is enabled. For example, in kernel/sched/core.c, there is a snippet of code
int mutex_spin_on_owner(struct mutex *lock, struct task_struct *owner)
{
if (!sched_feat(OWNER_SPIN))
return 0;
which is testing whether the "spin-wait on mutex acquisition if the mutex owner is running" feature is set. You can see the full list of scheduler features in kernel/sched/features.h but a short summary is that they are tunables settable at runtime without rebuilding the kernel through /sys/kernel/debug/sched_features.
For example if you have not changed the default settings on your system, you will see "OWNER_SPIN" in your /sys/kernel/debug/sched_features, which means the !sched_feat(OWNER_SPIN) in the snippet above will evaluate to false and the scheduler code will continue on into the rest of the code in mutex_spin_on_owner().
The reason that the macro definition you partially copied is more complicated than you might expect is that it uses the jump labels feature when available and needed to eliminate the overhead of these conditional tests in frequently run scheduler code paths. (The jump label version is only used when HAVE_JUMP_LABEL is set in the config, for obvious reasons, and when SCHED_DEBUG is set because otherwise the scheduler feature bits can't change at runtime) You can follow the link above to lwn.net for more details, but in a nutshell jump labels are a way to use runtime binary patching to make conditional tests of flags much cheaper at the cost of making changing the flags much more expensive.
You can also look at the scheduler commit that introduced jump label use to see how the code used to be a bit simpler but not quite as efficient.

Double-Checked Locking Pattern in C++11?

The new machine model of C++11 allows for multi-processor systems to work reliably, wrt. to reorganization of instructions.
As Meyers and Alexandrescu pointed out the "simple" Double-Checked Locking Pattern implementation is not safe in C++03
Singleton* Singleton::instance() {
if (pInstance == 0) { // 1st test
Lock lock;
if (pInstance == 0) { // 2nd test
pInstance = new Singleton;
}
}
return pInstance;
}
They showed in their article that no matter what you do as a programmer, in C++03 the compiler has too much freedom: It is allowed to reorder the instructions in a way that you can not be sure that you end up with only one instance of Singleton.
My question is now:
Do the restrictions/definitions of the new C++11 machine model now constrain the sequence of instructions, that the above code would always work with a C++11 compiler?
How does a safe C++11-Implementation of this Singleton pattern now looks like, when using the new library facilities (instead of the mock Lock here)?
If pInstance is a regular pointer, the code has a potential data race -- operations on pointers (or any builtin type, for that matter) are not guaranteed to be atomic (EDIT: or well-ordered)
If pInstance is an std::atomic<Singleton*> and Lock internally uses an std::mutex to achieve synchronization (for example, if Lock is actually std::lock_guard<std::mutex>), the code should be data race free.
Note that you need both explicit locking and an atomic pInstance to achieve proper synchronization.
Since static variable initialization is now guaranteed to be threadsafe, the Meyer's singleton should be threadsafe.
Singleton* Singleton::instance() {
static Singleton _instance;
return &_instance;
}
Now you need to address the main problem: there is a Singleton in your code.
EDIT: based on my comment below: This implementation has a major drawback when compared to the others. What happens if the compiler doesn't support this feature? The compiler will spit out thread unsafe code without even issuing a warning. The other solutions with locks will not even compile if the compiler doesn't support the new interfaces. This might be a good reason not to rely on this feature, even for things other than singletons.
C++11 doesn't change the meaning of that implementation of double-checked locking. If you want to make double-checked locking work you need to erect suitable memory barriers/fences.

Resources