Use of Tcl in c++ multithreaded application - multithreading

I am facing crash while trying to create one tcl interpreter per thread. I am using TCL version 8.5.9 on linux rh6. It crashes in different functions each time seems some kind of memory corruption. Going through net it seems a valid approach. Has anybody faced similar issue? Does multi-threaded use of Tcl need any kind of special support?
Here is the following small program causing crash with tcl version 8.5.9.
#include <tcl.h>
#include <pthread.h>
void* run (void*)
{
Tcl_Interp *interp = Tcl_CreateInterp();
sleep(1);
Tcl_DeleteInterp(interp);
}
main ()
{
pthread_t t1, t2;
pthread_create(&t1, NULL, run, NULL);
pthread_create(&t2, NULL, run, NULL);
pthread_join (t1, NULL);
pthread_join (t2, NULL);
}

The default Tcl library isn't built thread enabled. (well, not with 8.5.9 afaik, 8.6 is).
So did you check that your tcl lib was built thread enabled?
If you have a tclsh built against the lib, you can simply run:
% parray ::tcl_platform
::tcl_platform(byteOrder) = littleEndian
::tcl_platform(machine) = intel
::tcl_platform(os) = Windows NT
::tcl_platform(osVersion) = 6.2
::tcl_platform(pathSeparator) = ;
::tcl_platform(platform) = windows
::tcl_platform(pointerSize) = 4
::tcl_platform(threaded) = 1
::tcl_platform(wordSize) = 4
If ::tcl_platform(threaded) is 0, your build isn't thread enabled. You would need to build a version with thread support by passing --enable-threads to the configure script.
Did you use the correct defines to declare you want the thread enabled Macros from tcl.h?
You should add -DTCL_THREADS to your compiler invocation, otherwise the locking macros are compiled as no-ops.

You need to use a thread-enabled build of the library.
When built without thread-enabling, Tcl internally uses quite a bit of global static data in places like memory management. It's pretty pervasive. While it might be possible to eventually make things work (provided you do all the initialisation and setup within a single thread) it's going to be rather unadvisable. That things crash in strange ways in your case isn't very surprising at all.
When you use a thread-enabled build of Tcl, all that global static data is converted to either thread-specific data or to appropriate mutex-guarded global data. That then allows Tcl to be used from many threads at once. However, a particular Tcl_Interp is bound to the thread that created it (as it uses lots of thread-specific data). In your case, that will be no problem; your interpreters are happily per-thread entities.
(Well, provided you also add a call to initialise the Tcl library itself, which only needs to be done once. Put Tcl_FindExecutable(NULL); inside main() before you create any of those threads.)
Tcl 8.5 defaulted to not being thread-enabled on Unix for backward-compatibility reasons — on Windows and Mac OS X it was thread-enabled due to the different ways they handle low-level events — but this was changed in 8.6. I don't know how to get a thread-enabled build on RH6 (other than building it yourself from source, which should be straight-forward).

Related

C++11 thread safe singleton using lambda and call_once: main function (g++, clang++, Ubuntu 14.04)

All!
I am new to C++11 and many of its features.
I am looking for a C++11 (non boost) implementation of a thread safe singleton, using lambda and call_once (Sorry... I have no rights to include the call_once tag in the post).
I have investigated quite a lot (I am using g++ (4.8, 5.x, 6.2), clang++3.8, Ubuntu 14.04, trying to avoid using boost), and I have found the following links:
http://www.nuonsoft.com/blog/2012/10/21/implementing-a-thread-safe-singleton-with-c11/comment-page-1/
http://silviuardelean.ro/2012/06/05/few-singleton-approaches/ (which seems to be very similar to the previous one, but it is more complete, and provides at the end its own implementation).
But: I am facing these problems with the mentioned implementations: Or I am writing a wrong implementation of main function (probable), or there are mistakes in the posted codes (less probable), but I am receiving different compiling / linking errors (or both things at the same time, of course...).
Similar happens with following code, which seems to compile according to comments (but this one does not use lambda, neither call_once):
How to ensure std::call_once really is only called once (In this case, it compiles fine, but throws the following error in runtime):
terminate called after throwing an instance of 'std::system_error'
what(): Unknown error -1
Aborted (core dumped)
So, could you help me, please, with the correct way to call the getInstance() in the main function, to get one (and only one object) and then, how to call other functions that I might include in the Singleton? (Something like: Singleton::getInstance()->myFx(x, y, z);?
(Note: I have also found several references in StackOverflow, which are resolved as "thread safe", but there are similar implementations in other StackOverflow posts and other Internet places which are not considered "thread safe"; here are a few example of both (these do not use lambda) ):
Thread-safe singleton in C++11
c++ singleton implementation STL thread safe
Thread safe singleton in C++
Thread safe singleton implementation in C++
Thread safe lazy construction of a singleton in C++
Finally, I will appreciate very much if you can suggest to me the best books to study about these subjects. Thanks in advance!!
I just ran across this issue. In my case, I needed to add -lpthread to my compilation options.
Implementing a singleton with a static variable as e. g. suggested by Thread safe singleton implementation in C++ is thread safe with C++11. With C++11 the initialization of static variables is defined to happen on
exactly one thread, and no other threads will proceed until that initialization is complete. (I can also backup that with problems we recently had on an embedded platform when we used call_once to implement a singleton and it worked after we returned to the "classic" singleton implementation with the static variable.)
ISO/IEC 14882:2011 defines in §3.6.2 e. g. that
Static initialization shall be performed before any dynamic initialization takes place.
and as part of §6.7:
The zero-initialization (8.5) of all block-scope variables with static
storage duration (3.7.1) or thread storage duration (3.7.2) is
performed before any other initialization takes place.
(See also this answer)
A very good book I can recommend is "C++ Concurrency in Action" by A. Williams. (As part of Chapter 3 call_once and the Singleton pattern is discussed - that is why I know that the "classic Singleton" is thread safe since C++11.)

VC++ native mutex heap corruption

I have a native c++ library that gets used by a managed C++ application. The native library is compiled with no CLR support and the managed C++ application with it (/CLR compiler option).
When I use a std::mutex in the native library I get a heap corruption when the owning native class is deleted. The use of mutex.h is blocked by managed C++ so I'm guessing that could be part of the reason.
The minimal native class that demonstrates the issue is:
Header:
#pragma once
#include <stdio.h>
#ifndef __cplusplus_cli
#include <mutex>
#endif
namespace MyNamespace {
class SomeNativeLibrary
{
public:
SomeNativeLibrary();
~SomeNativeLibrary();
void DoSomething();
#ifndef __cplusplus_cli
std::mutex aMutex;
#endif
};
}
Implementation:
#include "SomeNativeLibrary.h"
namespace MyNamespace {
SomeNativeLibrary::SomeNativeLibrary()
{}
SomeNativeLibrary::~SomeNativeLibrary()
{}
void SomeNativeLibrary::DoSomething(){
printf("I did something.\n");
}
}
Managed C++ Console Application:
int main(array<System::String ^> ^args)
{
Console::WriteLine(L"Unit Test Console:");
MyNamespace::SomeNativeLibrary *someNativelib = new MyNamespace::SomeNativeLibrary();
someNativelib->DoSomething();
delete someNativelib;
getchar();
return 0;
}
The heap corruption debug error occurs when the attempt is made to delete the someNativeLib pointer.
Is there anything I can do to use a std::mutex safely in the native library or is there an alternative I could use? In my live code the mutex is used for is to ensure that only a single thread accesses a std::vector.
The solution was to use a CRITICAL_SECTION as the lock instead. It's actually more efficient than a mutex in my case anyway since the lock is only for threads in the same process.
Not sure were you reading your own post but there is a clue in your code:
#ifndef __cplusplus_cli
std::mutex aMutex;
#endif
Member 'aMutex' compiles only if compile condition '__cplusplus_cli' is undefined.
So the moment you included that header in Managed C++ it vanished from definition.
So your Native project and Managed project have mismatch in class definition for beginners == mostly ends in Access Violation if attempted to write to location beyond class memory (non existing member in CLI version if instantiated there).
Or just HEAP CORRUPTION in managed code to put it simply.
So what you have done is no go, ever!
But I was actually amazed that you managed to include native lib, and successfully compile both projects. I must ask how did you hack project properties to manage that. Or you just may have found yet another bug ;)
About question: for posterity
Yes CRITICAL_SECTION helps, Yes it's more faster from mutex since it is implemented in single process and some versions of it even in hardware (!). Also had plenty of changes since it's introduction and some nasty DEAD-LOCKS issues.
example: https://microsoft.public.win32.programmer.kernel.narkive.com/xS8sFPCG/criticalsection-deadlock-with-owningthread-of-zero
end up just killing entire OS. So lock only very small piece of code that actually only accesses the data, and exit locks immediately.
As replacement, you could just use plain "C" kernel mutex or events if not planning cross-platform support (Linux/iOS/Android/Win/MCU/...).
There is a ton of other replacements coming from Windows kernel.
// mutex
https://learn.microsoft.com/en-us/windows/win32/api/synchapi/nf-synchapi-createmutexw
HANDLE hMutex = CreateMutex(NULL, TRUE, _T("MutexName"));
NOTE: mutex name rules 'Global' vs 'Local'.
// or event
https://learn.microsoft.com/en-us/windows/win32/api/synchapi/nf-synchapi-createeventw
HANDLE hEvent = CreateEvent(NULL, TRUE, FALSE, _T("EventName"));
to either set, or clear/reset event state, just call SetEvent(hEvent), or ResetEvent(hEvent).
To wait for signal (set) again simply
int nret = WaitForSingleObject(hEvent, -1);
INFINITE define (-1) wait for infinity, a value which can be replaced with actual milliseconds timeout, and then return value could be evaluated against S_OK. If not S_OK it's most likely TIMEOUT.
There is a sea of synchronization functions/techniques depending on which you actually need to do !?:
https://learn.microsoft.com/en-us/windows/win32/sync/synchronization-functions
But use it with sanity check.
Since you might be just trying to fix wrong thing in the end, each solution depends on the actual problem. And the moment you think it's too complex know it's wrong solution, simplest solutions were always best, but always verify and validate.

OpenGL Rendering in a secondary thread

I'm writing a 3D model viewer application as a hobby project, and also as a test platform to try out different rendering techniques. I'm using SDL to handle window management and events, and OpenGL for the 3D rendering. The first iteration of my program was single-threaded, and ran well enough. However, I noticed that the single-threaded program caused the system to become very sluggish/laggy. My solution was to move all of the rendering code into a different thread, thereby freeing the main thread to handle events and prevent the app from becoming unresponsive.
This solution worked intermittently, the program frequently crashed due to a changing (and to my mind bizarre) set of errors coming mainly from the X window system. This led me to question my initial assumption that as long as all of my OpenGL calls took place in the thread where the context was created, everything should still work out. After spending the better part of a day searching the internet for an answer, I am thoroughly stumped.
More succinctly: Is it possible to perform 3D rendering using OpenGL in a thread other than the main thread? Can I still use a cross-platform windowing library such as SDL or GLFW with this configuration? Is there a better way to do what I'm trying to do?
So far I've been developing on Linux (Ubuntu 11.04) using C++, although I am also comfortable with Java and Python if there is a solution that works better in those languages.
UPDATE: As requested, some clarifications:
When I say "The system becomes sluggish" I mean interacting with the desktop (dragging windows, interacting with the panel, etc) becomes much slower than normal. Moving my application's window takes time on the order of seconds, and other interactions are just slow enough to be annoying.
As for interference with a compositing window manager... I am using the GNOME shell that ships with Ubuntu 11.04 (staying away from Unity for now...) and I couldn't find any options to disable desktop effects such as there was in previous distributions. I assume this means I'm not using a compositing window manager...although I could be very wrong.
I believe the "X errors" are server errors due to the error messages I'm getting at the terminal. More details below.
The errors I get with the multi-threaded version of my app:
XIO: fatal IO error 11 (Resource temporarily unavailable) on X server ":0.0"
after 73 requests (73 known processed) with 0 events remaining.
X Error of failed request: BadColor (invalid Colormap parameter)
Major opcode of failed request: 79 (X_FreeColormap)
Resource id in failed request: 0x4600001
Serial number of failed request: 72
Current serial number in output stream: 73
Game: ../../src/xcb_io.c:140: dequeue_pending_request: Assertion `req == dpy->xcb->pending_requests' failed.
Aborted
I always get one of the three errors above, which one I get varies, apparently at random, which (to my eyes) would appear to confirm that my issue does in fact stem from my use of threads. Keep in mind that I'm learning as I go along, so there is a very good chance that in my ignorance I've something rather stupid along the way.
SOLUTION: For anyone who is having a similar issue, I solved my problem by moving my call to SDL_Init(SDL_INIT_VIDEO) to the rendering thread, and locking the context initialization using a mutex. This ensures that the context is created in the thread that will be using it, and it prevents the main loop from starting before initialization tasks have finished. A simplified outline of the startup procedure:
1) Main thread initializes struct which will be shared between the two threads, and which contains a mutex.
2) Main thread spawns render thread and sleeps for a brief period (1-5ms), giving the render thread time to lock the mutex. After this pause, the main thread blocks while trying to lock the mutex.
3) Render thread locks mutex, initializes SDL's video subsystem and creates OpenGL context.
4) Render thread unlocks mutex and enters its "render loop".
5) The main thread is no longer blocked, so it locks and unlocks the mutex before finishing its initialization step.
Be sure and read the answers and comments, there is a lot of useful information there.
As long as the OpenGL context is touched from only one thread at a time, you should not run into any problems. You said even your single threaded program made your system sluggish. Does that mean the whole system or only your own application? The worst that should happen in a single threaded OpenGL program is, that processing user inputs for that one program gets laggy but the rest of the system is not affected.
If you use some compositing window manager (Compiz, KDE4 kwin), please try out what happens if you disable all compositing effects.
When you say X errors do you mean client side errors, or errors reported in the X server log? The latter case should not happen, because any kind of kind of malformed X command stream the X server must be able to cope with and at most emit a warning. If it (the X server) crashes this is a bug and should reported to X.org.
If your program crashes, then there's something wrong in its interaction with X; in that case please provide us with the error output in its variations.
What I did in a similar situation was to keep my OpenGL calls in the main thread but move the vertex arrays preparation to a separate thread (or threads).
Basically, if you manage to separate the cpu intensive stuff from the OpenGL calls you don't have to worry about the unfortunately dubious OpenGL multithreading.
It worked out beautifully for me.
Just in case - the X-Server has its' own sync subsystem.
Try following while drawing:
man XInitThreads - for initialization
man XLockDisplay/XUnlockDisplay -- for drawing (not sure for events processing);
I was getting one of your errors:
../../src/xcb_io.c:140: dequeue_pending_request: Assertion `req ==
dpy->xcb->pending_requests' failed. Aborted
and a whole host of different ones as well. Turns out that SDL_PollEvent needs an a pointer with initialized memory. So this fails:
SDL_Event *event;
SDL_PollEvent(event);
while this works:
SDL_Event event;
SDL_PollEvent(&event);
In case anyone else runs across this from google.
This is half an answer and half a question.
Rendering in SDL in a separate thread is possible. It works usually on any OS. What you need to do is, that you make sure you make the GL context current when the render thread takes over. At the same time, before you do so, you need to release it from the main thread, e.g.:
Called from the main thread:
void Renderer::Init()
{
#ifdef _WIN32
m_CurrentContext = wglGetCurrentContext();
m_CurrentDC = wglGetCurrentDC();
// release current context
wglMakeCurrent( nullptr, nullptr );
#endif
#ifdef __linux__
if (!XInitThreads())
{
THROW( "XLib is not thread safe." );
}
SDL_SysWMinfo wm_info;
SDL_VERSION( &wm_info.version );
if ( SDL_GetWMInfo( &wm_info ) ) {
Display *display = wm_info.info.x11.gfxdisplay;
m_CurrentContext = glXGetCurrentContext();
ASSERT( m_CurrentContext, "Error! No current GL context!" );
glXMakeCurrent( display, None, nullptr );
XSync( display, false );
}
#endif
}
Called from the render thread:
void Renderer::InitGL()
{
// This is important! Our renderer runs its own render thread
// All
#ifdef _WIN32
wglMakeCurrent(m_CurrentDC,m_CurrentContext);
#endif
#ifdef __linux__
SDL_SysWMinfo wm_info;
SDL_VERSION( &wm_info.version );
if ( SDL_GetWMInfo( &wm_info ) ) {
Display *display = wm_info.info.x11.gfxdisplay;
Window window = wm_info.info.x11.window;
glXMakeCurrent( display, window, m_CurrentContext );
XSync( display, false );
}
#endif
// Init GLEW - we need this to use OGL extensions (e.g. for VBOs)
GLenum err = glewInit();
ASSERT( GLEW_OK == err, "Error: %s\n", glewGetErrorString(err) );
The risks here is, that SDL does not have a native MakeCurrent() function, unfortunately. So, we have to poke around a little in SDL internals (1.2, 1.3 might have solved this by now).
And one problem remains, that for some reason, I run into a problem when SDL is shutting down. Maybe someone can tell me how to safely release the context when the thread terminates.
C++, SDL, OpenGl:::
on main thread: SDL_CreateWindow( );
SDL_CreateSemaphore( );
SDL_SemWait( );
on renderThread: SDL_CreateThread( run, "rendererThread", (void*)this )
SDL_GL_CreateContext( )
"initialize the rest of openGl and glew"
SDL_SemPost( ) //unlock the previously created semaphore
P.S: SDL_CreateThread( ) only takes functions as its first parameter not methods, if a method is wanted than you simulate a method/function in your class by making it a friend function. this way it will have method traits while still able to be used as a functor for the SDL_CreateThread( ).
P.S.S: inside of the "run( void* data )" created for the thread, the "(void*)" this is important and in order to re-obtain "this" inside of the function this line is needed "ClassName* me = (ClassName*)data;"

Is there some kind of incompatibility with Boost::thread() and Nvidia CUDA?

I'm developing a generic streaming CUDA kernel execution Framework that allows parallel data copy & execution on the GPU.
Currently I'm calling the cuda kernels within a C++ static function wrapper, so I can call the kernels from a .cpp file (not .cu), like this:
//kernels.cu:
//kernel definition
__global__ void kernelCall_kernel( dataRow* in, dataRow* out, void* additionalData){
//Do something
};
//kernel handler, so I can compile this .cu and link it with the main project and call it within a .cpp file
extern "C" void kernelCall( dataRow* in, dataRow* out, void* additionalData){
int blocksize = 256;
dim3 dimBlock(blocksize);
dim3 dimGrid(ceil(tableSize/(float)blocksize));
kernelCall_kernel<<<dimGrid,dimBlock>>>(in, out, additionalData);
}
If I call the handler as a normal function, the data printed is right.
//streamProcessing.cpp
//allocations and definitions of data omitted
//copy data to GPU
cudaMemcpy(data_d,data_h,tableSize,cudaMemcpyHostToDevice);
//call:
kernelCall(data_d,result_d,null);
//copy data back
cudaMemcpy(result_h,result_d,resultSize,cudaMemcpyDeviceToHost);
//show result:
printTable(result_h,resultSize);// this just iterate and shows the data
But to allow parallel copy and execution of data on the GPU I need to create a thread, so when I call it making a new boost::thread:
//allocations, definitions of data,copy data to GPU omitted
//call:
boost::thread* kernelThreadOwner = new boost::thread(kernelCall, data_d,result_d,null);
kernelThreadOwner->join();
//Copy data back and print ommited
I just get garbage when printing the result on the end.
Currently I'm just using one thread, for testing purpose, so there should be no much difference in calling it directly or creating a thread. I have no clue why calling the function directly gives the right result, and when creating a thread not. Is this a problem with CUDA & boost? Am I missing something? Thank you in advise.
The problem is that (pre CUDA 4.0) CUDA contexts are tied to the thread in which they were created. When you are using two threads, you have two contexts. The context that the main thread is allocating and reading from, and the context that the thread which runs the kernel inside are not the same. Memory allocations are not portable between contexts. They are effectively separate memory spaces inside the same GPU.
If you want to use threads in this way, you either need to refactor things so that one thread only "talks" to the GPU, and communicates with the parent via CPU memory, or use the CUDA context migration API, which allows a context to be moved from one thread to another (via cuCtxPushCurrent and cuCtxPopCurrent). Be aware that context migration isn't free, and there is latency involved, so if you plan to migrating contexts around frequently, you might find it more efficient to change to a different design which preserves context-thread affinity.

Thread Communication

Is there any tool available for tracing communication among threads;
1. running in a single process
2. running in different processes (IPC)
I am presuming you need to trace this for debugging. Under normal circumstances it's hard to do this, without custom written code. For a similar problem that I faced, I had a per-processor tracing buffer, which used to record briefly the time and interesting operation that was performed by the running thread. The log was a circular trace which used to store data like this:
struct trace_data {
int op;
void *data;
struct time t;
union {
struct {
int op1_field1;
int op1_field2;
} d1;
struct {
int op2_field1;
int op2_field2;
} d2
} u;
}
The trace log was an array of these structures of length 1024, one for each processor. Each thread used to trace operations, as well as time to determine causality of events. The fields which were used to store data in the "union" depended upon the operation being done. The "data" pointer's meaning depended upon the "op" as well. When the program used to crash, I'd open the core in gdb and I had a gdb script which would go through the logs in each processor and print out the ops and their corresponding data, to find out the history of events.
For different processes you could do such logging to a file instead - one per process. This example is in C, but you can do this in whatever language you want to use, as long as you can figure out the CPU id on which the thread is running currently.
You might be looking for something like the Intel Thread Checker as long as you're using pthreads in (1).
For communication between different processes (2), you can use Aspect-Oriented Programming (AOP) if you have the source code, or write your own wrapper for the IPC functions and LD_PRELOAD it.
Edit: Whoops, you said tracing, not checking.
It will depend so much on the operating system and development environment that you are using. If you're with Visual Studio, look at the tools in Visual Studio 2010.

Resources