I need to write an application using MPICH2 (64 bit, in case you're wondering). A GUI is entirely optional but would of course be a huge plus. Will mpiexec have any difficulties running managed VC++ code? Are there any other problems I might run into with compiling/linking (calling conventions, etc)?
Just to give you an idea, the general structure of the program would be like this:
int main(array<System::String ^> ^args)
{
/* Get MPI rank */
if ( rank == 0 )
{
// Enabling Windows XP visual effects before any controls are created
Application::EnableVisualStyles();
Application::SetCompatibleTextRenderingDefault(false);
// Create the main window and run it
// Send/receive messages in Form1's code
Application::Run(gcnew Form1());
}
else
{
/* Send/receive messages to/from process #0 only */
}
return 0;
}
MPI is just another library so no magic. Your code should look something like this:
init MPI
if(rank == 0) init your GUI;
while (1){
if(rank == 0)get input;
perform MPI computation on input
make sure rank 0 ends up with the final result
if(rank == 0) display result on GUI;
}
if(rank == 0) clean up GUI;
clean up MPI
Related
So, working with Visual Studio 2008 developing native C++ code for a Windows CE 6.0 platform. Consider the following multithreaded application:
#include "stdafx.h"
void IncrementCounter(int& counter)
{
if (++counter >= 1000)
{
counter = 0;
}
}
unsigned long ThreadFunction(void* arguments)
{
int threadCounter = 0;
while (true)
{
Sleep(20);
IncrementCounter(threadCounter);
}
return 0;
}
int _tmain(int argc, _TCHAR* argv[])
{
CreateThread(
NULL,
0,
(LPTHREAD_START_ROUTINE)ThreadFunction,
NULL,
0,
NULL
);
int mainCounter = 0;
while (true)
{
Sleep(20);
IncrementCounter(mainCounter);
}
return 0;
}
When I build this to run on my Windows 7 dev. machine and run a debug session from Visual Studio with a breakpoint on the counter = 0; statement, execution will eventually break and two threads will be displayed in the "Threads" debug window. I can switch back and forth between the two threads using either a double-click or right-click->"Switch to Thread" and see a call stack and browse source and inspect symbols (for the call stack frames within my application code) for both threads. However when I do the same on Windows CE connecting via. ActiveSync/WMDC (have tried on both our custom CE 6.0 hardware with an in-house OS and SDK, and an old Windows mobile 5.0 PDA with the stock MS SDK) I can see a call stack and browse source for the thread in which the break has taken place (where the current execution point is within my application code), however I don't get anything useful for the other thread, which is currently blocked in kernel space waiting it's sleep timeout.
Anyone know whether it's possible to get this working better on Windows CE? I'm guessing it might be something to do with the debugger not knowing where to find .pdb symbol files for the WinCE kernel elements, or perhaps do I need to be running a Debug OS?
Windows CE 6 remote debugging. No call stack when pause program describes the same issue, but doesn't really provide a solution
thanks
Richard
Probably its because of missing pdb file for coredll.dll. If you are creating image for your device you will have access to this file, otherwise I am afraid its platform dependent.
You can find here that this issue is considered to be by design in VS2005 so maybe its the same for VS2008:
http://connect.microsoft.com/VisualStudio/feedback/details/190785/unable-to-debug-windows-mobile-application-that-is-in-a-system-call
In following link you can find some instructions for finding call stack using platform builder for "Thread That Is Not Running"
https://distrinet.cs.kuleuven.be/projects/SEESCOA/internal/workpackages/workpackage6/Task6dot2/ESCE/classes/331.pdf
Since I'am using only VS 2005 I cannot confirm if its of any help.
If logging is not sufficient (as was suggested in the SO link you provided), to find call stack for a thread like in your example I suggest using GetThreadCallStack function. Here is a step by step procedure:
1 - Add following code to your project:
typedef struct _CallSnapshotEx {
DWORD dwReturnAddr;
DWORD dwFramePtr;
DWORD dwCurProc;
DWORD dwParams[4];
} CallSnapshotEx;
#define STACKSNAP_EXTENDED_INFO 2
DWORD dwGUIThread;
void DumpGUIThreadCallStack() {
HINSTANCE hCore = LoadLibrary(_T("coredll.dll"));
typedef ULONG (*GETTHREADCALLSTACK)(HANDLE hThrd, ULONG dwMaxFrames, LPVOID lpFrames[], DWORD dwFlags,DWORD dwSkip);
GETTHREADCALLSTACK pGetThreadCallStack = (GETTHREADCALLSTACK)GetProcAddress(hCore, _T("GetThreadCallStack"));
if ( !pGetThreadCallStack )
return;
#define MAX_FRAMES 40
CallSnapshotEx lpFrames[MAX_FRAMES];
DWORD dwCnt = pGetThreadCallStack((HANDLE)dwGUIThread, MAX_FRAMES, (void**)lpFrames, STACKSNAP_EXTENDED_INFO, 0);
TCHAR szBuff[64];
for ( DWORD i = 0; i < dwCnt; ++i ) {
wsprintf(szBuff, L"[%d] %p\n", i, lpFrames[i].dwReturnAddr);
OutputDebugString(szBuff);
}
}
it will dump in Output window call frames return addresses (sample output is in point 3).
2 - initialize dwGUIThread in WinMain:
dwGUIThread = GetCurrentThreadId();
3 - execute DumpGUIThreadCallStack(); before actuall breakpoint inside ThreadFunction. It will write to output window text similar to this:
[0] 8C04D2C4
[1] 8C04D34C
[2] 40026D48
[3] 000111F4 <--- 1
[4] 00011BAC <--- 2
[5] 4003C2DC
1 and 2 are return addresses that you are interested in, and you want to find symbols nearest to them.
4 - while inside debugger switch to disassembly mode (right click on source file and choose 'Go to disassembly'). In this mode at the top of the window you will see Address: line. You should copy paste to it addresses from output window, in my case 000111F4 will direct me to following lines:
while (true)
{
Sleep(20);
000111F0 mov r0, #0x14
000111F4 bl 0001193C // <--- 1
IncrementCounter(mainCounter);
which gives you what your GUI thread is actually doing.
Visual Studio Debugger allows to execute functions from immediate window, but I was unable to call DumpGUIThreadCallStack, I am always getting 'Error: function evaluation not supported'.
To find nearest symbols for frame return addresses you can also use .map files together with .cod files (/FAcs compiled sources), there are some good tutorials on that on google.
Above example was tested with the use of VS 2005 and Standard SDK 5.0, on WCE6.0 (end user) device.
So in my ilumination days, i started to think about how the hell do windows/linux implement the mutex, i've implemented this synchronizer in 100... different ways, in many diferent arquitectures but never think how it is really implemented in big ass OS, for example in the ARM world i made some of my synchronizers disabling the interrupts but i always though that it wasn't a really good way to do it.
I tried to "swim" throgh the linux kernel but just like a though i can't see nothing that satisfies my curiosity. I'm not an expert in threading, but i have solid all the basic and intermediate concepts of it.
So does anyone know how a mutex is implemented?
A quick look at code apparently from one Linux distribution seems to indicate that it is implemented using an interlocked compare and exchange. So, in some sense, the OS isn't really implementing it since the interlocked operation is probably handled at the hardware level.
Edit As Hans points out, the interlocked exchange does the compare and exchange in an atomic manner. Here is documentation for the Windows version. For fun, I just now wrote a small test to show a really simple example of creating a mutex like that. This is a simple acquire and release test.
#include <windows.h>
#include <assert.h>
#include <stdio.h>
struct homebrew {
LONG *mutex;
int *shared;
int mine;
};
#define NUM_THREADS 10
#define NUM_ACQUIRES 100000
DWORD WINAPI SomeThread( LPVOID lpParam )
{
struct homebrew *test = (struct homebrew*)lpParam;
while ( test->mine < NUM_ACQUIRES ) {
// Test and set the mutex. If it currently has value 0, then it
// is free. Setting 1 means it is owned. This interlocked function does
// the test and set as an atomic operation
if ( 0 == InterlockedCompareExchange( test->mutex, 1, 0 )) {
// this tread now owns the mutex. Increment the shared variable
// without an atomic increment (relying on mutex ownership to protect it)
(*test->shared)++;
test->mine++;
// Release the mutex (4 byte aligned assignment is atomic)
*test->mutex = 0;
}
}
return 0;
}
int main( int argc, char* argv[] )
{
LONG mymutex = 0; // zero means
int shared = 0;
HANDLE threads[NUM_THREADS];
struct homebrew test[NUM_THREADS];
int i;
// Initialize each thread's structure. All share the same mutex and a shared
// counter
for ( i = 0; i < NUM_THREADS; i++ ) {
test[i].mine = 0; test[i].shared = &shared; test[i].mutex = &mymutex;
}
// create the threads and then wait for all to finish
for ( i = 0; i < NUM_THREADS; i++ )
threads[i] = CreateThread(NULL, 0, SomeThread, &test[i], 0, NULL);
for ( i = 0; i < NUM_THREADS; i++ )
WaitForSingleObject( threads[i], INFINITE );
// Verify all increments occurred atomically
printf( "shared = %d (%s)\n", shared,
shared == NUM_THREADS * NUM_ACQUIRES ? "correct" : "wrong" );
for ( i = 0; i < NUM_THREADS; i++ ) {
if ( test[i].mine != NUM_ACQUIRES ) {
printf( "Thread %d cheated. Only %d acquires.\n", i, test[i].mine );
}
}
}
If I comment out the call to the InterlockedCompareExchange call and just let all threads run the increments in a free-for-all fashion, then the results do result in failures. Running it 10 times, for example, without the interlocked compare call:
shared = 748694 (wrong)
shared = 811522 (wrong)
shared = 796155 (wrong)
shared = 825947 (wrong)
shared = 1000000 (correct)
shared = 795036 (wrong)
shared = 801810 (wrong)
shared = 790812 (wrong)
shared = 724753 (wrong)
shared = 849444 (wrong)
The curious thing is that one time the results showed now incorrect contention. That might be because there is no "everyone start now" synchronization; maybe all threads started and finished in order in that case. But when I have the InterlockedExchangeCall in place, it runs without failure (or at least it ran 100 times without failure ... that doesn't prove I didn't write a subtle bug into the example).
Here is the discussion from the people who implemented it ... very interesting as it shows the tradeoffs ..
Several posts from Linus T ... of course
In earlier days pre-POSIX etc I used to implement synchronization by using a native mode word (e.g. 16 or 32 bit word) and the Test And Set instruction lurking on every serious processor. This instruction guarantees to test the value of a word and set it in one atomic instruction. This provides the basis for a spinlock and from that a hierarchy of synchronization functions could be built. The simplest is of course just a spinlock which performs a busy wait, not an option for more than transitory sync'ing, then a spinlock which drops the process time slice at each iteration for a lower system impact. Notional concepts like Semaphores, Mutexes, Monitors etc can be built by getting into the kernel scheduling code.
As I recall the prime usage was to implement message queues to permit multiple clients to access a database server. Another was a very early real time car race result and timing system on a quite primitive 16 bit machine and OS.
These days I use Pthreads and Semaphores and Windows Events/Mutexes (mutices?) etc and don't give a thought as to how they work, although I must admit that having been down in the engine room does give one and intuitive feel for better and more efficient multiprocessing.
In windows world.
The mutex before the windows vista mas implemented with a Compare Exchange to change the state of the mutex from Empty to BeingUsed, the other threads that entered the wait on the mutex the CAS will obvious fail and it must be added to the mutex queue for furder notification. Those operations (add/remove/check) of the queue would be protected by an common lock in windows kernel.
After Windows XP, the mutex started to use a spin lock for performance reasons being a self-suficiant object.
In unix world i didn't get much furder but probably is very similar to the windows 7.
Finally for kernels that work on a single processor the best way is to disable the interrupts when entering the critical section and re-enabling then when exiting.
I am writing a code for linux kernel module and experiencing a strange behavior in it.
Here is my code:
int data = 0;
void threadfn1()
{
int j;
for( j = 0; j < 10; j++ )
printk(KERN_INFO "I AM THREAD 1 %d\n",j);
data++;
}
void threadfn2()
{
int j;
for( j = 0; j < 10; j++ )
printk(KERN_INFO "I AM THREAD 2 %d\n",j);
data++;
}
static int __init abc_init(void)
{
struct task_struct *t1 = kthread_run(threadfn1, NULL, "thread1");
struct task_struct *t2 = kthread_run(threadfn2, NULL, "thread2");
while( 1 )
{
printk("debug\n"); // runs ok
if( data >= 2 )
{
kthread_stop(t1);
kthread_stop(t2);
break;
}
}
printk(KERN_INFO "HELLO WORLD\n");
}
Basically I was trying to wait for threads to finish and then print something after that.
The above code does achieve that target but WITH "printk("debug\n");" not commented. As soon as I comment out printk("debug\n"); to run the code without debugging and load the module through insmod command, the module hangs on and it seems like it gets lost in recursion. I dont why printk effects my code in such a big way?
Any help would be appreciated.
regards.
You're not synchronizing the access to the data-variable. What happens is, that the compiler will generate a infinite loop. Here is why:
while( 1 )
{
if( data >= 2 )
{
kthread_stop(t1);
kthread_stop(t2);
break;
}
}
The compiler can detect that the value of data never changes within the while loop. Therefore it can completely move the check out of the loop and you'll end up with a simple
while (1) {}
If you insert printk the compiler has to assume that the global variable data may change (after all - the compiler has no idea what printk does in detail) therefore your code will start to work again (in a undefined behavior kind of way..)
How to fix this:
Use proper thread synchronization primitives. If you wrap the access to data into a code section protected by a mutex the code will work. You could also replace the variable data and use a counted semaphore instead.
Edit:
This link explains how locking in the linux-kernel works:
http://www.linuxgrill.com/anonymous/fire/netfilter/kernel-hacking-HOWTO-5.html
With the call to printk() removed the compiler is optimising the loop into while (1);. When you add the call to printk() the compiler is not sure that data isn't changed and so checks the value each time through the loop.
You can insert a barrier into the loop, which forces the compiler to reevaluate data on each iteration. eg:
while (1) {
if (data >= 2) {
kthread_stop(t1);
kthread_stop(t2);
break;
}
barrier();
}
Maybe data should be declared volatile? It could be that the compiler is not going to memory to get data in the loop.
Nils Pipenbrinck's answer is spot on. I'll just add some pointers.
Rusty's Unreliable Guide to Kernel Locking (every kernel hacker should read this one).
Goodbye semaphores?, The mutex API (lwn.net articles on the new mutex API introduced in early 2006, before that the Linux kernel used semaphores as mutexes).
Also, since your shared data is a simple counter, you can just use the atomic API (basically, declare your counter as atomic_t and access it using atomic_* functions).
Volatile might not always be "bad idea". One needs to separate out
the case of when volatile is needed and when mutual exclusion
mechanism is needed. It is non optimal when one uses or misuses
one mechanism for the other. In the above case. I would suggest
for optimal solution, that both mechanisms are needed: mutex to
provide mutual exclusion, volatile to indicate to compiler that
"info" must be read fresh from hardware. Otherwise, in some
situation (optimization -O2, -O3), compilers might inadvertently
leave out the needed codes.
Are there locks in Linux where the waiting queue is FIFO? This seems like such an obvious thing, and yet I just discovered that pthread mutexes aren't FIFO, and semaphores apparently aren't FIFO either (I'm working on kernel 2.4 (homework))...
Does Linux have a lock with FIFO waiting queue, or is there an easy way to make one with existing mechanisms?
Here is a way to create a simple queueing "ticket lock", built on pthreads primitives. It should give you some ideas:
#include <pthread.h>
typedef struct ticket_lock {
pthread_cond_t cond;
pthread_mutex_t mutex;
unsigned long queue_head, queue_tail;
} ticket_lock_t;
#define TICKET_LOCK_INITIALIZER { PTHREAD_COND_INITIALIZER, PTHREAD_MUTEX_INITIALIZER }
void ticket_lock(ticket_lock_t *ticket)
{
unsigned long queue_me;
pthread_mutex_lock(&ticket->mutex);
queue_me = ticket->queue_tail++;
while (queue_me != ticket->queue_head)
{
pthread_cond_wait(&ticket->cond, &ticket->mutex);
}
pthread_mutex_unlock(&ticket->mutex);
}
void ticket_unlock(ticket_lock_t *ticket)
{
pthread_mutex_lock(&ticket->mutex);
ticket->queue_head++;
pthread_cond_broadcast(&ticket->cond);
pthread_mutex_unlock(&ticket->mutex);
}
If you are asking what I think you are asking the short answer is no. Threads/processes are controlled by the OS scheduler. One random thread is going to get the lock, the others aren't. Well, potentially more than one if you are using a counting semaphore but that's probably not what you are asking.
You might want to look at pthread_setschedparam but it's not going to get you where I suspect you want to go.
You could probably write something but I suspect it will end up being inefficient and defeat using threads in the first place since you will just end up randomly yielding each thread until the one you want gets control.
Chances are good you are just thinking about the problem in the wrong way. You might want to describe your goal and get better suggestions.
I had a similar requirement recently, except dealing with multiple processes. Here's what I found:
If you need 100% correct FIFO ordering, go with caf's pthread ticket lock.
If you're happy with 99% and favor simplicity, a semaphore or a mutex can do really well actually.
Ticket lock can be made to work across processes:
You need to use shared memory, process-shared mutex and condition variable, handle processes dying with the mutex locked (-> robust mutex) ... Which is a bit overkill here, all I need is the different instances don't get scheduled at the same time and the order to be mostly fair.
Using a semaphore:
static sem_t *sem = NULL;
void fifo_init()
{
sem = sem_open("/server_fifo", O_CREAT, 0600, 1);
if (sem == SEM_FAILED) fail("sem_open");
}
void fifo_lock()
{
int r;
struct timespec ts;
if (clock_gettime(CLOCK_REALTIME, &ts) == -1) fail("clock_gettime");
ts.tv_sec += 5; /* 5s timeout */
while ((r = sem_timedwait(sem, &ts)) == -1 && errno == EINTR)
continue; /* Restart if interrupted */
if (r == 0) return;
if (errno == ETIMEDOUT) fprintf(stderr, "timeout ...\n");
else fail("sem_timedwait");
}
void fifo_unlock()
{
/* If we somehow end up with more than one token, don't increment the semaphore... */
int val;
if (sem_getvalue(sem, &val) == 0 && val <= 0)
if (sem_post(sem)) fail("sem_post");
usleep(1); /* Yield to other processes */
}
Ordering is almost 100% FIFO.
Note: This is with a 4.4 Linux kernel, 2.4 might be different.
I'm writing a system (X-Platform Windows/Linux) that talks to a custom device using an FTDI USB chip. I use their D2XX driver for device open/close/read/write. So far, so good.
I need to know when the device is disconnected so the program can respond gracefully. At present, under Windows the application receives a sudden unexpected close. Under Linux, when the device is disconnected, there is a sgementation fault.
I have found informaiton under Windows about listening for the WM_DEVICECHANGE message. However, I have not found how to detect this event under Windows. There is information for the device driver level interacting with the kernel. However, I can't figure out how to do this at an application level. The FTDI driver does not offer any such service.
The system is written using the Qt framework with C++. The device driver is FTDI's D2XX driver.
Can anyone point me in the right direction?
Thanks so much in advance!
Judy
You'll probably want to use HAL (freedesktop.org's Hardware Abstraction Layer).
In the future you will probably want to use DeviceKit. It is a project fix the many problems with HAL. It hasn't been adopted by all major distros yet though (I think just Fedora), so you probably don't want to use it right now.
Edit: As Jeach said, you can use udev also. I wouldn't suggest this, as it is much lower level, and harder to program, but if latency is very important, this might be the best option.
Although what I'm about to tell you won't directly answer your question, it may give you a hint as to your next move.
I use udev rules configured in '/etc/udev/rules.d/' which run various scripts. When a USB device gets connected/disconnected I run a script which sends a HUP signal to my binary. Since my requirements can handle a bit of lag it works perfectly fine for me.
But my point is that maybe there is a udev library you can link to and register events programmatically (instead of scripts).
Hope it helps... good luck!
I recently had a project which involved reading via an FTDI chip. I also tried using libftdi but found out that it is much simpler to use /dev/ttyUSB* for reading and writing. This way, you can use QFile('/dev/ttyUSB*') to write and read. You can also check if the device actually exists and it won't segfault. Of course, this is not a very 'Platform independent' way. To get a platform independent method, you can use a Serial library for Qt.
You obviously have to write different implementations for the different operating systems unless you want to create a thread to continuously run:
FT_ListDevices(&numDevs, nullptr, FT_LIST_NUMBER_ONLY);
and enumerate the devices if numDevs changed compared to previous checks.
If you are like me and don't really like to do that sort of continuous polling on your USB devices then you will have to target your specific operating system.
Here's a link to some sample code from FTDI:
http://www.ftdichip.com/Support/SoftwareExamples/CodeExamples/VC.htm
Example 7 shows how to detect the USB insertion and removal on windows:
http://www.ftdichip.com/Support/Documents/AppNotes/AN_152_Detecting_USB_%20Device_Insertion_and_Removal.pdf
On Linux I can personally recommend using udev.
This code is for enumerating the devices:
#include <sys/types.h>
#include <dirent.h>
#include <cstdlib>
#include <libudev.h>
#include <fcntl.h>
struct udev *udev = udev_new();
if (!udev) {
cout << "Can't create udev" <<endl;
}
struct udev_enumerate *enumerate = udev_enumerate_new(udev);
udev_enumerate_add_match_subsystem(enumerate, "usb");
udev_enumerate_scan_devices(enumerate);
struct udev_list_entry *dev_list_entry, *devices = udev_enumerate_get_list_entry(enumerate);
struct udev_device *dev;
udev_list_entry_foreach(dev_list_entry, devices) {
const char *path;
path = udev_list_entry_get_name(dev_list_entry);
dev = udev_device_new_from_syspath(udev, path);
if( udev_device_get_devnode(dev) != nullptr ){
string vendor = (std::string) udev_device_get_sysattr_value(dev, "idVendor");
string product = (std::string) udev_device_get_sysattr_value(dev, "idProduct");
string description = (std::string)udev_device_get_sysattr_value(dev, "product");
cout << vendor << product << description << endl;
}
udev_device_unref(dev);
}
udev_enumerate_unref(enumerate);
This code I put in a separate thread that waits to receive an insertion or a removal event
struct udev_device *dev;
struct udev_monitor *mon = udev_monitor_new_from_netlink(udev, "udev");
udev_monitor_filter_add_match_subsystem_devtype(mon, "usb", NULL);
udev_monitor_enable_receiving(mon);
int fd = udev_monitor_get_fd(mon);
int flags = fcntl(fd, F_GETFL, 0);
if (flags == -1){
debugError("Can't get flags for fd");
}
flags &= ~O_NONBLOCK;
fcntl(fd, F_SETFL, flags);
fd_set fds;
FD_ZERO(&fds);
FD_SET(fd, &fds);
while( _running ){
cout << "waiting for udev" << endl;
dev = udev_monitor_receive_device(mon);
if (dev && udev_device_get_devnode(dev) != nullptr ) {
string action = (std::string)udev_device_get_action(dev);
if( action == "add" ){
cout << "do something with your device... " << endl;
} else {
string path = (std::string)udev_device_get_devnode(dev);
for( auto device : _DevicesList ){
if( device.getPath() == path ){
_DevicesList.erase(iter);
cout << "Erased Device from list" << endl;
break;
}
}
}
udev_device_unref(dev);
}
}
udev_monitor_unref(mon);
some of the functions and variables are obviously not defined when you copy/paste this code.
I keep a list of detected devices to check the path and other information like the location ID of the inserted device. I need the location ID later to FT_OpenEx via FT_OPEN_BY_LOCATION. To get the location id I read the contents of the following files:
string getFileContent(string file ){
string content = "";
ifstream readfile( file );
if( readfile.is_open() ){
getline(readfile, content );
readfile.close();
}
return content;
}
string usbdirectory = "/sys/bus/usb/devices";
string dev1content = getFileContent(usbdirectory+"/usb"+udev_device_get_sysattr_value(dev, "busnum" )+"/dev");
int dev1num = std::atoi(dev1content.substr(dev1content.find_first_of(":")+1).c_str());
string dev2content = (std::string)udev_device_get_sysattr_value(dev, "dev");
int dev2num = std::atoi(dev2content.substr(dev2content.find_first_of(":")+1).c_str());
int locationid = dev1num+dev2num+257;
I can't guarantee that the locationid is correct but it seemed to work for me until now.
Don't forget that you have two problems here :
Detecting device insertion / removal
Properly terminating your application.
The first problem has been adressed by Zifre.
But the second problem remains : your Linux app should not be segfaulting when the device is removed, and I think the two problems are unrelated : if the device is removed in the middle of a write or read system call, then those system call will return with an error before you get any notification, and this should not segfault your app.