I have two CPUs on the chip and they have a shared memory. This is not a SMP architecture. Just two CPUs on the chip with shared memory.
There is a Unix-like operating system on the first CPU and there is a Linux operating system on the second CPU.
The first CPU does some job and the result of this job is some data. After first CPU finishes its job it should say to another CPU that job is finished and the second CPU have to process this data.
What is the way to handle interprocessor communication? What algorithm should I use to do that?
Any reference to an article about it would be greatly appreciated.
It all depends on the hardware. If all you have is shared memory, and no other way of communication, then you have to use a polling of some sort.
Are both of your processor running linux ? How do they handle the shared memory ?
A good solution is to use a linked list as a fifo. On this fifo you put data descriptor, like adress and size.
For example, you can have an input and output fifo, and go like this :
Processor A does some calculation
Processor A push the data descriptoron the output fifo
Processor A wait for data descriptor on the input fifo
loop
Processor B wait for data descriptor on the output fifo
Processor B works with data
Processor B push used data descriptor on the input fifo
loop
Of course, the hard part is in the locking. May be you should reformulate your question to emphasize this is not 'standard' SMP.
If you have no atomic test and set bit operation available on the memory, I guess you have to go with a scheme where some zone of memory is write only for one processor, and read only for the other.
Edit : See Hasturkun answer, for a way of passing messages from one processor to the other, using ordered write instead of atomicity to provide serialized access to some predefined data.
Ok. I understand the question.I have worked on this kind of an issue.
Now first thing that you need to understand is the working of the shared memory that exists between the 2 CPUs. Because these shared memory can be accessed in different ways, u need to figure out which one suits u the best.
Most times hardware semaphores will be provided in the shared memory along with the hardware interrupt to notify the message transfer from one processor to the other processor.
So have a look at this first.
A really good method is to just send IP packets back and forth (using sockets). this has the advantage that you can test stuff off-chip - as in, run a test version of one process on a PC, if you have networking.
If both processors are managed by a single os, then you can use any of the standard IPC to communicate with each other as OS takes care of everything. If they are running on different OSes then sockets would be your best bet.
EDIT
Quick unidirectional version:
Ready flag
Done flag
init:
init()
{
ready = 0;
done = 1;
}
writer:
send()
{
while (!done)
sleep();
/* copy data in */
done = 0;
ready = 1;
}
reader:
poll()
{
while (1)
{
if (ready)
{
recv();
}
sleep();
}
}
recv()
{
/* copy data out */
ready = 0;
done = 1;
}
Build a message passing system via the shared mem (which should be coherent, either by being uncached for both processors, or by use of cache flush/invalidate calls).
Your shared memory structure should have (at least) the following fields:
Current owner
Message active (as in, should be read)
Request usage fields
Flow will probably be like this: (assumed send/recv synchronized not to run at same time)
poll()
{
/* you're better off using interrupts instead, if you have them */
while(1)
{
if (current_owner == me)
{
if (active)
{
recv();
}
else if (!request[me] && request[other])
{
request[other] = 0;
current_owner = other;
}
}
sleep();
}
}
recv()
{
/* copy data... */
active = 0;
/* check if we still want it */
if (!request[me] && request[other])
{
request[other] = 0;
current_owner = other;
}
}
send()
{
request[me] = 1;
while (current_owner != me || active)
{
sleep();
}
request[me] = 0;
/* copy data in... */
/* pass to other side */
active = 1;
current_owner = other;
}
How about using the shared mem?
I don't have a good link right now, but if you google for IPC + shared mem I bet you find some good info :)
Are you sure you need to do this? In my experience you're better off letting your compiler & operating system manage how your process uses multiple CPUs.
Related
I am writing a multi-thread application in Linux.
There is no RT patch in kernel, yet I use threads with priorities.
On checking the time it takes to execute printf , I measure different values every time I measure, although it is done in the highest priority thread :
if(clock_gettime(CLOCK_MONOTONIC, &start))
{ /* handle error */
}
for(int i=0; i< 1000; i++)
printf("hello world");
if(clock_gettime(CLOCK_MONOTONIC, &end))
{
/* handle error */
}
elapsedSeconds = TimeSpecToSeconds(&end) - TimeSpecToSeconds(&start);
Why does printf change timing and in non deterministic way , i.e. each
How should printf be used with RT threads ?
Can it be used inside RT thread or should it be totally avoided ?
Is write to disk should be treated in the same way as printf ? Should it be used only in separate low priority thread ?
printf under the hood triggers a non-realtime (even blocking) mechanism of the buffered IO.
It's not only non-deterministic, but opens the possibility of a priority inversion.
You should be very careful using it from a real time thread (I would say totally avoid it.
Normally, in a latency bound code you would use a wait-free binary audit into a chain of (pre-allocated or memory mapped) ring buffers and flush them using a background lower priority thread (or even a separate process).
Today a friend of mine told me that Go programs can scale themselves on multiple CPU cores. I were quite surprised to hear that knowing that system task schedulers do not know anything about goroutines and hence can't run them on multiple cores.
I did some search and found out that Go programs can spawn multiple OS tasks to run them on different cores (the number is controlled by GOMAXPROCS environment variable). But as far as I know forking a process leads to complete copy of process data and different processes run in different address spaces.
So what about global variables in Go programs? Are they safe to use with multiple goroutines? Do they somehow synchronize between system processes? And if they do then how? I am mainly concerned about linux and freebsd implementations.
I figured it out! It's all in go sources.
There is a Linux system call that I were unaware of.
It's called "clone". It is more flexible than fork and it allows
a child process to live in its parent's address space.
Here is a short overview of the thread creation process.
First there is a newm function in src/runtime/proc.go. This
function is responsible for creating a new working thread
(or machine as it is called in comments).
// Create a new m. It will start off with a call to fn, or else the scheduler.
// fn needs to be static and not a heap allocated closure.
// May run with m.p==nil, so write barriers are not allowed.
//go:nowritebarrier
func newm(fn func(), _p_ *p) {
// ... some code skipped ...
newosproc(mp, unsafe.Pointer(mp.g0.stack.hi))
}
This function calls newosproc which is OS-specific.
For Linux it can be found in src/runtime/os_linux.go. Here
are relevant parts of that file:
var (
// ...
cloneFlags = _CLONE_VM | /* share memory */
_CLONE_FS | /* share cwd, etc */
_CLONE_FILES | /* share fd table */
_CLONE_SIGHAND | /* share sig handler table */
_CLONE_THREAD /* revisit - okay for now */
)
// May run with m.p==nil, so write barriers are not allowed.
//go:nowritebarrier
func newosproc(mp *m, stk unsafe.Pointer) {
// ... some code skipped ...
ret := clone(cloneFlags, /* ... other flags ... */)
// ... code skipped
}
And the clone function is defined in architecture-specific
files. For amd64 it is in src/runtime/sys_linux_amd64.s.
It is the actual system call.
So Go programs do run in multiple OS threads which enables
spanning across CPUs, but they use one shared address space.
Phew... I love Go.
As the title says, how do two or more threads share memory on the heap that they have allocated? I've been thinking about it and I can't figure out how they can do it. Here is my understanding of the process, presumably I am wrong somewhere.
Any thread can add or remove a given number of bytes on the heap by making a system call which returns a pointer to this data, presumably by writing to a register which the thread can then copy to the stack.
So two threads A and B can allocate as much memory as they want. But I don't see how thread A could know where the memory that thread B has allocated is located. Nor do I know how either thread could know where the other thread's stack is located. Multi-threaded programs share the heap and, I believe, can access one another's stack but I can't figure out how.
I tried searching for this question but only found language specific versions that abstract away the details.
Edit:
I am trying not to be language or OS specific but I am using Linux and am looking at it from a low level perspective, assembly I guess.
My interpretation of your question: How can thread A get to know a pointer to the memory B is using? How can they exchange data?
Answer: They usually start with a common pointer to a common memory area. That allows them to exchange other data including pointers to other data with each other.
Example:
Main thread allocates some shared memory and stores its location in p
Main thread starts two worker threads, passing the pointer p to them
The workers can now use p and work on the data pointed to by p
And in a real language (C#) it looks like this:
//start function ThreadProc and pass someData to it
new Thread(ThreadProc).Start(someData)
Threads usually do not access each others stack. Everything starts from one pointer passed to the thread procedure.
Creating a thread is an OS function. It works like this:
The application calls the OS using the standard ABI/API
The OS allocates stack memory and internal data structures
The OS "forges" the first stack frame: It sets the instruction pointer to ThreadProc and "pushes" someData onto the stack. I say "forge" because this first stack frame does not arise naturally but is created by the OS artificially.
The OS schedules the thread. ThreadProc does not know it has been setup on a fresh stack. All it knows is that someData is at the usual stack position where it would expect it.
And that is how someData arrives in ThreadProc. This is the way the first, initial data item is shared. Steps 1-3 are executed synchronously by the parent thread. 4 happens on the child thread.
A really short answer from a bird's view (1000 miles above):
Threads are execution paths of the same process, and the heap actually belongs to the process (and as a result shared by the threads). Each threads just needs its own stack to function as a separate unit of work.
Threads can share memory on a heap if they both use the same heap. By default most languages/frameworks have a single default heap that code can use to allocate memory from the heap. In unmanaged languages you generally make explicit calls to allocate heap memory. In C, that might be malloc, etc. for example. In managed languages heap allocation is usually automatic and how allocation is done depends on the language--usually through the use of the new operator. but, that depends slightly on context. If you provide the OS or language context you're asking about, I might be able to provide more detail.
A Thread shared with other threads belonging to the same process: its code section, data section and other operating system resources such as open files and signals.
The part you are missing is static memory containing static variables.
This memory is allocated when the program is started, and assigned known adresses (determined at the linking time). All threads can access this memory without exchanging any data runtime, because the addresses are effectively hardcoded.
A simple example might look like this:
// Global variable.
std::atomic<int> common_var;
void thread1() {
common_var = compute_some_value();
}
void thread2() {
do_something();
int current_value = common_var;
do_more();
}
And of course the global value may be a pointer, that can be used to exchange heap memory. The producer allocates some objects, the consumer takes and uses them.
// Global variable.
std::atomic<bool> produced;
SomeData* data_pointer;
void producer_thread() {
while (true) {
if (!produced) {
SomeData* new_data = new SomeData();
data_pointer = new_data;
// Let the other thread know there is something to read.
produced = true;
}
}
}
void consumer_thread() {
while (true) {
if (produced) {
SomeData* my_data = data_pointer;
data_pointer = nullptr;
// Let the other thread know we took the data.
produced = false;
do_something_with(my_data);
delete my_data;
}
}
}
Please note: these are not examples of good concurrent code, but they show the general idea without too much clutter.
I want to read messages sent from an Arduino via the FTDI (serial) interface in a simple C or C++ program under Linux. The Arduino sends a two character 'header', a command byte followed by a few bytes of data depending on the command.
My first attempt was to simply poll the data using open() and read() but doing so causes about 12% CPU use. This didn't seem to be the appropriate way of doing things.
Second I read up on libevent on implemented an event loop that fires an event when data is present on the file descriptor. My cpu usage was next to nothing but I couldn't read the entire message before another event was called. The events didn't fire when an entire message was received but as soon as any/some data was available on the file descriptor. Looking at it more it was obvious that this wouldn't work quite the way I wanted it. This is my event code: http://pastebin.com/b9W0jHjb
Third I implemented a buffered event with libevent. It seemed to work somewhat better but still split some of the messages up. My event code is: http://pastebin.com/PQNriUCN
Fourth I dumped libevent and tried out Boost's ASIO class. The example I was following was http://www.webalice.it/fede.tft/serial_port/serial_port.html. It seemed to work alright but the "event loop" was a "while(1) {}" which caused CPU usage to go up again. The loop just checks for error status while serial reading happens in a callback on a different thread. I added a usleep(1) to the while loop and it brought my CPU usage to 2% which is ok, but still seems heavy for such a light program.
Most of the examples of libevent and even the underlying epoll use TCP sockets which doesn't seem to behave quite the same as serial port data.
So my main question is: what is a good lightweight way to read messages from a serial port without heavy polling? (in linux, using C or C++)
The OP has probably long since solved this, but for the sake of anyone who gets here by google:
#include <sys/poll.h>
struct pollfd fds[1];
fds[0].fd = serial_fd;
fds[0].events = POLLIN ;
int pollrc = poll( fds, 1, 1000);
if (pollrc < 0)
{
perror("poll");
}
else if( pollrc > 0)
{
if( fds[0].revents & POLLIN )
{
char buff[1024];
ssize_t rc = read(serial_fd, buff, sizeof(buff) );
if (rc > 0)
{
/* You've got rc characters. do something with buff */
}
}
}
Make sure the serial port is opened in nonblocking mode as poll() can sometimes return when there are no characters waiting.
On constrained devices, I often find myself "faking" locks between 2 threads with 2 bools. Each is only read by one thread, and only written by the other. Here's what I mean:
bool quitted = false, paused = false;
bool should_quit = false, should_pause = false;
void downloader_thread() {
quitted = false;
while(!should_quit) {
fill_buffer(bfr);
if(should_pause) {
is_paused = true;
while(should_pause) sleep(50);
is_paused = false;
}
}
quitted = true;
}
void ui_thread() {
// new Thread(downloader_thread).start();
// ...
should_pause = true;
while(!is_paused) sleep(50);
// resize buffer or something else non-thread-safe
should_pause = false;
}
Of course on a PC I wouldn't do this, but on constrained devices, it seems reading a bool value would be much quicker than obtaining a lock. Of course I trade off for slower recovery (see "sleep(50)") when a change to the buffer is needed.
The question -- is it completely thread-safe? Or are there hidden gotchas I need to be aware of when faking locks like this? Or should I not do this at all?
Using bool values to communicate between threads can work as you intend, but there are indeed two hidden gotchas as explained in this blog post by Vitaliy Liptchinsky:
Cache Coherency
A CPU does not always fetch memory values from RAM. Fast memory caches on the die are one of the tricks used by CPU designers to work around the Von Neumann bottleneck. On some multi-cpu or multi-core architectures (like Intel's Itanium) these CPU caches are not shared or automatically kept in sync. In other words, your threads may be seeing different values for the same memory address if they run on different CPU's.
To avoid this you need to declare your variables as volatile (C++, C#, java), or do explicit volatile read/writes, or make use of locking mechanisms.
Compiler Optimizations
The compiler or JITter may perform optimizations which are not safe if multiple threads are involved. See the linked blog post for an example. Again, you must make use of the volatile keyword or other mechanisms to inform you compiler.
Unless you understand the memory architecture of your device in detail, as well as the code generated by your compiler, this code is not safe.
Just because it seems that it would work, doesn't mean that it will. "Constrained" devices, like the unconstrained type, are getting more and more powerful. I wouldn't bet against finding a dual-core CPU in a cell phone, for instance. That means I wouldn't bet that the above code would work.
Concerning the sleep call, you could always just do sleep(0) or the equivalent call that pauses your thread letting the next in line a turn.
Concerning the rest, this is thread safe if you know the implementation details of your device.
Answering the questions.
Is this completely thread safe? I would answer no this is not thread safe and I would just not do this at all. Without knowing the details of our device and compiler, if this is C++, the compiler is free to reorder and optimize things away as it sees fit. e.g. you wrote:
is_paused = true;
while(should_pause) sleep(50);
is_paused = false;
but the compiler may choose to reorder this into something like this:
sleep(50);
is_paused = false;
this probably won't work even a single core device as others have said.
Rather than taking a lock, you may try to do better to just do less on the UI thread rather than yield in the middle of processing UI messages. If you think that you have spent too much time on the UI thread then find a way to cleanly exit and register an asynchronous call back.
If you call sleep on a UI thread (or try to acquire a lock or do anyting that may block) you open the door to hangs and glitchy UIs. A 50ms sleep is enough for a user to notice. And if you try to acquire a lock or do any other blocking operation (like I/O) you need to deal with the reality of waiting for an indeterminate amount of time to get the I/O which tends to translate from glitch to hang.
This code is unsafe under almost all circumstances. On multi-core processors you will not have cache coherency between cores because bool reads and writes are not atomic operations. This means each core is not guarenteed to have the same value in the cache or even from memory if the cache from the last write hasn't been flushed.
However, even on resource constrained single core devices this is not safe because you do not have control over the scheduler. Here is an example, for simplicty I'm going to pretend these are the only two threads on the device.
When the ui_thread runs, the following lines of code could be run in the same timeslice.
// new Thread(downloader_thread).start();
// ...
should_pause = true;
The downloader_thread runs next and in it's time slice the following lines are executed:
quitted = false;
while(!should_quit)
{
fill_buffer(bfr);
The scheduler prempts the downloader_thread before fill_buffer returns and then activates the ui_thread which runs.
while(!is_paused) sleep(50);
// resize buffer or something else non-thread-safe
should_pause = false;
The resize buffer operation is done while the downloader_thread is in the process of filling the buffer. This means the buffer is corrupted and you'll likely crash soon. It won't happen everytime, but the fact that you are filling the buffer before you set is_paused to true makes it more likely to happen, but even if you switched the order of those two operations on the downloader_thread you would still have a race condition, but you'd likely deadlock instead of corrupting the buffer.
Incidentally, this is a type of spinlock, it just doesn't work. Spinlock's aren't very for wait times that are likely to span to many time slices cause the spin the processor. Your implmentation does sleep which is a bit nicer but the scheduler still has to run your thread and thread context switches aren't cheap. If you are waiting on a critical section or semaphore, the scheduler doesn't active your thread again till the resource has become free.
You might be able to get away with this in some form on a specific platform/architecture, but it is really easy to make a mistake that is very hard to track down.