Golang, processes and shared memory - linux

Today a friend of mine told me that Go programs can scale themselves on multiple CPU cores. I were quite surprised to hear that knowing that system task schedulers do not know anything about goroutines and hence can't run them on multiple cores.
I did some search and found out that Go programs can spawn multiple OS tasks to run them on different cores (the number is controlled by GOMAXPROCS environment variable). But as far as I know forking a process leads to complete copy of process data and different processes run in different address spaces.
So what about global variables in Go programs? Are they safe to use with multiple goroutines? Do they somehow synchronize between system processes? And if they do then how? I am mainly concerned about linux and freebsd implementations.

I figured it out! It's all in go sources.
There is a Linux system call that I were unaware of.
It's called "clone". It is more flexible than fork and it allows
a child process to live in its parent's address space.
Here is a short overview of the thread creation process.
First there is a newm function in src/runtime/proc.go. This
function is responsible for creating a new working thread
(or machine as it is called in comments).
// Create a new m. It will start off with a call to fn, or else the scheduler.
// fn needs to be static and not a heap allocated closure.
// May run with m.p==nil, so write barriers are not allowed.
//go:nowritebarrier
func newm(fn func(), _p_ *p) {
// ... some code skipped ...
newosproc(mp, unsafe.Pointer(mp.g0.stack.hi))
}
This function calls newosproc which is OS-specific.
For Linux it can be found in src/runtime/os_linux.go. Here
are relevant parts of that file:
var (
// ...
cloneFlags = _CLONE_VM | /* share memory */
_CLONE_FS | /* share cwd, etc */
_CLONE_FILES | /* share fd table */
_CLONE_SIGHAND | /* share sig handler table */
_CLONE_THREAD /* revisit - okay for now */
)
// May run with m.p==nil, so write barriers are not allowed.
//go:nowritebarrier
func newosproc(mp *m, stk unsafe.Pointer) {
// ... some code skipped ...
ret := clone(cloneFlags, /* ... other flags ... */)
// ... code skipped
}
And the clone function is defined in architecture-specific
files. For amd64 it is in src/runtime/sys_linux_amd64.s.
It is the actual system call.
So Go programs do run in multiple OS threads which enables
spanning across CPUs, but they use one shared address space.
Phew... I love Go.

Related

Will Go's scheduler yield control from one goroutine to another for CPU-intensive work?

The accepted answer at golang methods that will yield goroutines explains that Go's scheduler will yield control from one goroutine to another when a syscall is encountered. I understand that this means if you have multiple goroutines running, and one begins to wait for something like an HTTP response, the scheduler can use this as a hint to yield control from that goroutine to another.
But what about situations where there are no syscalls involved? What if, for example, you had as many goroutines running as logical CPU cores/threads available, and each were in the middle of a CPU-intensive calculation that involved no syscalls. In theory, this would saturate the CPU's ability to do work. Would the Go scheduler still be able to detect an opportunity to yield control from one of these goroutines to another, that perhaps wouldn't take as long to run, and then return control back to one of these goroutines performing the long CPU-intensive calculation?
There are few if any promises here.
The Go 1.14 release notes says this in the Runtime section:
Goroutines are now asynchronously preemptible. As a result, loops without function calls no longer potentially deadlock the scheduler or significantly delay garbage collection. This is supported on all platforms except windows/arm, darwin/arm, js/wasm, and plan9/*.
A consequence of the implementation of preemption is that on Unix systems, including Linux and macOS systems, programs built with Go 1.14 will receive more signals than programs built with earlier releases. This means that programs that use packages like syscall or golang.org/x/sys/unix will see more slow system calls fail with EINTR errors. ...
I quoted part of the third paragraph here because this gives us a big clue as to how this asynchronous preemption works: the runtime system has the OS deliver some OS signal (SIGALRM, SIGVTALRM, etc.) on some sort of schedule (real or virtual time). This allows the Go runtime to implement the same kind of schedulers that real OSes implement with real (hardware) or virtual (virtualized hardware) timers. As with OS schedulers, it's up to the runtime to decide what to do with the clock ticks: perhaps just run the GC code, for instance.
We also see a list of platforms that don't do it. So we probably should not assume it will happen at all.
Fortunately, the runtime source is actually available: we can go look to see what does happen, should any given platform implement it. This shows that in runtime/signal_unix.go:
// We use SIGURG because it meets all of these criteria, is extremely
// unlikely to be used by an application for its "real" meaning (both
// because out-of-band data is basically unused and because SIGURG
// doesn't report which socket has the condition, making it pretty
// useless), and even if it is, the application has to be ready for
// spurious SIGURG. SIGIO wouldn't be a bad choice either, but is more
// likely to be used for real.
const sigPreempt = _SIGURG
and:
// doSigPreempt handles a preemption signal on gp.
func doSigPreempt(gp *g, ctxt *sigctxt) {
// Check if this G wants to be preempted and is safe to
// preempt.
if wantAsyncPreempt(gp) && isAsyncSafePoint(gp, ctxt.sigpc(), ctxt.sigsp(), ctxt.siglr()) {
// Inject a call to asyncPreempt.
ctxt.pushCall(funcPC(asyncPreempt))
}
// Acknowledge the preemption.
atomic.Xadd(&gp.m.preemptGen, 1)
atomic.Store(&gp.m.signalPending, 0)
}
The actual asyncPreempt function is in assembly, but it just does some assembly-only trickery to save user registers, and then calls asyncPreempt2 which is in runtime/preempt.go:
//go:nosplit
func asyncPreempt2() {
gp := getg()
gp.asyncSafePoint = true
if gp.preemptStop {
mcall(preemptPark)
} else {
mcall(gopreempt_m)
}
gp.asyncSafePoint = false
}
Compare this to runtime/proc.go's Gosched function (documented as the way to voluntarily yield):
//go:nosplit
// Gosched yields the processor, allowing other goroutines to run. It does not
// suspend the current goroutine, so execution resumes automatically.
func Gosched() {
checkTimeouts()
mcall(gosched_m)
}
We see the main differences include some "async safe point" stuff and that we arrange for an M-stack-call to gopreempt_m instead of gosched_m. So, apart from the safety check stuff and a different trace call (not shown here) the involuntary preemption is almost exactly the same as voluntary preemption.
To find this, we had to dig rather deep into the (Go 1.14, in this case) implementation. One might not want to depend too much on this.
A little bit more on this to complete #torek's answer.
Goroutines are interruptible when there is a syscall, but also when a routine is waiting on a lock, a chan or sleeping.
As #torek's said, since 1.14 routines can also be preempted even when they do none of the above. The scheduler can mark any routine as preemptible after it ran for more than 10ms.
More reading there: https://medium.com/a-journey-with-go/go-goroutine-and-preemption-d6bc2aa2f4b7

How do two or more threads share memory on the heap that they have allocated?

As the title says, how do two or more threads share memory on the heap that they have allocated? I've been thinking about it and I can't figure out how they can do it. Here is my understanding of the process, presumably I am wrong somewhere.
Any thread can add or remove a given number of bytes on the heap by making a system call which returns a pointer to this data, presumably by writing to a register which the thread can then copy to the stack.
So two threads A and B can allocate as much memory as they want. But I don't see how thread A could know where the memory that thread B has allocated is located. Nor do I know how either thread could know where the other thread's stack is located. Multi-threaded programs share the heap and, I believe, can access one another's stack but I can't figure out how.
I tried searching for this question but only found language specific versions that abstract away the details.
Edit:
I am trying not to be language or OS specific but I am using Linux and am looking at it from a low level perspective, assembly I guess.
My interpretation of your question: How can thread A get to know a pointer to the memory B is using? How can they exchange data?
Answer: They usually start with a common pointer to a common memory area. That allows them to exchange other data including pointers to other data with each other.
Example:
Main thread allocates some shared memory and stores its location in p
Main thread starts two worker threads, passing the pointer p to them
The workers can now use p and work on the data pointed to by p
And in a real language (C#) it looks like this:
//start function ThreadProc and pass someData to it
new Thread(ThreadProc).Start(someData)
Threads usually do not access each others stack. Everything starts from one pointer passed to the thread procedure.
Creating a thread is an OS function. It works like this:
The application calls the OS using the standard ABI/API
The OS allocates stack memory and internal data structures
The OS "forges" the first stack frame: It sets the instruction pointer to ThreadProc and "pushes" someData onto the stack. I say "forge" because this first stack frame does not arise naturally but is created by the OS artificially.
The OS schedules the thread. ThreadProc does not know it has been setup on a fresh stack. All it knows is that someData is at the usual stack position where it would expect it.
And that is how someData arrives in ThreadProc. This is the way the first, initial data item is shared. Steps 1-3 are executed synchronously by the parent thread. 4 happens on the child thread.
A really short answer from a bird's view (1000 miles above):
Threads are execution paths of the same process, and the heap actually belongs to the process (and as a result shared by the threads). Each threads just needs its own stack to function as a separate unit of work.
Threads can share memory on a heap if they both use the same heap. By default most languages/frameworks have a single default heap that code can use to allocate memory from the heap. In unmanaged languages you generally make explicit calls to allocate heap memory. In C, that might be malloc, etc. for example. In managed languages heap allocation is usually automatic and how allocation is done depends on the language--usually through the use of the new operator. but, that depends slightly on context. If you provide the OS or language context you're asking about, I might be able to provide more detail.
A Thread shared with other threads belonging to the same process: its code section, data section and other operating system resources such as open files and signals.
The part you are missing is static memory containing static variables.
This memory is allocated when the program is started, and assigned known adresses (determined at the linking time). All threads can access this memory without exchanging any data runtime, because the addresses are effectively hardcoded.
A simple example might look like this:
// Global variable.
std::atomic<int> common_var;
void thread1() {
common_var = compute_some_value();
}
void thread2() {
do_something();
int current_value = common_var;
do_more();
}
And of course the global value may be a pointer, that can be used to exchange heap memory. The producer allocates some objects, the consumer takes and uses them.
// Global variable.
std::atomic<bool> produced;
SomeData* data_pointer;
void producer_thread() {
while (true) {
if (!produced) {
SomeData* new_data = new SomeData();
data_pointer = new_data;
// Let the other thread know there is something to read.
produced = true;
}
}
}
void consumer_thread() {
while (true) {
if (produced) {
SomeData* my_data = data_pointer;
data_pointer = nullptr;
// Let the other thread know we took the data.
produced = false;
do_something_with(my_data);
delete my_data;
}
}
}
Please note: these are not examples of good concurrent code, but they show the general idea without too much clutter.

Is there some kind of incompatibility with Boost::thread() and Nvidia CUDA?

I'm developing a generic streaming CUDA kernel execution Framework that allows parallel data copy & execution on the GPU.
Currently I'm calling the cuda kernels within a C++ static function wrapper, so I can call the kernels from a .cpp file (not .cu), like this:
//kernels.cu:
//kernel definition
__global__ void kernelCall_kernel( dataRow* in, dataRow* out, void* additionalData){
//Do something
};
//kernel handler, so I can compile this .cu and link it with the main project and call it within a .cpp file
extern "C" void kernelCall( dataRow* in, dataRow* out, void* additionalData){
int blocksize = 256;
dim3 dimBlock(blocksize);
dim3 dimGrid(ceil(tableSize/(float)blocksize));
kernelCall_kernel<<<dimGrid,dimBlock>>>(in, out, additionalData);
}
If I call the handler as a normal function, the data printed is right.
//streamProcessing.cpp
//allocations and definitions of data omitted
//copy data to GPU
cudaMemcpy(data_d,data_h,tableSize,cudaMemcpyHostToDevice);
//call:
kernelCall(data_d,result_d,null);
//copy data back
cudaMemcpy(result_h,result_d,resultSize,cudaMemcpyDeviceToHost);
//show result:
printTable(result_h,resultSize);// this just iterate and shows the data
But to allow parallel copy and execution of data on the GPU I need to create a thread, so when I call it making a new boost::thread:
//allocations, definitions of data,copy data to GPU omitted
//call:
boost::thread* kernelThreadOwner = new boost::thread(kernelCall, data_d,result_d,null);
kernelThreadOwner->join();
//Copy data back and print ommited
I just get garbage when printing the result on the end.
Currently I'm just using one thread, for testing purpose, so there should be no much difference in calling it directly or creating a thread. I have no clue why calling the function directly gives the right result, and when creating a thread not. Is this a problem with CUDA & boost? Am I missing something? Thank you in advise.
The problem is that (pre CUDA 4.0) CUDA contexts are tied to the thread in which they were created. When you are using two threads, you have two contexts. The context that the main thread is allocating and reading from, and the context that the thread which runs the kernel inside are not the same. Memory allocations are not portable between contexts. They are effectively separate memory spaces inside the same GPU.
If you want to use threads in this way, you either need to refactor things so that one thread only "talks" to the GPU, and communicates with the parent via CPU memory, or use the CUDA context migration API, which allows a context to be moved from one thread to another (via cuCtxPushCurrent and cuCtxPopCurrent). Be aware that context migration isn't free, and there is latency involved, so if you plan to migrating contexts around frequently, you might find it more efficient to change to a different design which preserves context-thread affinity.

ucontext across threads

Are contexts (the objects manipulated by functions in ucontext.h) allowed to be shared across threads? That is, can I swapcontext with the second argument being a context created in makecontext on another thread? A test program seems to show this working on Linux. I can't find documentation one way or the other on this, whereas Windows fibers appear to explicitly support such a use case. Is this safe and OK to do in general? Is it standard POSIX behavior that this should work?
Actually, there was an NGPT - threading library for linux, which uses not a current 1:1 threading model (each user thread is the kernel thread or LWP), but a M:N threading model (several user threads corresponds to another, smaller number of kernel threads).
According to ftp://ftp.uni-duisburg.de/Linux/NGPT/ngpt-0.9.4.tar.gz/ngpt-0.9.4/pth_sched.c:170 pth_scheduler it was possible of moving user thread contexts between native (kernel) threads:
/*
* See if the thread is unbound...
* Break out and schedule if so...
*/
if (current->boundnative == 0)
break;
/*
* See if the thread is bound to a different native thread...
* Break out and schedule if not...
*/
if (current->boundnative == this_sched->lastrannative)
break;
To save and restore user threads, the ucontext can be used ftp://ftp.uni-duisburg.de/Linux/NGPT/ngpt-0.9.4.tar.gz/ngpt-0.9.4/pth_mctx.c:64 and seems this was a preferred method (mcsc):
/*
* save the current machine context
*/
#if PTH_MCTX_MTH(mcsc)
#define pth_mctx_save(mctx) \
( (mctx)->error = errno, \
getcontext(&(mctx)->uc) )
#elif
....
/*
* restore the current machine context
* (at the location of the old context)
*/
#if PTH_MCTX_MTH(mcsc)
#define pth_mctx_restore(mctx) \
( errno = (mctx)->error, \
(void)setcontext(&(mctx)->uc) )
#elif PTH_MCTX_MTH(sjlj)
...
#if PTH_MCTX_MTH(mcsc)
/*
* VARIANT 1: THE STANDARDIZED SVR4/SUSv2 APPROACH
*
* This is the preferred variant, because it uses the standardized
* SVR4/SUSv2 makecontext(2) and friends which is a facility intended
* for user-space context switching. The thread creation therefore is
* straight-foreward.
*/
So, even if NGPT is dead and unused, it selected *context() for switching user threads even between kernel threads. I assume, that using *context() family is safe enough on Linux.
There can be some problems when mixing ucontexts and other native threads library. I will consider a NPTL, which is standard linux native threading library since glibc 2.4. The main problem is THREAD_SELF - pointer to struct pthread of the current thread. TLS (Thread-local storage) also works via THREAD_SELF. The THREAD_SELF is usually stored on register (r2 on powerpc, %gs on x86, etc). get/setcontext might save and restore this register breaking internals of native pthread library (e.g. thread-local storage, thread identification, etc).
The glibc setcontext will not save/restore %gs register to be compatible with pthreads:
/* Restore the FS segment register. We don't touch the GS register
since it is used for threads. */
movl oFS(%eax), %ecx
movw %cx, %fs
You should check, does setcontext saves THREAD_SELF register on the architecture you are interested in. Also, your code can be not portable between OSes and libcs.
From the man page
In a System V-like environment, one
has the type ucontext_t defined in
and the four functions
getcontext(2), setcontext(2),
makecontext() and swapcontext() that
allow user-level context switching
between multiple threads of control
within a process.
Sounds like that's what it's for.
EDIT: although this discussion seems to indicate that you shouldn't be mixing them.

What is the best way for interprocessor communication in Linux?

I have two CPUs on the chip and they have a shared memory. This is not a SMP architecture. Just two CPUs on the chip with shared memory.
There is a Unix-like operating system on the first CPU and there is a Linux operating system on the second CPU.
The first CPU does some job and the result of this job is some data. After first CPU finishes its job it should say to another CPU that job is finished and the second CPU have to process this data.
What is the way to handle interprocessor communication? What algorithm should I use to do that?
Any reference to an article about it would be greatly appreciated.
It all depends on the hardware. If all you have is shared memory, and no other way of communication, then you have to use a polling of some sort.
Are both of your processor running linux ? How do they handle the shared memory ?
A good solution is to use a linked list as a fifo. On this fifo you put data descriptor, like adress and size.
For example, you can have an input and output fifo, and go like this :
Processor A does some calculation
Processor A push the data descriptoron the output fifo
Processor A wait for data descriptor on the input fifo
loop
Processor B wait for data descriptor on the output fifo
Processor B works with data
Processor B push used data descriptor on the input fifo
loop
Of course, the hard part is in the locking. May be you should reformulate your question to emphasize this is not 'standard' SMP.
If you have no atomic test and set bit operation available on the memory, I guess you have to go with a scheme where some zone of memory is write only for one processor, and read only for the other.
Edit : See Hasturkun answer, for a way of passing messages from one processor to the other, using ordered write instead of atomicity to provide serialized access to some predefined data.
Ok. I understand the question.I have worked on this kind of an issue.
Now first thing that you need to understand is the working of the shared memory that exists between the 2 CPUs. Because these shared memory can be accessed in different ways, u need to figure out which one suits u the best.
Most times hardware semaphores will be provided in the shared memory along with the hardware interrupt to notify the message transfer from one processor to the other processor.
So have a look at this first.
A really good method is to just send IP packets back and forth (using sockets). this has the advantage that you can test stuff off-chip - as in, run a test version of one process on a PC, if you have networking.
If both processors are managed by a single os, then you can use any of the standard IPC to communicate with each other as OS takes care of everything. If they are running on different OSes then sockets would be your best bet.
EDIT
Quick unidirectional version:
Ready flag
Done flag
init:
init()
{
ready = 0;
done = 1;
}
writer:
send()
{
while (!done)
sleep();
/* copy data in */
done = 0;
ready = 1;
}
reader:
poll()
{
while (1)
{
if (ready)
{
recv();
}
sleep();
}
}
recv()
{
/* copy data out */
ready = 0;
done = 1;
}
Build a message passing system via the shared mem (which should be coherent, either by being uncached for both processors, or by use of cache flush/invalidate calls).
Your shared memory structure should have (at least) the following fields:
Current owner
Message active (as in, should be read)
Request usage fields
Flow will probably be like this: (assumed send/recv synchronized not to run at same time)
poll()
{
/* you're better off using interrupts instead, if you have them */
while(1)
{
if (current_owner == me)
{
if (active)
{
recv();
}
else if (!request[me] && request[other])
{
request[other] = 0;
current_owner = other;
}
}
sleep();
}
}
recv()
{
/* copy data... */
active = 0;
/* check if we still want it */
if (!request[me] && request[other])
{
request[other] = 0;
current_owner = other;
}
}
send()
{
request[me] = 1;
while (current_owner != me || active)
{
sleep();
}
request[me] = 0;
/* copy data in... */
/* pass to other side */
active = 1;
current_owner = other;
}
How about using the shared mem?
I don't have a good link right now, but if you google for IPC + shared mem I bet you find some good info :)
Are you sure you need to do this? In my experience you're better off letting your compiler & operating system manage how your process uses multiple CPUs.

Resources