Lua operations, that works in mutitheaded environment - multithreading

My application uses Lua in multithreaded environment with global mutex. It implemented like this:
Thread locks mutex,
Call lua_newthread
Perform some initialization on coroutine
Run lua_resume on coroutine
Unlocks mutex
lua_lock/unlock is not implemented, GC is stopped, when lua works with coroutine.
My question is, can I perform steps 2 and 3 without locking, if initialisation process does not requires any global Lua structs? Can i perform all this process without locking at all, if coroutine does not requires globals too?
In what case I generally can use Lua functions without locking?

Lua does not guarantee thread safety if you're trying to use single Lua state in separate OS threads without lua_lock/unlock. If you want to use multithreaded environment you need to use individual state for each OS thread.
Look at some multithreading solutions, e.g. https://github.com/effil/effil.

In what case I generally can use Lua functions without locking?
On the same Lua state (or threads derived from the same source Lua state)?
None.
Lua is thread-safe in the sense that separate Lua state instances can be executed in parallel. There are absolutely no thread safety guarantees when you call any Lua API function from two different threads on the same Lua state instance.
You cannot do any of the steps 2, 3, or 4 outside of some synchronization mechanism to prevent concurrent access to the same state. It doesn't matter if it's just creating a new thread (which allocates memory) or some "initialization process" (which will likely allocate memory). Even things that don't allocate memory are still not allowed.
Lua offers no guarantees about thread-safety within a Lua state.

Related

Is lua table thread safe?

Say, I have a lua table t={"a"={}, "b"={}}.
My question is, I have two threads, thread A, thread B.
And if I create two lua_State individually for these two threads through lua_newthread, and thread A only read/write t.a, thread B only read/write t.b.
Should I use lua_lock in each thread above?
If the answer is YES, then, does any of operation on t need a lua_lock?
TL;DR: A Lua engine-state is not thread-safe, so there's no reason to make the Lua table thread-safe.
A lua_State is not the engine state, though it references it. Instead, it is the state of a Lua thread, which is no relation to an application thread. Lua threads with the same engine state cannot execute concurrently, as the Lua engine is inherently single-threaded (you may use an arbitrary number of engines at the same time though), instead they are cooperatively multitasked.
So lua_State *lua_newthread (lua_State *L); creates a new Lua thread, not an OS thread.
lua_lock and the like do not refer to thread safety, but were the way native code can keep hold of Lua objects across calls into the engine in version 2 of the implementation: https://www.lua.org/manual/2.1/subsection3_5_6.html
The modern way is using the registry, a Lua table accessible from native code: http://www.lua.org/manual/5.3/manual.html#4.5
lua table isn't thread safe, but there's no need to lock since threads don't read/write on the same element.
NO, lua table is not thread safe.
And yes, all operation on table t will need lua_lock, because none of them is atomic operation.

Does a PTHREAD mutex only avoid simultaneous access to a resource, or it does anything more?

Example:
A thread finishes writing to a shared variable, and then it unlocks it, but continues to use that variable's value (without changing it).
And immediately, another thread successfully unlocks() that mutex and reads the shared variable.
For my (mis-)understanding, some things could be happening on this situation:
On the WRITER thread:
A compiler optimization could make the write occur only at some later point
The written value could be retained in the current CPU core's cache, and flushed to the memory at some later point
On the READER thread:
The value of the variable may have been read before the mutex lock(), and because of some compiler optimization or just the usual work of the CPU cache, still be considered "already read from memory" and thus, not fetched from the memory again.
Thus, the value we have here is not the updated one from the other thread.
Does the pthread mutex lock/unlock() functions execute any code to "flush" the current cache to the memory and anything else needed to make sure the current thread is synchronized with everything else (I cannot think of anything else than the cache), or is it just not needed (at least in all known architectures)?
Because if all the mutexes do is just what the name does - mutual exclusion to it's reference - then, if I have thousands of threads dealing with the same data and from my algorithm's point of view, I already know that when one thread is using a variable, no other thread will try to use it at the same time, than it means I don't need a mutex? Or will my code be missing some low level and architecture-specific method(s) implemented inside the PTHREAD library to avoid the problems above?
The pthreads mutex lock and unlock functions are among the list of functions in POSIX "...that synchronize thread execution and also synchronize memory with respect to other threads". So yes, they do more than just interlock execution.
Whether or not they need to issue additional instructions to the hardware is of course architecture dependent (noting that almost every modern CPU architecture will at least happily reorder reads with respect to each other unless told otherwise), but in every case those functions must act as "compiler barriers" - that is, they ensure that the compiler won't reorder, coalesce or omit memory accesses in situations where it would otherwise be allowed to.
It is allowed to have multiple threads reading a shared value without mutual exclusion though - all you need to ensure is that both the writing and reading threads executed some synchronising function between the write and the read. For example, an allowable situation is to have many reading threads that defer reading the shared state until they have passed a barrier (pthread_barrier_wait()) and a writing thread that performs all its writes to the shared state before it passes the barrier. Reader-writer locks (pthread_rwlock_*) are also built around this idea.

Thread in Tcl is not really working as C threads

In Tclsh thread package, a created thread is not sharing variables and namespace with main thread, which is quite different from C implementation of threads. Why is this contradiction in tcl thread design. Or am i missing something in the code? Does all scripting language have similar threaded design with them?
Below is the quote from Tcl thread documentation PDF,
thread::create
. All other extensions must be loaded
explicitly into each thread
that needs to use them
It's not a contradiction. It's just a different model. It has its advantages and its disadvantages. The key disadvantage you already know: scripts and variables are not shared (unless you take special steps). The key advantage is that the Tcl implementation has no big global locks, and that makes it much easier to use multi-core hardware effectively and means that there are very few gotchas when doing so. Contrast this with the Python Global Interpreter Lock, which is necessary because Python uses the C-like global shared state model.
At the low level, Tcl's threading is strongly isolated with plenty of thread-shared variables behind the scenes so that locks can be avoided (including in the memory management a lot of time, which would otherwise be a key bottleneck). Inter-thread communications are based on top of Tcl's built-in event queueing system; when two threads communicate, one sends a message and (optionally) waits for the other to respond, with the receiver getting the message placed on its internal queue of events until it is in a state that is ready to handle it. This does slow down inter-thread communications, but is much faster when they're not communicating.
It is actually similar to one way you'd use threads in C: message passing. Of course, you can use threads in other ways as well in C. But message passing is one way to completely avoid deadlocks since the semaphores/mutexes can be completely managed around the message queues and you don't need them anywhere else in your code.
This is in fact what Tcl implements at the C level. And it is in fact why it was done this way: to avoid the need for semaphores (to prevent the user form deadlocking himself).
Most other scripting languages simply provide a thin wrapper around pthreads so you can deadlock yourself if you're not careful. I remember way back in the early 2000s the general advice for threaded programming in C and most other languages is to implement a message passing architecture to avoid deadlocks.
Since tcl generally takes the view that API exposed at the script level should be high level, the thread implementation was implemented with a message passing architecture built-in. Of course, there is also the convenient fact that it also avoids having to make the tcl interpreter thread-safe (thus introducing mutexes all over the interpreter source code).
Making interpreters thread-safe is non trivial. Some languages suffer mysterious crashes to this day when running threaded applications. Some languages took over a decade to iron out all threading bugs. Tcl just decided not to try. The tcl interpreter is small enough and spins up quite fast so the solution was to simply run one interpreter per thread.

performance of pthread_cond_broadcast when no one is waiting on condition

If I call pthread_cond_broadcast and no one is waiting on the condition, will the pthread_cond_broadcast invoke a context switch and/or call to kernel?
If not, can I rely on it being very fast (by fast I mean, just running a small number of assmebly instruction in current process and then returning)?
There are no guarantees in POSIX, but since your question is tagged linux and nptl an answer in that context can be given.
If there are no waiters on the condition variable, then the nptl glibc code for pthread_cond_broadcast() just takes a low-level lock protecting the internals of the condition variable itself, tests a value then unlocks the low-level lock. The low-level lock itself uses a futex, which will only enter the kernel if there is contention on that lock.
This means that unless there is a lot of contention on the condition variable itself (ie. a large number of threads frequently calling pthread_cond_broadcast() / pthread_cond_signal() on the same condition variable) there will be no system call to the kernel, and the overhead will only be a few locked instructions.
The pthread Open Group Base Specifications states:
The pthread_cond_broadcast() and pthread_cond_signal() functions shall have no effect if there are no threads currently blocked on cond.
To get a measure of whether or not this takes "just running a small number of assmebly [sic] instruction" you'll have to get out some run-time performance-analysis tool (e.g IBM's Quantify) and run it against your code.

thread synchronization vs process synchronization

can we use the same synchronization mechanisams for both thread synchronization and process synchronization
what are thes synchronization mechanisams that are avilable only within the process
semaphores are generally what are used for multi process synchronization in terms of shared memory access, etc.
critical sections, mutexes and conditions are the more common tools for thread synchronization within a process.
generally speaking, the methods used to synchronize threads are not used to synchronize processes, but the reverse is usually not true. In fact its fairly common to use semaphores for thread synchronization.
There are several synchronization entities. They have different purposes and scope. Different languages and operating system implement them differently. On Windows, for one, you can use monitors for synching threads within a processes, or mutex for synching processes. There are semaphores, events, barriers... It all depends on the case. .NET provides so called slim versions that have improved performance but target only in-process synching.
One thing to remember though. Synching processes requires system resource, allocation and manipulation (locking and releasing) of which take quite a while.
An application consists of one or more
processes. A process, in the simplest
terms, is an executing program. One or
more threads run in the context of the
process. A thread is the basic unit to
which the operating system allocates
processor time. A thread can execute
any part of the process code,
including parts currently being
executed by another thread.
Ref.
As to specific synchronisation constructs, that will depend on the OS/Environment/language
One difference: Threads within a process have equal access to the memory of the process. Memory is typically private to a process, but can be explicitly shared.

Resources