Multithreading in Lua - multithreading

Multithreading in Lua - multithreading

I was having a discussion with my friend the other day. I was saying how that, in pure Lua, you couldn't build a preemptive multitasking system. He claims you can, because of the following reason:
Both C and Lua have no inbuilt threading libraries [OP's note: well, Lua technically does, but AFAIK it's not useful for our purposes]. Windows, which is written in mostly C(++) has pre-emptive multitasking, which they built from scratch. Therefore, you should be able to do the same in Lua.
The big problem i see with that is that the main way preemptive multitasking works (to my knowledge) is that it generates regular interrupts which the manager uses to get control and determine what code it should be working on next. I also don't think Lua has any facility that can do that.
My question is: is it possible to write a pure-Lua library that allows people to have pre-emptive multitasking?

I can't see how to do it, although without a formal semantics of Lua (like the semantics of yield for example), it's really hard to come up with an ironclad argument why it can't be done. (I've been wanting a formal semantics for ages, but evidently Roberto and lhf have better things to do.)
If I wanted pre-emptive multitasking for Lua, I wouldn't even try to do it in pure Lua. Instead I'd use an old trick I first saw 20 years ago in Standard ML of New Jersey:
Interrupt sets a flag in the lua_State saying "current coroutine has been preempted".
Alter the VM so that on every loop and every function call, it checks the flag and yields if necessary.
This patch would be easy to write and easy to maintain. It doesn't solve the problem of the long-running C function that can't be pre-empted, but if you have to solve that problem, you are wandering into much harder territory, and you may as well do all your threading at the C level, not the Lua level.

No. It's not possible to write a preemptive scheduler in pure Lua. At some point a preemptive scheduler needs some mechanism like an interrupt service routine to take control away from the current thread and give it to the scheduler which can then give it to another thread. Pure Lua doesn't have this mechanism.
You mention that Windows is written in mostly C/C++. The keyword is mostly. You can't write a preemptive scheduler in pure ANSI C/C++. Usually, part of the interrupt service routine is written in assembly language. Or, the C/C++ compiler implements a non-standard extension that allows interrupt service routines to be written in C/C++. Some compilers allow you to declare a functions with an __interrupt modifier that that causes the compiler to generate a prolong / epilog that allows the function to be used as an interrupt service routine.
Also, code that sets up the interrupt service routine fiddles with CPU registers with memory mapped IO, or a IO instructions. None of this code is portable ANSI C/C++. And, depends on the CPU architecture.

Not that I know of, no. It would almost be absurdly simple if you could yield from hooks set on coroutines with debug.sethook though, but it doesn't work. You can yield from C hooks set from C (lua_sethook), but I couldn't figure out exactly to do that, and it's not pure Lua anyways.
Even if it were possible, it wouldn't be true threading. Everything would still run within the same operating system thread, for example. Your hook would take a variety of factors into account (such as time, perhaps memory, etc.) and then determine whether to yield. The yielded-to coroutine then would decide which child coroutine to run next. You'd also need to decide on when the hook should be called. Most frequent would be on every Lua instruction, but that carries a performance penalty. And if the coroutine calls into a C function, Lua has no jurisdiction. If that C call takes a long time, there's nothing you can do about it.
Here's a related thread from the Lua-L mailing list which you might find interesting.

Related

Thread in Tcl is not really working as C threads

In Tclsh thread package, a created thread is not sharing variables and namespace with main thread, which is quite different from C implementation of threads. Why is this contradiction in tcl thread design. Or am i missing something in the code? Does all scripting language have similar threaded design with them?
Below is the quote from Tcl thread documentation PDF,
thread::create
. All other extensions must be loaded
explicitly into each thread
that needs to use them

It's not a contradiction. It's just a different model. It has its advantages and its disadvantages. The key disadvantage you already know: scripts and variables are not shared (unless you take special steps). The key advantage is that the Tcl implementation has no big global locks, and that makes it much easier to use multi-core hardware effectively and means that there are very few gotchas when doing so. Contrast this with the Python Global Interpreter Lock, which is necessary because Python uses the C-like global shared state model.
At the low level, Tcl's threading is strongly isolated with plenty of thread-shared variables behind the scenes so that locks can be avoided (including in the memory management a lot of time, which would otherwise be a key bottleneck). Inter-thread communications are based on top of Tcl's built-in event queueing system; when two threads communicate, one sends a message and (optionally) waits for the other to respond, with the receiver getting the message placed on its internal queue of events until it is in a state that is ready to handle it. This does slow down inter-thread communications, but is much faster when they're not communicating.

It is actually similar to one way you'd use threads in C: message passing. Of course, you can use threads in other ways as well in C. But message passing is one way to completely avoid deadlocks since the semaphores/mutexes can be completely managed around the message queues and you don't need them anywhere else in your code.
This is in fact what Tcl implements at the C level. And it is in fact why it was done this way: to avoid the need for semaphores (to prevent the user form deadlocking himself).
Most other scripting languages simply provide a thin wrapper around pthreads so you can deadlock yourself if you're not careful. I remember way back in the early 2000s the general advice for threaded programming in C and most other languages is to implement a message passing architecture to avoid deadlocks.
Since tcl generally takes the view that API exposed at the script level should be high level, the thread implementation was implemented with a message passing architecture built-in. Of course, there is also the convenient fact that it also avoids having to make the tcl interpreter thread-safe (thus introducing mutexes all over the interpreter source code).
Making interpreters thread-safe is non trivial. Some languages suffer mysterious crashes to this day when running threaded applications. Some languages took over a decade to iron out all threading bugs. Tcl just decided not to try. The tcl interpreter is small enough and spins up quite fast so the solution was to simply run one interpreter per thread.

When should a thread generally yield?

In most languages/frameworks, there exists a way for a thread to yield control to other threads. However, I can't really think of a time when yielding from a thread was the correct solution to a given problem. When, in general, should one use Thread.yield(), sleep(0), etc?

One use case could be for testing concurrent programs, try to find interleavings that reveal flaws in your synchronization patterns. For instance in Java:
A useful trick for increasing the number of interleavings, and
therefore more effectively exploring the state space of your programs,
is to use Thread.yield to encourage more context switches during
operations that access shared state. (The effectiveness of this
technique is platform-specific, since the JVM is free to treat
THRead.yield as a no-op [JLS 17.9]; using a short but nonzero sleep
would be slower but more reliable.) — JCIP
Also interesting from the Java point of view is that their semantics are not defined:
The semantics of Thread.yield (and Thread.sleep(0)) are undefined
[JLS 17.9]; the JVM is free to implement them as no-ops or treat them
as scheduling hints. In particular, they are not required to have the
semantics of sleep(0) on Unix systemsput the current thread at the end
of the run queue for that priority, yielding to other threads of the
same prioritythough some JVMs implement yield in this way. — JCIP
This makes them, of course, rather unreliable. This is very Java specific, however, in generally I believe following is true:
Both are low-level mechanism which can be used to influence the scheduling order. If this is used to achieve a certain functionality then this functionality is based on the probability of the OS scheduler which seems a rather bad idea. This should be managed by higher-level synchronization constructs instead.
For testing purpose or for forcing the program into a certain state it seems a handy tool.

When, in general, should one use Thread.yield(), sleep(0), etc?
It depends on the VM are thread model we are talking about. For me the answer is rarely if ever.
Traditionally some thread models were non-preemptive and others are (or were) not mature hence the need for Thread.yield().
I feel that Thread.yield() is like using register in C. We used to rely on it to improve the performance of our programs because in many cases the programmer was better at this than the compiler. But modern compilers are much smarter and in much fewer cases these days can the programmer actually improve the performance of a program with the use of register and Thread.yield().

Keep your OS scheduler decide for you ?
So never yield, and never sleep(0) until you match a case where sleep(0) is absolutly necessary and document it here.
Also context switch are costy so I don't think a lot of people want more context switches.

I know this is old, but you didn't get any good answers here.
In general yielding is a way to be polite to other threads/processes and give them a chance to run on the same CPU with minimal delay to the yielding thread.
Not all yielding is equal either. On Windows SwitchToThread() only releases CPU if another thread of equal or greater priority was scheduled to run on the same CPU which means it very possibly will simply resume the calling thread while Sleep(0) has looser scheduler semantics; on Linux sched_yield() is similar to SwitchToThread() while nanosleep() with a 0 timespec seemingly marks the thread as unready for whatever period the timer slack is set to (inferred from profiling and substantiated here ). Behavior on MacOS is seemingly similar to Linux, but with much less timer slack - haven't looked into it that much though.
Yielding was way more useful in the days when uniprocessor systems were abundant because it really helped keep the system moving, but for example on Windows where by default Sleep(1) is actually predictably at least a 15.6ms delay (note that this is nearly an entire frame at 60fps if you're making a game or media player or something) it's still pretty valid although MessageWaitForMultipleObjectsEx should be preferred in general UI applications. Windows 10 added a new type of high resolution waitable timer with microsecond granularity that should probably be preferred over other methods, so hopefully that kind of yielding won't be so necessary anymore either.
In the context of N:1 and N:M cooperative threading models (not common at the OS level anymore, but still employed at the application-level through libraries providing Fibers and Coroutines often enough) yielding is still also definitely useful to keep things moving.
Unfortunately it's also abused pretty often, for example yielding in a busy loop rather than waiting on a synchronization primitive because the appropriate primitive isn't obvious or because the developer is overly optimistic about how long their threads will wait for / overly pessimistic about the scheduler. But in practice on most modern multitasking OSes unless the system is extremely busy, threads waiting on a synchronization primitive will get run almost instantly when the primitive is triggered/released/whatever.
You should try to avoid yielding, especially as an alternative to using a proper synchronization method. When you do need to yield, a zero sleep or waiting on a high resolution time source is probably better than a normal yield - I call the prior a "long yield" as opposed to a "short yield" - but unless you're using the system interface the implementation of sleep in your programming language/framework of choice might "optimize" sleep(0) into a short yield or even a no-op for you, sadly.

Lua :: How to write simple program that will load multiple CPUs?

I haven't been able to write a program in Lua that will load more than one CPU. Since Lua supports the concept via coroutines, I believe it's achievable.
Reason for me failing can be one of:
It's not possible in Lua
I'm not able to write it ☺ (and I hope it's the case )
Can someone more experienced (I discovered Lua two weeks ago) point me in right direction?
The point is to write a number-crunching script that does hi-load on ALL cores...
For demonstrative purposes of power of Lua.
Thanks...

Lua coroutines are not the same thing as threads in the operating system sense.
OS threads are preemptive. That means that they will run at arbitrary times, stealing timeslices as dictated by the OS. They will run on different processors if they are available. And they can run at the same time where possible.
Lua coroutines do not do this. Coroutines may have the type "thread", but there can only ever be a single coroutine active at once. A coroutine will run until the coroutine itself decides to stop running by issuing a coroutine.yield command. And once it yields, it will not run again until another routine issues a coroutine.resume command to that particular coroutine.
Lua coroutines provide cooperative multithreading, which is why they are called coroutines. They cooperate with each other. Only one thing runs at a time, and you only switch tasks when the tasks explicitly say to do so.
You might think that you could just create OS threads, create some coroutines in Lua, and then just resume each one in a different OS thread. This would work so long as each OS thread was executing code in a different Lua instance. The Lua API is reentrant; you are allowed to call into it from different OS threads, but only if are calling from different Lua instances. If you try to multithread through the same Lua instance, Lua will likely do unpleasant things.
All of the Lua threading modules that exist create alternate Lua instances for each thread. Lua-lltreads just makes an entirely new Lua instance for each thread; there is no API for thread-to-thread communication outside of copying parameters passed to the new thread. LuaLanes does provide some cross-connecting code.

It is not possible with the core Lua libraries (if you don't count creating multiple processes and communicating via input/output), but I think there are Lua bindings for different threading libraries out there.
The answer from jpjacobs to one of the related questions links to LuaLanes, which seems to be a multi-threading library. (I have no experience, though.)
If you embed Lua in an application, you will usually want to have the multithreading somehow linked to your applications multithreading.

In addition to LuaLanes, take a look at llthreads

In addition to already suggested LuaLanes, llthreads and other stuff mentioned here, there is a simpler way.
If you're on POSIX system, try doing it in old-fashioned way with posix.fork() (from luaposix). You know, split the task to batches, fork the same number of processes as the number of cores, crunch the numbers, collate results.
Also, make sure that you're using LuaJIT 2 to get the max speed.

It's very easy just create multiple Lua interpreters and run lua programs inside all of them.
Lua multithreading is a shared nothing model. If you need to exchange data you must serialize the data into strings and pass them from one interpreter to the other with either a c extension or sockets or any kind of IPC.

Serializing data via IPC-like transport mechanisms is not the only way to share data across threads.
If you're programming in an object-oriented language like C++ then it's quite possible for multiple threads to access shared objects across threads via object pointers, it's just not safe to do so, unless you provide some kind of guarantee that no two threads will attempt to simultaneously read and write to the same data.
There are many options for how you might do that, lock-free and wait-free mechanisms are becoming increasingly popular.

Why might threads be considered "evil"?

I was reading the SQLite FAQ, and came upon this passage:
Threads are evil. Avoid them.
I don't quite understand the statement "Thread are evil". If that is true, then what is the alternative?
My superficial understanding of threads is:
Threads make concurrence happen. Otherwise, the CPU horsepower will be wasted, waiting for (e.g.) slow I/O.
But the bad thing is that you must synchronize your logic to avoid contention and you have to protect shared resources.
Note: As I am not familiar with threads on Windows, I hope the discussion will be limited to Linux/Unix threads.

When people say that "threads are evil", the usually do so in the context of saying "processes are good". Threads implicitly share all application state and handles (and thread locals are opt-in). This means that there are plenty of opportunities to forget to synchronize (or not even understand that you need to synchronize!) while accessing that shared data.
Processes have separate memory space, and any communication between them is explicit. Furthermore, primitives used for interprocess communication are often such that you don't need to synchronize at all (e.g. pipes). And you can still share state directly if you need to, using shared memory, but that is also explicit in every given instance. So there are fewer opportunities to make mistakes, and the intent of the code is more explicit.

Simple answer the way I understand it...
Most threading models use "shared state concurrency," which means that two execution processes can share the same memory at the same time. If one thread doesn't know what the other is doing, it can modify the data in a way that the other thread doesn't expect. This causes bugs.
Threads are "evil" because you need to wrap your mind around n threads all working on the same memory at the same time, and all of the fun things that go with it (deadlocks, racing conditions, etc).
You might read up about the Clojure (immutable data structures) and Erlang (message passsing) concurrency models for alternative ideas on how to achieve similar ends.

What makes threads "evil" is that once you introduce more than one stream of execution into your program, you can no longer count on your program to behave in a deterministic manner.
That is to say: Given the same set of inputs, a single-threaded program will (in most cases) always do the same thing.
A multi-threaded program, given the same set of inputs, may well do something different every time it is run, unless it is very carefully controlled. That is because the order in which the different threads run different bits of code is determined by the OS's thread scheduler combined with a system timer, and this introduces a good deal of "randomness" into what the program does when it runs.
The upshot is: debugging a multi-threaded program can be much harder than debugging a single-threaded program, because if you don't know what you are doing it can be very easy to end up with a race condition or deadlock bug that only appears (seemingly) at random once or twice a month. The program will look fine to your QA department (since they don't have a month to run it) but once it's out in the field, you'll be hearing from customers that the program crashed, and nobody can reproduce the crash.... bleah.
To sum up, threads aren't really "evil", but they are strong juju and should not be used unless (a) you really need them and (b) you know what you are getting yourself into. If you do use them, use them as sparingly as possible, and try to make their behavior as stupid-simple as you possibly can. Especially with multithreading, if anything can go wrong, it (sooner or later) will.

I would interpret it another way. It's not that threads are evil, it's that side-effects are evil in a multithreaded context (which is a lot less catchy to say).
A side effect in this context is something that affects state shared by more than one thread, be it global or just shared. I recently wrote a review of Spring Batch and one of the code snippets used is:
private static Map<Long, JobExecution> executionsById = TransactionAwareProxyFactory.createTransactionalMap();
private static long currentId = 0;
public void saveJobExecution(JobExecution jobExecution) {
Assert.isTrue(jobExecution.getId() == null);
Long newId = currentId++;
jobExecution.setId(newId);
jobExecution.incrementVersion();
executionsById.put(newId, copy(jobExecution));
}
Now there are at least three serious threading issues in less than 10 lines of code here. An example of a side effect in this context would be updating the currentId static variable.
Functional programming (Haskell, Scheme, Ocaml, Lisp, others) tend to espouse "pure" functions. A pure function is one with no side effects. Many imperative languages (eg Java, C#) also encourage the use of immutable objects (an immutable object is one whose state cannot change once created).
The reason for (or at least the effect of) both of these things is largely the same: they make multithreaded code much easier. A pure function by definition is threadsafe. An immutable object by definition is threadsafe.
The advantage processes have is that there is less shared state (generally). In traditional UNIX C programming, doing a fork() to create a new process would result in shared process state and this was used as a means of IPC (inter-process communication) but generally that state is replaced (with exec()) with something else.
But threads are much cheaper to create and destroy and they take less system resources (in fact, the operating itself may have no concept of threads yet you can still create multithreaded programs). These are called green threads.

The paper you linked to seems to explain itself very well. Did you read it?
Keep in mind that a thread can refer to the programming-language construct (as in most procedural or OOP languages, you create a thread manually, and tell it to executed a function), or they can refer to the hardware construct (Each CPU core executes one thread at a time).
The hardware-level thread is obviously unavoidable, it's just how the CPU works. But the CPU doesn't care how the concurrency is expressed in your source code. It doesn't have to be by a "beginthread" function call, for example. The OS and the CPU just have to be told which instruction threads should be executed.
His point is that if we used better languages than C or Java with a programming model designed for concurrency, we could get concurrency basically for free. If we'd used a message-passing language, or a functional one with no side-effects, the compiler would be able to parallelize our code for us. And it would work.

Threads aren't any more "evil" than hammers or screwdrivers or any other tools; they just require skill to utilize. The solution isn't to avoid them; it's to educate yourself and up your skill set.

Creating a lot of threads without constraint is indeed evil.. using a pooling mechanisme (threadpool) will mitigate this problem.
Another way threads are 'evil' is that most framework code is not designed to deal with multiple threads, so you have to manage your own locking mechanisme for those datastructures.
Threads are good, but you have to think about how and when you use them and remember to measure if there really is a performance benefit.

A thread is a bit like a light weight process. Think of it as an independent path of execution within an application. The thread runs in the same memory space as the application and therefore has access to all the same resources, global objects and global variables.
The good thing about them: you can parallelise a program to improve performance. Some examples, 1) In an image editing program a thread may run the filter processing independently of the GUI. 2) Some algorithms lend themselves to multiple threads.
Whats bad about them? if a program is poorly designed they can lead to deadlock issues where both threads are waiting on each other to access the same resource. And secondly, program design can me more complex because of this. Also, some class libraries don't support threading. e.g. the c library function "strtok" is not "thread safe". In other words, if two threads were to use it at the same time they would clobber each others results. Fortunately, there are often thread safe alternatives... e.g. boost library.
Threads are not evil, they can be very useful indeed.
Under Linux/Unix, threading hasn't been well supported in the past although I believe Linux now has Posix thread support and other unices support threading now via libraries or natively. i.e. pthreads.
The most common alternative to threading under Linux/Unix platforms is fork. Fork is simply a copy of a program including it's open file handles and global variables. fork() returns 0 to the child process and the process id to the parent. It's an older way of doing things under Linux/Unix but still well used. Threads use less memory than fork and are quicker to start up. Also, inter process communications is more work than simple threads.

In a simple sense you can think of a thread as another instruction pointer in the current process. In other words it points the IP of another processor to some code in the same executable. So instead of having one instruction pointer moving through the code there are two or more IP's executing instructions from the same executable and address space simultaneously.
Remember the executable has it's own address space with data / stack etc... So now that two or more instructions are being executed simultaneously you can imagine what happens when more than one of the instructions wants to read/write to the same memory address at the same time.
The catch is that threads are operating within the process address space and are not afforded protection mechanisms from the processor that full blown processes are. (Forking a process on UNIX is standard practice and simply creates another process.)
Out of control threads can consume CPU cycles, chew up RAM, cause execeptions etc.. etc.. and the only way to stop them is to tell the OS process scheduler to forcibly terminate the thread by nullifying it's instruction pointer (i.e. stop executing). If you forcibly tell a CPU to stop executing a sequence of instructions what happens to the resources that have been allocated or are being operated on by those instructions? Are they left in a stable state? Are they properly freed? etc...
So, yes, threads require more thought and responsibility than executing a process because of the shared resources.

For any application that requires stable and secure execution for long periods of time without failure or maintenance, threads are always a tempting mistake. They invariably turn out to be more trouble than they are worth. They produce rapid results and prototypes that seem to be performing correctly but after a couple weeks or months running you discover that they have critical flaws.
As mentioned by another poster, once you use even a single thread in your program you have now opened a non-deterministic path of code execution that can produce an almost infinite number of conflicts in timing, memory sharing and race conditions. Most expressions of confidence in solving these problems are expressed by people who have learned the principles of multithreaded programming but have yet to experience the difficulties in solving them.
Threads are evil. Good programmers avoid them wherever humanly possible. The alternative of forking was offered here and it is often a good strategy for many applications. The notion of breaking your code down into separate execution processes which run with some form of loose coupling often turns out to be an excellent strategy on platforms that support it. Threads running together in a single program is not a solution. It is usually the creation of a fatal architectural flaw in your design that can only be truly remedied by rewriting the entire program.
The recent drift towards event oriented concurrency is an excellent development innovation. These kinds of programs usually prove to have great endurance after they are deployed.
I've never met a young engineer who didn't think threads were great. I've never met an older engineer who didn't shun them like the plague.

Being an older engineer, I heartily agree with the answer by Texas Arcane.
Threads are very evil because they cause bugs that are extremely difficult to solve. I have literally spent months solving sporadic race-conditions. One example caused trams to suddenly stop about once a month in the middle of the road and block traffic until towed away. Luckily I didn't create the bug, but I did get to spend 4 months full-time to solve it...
It's a tad late to add to this thread, but I would like to mention a very interesting alternative to threads: asynchronous programming with co-routines and event loops. This is being supported by more and more languages, and does not have the problem of race conditions like multi-threading has.
It can replace multi-threading in cases where it is used to wait on events from multiple sources, but not where calculations need to be performed in parallel on multiple CPU cores.

Is it possible to create threads without system calls in Linux x86 GAS assembly?

Whilst learning the "assembler language" (in linux on a x86 architecture using the GNU as assembler), one of the aha moments was the possibility of using system calls. These system calls come in very handy and are sometimes even necessary as your program runs in user-space.
However system calls are rather expensive in terms of performance as they require an interrupt (and of course a system call) which means that a context switch must be made from your current active program in user-space to the system running in kernel-space.
The point I want to make is this: I'm currently implementing a compiler (for a university project) and one of the extra features I wanted to add is the support for multi-threaded code in order to enhance the performance of the compiled program. Because some of the multi-threaded code will be automatically generated by the compiler itself, this will almost guarantee that there will be really tiny bits of multi-threaded code in it as well. In order to gain a performance win, I must be sure that using threads will make this happen.
My fear however is that, in order to use threading, I must make system calls and the necessary interrupts. The tiny little (auto-generated) threads will therefore be highly affected by the time it takes to make these system calls, which could even lead to a performance loss...
my question is therefore twofold (with an extra bonus question underneath it):
Is it possible to write assembler
code which can run multiple threads
simultaneously on multiple cores at
once, without the need of system
calls?
Will I get a performance gain if I have really tiny threads (tiny as in the total execution time of the thread), performance loss, or isn't it worth the effort at all?
My guess is that multithreaded assembler code is not possible without system calls. Even if this is the case, do you have a suggestion (or even better: some real code) for implementing threads as efficient as possible?

The short answer is that you can't. When you write assembly code it runs sequentially (or with branches) on one and only one logical (i.e. hardware) thread. If you want some of the code to execute on another logical thread (whether on the same core, on a different core on the same CPU or even on a different CPU), you need to have the OS set up the other thread's instruction pointer (CS:EIP) to point to the code you want to run. This implies using system calls to get the OS to do what you want.
User threads won't give you the threading support that you want, because they all run on the same hardware thread.
Edit: Incorporating Ira Baxter's answer with Parlanse. If you ensure that your program has a thread running in each logical thread to begin with, then you can build your own scheduler without relying on the OS. Either way, you need a scheduler to handle hopping from one thread to another. Between calls to the scheduler, there are no special assembly instructions to handle multi-threading. The scheduler itself can't rely on any special assembly, but rather on conventions between parts of the scheduler in each thread.
Either way, whether or not you use the OS, you still have to rely on some scheduler to handle cross-thread execution.

"Doctor, doctor, it hurts when I do this". Doctor: "Don't do that".
The short answer is you can do multithreaded programming without
calling expensive OS task management primitives. Simply ignore the OS for thread
scheduling operations. This means you have to write your own thread
scheduler, and simply never pass control back to the OS.
(And you have to be cleverer somehow about your thread overhead
than the pretty smart OS guys).
We chose this approach precisely because windows process/thread/
fiber calls were all too expensive to support computation
grains of a few hundred instructions.
Our PARLANSE programming langauge is a parallel programming language:
See http://www.semdesigns.com/Products/Parlanse/index.html
PARLANSE runs under Windows, offers parallel "grains" as the abstract parallelism
construct, and schedules such grains by a combination of a highly
tuned hand-written scheduler and scheduling code generated by the
PARLANSE compiler that takes into account the context of grain
to minimimze scheduling overhead. For instance, the compiler
ensures that the registers of a grain contain no information at the point
where scheduling (e.g., "wait") might be required, and thus
the scheduler code only has to save the PC and SP. In fact,
quite often the scheduler code doesnt get control at all;
a forked grain simply stores the forking PC and SP,
switches to compiler-preallocated stack and jumps to the grain
code. Completion of the grain will restart the forker.
Normally there's an interlock to synchronize grains, implemented
by the compiler using native LOCK DEC instructions that implement
what amounts to counting semaphores. Applications
can fork logically millions of grains; the scheduler limits
parent grains from generating more work if the work queues
are long enough so more work won't be helpful. The scheduler
implements work-stealing to allow work-starved CPUs to grab
ready grains form neighboring CPU work queues. This has
been implemented to handle up to 32 CPUs; but we're a bit worried
that the x86 vendors may actually swamp use with more than
that in the next few years!
PARLANSE is a mature langauge; we've been using it since 1997,
and have implemented a several-million line parallel application in it.

Implement user-mode threading.
Historically, threading models are generalised as N:M, which is to say N user-mode threads running on M kernel-model threads. Modern useage is 1:1, but it wasn't always like that and it doesn't have to be like that.
You are free to maintain in a single kernel thread an arbitrary number of user-mode threads. It's just that it's your responsibility to switch between them sufficiently often that it all looks concurrent. Your threads are of course co-operative rather than pre-emptive; you basically scatted yield() calls throughout your own code to ensure regular switching occurs.

If you want to gain performance, you'll have to leverage kernel threads. Only the kernel can help you get code running simultaneously on more than one CPU core. Unless your program is I/O bound (or performing other blocking operations), performing user-mode cooperative multithreading (also known as fibers) is not going to gain you any performance. You'll just be performing extra context switches, but the one CPU that your real thread is running will still be running at 100% either way.
System calls have gotten faster. Modern CPUs have support for the sysenter instruction, which is significantly faster than the old int instruction. See also this article for how Linux does system calls in the fastest way possible.
Make sure that the automatically-generated multithreading has the threads run for long enough that you gain performance. Don't try to parallelize short pieces of code, you'll just waste time spawning and joining threads. Also be wary of memory effects (although these are harder to measure and predict) -- if multiple threads are accessing independent data sets, they will run much faster than if they were accessing the same data repeatedly due to the cache coherency problem.

Quite a bit late now, but I was interested in this kind of topic myself.
In fact, there's nothing all that special about threads that specifically requires the kernel to intervene EXCEPT for parallelization/performance.
Obligatory BLUF:
Q1: No. At least initial system calls are necessary to create multiple kernel threads across the various CPU cores/hyper-threads.
Q2: It depends. If you create/destroy threads that perform tiny operations then you're wasting resources (the thread creation process would greatly exceed the time used by the tread before it exits). If you create N threads (where N is ~# of cores/hyper-threads on the system) and re-task them then the answer COULD be yes depending on your implementation.
Q3: You COULD optimize operation if you KNEW ahead of time a precise method of ordering operations. Specifically, you could create what amounts to a ROP-chain (or a forward call chain, but this may actually end up being more complex to implement). This ROP-chain (as executed by a thread) would continuously execute 'ret' instructions (to its own stack) where that stack is continuously prepended (or appended in the case where it rolls over to the beginning). In such a (weird!) model the scheduler keeps a pointer to each thread's 'ROP-chain end' and writes new values to it whereby the code circles through memory executing function code that ultimately results in a ret instruction. Again, this is a weird model, but is intriguing nonetheless.
Onto my 2-cents worth of content.
I recently created what effectively operate as threads in pure assembly by managing various stack regions (created via mmap) and maintaining a dedicated area to store the control/individualization information for the "threads". It is possible, although I didn't design it this way, to create a single large block of memory via mmap that I subdivide into each thread's 'private' area. Thus only a single syscall would be required (although guard pages between would be smart these would require additional syscalls).
This implementation uses only the base kernel thread created when the process spawns and there is only a single usermode thread throughout the entire execution of the program. The program updates its own state and schedules itself via an internal control structure. I/O and such are handled via blocking options when possible (to reduce complexity), but this isn't strictly required. Of course I made use of mutexes and semaphores.
To implement this system (entirely in userspace and also via non-root access if desired) the following were required:
A notion of what threads boil down to:
A stack for stack operations (kinda self explaining and obvious)
A set of instructions to execute (also obvious)
A small block of memory to hold individual register contents
What a scheduler boils down to:
A manager for a series of threads (note that processes never actually execute, just their thread(s) do) in a scheduler-specified ordered list (usually priority).
A thread context switcher:
A MACRO injected into various parts of code (I usually put these at the end of heavy-duty functions) that equates roughly to 'thread yield', which saves the thread's state and loads another thread's state.
So, it is indeed possible to (entirely in assembly and without system calls other than initial mmap and mprotect) to create usermode thread-like constructs in a non-root process.
I only added this answer because you specifically mention x86 assembly and this answer was entirely derived via a self-contained program written entirely in x86 assembly that achieves the goals (minus multi-core capabilities) of minimizing system calls and also minimizes system-side thread overhead.

System calls are not that slow now, with syscall or sysenter instead of int. Still, there will only be an overhead when you create or destroy the threads. Once they are running, there are no system calls. User mode threads will not really help you, since they only run on one core.

First you should learn how to use threads in C (pthreads, POSIX theads). On GNU/Linux you will probably want to use POSIX threads or GLib threads.
Then you can simply call the C from assembly code.
Here are some pointers:
Posix threads: link text
A tutorial where you will learn how to call C functions from assembly: link text
Butenhof's book on POSIX threads link text

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string