Multithreading in Embedded Systems

Multithreading in Embedded Systems - multithreading

I am confused about the following:
I am hoping to get a job in the field of embedded systems. However, every interview I've had seems to end up with a conversation about threads in C and how to do thread-safe programming
My question is how do I go about learning multithreading in embedded systems? Are they the same as POSIX threads? For example, the tasks in FreeRTOS. Are they same thing as pthreads?
Can someone give me some tips on what to do and where to start?

Every OS has it's own threads/task/processes characteristics.
Despite the differences, the methods to synchronize, guard and interchange data between those, are roughly the same.
If someone knows that you don't know a specific OS, invited you to an interview - he/she probably expects you to answer in general and not to be OS specific.
You can solve any problem with POSIX (or any other) tool-set in mind and to mention that migration of the solution to a non-POSIX environment will keep same logic with some minor adaptations.

Multithreading concept is almost same everywhere, whether in RTOS or Linux.
The difference is in the operational behavior.
My question is how do I go about learning multithreading in embedded
systems?
My suggestion is to first learn and understand the concepts of multithreading by referring some online material, you can practice by writing some simple codes on your desktop running any flavor of Linux.
The go for some advanced topics like synchronization mechanism using Semaphore and Mutexes, you will then get to learn about the basic concept of when to use a semaphore and when to use a mutex for thread synchronization.
Then move to some Embedded Targets and try out some code using uCOS-II/uCOS-III or FreeRTOS.
Are they the same as POSIX threads?
No, they are not exactly same, POSIX thread library is a bit advance and is highly portable on different OS. For e.g. a multithread code written on Linux using pthread can also be compiled and executed on Windows with little or no change.
On the other hand, a thread implementation on RTOS is different, threads in RTOS are treated as tasks and they start executing only when a call to start the scheduler is made.

From my own experience trying to find learning resources, I found the the FreeRTOS docs very useful. They have both a reference manual as well as the Mastering the FreeRTOS Kernal doc which includes code snippets and covers topics such as task management, software timers, resource management, and general thread safe programming techniques. I dont think this would be the best place to start out, but once you've familiarized yourself with basics the other answers and comments have mentioned, this could help with the next step of learning by doing.

Related

How Do I Choose Between the Various Ways to do Threading in Delphi?

It seems that I've finally got to implement some sort of threading into my Delphi 2009 program. If there were only one way to do it, I'd be off and running. But I see several possibilities.
Can anyone explain what's the difference between these and why I'd choose one over another.
The TThread class in Delphi
AsyncCalls by Andreas Hausladen
OmniThreadLibrary by Primoz Gabrijelcic (gabr)
... any others?
Edit:
I have just read an excellent article by Gabr in the March 2010 (No 10) issue of Blaise Pascal Magazine titled "Four Ways to Create a Thread". You do have to subscribe to gain content to the magazine, so by copyright, I can't reproduce anything substantial about it here.
In summary, Gabr describes the difference between using TThreads, direct Windows API calls, Andy's AsyncCalls, and his own OmniThreadLibrary. He does conclude at the end that:
"I'm not saying that you have to choose anything else than the classical Delphi way (TThread) but it is still good to be informed of options you have"
Mghie's answer is very thorough and suggests OmniThreadLibrary may be preferable. But I'm still interested in everyone's opinions about how I (or anyone) should choose their threading method for their application.
And you can add to the list:
. 4. Direct calls to the Windows API
. 5. Misha Charrett's CSI Distributed Application Framework as suggested by LachlanG in his answer.
Conclusion:
I'm probably going to go with OmniThreadLibrary. I like Gabr's work. I used his profiler GPProfile many years ago, and I'm currently using his GPStringHash which is actually part of OTL.
My only concern might be upgrading it to work with 64-bit or Unix/Mac processing once Embarcadero adds that functionality into Delphi.

If you are not experienced with multi-threading you should probably not start with TThread, as it is but a thin layer over native threading. I consider it also to be a little rough around the edges; it has not evolved a lot since the introduction with Delphi 2, mostly changes to allow for Linux compatibility in the Kylix time frame, and to correct the more obvious defects (like fixing the broken MREW class, and finally deprecating Suspend() and Resume() in the latest Delphi version).
Using a simple thread wrapper class basically also causes the developer to focus on a level that is much too low. To make proper use of multiple CPU cores a focus on tasks instead of threads is better, because the partitioning of work with threads does not adapt well to changing requirements and environments - depending on the hardware and the other software running in parallel the optimum number of threads may vary greatly, even at different times on the same system. A library that you pass only chunks of work to, and which schedules them automatically to make best use of the available resources helps a lot in this regard.
AsyncCalls is a good first step to introduce threads into an application. If you have several areas in your program where a number of time-consuming steps need to be performed that are independent of each other, then you can simply execute them asynchronously by passing each of them to AsyncCalls. Even when you have only one such time-consuming action you can execute it asynchronously and simply show a progress UI in the VCL thread, optionally allowing for cancelling the action.
AsyncCalls is IMO not so good for background workers that stay around during the whole program runtime, and it may be impossible to use when some of the objects in your program have thread affinity (like database connections or OLE objects that may have a requirement that all calls happen in the same thread).
What you also need to be aware of is that these asynchronous actions are not of the "fire-and-forget" kind. Every overloaded AsyncCall() function returns an IAsyncCall interface pointer that you may need to keep a reference to if you want to avoid blocking. If you don't keep a reference, then the moment the ref count reaches zero the interface will be freed, which will cause the thread releasing the interface to wait for the asynchronous call to complete. This is something that you might see while debugging, when exiting the method that created the IAsyncCall may take a mysterious amount of time.
OTL is in my opinion the most versatile of your three options, and I would use it without a second thought. It can do everything TThread and AsyncCalls can do, plus much more. It has a sound design, which is high-level enough both to make life for the user easy, and to let a port to a Unixy system (while keeping most of the interface intact) look at least possible, if not easy. In the last months it has also started to acquire some high-level constructs for parallel work, highly recommended.
OTL has a few dozen samples too, which is important to get started. AsyncCalls has nothing but a few lines in comments, but then it is easy enough to understand due to its limited functionality (it does only one thing, but it does it well). TThread has only one sample, which hasn't really changed in 14 years and is mostly an example of how not to do things.
Whichever of the options you choose, no library will eliminate the need to understand threading basics. Having read a good book on these is a prerequisite to any successful coding. Proper locking for example is a requirement with all of them.

There is another lesser known Delphi threading library, Misha Charrett's CSI Application Framework.
It's based around message passing rather than shared memory. The same message passing mechanism is used to communicate between threads running in the same process or in other processes so it's both a threading library and a distributed inter-process communication library.
There's a bit of a learning curve to get started but once you get going you don't have to worry about all the traditional threading issues such as deadlocks and synchronisation, the framework takes care of most of that for you.
Misha's been developing this for years and is still actively improving the framework and documentation all the time. He's always very responsive to support questions.

TThread is a simple class that encapsulates a Windows thread. You make a descendant class with an Execute method that contains the code this thread should execute, create the thread and set it to run and the code executes.
AsyncCalls and OmniThreadLibrary are both libraries that build a higher-level concept on top of threads. They're about tasks, discrete pieces of work that you need to have execute asynchronously. You start the library, it sets up a task pool, a group of special threads whose job is to wait around until you have work for them, and then you pass the library a function pointer (or method pointer or anonymous method) containing the code that needs to be executed, and it executes it in one of the task pool threads and handles a lot of the the low-level details for you.
I haven't used either library all that much, so I can't really give you a comparison between the two. Try them out and see what they can do, and which one feels better to you.

(sorry, I don't have enough points to comment so I'm putting this in as an answer rather than another vote for OTL)
I've used TThread, CSI and OmniThread (OTL). The two libraries both have non-trivial learning curves but are much more capable than TThread. My conclusion is that if you're going to do anything significant with threading you'll end up writing half of the library functionality anyway, so you might as well start with the working, debugged version someone else wrote. Both Misha and Gabr are better programmers than most of us, so odds are they've done a better job than we will.
I've looked at AsyncCalls but it didn't do enough of what I wanted. One thing it does have is a "Synchronize" function (missing from OTL) so if you're dependent on that you might go with AynscCalls purely for that. IMO using message passing is not hard enough to justify the nastiness of Synchronize, so buckle down and learn how to use messages.
Of the three I prefer OTL, largely because of the collection of examples but also because it's more self-contained. That's less of an issue if you're already using the JCL or you work in only one place, but I do a mix including contract work and selling clients on installing Misha's system is harder than the OTL, just because the OTL is ~20 files in one directory. That sounds silly, but it's important for many people.
With OTL the combination of searching the examples and source code for keywords, and asking questions in the forums works for me. I'm familiar with the traditional "offload CPU-intensive tasks" threading jobs, but right now I'm working on backgrounding a heap of database work which has much more "threads block waiting for DB" and less "CPU maxed out", and the OTL is working quite well for that. The main differences are that I can have 30+ threads running without the CPU maxing out, but stopping one is generally impossible.

I know this isn't the most advanced method :-) and maybe it has limitations too, but I just tried System.BeginThread and found it quite simple - probably because of the quality of the documentation I was referring to... http://www.delphibasics.co.uk/RTL.asp?Name=BeginThread (IMO Neil Moffatt could teach MSDN a thing or two)
That's the biggest factor I find in trying to learn new things, the quality of the documentation, not it's quantity. A couple of hours was all it took, then I was back to the real work rather than worrying about how to get the thread to do it's business.
EDIT actually Rob Kennedy does a great job explaining BeginThread here BeginThread Structure - Delphi
EDIT actually the way Rob Kennedy explains TThread in the same post, I think I'll change my code to use TThread tommorrow. Who knows what it will look like next week! (AsyncCalls maybe)

threads and high level languages

can someone tell me why if i use threads it's better to use an low level languages like c++
and not c# and JAVA? someone asked me that in an interview and i did'nt know the answer

It's news to me. Higher level languages provide easy to use abstractions over thread management, for example.
I expect the interviewer's point would make sense in context. It's dependent on the problem in hand - the level of timing control you need if you're writing a computer game or software for an engine management system may be greater than if you are writing a conference room booking system.
You trade off the low-level control and the associated learning curve and risk you get with lower-level languages for ease of use, safety and productivity of higher-level languages.

I don't think this is necessarily true. In Java (I can't comment on C#) a thread maps directly to a native thread. From here:
The Java HotSpot™ virtual machine
currently associates each Java thread
with a unique native thread. The
relationship between the Java thread
and the native thread is stable and
persists for the lifetime of the Java
thread.
plus you have the additional high level constructs such as the Executor framework.
Going forwards, functional languages (such as F# and Scala) encourage immutability, which contribute to a safer threaded environment.
There may well be scenarios where a low-level language offers more control (as for most requirements), but I suspect those will be fairly specialised situations. You have to balance that against the safety/productivity that the higher-level languages offer.
EDIT : From your comments supplementing the question, this may relate to running a garbage collector and consequent garbage-collection pauses and the impact on providing real-time performance and predictability. Threading in C/C++ may well offer some benefits in this area since a garbage collection cycle is not going to kick off during some critical time-dependent code. For this reason (amongst others) Java can't be considered as a real-time platform.

like most answers : it depends. languages with built in threading facilities like C# and Java
will do some or most of the work needed for thread usage and synchronization for you.
with C++ you have do it yourself but you can employ better optimization techniques for your specific OS and platform

Will you use threads or not - depends solely on application, not on language. And language is a function of design.
C++ provides more control, c# provides more abstraction, Java provides simplification, but in the end they all work the same way.

Advice on starting a large multi-threaded programming project

My company currently runs a third-party simulation program (natural catastrophe risk modeling) that sucks up gigabytes of data off a disk and then crunches for several days to produce results. I will soon be asked to rewrite this as a multi-threaded app so that it runs in hours instead of days. I expect to have about 6 months to complete the conversion and will be working solo.
We have a 24-proc box to run this. I will have access to the source of the original program (written in C++ I think), but at this point I know very little about how it's designed.
I need advice on how to tackle this. I'm an experienced programmer (~ 30 years, currently working in C# 3.5) but have no multi-processor/multi-threaded experience. I'm willing and eager to learn a new language if appropriate. I'm looking for recommendations on languages, learning resources, books, architectural guidelines. etc.
Requirements: Windows OS. A commercial grade compiler with lots of support and good learning resources available. There is no need for a fancy GUI - it will probably run from a config file and put results into a SQL Server database.
Edit: The current app is C++ but I will almost certainly not be using that language for the re-write. I removed the C++ tag that someone added.

Numerical process simulations are typically run over a single discretised problem grid (for example, the surface of the Earth or clouds of gas and dust), which usually rules out simple task farming or concurrency approaches. This is because a grid divided over a set of processors representing an area of physical space is not a set of independent tasks. The grid cells at the edge of each subgrid need to be updated based on the values of grid cells stored on other processors, which are adjacent in logical space.
In high-performance computing, simulations are typically parallelised using either MPI or OpenMP. MPI is a message passing library with bindings for many languages, including C, C++, Fortran, Python, and C#. OpenMP is an API for shared-memory multiprocessing. In general, MPI is more difficult to code than OpenMP, and is much more invasive, but is also much more flexible. OpenMP requires a memory area shared between processors, so is not suited to many architectures. Hybrid schemes are also possible.
This type of programming has its own special challenges. As well as race conditions, deadlocks, livelocks, and all the other joys of concurrent programming, you need to consider the topology of your processor grid - how you choose to split your logical grid across your physical processors. This is important because your parallel speedup is a function of the amount of communication between your processors, which itself is a function of the total edge length of your decomposed grid. As you add more processors, this surface area increases, increasing the amount of communication overhead. Increasing the granularity will eventually become prohibitive.
The other important consideration is the proportion of the code which can be parallelised. Amdahl's law then dictates the maximum theoretically attainable speedup. You should be able to estimate this before you start writing any code.
Both of these facts will conspire to limit the maximum number of processors you can run on. The sweet spot may be considerably lower than you think.
I recommend the book High Performance Computing, if you can get hold of it. In particular, the chapter on performance benchmarking and tuning is priceless.
An excellent online overview of parallel computing, which covers the major issues, is this introduction from Lawerence Livermore National Laboratory.

Your biggest problem in a multithreaded project is that too much state is visible across threads - it is too easy to write code that reads / mutates data in an unsafe manner, especially in a multiprocessor environment where issues such as cache coherency, weakly consistent memory etc might come into play.
Debugging race conditions is distinctly unpleasant.
Approach your design as you would if, say, you were considering distributing your work across multiple machines on a network: that is, identify what tasks can happen in parallel, what the inputs to each task are, what the outputs of each task are, and what tasks must complete before a given task can begin. The point of the exercise is to ensure that each place where data becomes visible to another thread, and each place where a new thread is spawned, are carefully considered.
Once such an initial design is complete, there will be a clear division of ownership of data, and clear points at which ownership is taken / transferred; and so you will be in a very good position to take advantage of the possibilities that multithreading offers you - cheaply shared data, cheap synchronisation, lockless shared data structures - safely.

If you can split the workload up into non-dependent chunks of work (i.e., the data set can be processed in bits, there aren't lots of data dependencies), then I'd use a thread pool / task mechanism. Presumably whatever C# has as an equivalent to Java's java.util.concurrent. I'd create work units from the data, and wrap them in a task, and then throw the tasks at the thread pool.
Of course performance might be a necessity here. If you can keep the original processing code kernel as-is, then you can call it from within your C# application.
If the code has lots of data dependencies, it may be a lot harder to break up into threaded tasks, but you might be able to break it up into a pipeline of actions. This means thread 1 passes data to thread 2, which passes data to threads 3 through 8, which pass data onto thread 9, etc.
If the code has a lot of floating point mathematics, it might be worth looking at rewriting in OpenCL or CUDA, and running it on GPUs instead of CPUs.

For a 6 month project I'd say it definitely pays out to start reading a good book about the subject first. I would suggest Joe Duffy's Concurrent Programming on Windows. It's the most thorough book I know about the subject and it covers both .NET and native Win32 threading. I've written multithreaded programs for 10 years when I discovered this gem and still found things I didn't know in almost every chapter.
Also, "natural catastrophe risk modeling" sounds like a lot of math. Maybe you should have a look at Intel's IPP library: it provides primitives for many common low-level math and signal processing algorithms. It supports multi threading out of the box, which may make your task significantly easier.

There are a lot of techniques that can be used to deal with multithreading if you design the project for it.
The most general and universal is simply "avoid shared state". Whenever possible, copy resources between threads, rather than making them access the same shared copy.
If you're writing the low-level synchronization code yourself, you have to remember to make absolutely no assumptions. Both the compiler and CPU may reorder your code, creating race conditions or deadlocks where none would seem possible when reading the code. The only way to prevent this is with memory barriers. And remember that even the simplest operation may be subject to threading issues. Something as simple as ++i is typically not atomic, and if multiple threads access i, you'll get unpredictable results.
And of course, just because you've assigned a value to a variable, that's no guarantee that the new value will be visible to other threads. The compiler may defer actually writing it out to memory. Again, a memory barrier forces it to "flush" all pending memory I/O.
If I were you, I'd go with a higher level synchronization model than simple locks/mutexes/monitors/critical sections if possible. There are a few CSP libraries available for most languages and platforms, including .NET languages and native C++.
This usually makes race conditions and deadlocks trivial to detect and fix, and allows a ridiculous level of scalability. But there's a certain amount of overhead associated with this paradigm as well, so each thread might get less work done than it would with other techniques. It also requires the entire application to be structured specifically for this paradigm (so it's tricky to retrofit onto existing code, but since you're starting from scratch, it's less of an issue -- but it'll still be unfamiliar to you)
Another approach might be Transactional Memory. This is easier to fit into a traditional program structure, but also has some limitations, and I don't know of many production-quality libraries for it (STM.NET was recently released, and may be worth checking out. Intel has a C++ compiler with STM extensions built into the language as well)
But whichever approach you use, you'll have to think carefully about how to split the work up into independent tasks, and how to avoid cross-talk between threads. Any time two threads access the same variable, you have a potential bug. And any time two threads access the same variable or just another variable near the same address (for example, the next or previous element in an array), data will have to be exchanged between cores, forcing it to be flushed from CPU cache to memory, and then read into the other core's cache. Which can be a major performance hit.
Oh, and if you do write the application in C++, don't underestimate the language. You'll have to learn the language in detail before you'll be able to write robust code, much less robust threaded code.

One thing we've done in this situation that has worked really well for us is to break the work to be done into individual chunks and the actions on each chunk into different processors. Then we have chains of processors and data chunks can work through the chains independently. Each set of processors within the chain can run on multiple threads each and can process more or less data depending on their own performance relative to the other processors in the chain.
Also breaking up both the data and actions into smaller pieces makes the app much more maintainable and testable.

There's plenty of specific bits of individual advice that could be given here, and several people have done so already.
However nobody can tell you exactly how to make this all work for your specific requirements (which you don't even fully know yourself yet), so I'd strongly recommend you read up on HPC (High Performance Computing) for now to get the over-arching concepts clear and have a better idea which direction suits your needs the most.

The model you choose to use will be dictated by the structure of your data. Is your data tightly coupled or loosely coupled? If your simulation data is tightly coupled then you'll want to look at OpenMP or MPI (parallel computing). If your data is loosely coupled then a job pool is probably a better fit... possibly even a distributed computing approach could work.
My advice is get and read an introductory text to get familiar with the various models of concurrency/parallelism. Then look at your application's needs and decide which architecture you're going to need to use. After you know which architecture you need, then you can look at tools to assist you.
A fairly highly rated book which works as an introduction to the topic is "The Art of Concurrency: A Thread Monkey's Guide to Writing Parallel Application".

Read about Erlang and the "Actor Model" in particular. If you make all your data immutable, you will have a much easier time parallelizing it.

Most of the other answers offer good advice regarding partitioning the project - look for tasks that can be cleanly executed in parallel with very little data sharing required. Be aware of non-thread safe constructs such as static or global variables, or libraries that are not thread safe. The worst one we've encountered is the TNT library, which doesn't even allow thread-safe reads under some circumstances.
As with all optimisation, concentrate on the bottlenecks first, because threading adds a lot of complexity you want to avoid it where it isn't necessary.
You'll need a good grasp of the various threading primitives (mutexes, semaphores, critical sections, conditions, etc.) and the situations in which they are useful.
One thing I would add, if you're intending to stay with C++, is that we have had a lot of success using the boost.thread library. It supplies most of the required multi-threading primitives, although does lack a thread pool (and I would be wary of the unofficial "boost" thread pool one can locate via google, because it suffers from a number of deadlock issues).

I would consider doing this in .NET 4.0 since it has a lot of new support specifically targeted at making writing concurrent code easier. Its official release date is March 22, 2010, but it will probably RTM before then and you can start with the reasonably stable Beta 2 now.
You can either use C# that you're more familiar with or you can use managed C++.
At a high level, try to break up the program into System.Threading.Tasks.Task's which are individual units of work. In addition, I'd minimize use of shared state and consider using Parallel.For (or ForEach) and/or PLINQ where possible.
If you do this, a lot of the heavy lifting will be done for you in a very efficient way. It's the direction that Microsoft is going to increasingly support.
2: I would consider doing this in .NET 4.0 since it has a lot of new support specifically targeted at making writing concurrent code easier. Its official release date is March 22, 2010, but it will probably RTM before then and you can start with the reasonably stable Beta 2 now. At a high level, try to break up the program into System.Threading.Tasks.Task's which are individual units of work. In addition, I'd minimize use of shared state and consider using Parallel.For and/or PLINQ where possible. If you do this, a lot of the heavy lifting will be done for you in a very efficient way. 1: http://msdn.microsoft.com/en-us/library/dd321424%28VS.100%29.aspx

Sorry i just want to add a pessimistic or better realistic answer here.
You are under time pressure. 6 month deadline and you don't even know for sure what language is this system and what it does and how it is organized. If it is not a trivial calculation then it is a very bad start.
Most importantly: You say you have never done mulitithreading programming before. This is where i get 4 alarm clocks ringing at once. Multithreading is difficult and takes a long time to learn it when you want to do it right - and you need to do it right when you want to win a huge speed increase. Debugging is extremely nasty even with good tools like Total Views debugger or Intels VTune.
Then you say you want to rewrite the app in another lanugage - well this isn't as bad as you have to rewrite it anyway. THe chance to turn a single threaded Program into a well working multithreaded one without total redesign is almost zero.
But learning multithreading and a new language (what is your C++ skills?) with a timeline of 3 month (you have to write a throw away prototype - so i cut the timespan into two halfs) is extremely challenging.
My advise here is simple and will not like it: Learn multithreadings now - because it is a required skill set in the future - but leave this job to someone who already has experience. Well unless you don't care about the program being successfull and are just looking for 6 month payment.

If it's possible to have all the threads working on disjoint sets of process data, and have other information stored in the SQL database, you can quite easily do it in C++, and just spawn off new threads to work on their own parts using the Windows API. The SQL server will handle all the hard synchronization magic with its DB transactions! And of course C++ will perform a lot faster than C#.
You should definitely revise C++ for this task, and understand the C++ code, and look for efficiency bugs in the existing code as well as adding the multi-threaded functionality.

You've tagged this question as C++ but mentioned that you're a C# developer currently, so I'm not sure if you'll be tackling this assignment from C++ or C#. Anyway, in case you're going to be using C# or .NET (including C++/CLI): I have the following MSDN article bookmarked and would highly recommend reading through it as part of your prep work.
Calling Synchronous Methods Asynchronously

Whatever technology your going to write this, take a look a this must read book on concurrency "Concurrent programming in Java" and for .Net I highly recommend the retlang library for concurrent app.

I don't know if it was mentioned yet, but if I were in your shoes, what I would be doing right now (aside from reading every answer posted here) is writing a multiple threaded example application in your favorite (most used) language.
I don't have extensive multithreaded experience. I've played around with it in the past for fun but I think gaining some experience with a throw-away application will suit your future efforts.
I wish you luck in this endeavor and I must admit I wish I had the opportunity to work on something like this...

Which scripting languages support multi-core programming?

I have written a little python application and here you can see how Task Manager looks during a typical run.
(source: weinzierl.name)
While the application is perfectly multithreaded, unsurprisingly it uses only one CPU core.
Regardless of the fact that most modern scripting languages support multithreading, scripts can run on one CPU core only.
Ruby, Python, Lua, PHP all can only run on a single core.
Even Erlang, which is said to be especially good for concurrent programming, is affected.
Is there a scripting language that has built in
support for threads that are not confined to a single core?
WRAP UP
Answers were not quite what I expected, but the TCL answer comes close.
I'd like to add perl, which (much like TCL) has interpreter-based threads.
Jython, IronPython and Groovy fall under the umbrella of combining a proven language with the proven virtual machine of another language. Thanks for your hints in this
direction.
I chose Aiden Bell's answer as Accepted Answer.
He does not suggest a particular language but his remark was most insightful to me.

You seem use a definition of "scripting language" that may raise a few eyebrows, and I don't know what that implies about your other requirements.
Anyway, have you considered TCL? It will do what you want, I believe.
Since you are including fairly general purpose languages in your list, I don't know how heavy an implementation is acceptable to you. I'd be surprised if one of the zillion Scheme implementations doesn't to native threads, but off the top of my head, I can only remember the MzScheme used to but I seem to remember support was dropped. Certainly some of the Common LISP implementations do this well. If Embeddable Common Lisp (ECL) does, it might work for you. I don't use it though so I'm not sure what the state of it's threading support is, and this may of course depend on platform.
Update Also, if I recall correctly, GHC Haskell doesn't do quite what you are asking, but may do effectively what you want since, again, as I recall, it will spin of a native thread per core or so and then run its threads across those....

You can freely multi-thread with the Python language in implementations such as Jython (on the JVM, as #Reginaldo mention Groovy is) and IronPython (on .NET). For the classical CPython implementation of the Python language, as #Dan's comment mentions, multiprocessing (rather than threading) is the way to freely use as many cores as you have available

Thread syntax may be static, but implementation across operating systems and virtual machines may change
Your scripting language may use true threading on one OS and fake-threads on another.
If you have performance requirements, it might be worth looking to ensure that the scripted threads fall through to the most beneficial layer in the OS. Userspace threads will be faster, but for largely blocking thread activity kernel threads will be better.

As Groovy is based on the Java virtual machine, you get support for true threads.

F# on .NET 4 has excellent support for parallel programming and extremely good performance as well as support for .fsx files that are specifically designed for scripting. I do all my scripting using F#.

An answer for this question has already been accepted, but just to add that besides tcl, the only other interpreted scripting language that I know of that supports multithreading and thread-safe programming is Qore.
Qore was designed from the bottom up to support multithreading; every aspect of the language is thread-safe; the language was designed to support SMP scalability and multithreading natively. For example, you can use the background operator to start a new thread or the ThreadPool class to manage a pool of threads. Qore will also throw exceptions with common thread errors so that threading errors (like potential deadlocks or errors with threading APIs like trying to grab a lock that's already held by the current thread) are immediately visible to the programmer.
Qore additionally supports and thread resources; for example, a DatasourcePool allocation is treated as a thread-local resource; if you forget to commit or roll back a transaction before you end your thread, the thread resource handling for the DatasourcePool class will roll back the transaction automatically and throw an exception with user-friendly information about the problem and how it was solved.
Maybe it could be useful for you - an overview of Qore's features is here: Why use Qore?.

CSScript in combination with Parallel Extensions shouldn't be a bad option. You write your code in pure C# and then run it as a script.

It is not related to the threading mechanism. The problem is that (for example in python) you have to get interpreter instance to run the script. To acquire the interpreter you have to lock it as it is going to keep the reference count and etc and need to avoid concurrent access to this objects. Python uses pthread and they are real threads but when you are working with python objects just one thread is running an others waiting. They call this GIL (Global Interpreter Lock) and it is the main problem that makes real parallelism impossible inside a process.
https://wiki.python.org/moin/GlobalInterpreterLock
The other scripting languages may have kind of the same problem.

Guile supports POSIX threads which I believe are hardware threads.

What high level languages support multithreading? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
I'm wondering which languages support (or don't support) native multithreading, and perhaps get some details about the implementation. Hopefully we can produce a complete overview of this specific functionality.

Erlang has built-in support for concurrent programming.
Strictly speaking, Erlang processe are greenlets. But the language and virtual machine are designed from the ground up to support concurrency. The language has specific control structures for asynchronous inter-process messaging.
In Python, greenlet is a third-party package that provides lightweight threads and channel-based messaging. But it does not bear the comparison with Erlang.

I suppose that the list of languages that are higher-level than Haskell is pretty short, and it has pretty good support for concurrency and parallelism.

With CPython, one has to remember about the GIL. To summarize: only one processor is used, even on multiprocessor machines. There are multiple ways around this, as the comment shows.

Older versions of C and C++ (namely, C89, C99, C++98, and C++03) have no support at all in the core language, although libraries such as POSIX threads are available for pretty much every platform in common user today.
The newest versions of C and C++, C11 and C++11, do have built-in threading support in the language, but it's an optional feature of C11, so implementations such as single-core embedded systems can choose not to support it while supporting the rest of C11 if they desire.

Delphi/FreePascal also has support for threads.
I'll assume, from other answers, that it's only native on the Windows platforms.
Some nice libraries that implement better features on top of the TThread Object:
OmniThreadLibrary
BMThread

Clojure is an up and coming Lisp-dialect for the JVM that is specifically designed to handle concurrency well.
It features a functional style API, some very efficient implementations of various immutable data structures, and agent system (bit like actors in Scala and processes in Erlang). It even has software transactional memory.
All in all, Clojure goes to great lenght to help you write correct multithreaded and concurrent code.

I believe that the official squeak VM does not support native (OS) threads, but that the Gemstone version does.
(Feel free to edit this if not correct).

You need to define "native" in this context.
Java claims some sort of built-in multithreading, but is just based on coarse grained locking and some library support. At this moment, it is not more 'native' than C with the POSIX threads. The next version of C++ (0x) will include a threading library as well.

I know Java and C# support multithreading and that the next version of C++ will support it directly... (The planned implementation is available as part of the boost.org libraries...)

Boost::thread is great, I'm not sure whether you can say its part of the language though. It depends if you consider the CRT/STL/Boost to be 'part' of C++, or an optional add-on library.
(otherwise practically no language has native threading as they're all a feature of the OS).

This question doesn't make sense: whether a particular implementation chooses to implement threads as native threads or green threads has nothing to do with the language, that is an internal implementation detail.
There are Java implementations that use native threads and Java implementations that use green threads.
There are Ruby implementations that use native threads and Ruby implementations that use green threads.
There are Python implementations that use native threads and Python implementations that use green threads.
There are even POSIX Thread implementations that use green threads, e.g. the old LinuxThreads library or the GNU pth library.
And just because an implementation uses native threads doesn't mean that these threads can actually run in parallel; many implementations use a Global Interpreter Lock to ensure only one thread can run at a time. On the other hand, using green threads doesn't mean that they can't run in parallel: the BEAM Erlang VM for example can schedule its green threads (more precisely green processes) across mulitple CPU cores, the same is planned for the Rubinius Ruby VM.

Perl doesn't usefully support native threads.
Yes, there is a Perl threads module, and yes it uses native platform threads in its implementation. The problem is, it isn't very useful in the general case.
When you create a new thread using Perl threads, it copies the entire state of the Perl interpreter. This is very slow and uses lots of RAM. In fact it's probably slower than using fork() on Unix, as the latter uses copy-on-write and Perl threads do not.
But in general each language has its own threading model, some are different from others. Python (mostly) uses native platform threads but has a big lock which ensures that only one runs (Python code) at once. This actually has some advantages.
Aren't threads out of fashion these days in favour of processes? (Think Google Chrome, IE8)

I made a multithreading extension for Lua recently, called Lua Lanes. It merges multithreading concepts so naturally to the language, that I would not see 'built in' multithreading being any better.
For the record, Lua's built in co-operative multithreading (coroutines) can often also be used. With or without Lanes.
Lanes has no GIL, and runs code in separate Lua universes per thread. Thus, unless your C libraries would crash, it is immune to the problems associated with thread usage. In fact, the concept is more like processes and message passing, although only one OS process is used.

Finally Go is here with multi-threading with its own pkg Goroutine.
People say it is on the structure of C-language.
It is also easy to use and understand .

Perl and Python do. Ruby is working on it, but the threads in Ruby 1.8 are not really threads.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string