GO: runtime: program exceeds 10000-thread limit

GO: runtime: program exceeds 10000-thread limit - multithreading

when i using go, i got a error 'runtime: program exceeds 10000-thread limit'.
That is the information about goroutine.
'SCHED 229357ms: gomaxprocs=16 idleprocs=0 threads=8797 idlethreads=8374 runqueue=2131 gcwaiting=0 nmidlelocked=141 nmspinning=0 stopwait=0 sysmonwait=0'
we can see that has 8374 idlethreads, no reason to create any more thread by os.
why the program exceeds 10000-thread limit?

No way to provide a definitive answer without a lot of program code. And I doubt that you can provide a simplified example that will hit the problem. 10,000 threads is a lot of threads.
What I guess is that because Go creates a new thread for each goroutine's blocking operations, you have a lot of goroutines doing blocking calls.
I am not sure, but I think each goroutine keeps its own blocking call thread and they are not pooled. So having more than 10,000 goroutines that each make blocking calls may be the issue.
All guesses though.
EDIT:
Found a way to increase the 10,000 thread limit:
https://golang.org/pkg/runtime/debug/#SetMaxThreads
debug.SetMaxThreads(20000)

Related

should I create threads before hand to save time?

I am using python 2.7 .I am using multi-threading.Now if a thread dies I again
create one to compensate for it.So should I create a lot of threads before hand and store them
and use from them when one or more existing threads die or should I create one when some thread dies??
Which is more efficient in terms of time ??

When you say a thread "dies", do you mean you intentionally terminate it or it fails due to error?
If you're intentionally terminating it and you're worried about the time required to spawn a new thread, why not keep the thread persistent and simply have it do the job that the new thread would have done? This is a pretty standard approach - maintain a pool of "worker" threads and have a work queue with pending items to execute. They all run an identical loop which is to pull an item off the queue and execute it. These items can be objects with methods which contain the code to execute if it's convenient to work that way - if the tasks are all very similar then it might be easier to put the code into the thread's own function instead.
If you're talking about threads failing due to error, I wouldn't have imagined this was common enough to worry about it. If it is, you probably need to look at making your code more robust.
In either case, spawning a thread on most systems should be a lightweight activity - a lot more lightweight than spawning a whole new process, for example. As a result, I really wouldn't worry about keeping a pool of threads in reserve to use - that really sounds like early optimisation to me.
Even if spawning threads were slow, consider what you would be doing by spawning threads in advance - you would be taking up more memory (some memory in the OS to keep track of a the thread, some in Python for the objects that it uses to track the thread), although not a great deal; you'd also be spending more time at the start of your program creating all these threads. So, you might save a little time while you were running, but instead your program takes significantly longer to start. That doesn't sound like a sensible trade-off to me unless the speed and latency of your code is absolutely critical while it's running, and if speed is that critical then I'm not sure a pure Python solution is the right approach anyway. Something like C/C++ is going to give you better control of scheduling, at the expense of much more complexity.
In summary: seriously, don't worry about it, just spawn threads as you need them. Trust me, there will be much bigger speed problems elsewhere in your code which are much more deserving of your time.

Can code running in a background thread be faster than in the main VCL thread in Delphi?

If anybody has had a lot of experience timing code running on the main VCL thread vs a background thread, I'd like to get an opinion. I have some code that does some heavy string processing running in my Delphi 6 application on the main thread. Each time I run an operation, the time for each operation hovers around 50 ms on a single thread on my i5 Quad core. What makes me really suspicious is that the same code running on an old Pentium 4 that I have, shows the same time for the operation when usually I see code running about 4 times slower on the Pentium 4 than the Quad Core. I am beginning to wonder if the code might be consuming significantly less time than 50 ms but that there's something about the main VCL thread, perhaps Windows message handling or executing Windows API calls, that is creating an artificial "floor" for the operation. Note, an operation is triggered by an incoming request on a socket if that matters, but the time measurement does not take place until the data is fully received.
Before I undertake the work of moving all the code on to a background thread for testing, I am wondering if anyone has any general knowledge in this area? What have your experiences been with code running on and off the main VCL thread? Note, the timing measurements are being done when there is absolutely no user triggered activity going on during the tests.
I'm also wondering if raising the priority of the thread to just below real-time would do any good. I've never seen much improvement in my run times when experimenting with those flags.
-- roschler

Given all threads have the same priority, as they normally do, there can't be a difference, for the following reasons. If you're seeing a difference, re-evaluate the code (make sure you run the same thing in both VCL and background threads) and make sure you time it properly:
The compiler generates the exact same code, it doesn't care if the code is going to run in the main thread or in a background thread. In fact you can put the whole code in a procedure and call that from both your worker thread's Execute() and from the main VCL thread.
For the CPU all cores, and all threads, are equal. Unless it's actually a Hyper Threading CPU, where not all cores are real, but then see the next bullet.
Even if not all CPU cores are equal, your thread will very unlikely run on the same core, the operating system is free to move it around at will (and does actually schedule your thread to run on different cores at different times).
Messaging overhead doesn't matter for the main VCL thread, because unless you're calling Application.ProcessMessages() manually, the message pump is simply stopped while your procedure does it's work. The message pump is passive, your thread needs to request messages from the queue, but since the thread is busy doing your work, it's not requesting any messages so no overhead there.
There's just one place where threads are not equal, and this can change the perceived speed of execution: It's the operating system that schedules threads to execution units (cores), and for the operating system threads have different priorities. You can tell the OS a certain thread needs to be treated differently using the SetThreadPriority() API (which is used by the TThread.Priority property).

Without simple source code to reproduce the issue, and how you are timing your threads, it will be difficult to understand what occurs in your software.
Sounds definitively like either:
An Architecture issue - how are your threads defined?
A measurement issue - how are you timing your threads?
A typical scaling issue of both the memory manager and the RTL string-related implementation.
About the latest point, consider this:
The current memory manager (FastMM4) is not scaling well on multi-core CPU; try with a per-thread memory manager, like our experimental SynScaleMM - note e.g. that the Free Pascal Compiler team has written a new scaling MM from scratch recently, to avoid such issue;
Try changing the string process implementation to avoid memory allocation (use static buffers), and string reference-counting (every string reference counting access produces a LOCK DEC/INC which do not scale so well on multi-code CPU - use per-thread char-level process, using e.g. PChar on static buffers instead of string).
I'm sure that without string operations, you'll find that all threads are equivalent.
In short: neither the current Delphi MM, neither the current string implementation scales well on multi-core CPU. You just found out a known issue of the current RTL. Read this SO question.

When your code has control of the VCL thread, for instance if it is in one method and doesn't call out to any VCL controls or call Application.ProcessMessages, then the run time will not be affected just because it's in the main VCL thread.
There is no overhead, since you "own" the whole processing power of the thread when you are in your own code.
I would suggest that you use a profiling tool to find where the actual bottleneck is.

Performance can't be assessed statically. For that you need to get AQTime, or some other performance profiler for Delphi. I use AQtime, and I love it, but I'm aware it's considered expensive.
Your code will not magically get faster just because you moved it to a background thread. If anything, your all-inclusive-time until you see results in your UI might get a little slower, if you have to send a lot of data from the background thread to the foreground thread via some synchronization mechanisms.
If however you could execute parts of your algorithm in parallel, that is, split your work so that you have 2 or more worker threads processing your data, and you have a quad core processor, then your total time to do a fixed load of work, could decrease. That doesn't mean the code would run any faster, but depending on a lot of factors, you might achieve a slight benefit from multithreading, up to the number of cores in your computer. It's never ever going to be a 2x performance boost, to use two threads instead of one, but you might get 20%-40% better performance, in your more-than-one-threaded parallel solutions, depending on how scalable your heap is under multithreaded loads, and how IO/memory/cache bound your workload is.
As for raising thread priorities, generally all you will do there is upset the delicate balance of your Windows system's performance. By raising the priorities you will achieve (sometimes) a nominal, but unrepeatable and non-guaranteeable increase in performance. Depending on the other things you do in your code, and your data sources, playing with priorities of threads can introduce subtle problems. See Dining Philosophers problem for more.
Your best bet for optimizing the speed of string operations is to first test it and find out exactly where it is using most of its time. Is it heap operations? Memory Copy and move operations? Without a profiler, even with advice from other people, you will still be comitting a cardinal sin of programming; premature optimization. Be results oriented. Be science based. Measure. Understand. Then decide.
Having said that, I've seen a lot of horrible code in my time, and there is one killer thing that people do that totally kills their threaded app performance; Using TThread.Synchronize too much.
Here's a pathological (Extreme) case, that sadly, occurs in the wild fairly frequently:
procedure TMyThread.Execute;
begin
while not Terminated do
Synchronize(DoWork);
end;
The problem here is that 100% of the work is really done in the foreground, other than the "if terminated" check, which executes in the thread context. To make the above code even worse, add a non-interruptible sleep.
For fast background thread code, use Synchronize sparingly or not at all, and make sure the code it calls is simple and executes quickly, or better yet, use TThread.Queue or PostMessage if you could really live with queueing main thread activity.

Are you asking if a background thread would be faster? If your background thread would run the same code as the main thread and there's nothing else going on in the main thread, you don't stand to gain anything with a background thread. Threads should be used to split and distribute processing loads that would otherwise contend with one another and/or block one another when running in the main thread. Since you seem to be dealing with a case where your main thread is otherwise idle, simply spawning a thread to run slow code will not help.
Threads aren't magic, they can't speed up slow code or eliminate processing bottlenecks in a particular segment not related to contention on the main thread. Make sure your code isn't doing something you don't know about and that your timing methodology is correct.
My first hunch would be that your interaction with the socket is affecting your timing in a way you haven't detected... (I know you said you're sure that's not involved - but maybe check again...)

TPL Tasks, Threads, etc

Could someone clear up to me how these things correlate:
Task
Thread
ThreadPool's thread
Paraller.For/ForEach/Invoke
I.e. when I create a Task and run it, where does it get a thread to execute on? And when I call Parallel.* what is really going on under the covers?
Any links to articles, blogposts, etc are also very welcomed!

The ideal state of a system is to have 1 actively running thread per CPU core. By defining work in more general terms of "tasks", the TPL can dynamically decide how many threads to use and which tasks to do on each one in order to come closest to achieving that ideal state. These are decisions that are almost always best made dynamically at runtime because when writing the code you can't know for sure how many CPU cores will be available to your application, how busy they are with other work, etc.

Thread: is a real OS thread, has handle and ID.
ThreadPool: is a collection of already-created OS Threads. These threads are owned/maintained by the runtime, and your code is only allowed to "borrow" them for a while, you can only do work short-termed work in these threads, and you can't modify any thread state, nor delete these threads.
Best guesses on these two:
Task: might get run on a pre-created thread in the thread pool, or might get run as part of user-mode scheduling, this is all depending on what the runtime thinks is best. Another guess: with TPL, the user-mode scheduling is NOT based on OS Fibers, but is its own complete (and working) implementation).
Parallel.For: actually, no clue how this is implemented. The runtime might create new threads to do the parallel bits, or much more likely use the thread pool's threads for the parallelism.

How many threads can I spawn before efficiency drops?

Is there any formula, maybe involving RAM & number of CPUs, which can give me a rough idea of how many threads I can spawn before it starts to be inefficient and slows the PC?
I want to load test another machine, so want to send requests as quickly as pobbile. But there's no point of spawning a million threads if they will just get in each other's way.
Edit: The threads are making Remote Procedure Calls (SOAP), so will be blocking waiting for the call to return.

It depends on what the threads are doing. If they're doing calculations, then the number will be lower. If they're waiting on I/O, then you can have more.
However, if they're waiting on I/O then you may be able to achieve the same result using async I/O apis better than using multiple threads.

If all threads are active and not blocking waiting for something then basically one thread per CPU (core really). Any more than that and you're relying on the operating system to context switch between the threads on a given CPU.
But it all depends on what the threads are doing. If they're sleeping most of the time or waiting on asynchronous IO operations, then you mostly just need to worry about the memory used for the stack which defaults to about 1MB per thread I believe.

The other answers are of course correct; "it depends". If the threads are busy doing CPU-intensive work, there's no point having more than the number of cores available. But assuming they are waiting on external results, it can vary widely.
I often find that this question is answered by the architecture and requirements of an application; you need as many threads as you need.
But if you potentially have an unlimited number of threads you might end up spawning, I think that probably sounds like a task for the ThreadPool myself; let it decide how many threads to actually have running.

First of all starting a thread may be quite a slow operation itself. When you start a thread stack space must be allocated, entry points in DLLs may be called etc. If you have a lot more threads than available cores then the majority of your threads will not be running at any given moment. I.e. they use resources and contribute nothing.
It is hard to give an exact number of threads for optimal performance, cause it depends on what the threads are doing, but generally you shouldn't go way above the number of available cores. Keep in mind that you cannot have more running threads than the number of available cores.

Mutithreading thread control

How do I control the number of threads that my program is working on?
I have a program that is now ready for mutithreading but one problem is that the program is extremely memory intensive and i have to limit the number of threads running so that i don't run out of ram. The main program goes through and creates a whole bunch of handles and associated threads in suspended state.
I want the program to activate a set number of threads and when one thread finishes, it will automatically unsuspended the next thread in line until all the work has been completed. How do i do this?
Someone has once mentioned something about using a thread handler, but I can't seem to find any information about how to write one or exactly how it would work.
If anyone can help, it would be greatly appreciated.
Using windows and visual c++.
Note: i don't need to worry about the traditional problems of access with the threads, each one is completely independent of each other, its more of like batch processing rather than true mutithreading of a program.
Thanks,
-Faken

Don't create threads explicitly. Create a thread pool, see Thread Pools and queue up your work using QueueUserWorkItem. The thread pool size should be determined by the number of hardware threads available (number of cores and ratio of hyperthreading) and the ratio of CPU vs. IO your work items do. By controlling the size of the thread pool you control the number of maximum concurrent threads.

A Suspended thread doesn't use CPU resources, but it still consumes memory, so you really shouldn't be creating more threads than you want to run simultaneously.
It is better to have only as many threads as your maximum number of simultaneous tasks, and to use a queue to pass units of work to the pool of worker threads.
You can give work to the standard pool of threads created by Windows using the Windows Thread Pool API.
Be aware that you will share these threads and the queue used to submit work to them with all of the code in your process. If, for some reason, you don't want to share your worker threads with other code in your process, then you can create a FIFO queue, create as many threads as you want to run simultaneously and have each of them pull work items out of the queue. If the queue is empty they will block until work items are added to the queue.

There is so much to say here.
There are a few ways
You should only create as many thread handles as you plan on running at the same time, then reuse them when they complete. (Look up thread pool).
This guarantees that you can never have too many running at the same time. This raises the question of funding out when a thread completes. You can have a callback be called just before a thread terminates where a parameter in that callback is the thread handle that just finished. Use Boost bind and boost signals for that. When the callback is called, look for another task for that thread handle and restart the thread. That way all you have to do is add to the "tasks to do" list and the callback will remove the tasks for you. No polling needed, and no worries about too many threads.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string