OmniThreadLibrary memory leak (consumption) on pipeline running from another thread - multithreading

I'm running pipeline (thread's pipeline from OmniThreadLibrary) from another thread and got memory leak or rather memory consumption. But when application close then it's ok and there are no memory leak report (ReportMemoryLeaksOnShutdown := True;).
Here example: click button 10 times and test app will get ~600 MB of memory. Windows 7 x64, Delphi XE6, latest omni source.
It's a bug? Or I need use another code?
uses
OtlParallel,
OtlCommon;
procedure TForm75.Button1Click(Sender: TObject);
begin
// run empty pipeline from another threads
Parallel.&For(1, 100).Execute(
procedure(value: integer)
var
pipe: IOmniPipeline;
begin
pipe := Parallel.Pipeline
.Stage(procedure(const input: TOmniValue; var output: TOmniValue) begin end)
.Run;
pipe.Cancel;
pipe.WaitFor(100000);
pipe := nil;
end
);
end;
Edit 1:
Tested that code with ProcessExplorer and found what threads count at runtime is constant, but handles count is grown. If I'm insert Application.ProcessMessages; at the end of "for loop" (after pipe's code) then test app running good, handles are closing and memory consumption is constant. Don't know why.

How many threads does it create ?
Check it in SysInternals Process Explorer for example.
Or in Delphi IDE (View -> Debug Windows -> Threads)
I think that because you block each For-worker for wuite a long WaitFor your application then creates many worker threads for every button click, and when you click it 10 times it consequently creates 10 times many threads.
And yes, in general-purpose operating systems like Windows threads are expensive! Google for "windows thread memory footprint" - and multiply it by the number of threads created by 10 parallel-for loop you spawn.
https://msdn.microsoft.com/en-us/library/windows/desktop/ms686774.aspx
https://blogs.technet.microsoft.com/markrussinovich/2009/07/05/pushing-the-limits-of-windows-processes-and-threads/
This fact was the reason that for making highly parallel server applications special approaches were done to create light-eight application-level threads and bypass OS threads, to name a few
Make special language that would spawn dozens of thousands of cooperative thread and cross-thread enforce memory safety by strict language rules: https://www.erlang.org/docs
Make a library, which cannot enforce those regulations but at least can demand programmer to follow them voluntarily: https://en.wikipedia.org/wiki/Actor_model
Fibers: no-protection threads within threads: What is the difference between a thread and a fiber?
However OTL being a generic library for generic threads imposes little restrictions but relies on OS-provided native threads, and they are heavy expensive in both CPU time needed to create/release Windows Threads (mitigated by Thread Pools concept) and by memory footprint needed to maintain each Windows Threads by OS (which is unavoidable and you see its manifestation).
Of course later, when all those loops are worked through, their threads are getting closed and released, together with the memory that was used to maintain them. So no memory leak indeed, once you wait enough for all your threads to be terminated - they are, with all the temporarily allocated memory they used as their workplaces.
UPD. How to check that hypothesis? easiest way would be to change how many threads is spawned by every instance of For-Loop (by every button click).
See the .NumTasks options of you Parallel.For object:
http://otl.17slon.com/book/chap04.html#leanpub-auto-iomniparallelsimpleloop-interface
By default every button click should spawn one thread for every CPU core. But you can enforce your own size of thread pool. Add .NumTasks(1) call and check memory consumption, then check it into .NumTasks(10) and do it again. If the memory consumption would grow approximately tenfold after that - then it is it.

Related

Are thread pools safe and is use of them recommended?

I was researching the answer to this question and ran across this post. Is ThreadPool safe? How does ThreadPool compare with the OmniThreadLibrary? What are the pluses and minuses of using each?
Here is an example of what I am doing:
procedure DoWork(nameList: TList<Integer>)
var
i: Integer;
oneThread: PerNameThread;
begin
for (i := 0; to nameList.Count-1) do
begin
oneThread := PerNameThread.Create(Self);
oneThread.nameID = nameList[i];
oneThread.Start();
end
end;
I am creating a thread for each nameList item, and this could be up to 500 names. All these threads are too much, and slowing down the process, to the point where this process would be faster with just one thread.
First, you need to understand what a thread pool is.
A thread pool is a concept where you have a list of multiple threads that are suspended when they are not performing any tasks.
These threads are defined a bit differently than you are probably used to. Instead of them having all the necessary code inside their Execute() method, their Execute() method only contains a few lines of code to execute external code (giving the threads the ability to perform practically any processing that you require), take care of synchronizing the result back to the caller/UI, and returning the thread to the pool and putting it into a suspended state. When a new task is needed at a later time, a thread is given the task and resumed.
So by providing each thread with a method pointer for a task, you actually define what kind of job each thread will be processing each time it is run.
The main advantage of using a thread pool is that doing so avoids the overhead of creating and destroying a thread for each specific task.
As for OmniThreadLibrary, it is a full blown task management library. It uses its own thread pool and a pretty advanced task managing system that allows you to easily define which tasks can be executed in parallel, which tasks need to be executed in sequence, and which tasks have higher priority than others.
The only drawback of OmniThreadLibrary is that it is still limited to Windows only, so if you are thinking of providing multiplatform support for your application then you will have to find another solution.

What happens if a process keeps creating threads?

What happens if a process keeps creating threads especially when the number of threads exceeds the limit of the OS? What will Windows and Linux do?
If the threads aren't doing any work (i.e. you don't start them), then on Windows you're subject to resource limitations as pointed out in the blog post that Hans linked. A Linux system, too, will have some limit on the number of threads it can create; after all, your computer doesn't have infinite virtual memory, so at some point the call to create a thread is going to fail.
If the threads are actually doing work, what usually happens is that the system starts thrashing. Each thread (including the program's main thread) gets a small timeslice (typically measured in tens of milliseconds), and then it gets swapped out for the next available thread. With so many threads, their working sets are large enough to occupy all available RAM, so every thread context switch requires that the currently running thread is written to virtual memory (disk), and the next available thread is read from disk. So the system spends more time doing thread context switches than it does actually running the threads.
The threads will continue to execute, but very very slowly, and eventually you will run out of virtual memory. However, it's likely that it would take an exceedingly long time to create that many threads. You would probably give up and shut the machine off.
Most often, a machine that's suffering from this type of thrashing acts exactly like a machine that's stuck in an infinite loop on all cores. Even pressing Control+Break (or similar) won't take effect immediately because the thread that's handling that signal has to be in memory and running in order to process it. And after the thread does respond to such a signal, it takes an exceedingly long time for it to terminate all of the threads and clean up virtual memory.

Why threads starve even on preemptive multitasking OS (Windows 7)

I wrote a Win32 application (in Delphi-7 which is 32-bit using TThread class) to create 100 threads. Each thread when resumed will continuously (in a loop) increment a 64 bit counter associated with the thread object (so no locking or sharing of data).
If you let the system run for 10 to 15 seconds and stop after that, you should see roughly the same counts in each of the threads. But what I observed was that 81 threads ran under 400 million loops and the remaining ones looped more than 950 million times. Slowest thread got only 230 million compared to the fastest 2111 million.
According to MSDN, the preemptive multitasking is at the thread-level (not process level), so each of my thread should have gotten its time-slice in a round-robin fashion. What am I missing here and why is this discrepancy?
Edit1: Machine configuration: Intel i7 Quad Core 3.4GHz with hyper-threading turned on (8 active threads at a time). Running Windows-7 64 bit professional (and the test application is 32 bit)
Edit2 (thread code): The test application is built with optimization turned on and without any debug info. Run the test application outside of IDE.
type
TMyThread = class(TThread)
protected
FCount: Int64;
public
constructor Create;
procedure Execute; override;
property Count: Int64 read FCount;
end;
{ TMyThread }
constructor TMyThread.Create;
begin
inherited Create(True);
FCount := 0;
end;
procedure TMyThread.Execute;
begin
inherited;
while not Terminated do
begin
Inc(FCount);
end;
end;
Round-robin scheduling is an obvious strategy for a kernel. That's however not the way that the Windows scheduler works. It used to, back in the Windows 9x days, a scheduler which was very capable of giving various VMs equal time. But not in the NT branch, started by Dave Cutler's group, scheduling is purely based on priority.
Whatever thread has the highest priority gets the cpu. There's another chunk of code in Windows that tinkers with a thread's priority, modifying it from the default priority it gets when the thread got created. That code is aware of stuff like a thread owning a window that's in the foreground. Or a thread that's waiting for a synchronization object that got signaled. Or the more bizarre scheduling problems that tries to solve a priority inversion problem. Randomly giving a thread a chance to run even though it wasn't its turn.
Focus on writing sane code first. Starting a hundred threads isn't a very sane thing to do. You are trying to consume resources that the machine doesn't actually have available, nobody has a machine with a hundred cores. Yet. Powers of two, get a machine with 128 cores first.
I have reproduced and confirm your results. Additionally, disabling thread priority boost doesn't change the distribution. GetThreadTimes reports that threads with higher Values took more UserTime and vice versa, while KernelTime seems to have no correlation with Values.
Thread 97: 1081,5928 Ke:0 Us:25116161
Thread 98: 1153,8029 Ke:0 Us:26988173
Thread 99: 704,6996 Ke:0 Us:16848108
Clearly, some threads really get to run more often than others.
I haven't graphed the results, but I suppose what we're seeing is a Normal distribution, which means the results depend on a number of factors, some which are random.
I tried disabling hyper-threading (this kinda smoothed the results), then assigning each thread a single physical processor (by using SetThreadAffinityMask). In the second case, Values were much closer to each other.
SetThreadAffinityMask(Self.Handle, 1 shl (FIndex mod 4));
I can sort of understand how running on a hyper-threaded system can make some threads "unlucky": they are scheduled to compete with other threads on the same physical processor, and because of "soft affinity" to this virtual core they get to run on it again and again, thus scoring lower than others.
But as to why binding each thread to a fixed core helps on a non-hyperthreaded system, I don't know.
There are probably other random things involved, such as the activity on the cores by other processes. Thread can get "unlucky" if some other process' thread associated with the same core suddenly wakes up and starts doing some (relatively) heavy work.
All of this is guessing though.
Windows 7 is designed for user land. When your first thread wants to do work, the OS gives it a time slice. You, the user, just started it after all. By the time the 50th thread in succession (from the same process !) wants to do work, higher priority threads (background processes controlled by Windows 7 itself) step in. This is happening in such a fashion as to make some threads luckier.
You and I don't really want a personal OS that hands out CPU time based on the whims of user land processes. I would be curious to see how 2008 R2 server handled this. You also might play around with the Advanced tab setting: "Choose how to allocate processor resources".
Some good reasoning here..but there are some features to take into consideration.
Windows is trying to do Multitasking with software.
You hardware isnt multitasking, its using power to do what a parallel processed system would do.
Under windows, it give priority. in many ways..and its confusing.
let me explain it this way.
I have a small program what watches my Cores for their use.
When windows loads, you would think that ALL the cores would get used. NOPE.
As windows loads, the other cores Start to get used.
Then you would think, that as windows loads it would accelerate loading as it has access to the cores. it dont accelerate. It doesnt use the cores are FULL speed to load faster.
Even if windows shoved programs to 1 core EACH as they were loading, and running, it WAITS for them to finish. If it used ALL the cores to process each program, it uses Software(about 100 times slower then hardware) to assemble the parts in the other end.
Long ago, Intel wanted to change the hardware to parallel processed, and MS said 'NO' as their software isnt designed for it. NOW they are trying to push Serial based hardware design to the N point. Even after MS bought NT software. They have forgotten to use much of its design, recently.
There needs to be some hardware changes. There needs to be programming language Changes(MS created the programming language) and the Core of windows needs to be designed again. NOT changed. it needs to go back and start from scratch. Good luck with that.
to tell you how old this thought idea is...VIVA La' Amiga.

How to get the lowest cpu consumption when having an infinite loop in a thread

1.I have some infinite loops how can I get the lowest cpu consumption? Should I use a delay?
2.If I have multiple threads running in my application and one of them is THREAD_PRIORITY_IDLE does it affect other threads?
My code is as this for every thread
procedure TMatchLanLon.Execute;
begin
while not Terminated do
begin
//some code
Sleep(1000);
end;
end;
Typically a thread should sleep until signalled, but not using Sleep or SleepEx.
You create an Event and Wait for it to be signalled,either using TEvent or direct to Win32 API with WaitForSingleObject.
Sleep causes so many problems, including what I call "Sleeping beauty" disease. THe whole rest of your application has terminated and shut down a few hundred microseconds ago, and your thread has slept for a "million years" in relative computer timing terms, and when it wakes up the rest of your application has long since terminated. The next thing your background thread is likely to do is access some object which it has a reference to, which was frozen, and then (if you're lucky) it will crash. Don't use Sleep in threads. Wait for events, or use some pre-built worker thread (like the OmniThreadLibrary one).
I have some infinte loops how can i get the lowest cpu consumption ?
By blocking the loop until there is something to do.
If I have multiple threads running in my application and one of them is THREAD_PRIORITY_IDLE does it affect other threads ?
..depends . Probably not, but if any other threads are waiting on output from this thread, or the release of a lock from it, then the other threads are effectively 'dragged down' to THREAD_PRIORITY_IDLE as well.
Apart from this priority-inversion, (which can cause deadlocks when threads have several priority levels), spinlocks, a synchronization construct that is normally only bad, can become disastrous.

How to program number of your threads in Delphi

I found this on the Dr Dobbs site today at
http://www.ddj.com/hpc-high-performance-computing/220300055?pgno=3
It's a nice suggestion regarding thread implmentation.
What is best way of achieving this with TThread in Delphi I wonder?
Thanks
Brian
=== From Dr Dobbs ==============
Make multithreading configurable! The number of threads used in a program should always be configurable from 0 (no additional threads at all) to an arbitrary number. This not only allows a customization for optimal performance, but it also proves to be a good debugging tool and sometimes a lifesaver when unknown race conditions occur on client systems. I remember more than one situation where customers were able to overcome fatal bugs by switching off multithreading. This of course does not only apply to multithreaded file I/O.
Consider the following pseudocode:
int CMyThreadManger::AddThread(CThreadObj theTask)
{
if(mUsedThreadCount >= gConfiguration.MaxThreadCount())
return theTask.Execute(); // execute task in main thread
// add task to thread pool and start the thread
...
}
Such a mechanism is not very complicated (though a little bit more work will probably be needed than shown here), but it sometimes is very effective. It also may be used with prebuilt threading libraries such as OpenMP or Intel's Threaded Building Blocks. Considering the measurements shown here, its a good idea to include more than one configurable thread count (for example, one for file I/O and one for core CPU tasks). The default might probably be 0 for file I/O and <number of cores found> for CPU tasks. But all multithreading should be detachable. A more sophisticated approach might even include some code to test multithreaded performance and set the number of threads used automatically, may be even individually for different tasks.
===================
I would create an abstract class TTask. This class is meant to executes the task. With the method Execute:
type
TTask = abstract class
protected
procedure DoExecute; virtual; abstract;
public
procedure Execute;
end;
TTaskThread = class (TThread)
private
FTask : TTask;
public
constructor Create(const ATask: TTask);
// Assigns FTask and enables thread, free on terminate.
procedure Execute; override; // Calls FTask.Execute.
end;
The method Execute checks the number of threads. If the max is not reached, it starts a thread using TTaskThread that calls DoExecute and as such execute the task in a thread. If the max is reached, DoExecute is called directly.
The answer by Gamecat is good as far as the abstract task class is concerned, but I think calling DoExecute() for a task in the calling thread (as the article itself does too) is a bad idea. I would always queue the tasks to be executed by background threads, unless threading was disabled completely, and here's why.
Consider the following (contrived) case, where you need to execute three independent CPU-bound procedures:
Procedure1_WhichTakes200ms;
Procedure2_WhichTakes400ms;
Procedure3_WhichTakes200ms;
For better utilisation of your dual core system you want to execute them in two threads. You would limit the number of background threads to one, so with the main thread you have as many threads as cores.
Now the first procedure will be executed in a worker thread, and it will finish after 200 milliseconds. The second procedure will start immediately and be executed in the main thread, as the single configured worker thread is already occupied, and it will finish after 400 milliseconds. Then the last procedure will be executed in the worker thread, which has already been sleeping for 200 milliseconds now, and will finish after 200 milliseconds. Total execution time 600 milliseconds, and for 2/3 of that time only one of both threads was actually doing meaningful work.
You could reorder the procedures (tasks), but in real life it's probably impossible to know in advance how long each task will take.
Now consider the common way of employing a thread pool. As per configuration you would limit the number of threads in the pool to 2 (number of cores), use the main thread only to schedule the threads into the pool, and then wait for all tasks to complete. With above sequence of queued tasks thread 1 would take the first task, thread two would take the second task. After 200 milliseconds the first task would complete, and the first worker thread would take the third task from the pool, which is empty afterwards. After 400 milliseconds both the second and the third task would complete, and the main thread would be unblocked. Total time for execution 400 milliseconds, with 100% load on both cores in that time.
At least for CPU-bound threads it's of vital importance to always have work queued for the OS scheduler. Calling DoExecute() in the main thread interferes with that, and shouldn't be done.
I generally have only one class inheriting from TThread, one that takes 'worker items' from a queue or stack, and have them suspend when no more items are available. The main program can then decide how many instances of this thread to instantiate and start. (using this config value).
This 'worker items queue' should also be smart enough to resume suspended threads or create a new thread when required (and when the limit permits it), when a worker item is queued or a thread has finished processing a worker item.
My framework allows for a thread pool count for any of the threads in a configuration file, if you wish to have a look (http://www.csinnovations.com/framework_overview.htm).
From a certain version (think is was one of the XE versions) Delphi has as Parallel Programming Library included:
https://docwiki.embarcadero.com/RADStudio/Sydney/en/Using_the_Parallel_Programming_Library
It has theTTask to scedule work, and also several configuration options and the possibility to create your own thread pool(s).

Resources