Delphi Thread timing - multithreading

Delphi Thread timing - multithreading

I am trying to evaluate the speed of simultaneous threads. I don't understand the result. It's like there is a lock somewherer. I am running the following on a Dell 3571, 20 core/thread i9:
unit Unit1;
interface
uses
Winapi.Windows, Winapi.Messages, System.SysUtils, System.Variants, System.Classes, Vcl.Graphics,
Vcl.Controls, Vcl.Forms, Vcl.Dialogs, Vcl.StdCtrls;
type
TMyThread = class(TTHread)
public
procedure Execute; override;
end;
TForm1 = class(TForm)
Memo1: TMemo;
procedure FormCreate(Sender: TObject);
private
{ Private declarations }
public
{ Public declarations }
procedure Log(Sender: TMyThread; Log: string);
end;
var
Form1: TForm1;
implementation
{$R *.dfm}
procedure TForm1.Log(Sender: TMyThread; Log: string);
begin
Memo1.Lines.add(Log);
end;
procedure TForm1.FormCreate(Sender: TObject);
var
Thr: array[0..19] of TMyThread;
begin
for var t := 0 to 10 do
begin
var Thread := TMyThread.Create(True);
Thr[t] := Thread;
Thread.Priority := TPHigher;
end;
for var t := 0 to 10 do
Thr[t].Resume;
end;
{ MyThread }
procedure TMyThread.Execute;
begin
sleep(500);
try
var ii: nativeint;
var Start := GetTickCount;
for var i := 0 to 750000000 do
inc(ii);
var Delta := (GetTickCount - Start);
Synchronize( procedure begin Form1.Log(Self, Format( 'Done Loading : %dms', [Delta]) ); end );
except
asm nop; end;
end;
end;
end.
While running this with 1 thread, I am getting : 320 ms for one calculation
While runing this with 10 theand, I am getting :
Done Loading : 344ms
Done Loading : 375ms
Done Loading : 391ms
Done Loading : 422ms
Done Loading : 438ms
Done Loading : 469ms
Done Loading : 469ms
Done Loading : 469ms
Done Loading : 516ms
Done loading : 531ms
Should all the results be almaost the same at 320 ms ?
PS: I have tried with windows CreatThread, ITask... same result whatever the number of thread...
Any Idea? thank you.

You are spinning up 10 threads (in addition to the main thread running the application), but you are not telling Windows anything about scheduling them other than setting the priority to "Higher". All that can be determined by that "higher" thread priority is that your 10 threads all have the same priority when the Windows scheduler comes to allocate timeslice on a CPU/core.
Unless told otherwise, the scheduler will look at many factors to determine which core/CPU to schedule any thread on at any given point in time.
As a result, each thread could find itself being switched from one core to another on each timeslice, incurring a relatively expensive "context switch" each time. Or it may be scheduled on the same core it is already on. Threads that are consistently scheduled on the same core will perform "better" than any threads doing the same work with the overhead of numerous context switches.
To ensure that threads consistently run on the same core, avoiding potentially costly context switching, you need to set the Processor Affinity of each thread. This is accomplished using SetThreadAffinityMask
But It's More Complicated Than That
In addition, on modern CPUs there may be a mix of higher performance vs higher efficiency (typically slower) cores, so again depending on which core a particular thread is scheduled on at any given moment, it may be running "faster" or "slower" than other threads (though the OS should be ensuring that the most demanding threads are scheduled onto higher performance cores, this cannot be relied on as there are other factors).
Whilst you can contrive to schedule each thread on a separate, consistent core, what you can't do (so easily) is also determine what else Windows decides to schedule on each core, so there will still be some variability in performance between threads performing ostensibly the same work, depending on what the core they are assigned to is also doing.
If you are embarking on a project intended to extract maximum performance from a system via threading on a range of different hardware configurations (dual/quad/hexa/octa/more-core systems), be aware that the ideal configuration of your threads will vary across those different configurations.
This is particularly true when you develop real workloads to be performed by your threads rather than synthetic metrics-gathering workloads. If those workloads are periodically blocked by I/O they may be better off running on efficiency cores, or running on performance cores for CPU-bound work then re-scheduling onto efficiency cores when in an I/O wait-state (if you have that mix in a given workload). This is precisely the job that the OS scheduler will do for you (or try to).
If necessary, you will either need to devise heuristic techniques to adapt the configuration dynamically or provide some mechanism for the software to be manually configured to "tune" performance (or both).
Or, don't worry about it; accept that such variability is unavoidable and allow the OS to do the best job it can and only worry about stepping in to "do a better job" via configuration if it actually proves necessary.

Your processor is the i9 12900H. This has 6 performance cores, and 8 efficient cores. The performance cores are fast at the cost of power consumption, the efficient cores are slower, but consume much less power.
So this means that you do not have a symmetric set of processors. Although Intel might like you to think that you have 20 processors, you actually only have 6 fast processors. So I predict that if you changed your program to run 6 threads you would get results closer to your expectation.
Even accounting for all of this, you still cannot always expect linear scaling in a real world application. Your code is an artificial test, and the inner loop can be implemented entirely using registers. But with real world applications you can expect code to use main memory. And then in order to achieve linear scaling you depend on the memory system to deliver data to the processors efficiently. Whether this can be achieved depends on interplay with the program and the hardware.

Related

OmniThreadLibrary memory leak (consumption) on pipeline running from another thread

I'm running pipeline (thread's pipeline from OmniThreadLibrary) from another thread and got memory leak or rather memory consumption. But when application close then it's ok and there are no memory leak report (ReportMemoryLeaksOnShutdown := True;).
Here example: click button 10 times and test app will get ~600 MB of memory. Windows 7 x64, Delphi XE6, latest omni source.
It's a bug? Or I need use another code?
uses
OtlParallel,
OtlCommon;
procedure TForm75.Button1Click(Sender: TObject);
begin
// run empty pipeline from another threads
Parallel.&For(1, 100).Execute(
procedure(value: integer)
var
pipe: IOmniPipeline;
begin
pipe := Parallel.Pipeline
.Stage(procedure(const input: TOmniValue; var output: TOmniValue) begin end)
.Run;
pipe.Cancel;
pipe.WaitFor(100000);
pipe := nil;
end
);
end;
Edit 1:
Tested that code with ProcessExplorer and found what threads count at runtime is constant, but handles count is grown. If I'm insert Application.ProcessMessages; at the end of "for loop" (after pipe's code) then test app running good, handles are closing and memory consumption is constant. Don't know why.

How many threads does it create ?
Check it in SysInternals Process Explorer for example.
Or in Delphi IDE (View -> Debug Windows -> Threads)
I think that because you block each For-worker for wuite a long WaitFor your application then creates many worker threads for every button click, and when you click it 10 times it consequently creates 10 times many threads.
And yes, in general-purpose operating systems like Windows threads are expensive! Google for "windows thread memory footprint" - and multiply it by the number of threads created by 10 parallel-for loop you spawn.
https://msdn.microsoft.com/en-us/library/windows/desktop/ms686774.aspx
https://blogs.technet.microsoft.com/markrussinovich/2009/07/05/pushing-the-limits-of-windows-processes-and-threads/
This fact was the reason that for making highly parallel server applications special approaches were done to create light-eight application-level threads and bypass OS threads, to name a few
Make special language that would spawn dozens of thousands of cooperative thread and cross-thread enforce memory safety by strict language rules: https://www.erlang.org/docs
Make a library, which cannot enforce those regulations but at least can demand programmer to follow them voluntarily: https://en.wikipedia.org/wiki/Actor_model
Fibers: no-protection threads within threads: What is the difference between a thread and a fiber?
However OTL being a generic library for generic threads imposes little restrictions but relies on OS-provided native threads, and they are heavy expensive in both CPU time needed to create/release Windows Threads (mitigated by Thread Pools concept) and by memory footprint needed to maintain each Windows Threads by OS (which is unavoidable and you see its manifestation).
Of course later, when all those loops are worked through, their threads are getting closed and released, together with the memory that was used to maintain them. So no memory leak indeed, once you wait enough for all your threads to be terminated - they are, with all the temporarily allocated memory they used as their workplaces.
UPD. How to check that hypothesis? easiest way would be to change how many threads is spawned by every instance of For-Loop (by every button click).
See the .NumTasks options of you Parallel.For object:
http://otl.17slon.com/book/chap04.html#leanpub-auto-iomniparallelsimpleloop-interface
By default every button click should spawn one thread for every CPU core. But you can enforce your own size of thread pool. Add .NumTasks(1) call and check memory consumption, then check it into .NumTasks(10) and do it again. If the memory consumption would grow approximately tenfold after that - then it is it.

How to disperse single core process to multi core

I use dual-core, delphi xe6 and api which doesn't support multi-core. The application coded with these makes full load of one core. Can I disperse the load to the other core?

If you want to utilize all available CPUs and increase performance of the app, then you have to rewrite your app as it is suggested in another answer. IT WILL IMPROVE PERFORMANCE.
If you just want to spread executing of the app over all available CPUs, for example to get 25% load for every CPU of quad-core processor (instead of 100% load of single CPU), then it should be enough to set correct affinity mask for process, such task already discussed here for example. But it also depends on OS settings, Windows may limit number of CPUs available for app. IT WILL NOT IMPROVE PERFORMANCE.
Demo project for David:
procedure TForm18.FormCreate(Sender: TObject);
begin
while not Application.Terminated do
Application.ProcessMessages;
end;
Affinity mask 1 (4 CPUs allowed):
Affinity mask 2 (1 CPU allowed):
Definition from MSDN:
A process affinity mask is a bit vector in which each bit represents the processors that a process is allowed to run on.

What you are looking for is more than likely multithreading. (Or maybe multiprocessing.)
You should rewrite your application to process data and perform tasks in a parallel manner. While that is not a really difficult task, the wording of your question suggests, that you are not familiar with the concept of parallel programing.
If you wish to learn how to utilize threads this SO question, this embarcadero article, and this delphigeek article might help you to find the general direction.

Are thread pools safe and is use of them recommended?

I was researching the answer to this question and ran across this post. Is ThreadPool safe? How does ThreadPool compare with the OmniThreadLibrary? What are the pluses and minuses of using each?
Here is an example of what I am doing:
procedure DoWork(nameList: TList<Integer>)
var
i: Integer;
oneThread: PerNameThread;
begin
for (i := 0; to nameList.Count-1) do
begin
oneThread := PerNameThread.Create(Self);
oneThread.nameID = nameList[i];
oneThread.Start();
end
end;
I am creating a thread for each nameList item, and this could be up to 500 names. All these threads are too much, and slowing down the process, to the point where this process would be faster with just one thread.

First, you need to understand what a thread pool is.
A thread pool is a concept where you have a list of multiple threads that are suspended when they are not performing any tasks.
These threads are defined a bit differently than you are probably used to. Instead of them having all the necessary code inside their Execute() method, their Execute() method only contains a few lines of code to execute external code (giving the threads the ability to perform practically any processing that you require), take care of synchronizing the result back to the caller/UI, and returning the thread to the pool and putting it into a suspended state. When a new task is needed at a later time, a thread is given the task and resumed.
So by providing each thread with a method pointer for a task, you actually define what kind of job each thread will be processing each time it is run.
The main advantage of using a thread pool is that doing so avoids the overhead of creating and destroying a thread for each specific task.
As for OmniThreadLibrary, it is a full blown task management library. It uses its own thread pool and a pretty advanced task managing system that allows you to easily define which tasks can be executed in parallel, which tasks need to be executed in sequence, and which tasks have higher priority than others.
The only drawback of OmniThreadLibrary is that it is still limited to Windows only, so if you are thinking of providing multiplatform support for your application then you will have to find another solution.

Why threads starve even on preemptive multitasking OS (Windows 7)

I wrote a Win32 application (in Delphi-7 which is 32-bit using TThread class) to create 100 threads. Each thread when resumed will continuously (in a loop) increment a 64 bit counter associated with the thread object (so no locking or sharing of data).
If you let the system run for 10 to 15 seconds and stop after that, you should see roughly the same counts in each of the threads. But what I observed was that 81 threads ran under 400 million loops and the remaining ones looped more than 950 million times. Slowest thread got only 230 million compared to the fastest 2111 million.
According to MSDN, the preemptive multitasking is at the thread-level (not process level), so each of my thread should have gotten its time-slice in a round-robin fashion. What am I missing here and why is this discrepancy?
Edit1: Machine configuration: Intel i7 Quad Core 3.4GHz with hyper-threading turned on (8 active threads at a time). Running Windows-7 64 bit professional (and the test application is 32 bit)
Edit2 (thread code): The test application is built with optimization turned on and without any debug info. Run the test application outside of IDE.
type
TMyThread = class(TThread)
protected
FCount: Int64;
public
constructor Create;
procedure Execute; override;
property Count: Int64 read FCount;
end;
{ TMyThread }
constructor TMyThread.Create;
begin
inherited Create(True);
FCount := 0;
end;
procedure TMyThread.Execute;
begin
inherited;
while not Terminated do
begin
Inc(FCount);
end;
end;

Round-robin scheduling is an obvious strategy for a kernel. That's however not the way that the Windows scheduler works. It used to, back in the Windows 9x days, a scheduler which was very capable of giving various VMs equal time. But not in the NT branch, started by Dave Cutler's group, scheduling is purely based on priority.
Whatever thread has the highest priority gets the cpu. There's another chunk of code in Windows that tinkers with a thread's priority, modifying it from the default priority it gets when the thread got created. That code is aware of stuff like a thread owning a window that's in the foreground. Or a thread that's waiting for a synchronization object that got signaled. Or the more bizarre scheduling problems that tries to solve a priority inversion problem. Randomly giving a thread a chance to run even though it wasn't its turn.
Focus on writing sane code first. Starting a hundred threads isn't a very sane thing to do. You are trying to consume resources that the machine doesn't actually have available, nobody has a machine with a hundred cores. Yet. Powers of two, get a machine with 128 cores first.

I have reproduced and confirm your results. Additionally, disabling thread priority boost doesn't change the distribution. GetThreadTimes reports that threads with higher Values took more UserTime and vice versa, while KernelTime seems to have no correlation with Values.
Thread 97: 1081,5928 Ke:0 Us:25116161
Thread 98: 1153,8029 Ke:0 Us:26988173
Thread 99: 704,6996 Ke:0 Us:16848108
Clearly, some threads really get to run more often than others.
I haven't graphed the results, but I suppose what we're seeing is a Normal distribution, which means the results depend on a number of factors, some which are random.
I tried disabling hyper-threading (this kinda smoothed the results), then assigning each thread a single physical processor (by using SetThreadAffinityMask). In the second case, Values were much closer to each other.
SetThreadAffinityMask(Self.Handle, 1 shl (FIndex mod 4));
I can sort of understand how running on a hyper-threaded system can make some threads "unlucky": they are scheduled to compete with other threads on the same physical processor, and because of "soft affinity" to this virtual core they get to run on it again and again, thus scoring lower than others.
But as to why binding each thread to a fixed core helps on a non-hyperthreaded system, I don't know.
There are probably other random things involved, such as the activity on the cores by other processes. Thread can get "unlucky" if some other process' thread associated with the same core suddenly wakes up and starts doing some (relatively) heavy work.
All of this is guessing though.

Windows 7 is designed for user land. When your first thread wants to do work, the OS gives it a time slice. You, the user, just started it after all. By the time the 50th thread in succession (from the same process !) wants to do work, higher priority threads (background processes controlled by Windows 7 itself) step in. This is happening in such a fashion as to make some threads luckier.
You and I don't really want a personal OS that hands out CPU time based on the whims of user land processes. I would be curious to see how 2008 R2 server handled this. You also might play around with the Advanced tab setting: "Choose how to allocate processor resources".

Some good reasoning here..but there are some features to take into consideration.
Windows is trying to do Multitasking with software.
You hardware isnt multitasking, its using power to do what a parallel processed system would do.
Under windows, it give priority. in many ways..and its confusing.
let me explain it this way.
I have a small program what watches my Cores for their use.
When windows loads, you would think that ALL the cores would get used. NOPE.
As windows loads, the other cores Start to get used.
Then you would think, that as windows loads it would accelerate loading as it has access to the cores. it dont accelerate. It doesnt use the cores are FULL speed to load faster.
Even if windows shoved programs to 1 core EACH as they were loading, and running, it WAITS for them to finish. If it used ALL the cores to process each program, it uses Software(about 100 times slower then hardware) to assemble the parts in the other end.
Long ago, Intel wanted to change the hardware to parallel processed, and MS said 'NO' as their software isnt designed for it. NOW they are trying to push Serial based hardware design to the N point. Even after MS bought NT software. They have forgotten to use much of its design, recently.
There needs to be some hardware changes. There needs to be programming language Changes(MS created the programming language) and the Core of windows needs to be designed again. NOT changed. it needs to go back and start from scratch. Good luck with that.
to tell you how old this thought idea is...VIVA La' Amiga.

How to program number of your threads in Delphi

I found this on the Dr Dobbs site today at
http://www.ddj.com/hpc-high-performance-computing/220300055?pgno=3
It's a nice suggestion regarding thread implmentation.
What is best way of achieving this with TThread in Delphi I wonder?
Thanks
Brian
=== From Dr Dobbs ==============
Make multithreading configurable! The number of threads used in a program should always be configurable from 0 (no additional threads at all) to an arbitrary number. This not only allows a customization for optimal performance, but it also proves to be a good debugging tool and sometimes a lifesaver when unknown race conditions occur on client systems. I remember more than one situation where customers were able to overcome fatal bugs by switching off multithreading. This of course does not only apply to multithreaded file I/O.
Consider the following pseudocode:
int CMyThreadManger::AddThread(CThreadObj theTask)
{
if(mUsedThreadCount >= gConfiguration.MaxThreadCount())
return theTask.Execute(); // execute task in main thread
// add task to thread pool and start the thread
...
}
Such a mechanism is not very complicated (though a little bit more work will probably be needed than shown here), but it sometimes is very effective. It also may be used with prebuilt threading libraries such as OpenMP or Intel's Threaded Building Blocks. Considering the measurements shown here, its a good idea to include more than one configurable thread count (for example, one for file I/O and one for core CPU tasks). The default might probably be 0 for file I/O and <number of cores found> for CPU tasks. But all multithreading should be detachable. A more sophisticated approach might even include some code to test multithreaded performance and set the number of threads used automatically, may be even individually for different tasks.
===================

I would create an abstract class TTask. This class is meant to executes the task. With the method Execute:
type
TTask = abstract class
protected
procedure DoExecute; virtual; abstract;
public
procedure Execute;
end;
TTaskThread = class (TThread)
private
FTask : TTask;
public
constructor Create(const ATask: TTask);
// Assigns FTask and enables thread, free on terminate.
procedure Execute; override; // Calls FTask.Execute.
end;
The method Execute checks the number of threads. If the max is not reached, it starts a thread using TTaskThread that calls DoExecute and as such execute the task in a thread. If the max is reached, DoExecute is called directly.

The answer by Gamecat is good as far as the abstract task class is concerned, but I think calling DoExecute() for a task in the calling thread (as the article itself does too) is a bad idea. I would always queue the tasks to be executed by background threads, unless threading was disabled completely, and here's why.
Consider the following (contrived) case, where you need to execute three independent CPU-bound procedures:
Procedure1_WhichTakes200ms;
Procedure2_WhichTakes400ms;
Procedure3_WhichTakes200ms;
For better utilisation of your dual core system you want to execute them in two threads. You would limit the number of background threads to one, so with the main thread you have as many threads as cores.
Now the first procedure will be executed in a worker thread, and it will finish after 200 milliseconds. The second procedure will start immediately and be executed in the main thread, as the single configured worker thread is already occupied, and it will finish after 400 milliseconds. Then the last procedure will be executed in the worker thread, which has already been sleeping for 200 milliseconds now, and will finish after 200 milliseconds. Total execution time 600 milliseconds, and for 2/3 of that time only one of both threads was actually doing meaningful work.
You could reorder the procedures (tasks), but in real life it's probably impossible to know in advance how long each task will take.
Now consider the common way of employing a thread pool. As per configuration you would limit the number of threads in the pool to 2 (number of cores), use the main thread only to schedule the threads into the pool, and then wait for all tasks to complete. With above sequence of queued tasks thread 1 would take the first task, thread two would take the second task. After 200 milliseconds the first task would complete, and the first worker thread would take the third task from the pool, which is empty afterwards. After 400 milliseconds both the second and the third task would complete, and the main thread would be unblocked. Total time for execution 400 milliseconds, with 100% load on both cores in that time.
At least for CPU-bound threads it's of vital importance to always have work queued for the OS scheduler. Calling DoExecute() in the main thread interferes with that, and shouldn't be done.

I generally have only one class inheriting from TThread, one that takes 'worker items' from a queue or stack, and have them suspend when no more items are available. The main program can then decide how many instances of this thread to instantiate and start. (using this config value).
This 'worker items queue' should also be smart enough to resume suspended threads or create a new thread when required (and when the limit permits it), when a worker item is queued or a thread has finished processing a worker item.

My framework allows for a thread pool count for any of the threads in a configuration file, if you wish to have a look (http://www.csinnovations.com/framework_overview.htm).

From a certain version (think is was one of the XE versions) Delphi has as Parallel Programming Library included:
https://docwiki.embarcadero.com/RADStudio/Sydney/en/Using_the_Parallel_Programming_Library
It has theTTask to scedule work, and also several configuration options and the possibility to create your own thread pool(s).

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string