I have a program which accepts an SQL query as a command-line argument, queries a PostgreSQL database and produces a file formatted in one of several ways (most often it's used to produce CSV files).
However, this program has som serious memory leaks - with one particular query that produces as 12MB file, the program uses 8GB of RAM plus several GB of swap space before the operating system kills it. I want to find the cause of this memory leak. I don't know Delphi very well (and judging by the quality of the program, neither did the original author), but I am tasked with finding a quick fix.
The following doData function portion of outputs a single row of the result set. I'd hazard a guess that the problem is with the "copy" command (creating a string on the heap that is never freed), but I'm sure someone more experienced than myself will be able to confirm this answer or point me in the right direction.
procedure doData;
var
s, fldVal : string;
i, fldLen : integer;
begin
s := '';
for i := 0 to ds.Fields.Count-1 do
begin
if (ds.Fields[i].DataType = ftDate) or
(ds.Fields[i].DataType = ftDateTime) then
begin
if psql.outDate = 'i' then
fldLen := 8
else
fldLen := 10;
if ds.Fields[i].IsNull then
fldVal := ''
else
fldVal := formatDate(ds.Fields[i].AsDateTime);
end
else
begin
fldLen := ds.Fields[i].DisplayWidth;
fldVal := ds.Fields[i].AsString;
end;
if (psql.outType = 'd') or (psql.outType = 's') then
s := s + trim(fldVal)
else if psql.outType = 'f' then
begin
s := s + fldVal;
if fldLen - length(fldVal) > 0 then
s := s + copy(spaces, 1, fldLen - length(fldVal));
// Is this a memory leak above?
end;
if psql.outType = 's' then
begin
if i < ds.Fields.Count-1 then
s := s + psql.outDelimChar;
end
else
s := s + psql.outDelimChar;
end;
writeln(psql.outPrefixData + s);
end;
There are no leaks in this code. The Delphi string type is managed by the compiler and requires no explicit memory deallocations from the programmer.
If you wish to find your leak you should include the full debug version of FastMM. This will produce diagnostics reports of any leaks in your code, including stack traces that help identify where the leaked memory was originally allocated.
Related
I encounter some strange behaviors with threads so I guess there is something I'm doing wrong or something I don't understand.
My application (Delphi Berlin) has two processes : a service and a console app. They communitcate via socket (Indy).
Each process has a thread dedicated to communication.
I use TCriticalSection when I need to read/write variables used by main thread and by communication Thread.
I also make intensive usage of log. The log can be written (one log file by process) by main thread and by communication thread.
So what I'm doing when I want to write a trace in log file is to use a variable TCriticalSection to prevent main thread and commmunication thread to write to the log file at the same time:
Procedure TApp.trace(logLevel : byte; procName , pi_str: string);
var F: textfile;
LogFileName: String;
vl_log : Boolean;
vc_LogHeader : String;
th, thcurrent : TTh;
begin
if GetLog() then begin // False if log is deactivated
for th := Low(TTh) to High(TTh) do begin
if TThread.CurrentThread.ClassName = ThreadsLog.Name[th] then begin
thcurrent := th;
break;
end;
end;
if ThreadsLog.LogLevel[thcurrent] < logLevel then exit;
LogFileName := gc_tmp + WinProc.Name[WHO_AM_I] + '.log';
vc_LogHeader := '[' + GetLogTime + ' ' + ThreadsLog.Name[thcurrent] + ' ' + procName + ' ' + IntToStr(logLevel) + ']';
if Length(vc_LogHeader) < 60 then vc_LogHeader := vc_LogHeader + StringOfChar (' ', 60 - Length(vc_LogHeader) );
LockTrace.Acquire;
try
try
{$IFDEF MACOS}
AssignFile(F, LogFileName, CP_UTF8);
{$ELSE}
AssignFile(F, LogFileName);
{$ENDIF}
if FileExists(LogFileName) then Append(F) else Rewrite(F);
{$IFDEF MACOS}
Writeln(F, UTF8String(vc_LogHeader + AnsiString(pi_str)));
{$ELSE}
Writeln(F, vc_LogHeader + pi_str);
{$ENDIF}
CloseFile(F);
except
on e : exception do begin
dbg(LogFileName + ' ' + e.Message);
end;
end;
finally
lockTrace.Release;
end;
end;
end;
function TApp.GetLog() : boolean;
begin
gl_logLock.Acquire;
try
result := gl_log;
finally
gl_logLock.Release;
end;
end;
However sometimes, some lines are not written to the file.
But dbg(LogFileName + ' ' + e.Message) does not execute cause it is supposed to write in another log file and this file stays empty. So no exception seems to be fired.
Is it possible to use TCriticalSection this way ?
What I understand about TCriticalSection, is that it puts a lock so others threads trying to put their own lock have to wait until it is released. Is that right ?
I guess I can use one variable or several variables TCriticalSection. If I use only one variable, there will be more cases where a lock exists so more time to wait. If I use one TCriticalSection per shared variable, there will be less locks so better performances. Is it right ?
Thanks for any correction or clarification.
There are many problems with your code, not all thread / critical section related.
function TApp.GetLog() : boolean;
begin
gl_logLock.Acquire;
try
result := gl_log;
finally
gl_logLock.Release;
end;
end;
The above lock code is useless it doesn't provide any protection whatsoever. Reading a boolean variable is already atomic. It's also symptomatic of a common misunderstanding of how to make code thread-safe.
Locks are intended to protect access to data.
The above pattern is often incorrectly used to protect access to an object.
But once the calling code is able to start using the object, you're already outside the lock.
I.e the underlying data of the object is no longer protected from concurrent thread access.
for th := Low(TTh) to High(TTh) do begin
if TThread.CurrentThread.ClassName = ThreadsLog.Name[th] then begin
thcurrent := th;
break;
end;
end;
if ThreadsLog.LogLevel[thcurrent] < logLevel then exit;
In the above, if the loop ever ends without the if condition evaluating to True, thcurrent will be uninitialised leading to undefined behaviour. Anything from AV exceptions to things just not behaving as you'd expect.
Quite possibly ThreadsLog.LogLevel[thcurrent] < logLevel could evaluate to True (and Exit) without triggering an AV for some undefined values of thcurrent.
Also note that looping through your threads and doing string comparisons is a pretty inefficient way to check your current thread. It's not clear what you're trying to achieve, but you should be able to figure certain things out simply from the current thread id.
You say dbg(LogFileName + ' ' + e.Message); is not called. Well there are many reasons it might not be called. You'll have to figure out which (1 or multiple) apply.
You could Exit early.
GetLog() might return False.
Any exception before the try..except block won't get there.
If you've disabled IO errors, an exception won't be raised by old-style file operations. You would have to manually check them using IOResult.
And of course dbg might be called, but could itself also fail in some way.
I'm developing an application with a security device that does some heavy checking at a separate thread. The return code differs from several situations, it's just do or die, and there is a specific case where the user has an old license and is given the chance to upgrade his license on the fly. If he chooses to do so, the process of upgrading the license takes some time, so I'm creating a simple form with just one label and exhibiting a simple message with this label, all at runtime, using a function that returns a TForm. This is the code for such function:
function TAuth.FrmWait: TForm;
var
lbl : TLabel;
frm : TForm;
regn : HRGN;
begin
frm := TForm.Create(Application);
with frm do
begin
Parent := F_MainForm;
ClientWidth := 479;
ClientHeight := 97;
Position := poMainFormCenter;
BorderStyle := bsNone;
Visible := true;
Color := clWebFloralWhite;
regn := CreateRoundRectRgn(0, 0,ClientWidth,ClientHeight,40,40);
SetWindowRgn(Handle, regn, True);
with Font do
begin
Size := 12;
Color := clNavy;
Style := [fsBold];
end;
end;
lbl := TLabel.Create(frm);
with lbl do
begin
Parent := frm;
Width := 400;
Height := 18;
Top := (frm.ClientHeight - Height) div 2;
Left := (frm.ClientWidth - Width) div 2;
Caption := 'Please Wait...';
ParentFont := true;
end;
result := frm;
Application.ProcessMessages; //without this the label won't appear
end;
Thus far all good, but the real problem is when I actually use the FrmWait. The message gets displayed all right but when I dispose of it, I keep getting an Index out of bounds error, which is really odd, since I'm not using any lists. The code gets done here:
if _signal = 0 then
begin
frm := FrmAguardar;
frm.Show;
end;
{
processing gets done here
}
if _signal = 0 then
begin
frm.Close();
FreeAndNil(frm);
end;
The issues happens almost everytime at the FreeAndNil, to make things more confusing, it doesn't always happens. It happens like, around 8 out of 10 times or so (sometimes less , others more) . I tried looking every where, all I managed to find is that some memory overwrite might be happening, still, no way to work around this. Bear in mind that this is all being done at a worker thread , not the main one. Except for this one special case, all other validations work just fine.
I'm totally lost here. Could this be bad design on my end, could this be a delphi Bug? I'm using Delphi XE2 with windows 8.1
Thanks for everything!
Bear in mind that this is all being done at a worker thread, not the main one.
As has been stated so many times, all VCL access must be from the main thread. You break that rule and that's the cause of your problem.
Separate the UI code from the worker code. Put the worker code in a thread. Keep the UI code in the main thread.
The goal is to achieve full usage of the available cores, in converting floats to strings in a single Delphi application. I think this problem applies to the general processing of string. Yet in my example I am specifically using the FloatToStr method.
What I am doing (I've kept this very simple so there is little ambiguity around the implementation):
Using Delphi XE6
Create thread objects which inherit from TThread, and start them.
In the thread execute procedure it will convert a large amount of
doubles into strings via the FloatToStr method.
To simplify, these doubles are just the same constant, so there is no
shared or global memory resource required by the threads.
Although multiple cores are used, the CPU usage % always will max out on the amount of a single core. I understand this is an established issue. So I have some specific questions.
In a simple way the same operation could be done by multiple app instances, and thereby achieve more full usage of the available CPU. Is it possible to do this effectively within the same executable ?
I.e. assign threads different process ids on the OS level or some equivalent division recognised by the OS ? Or is this simply not possible in out of the box Delphi ?
On scope :
I know there are different memory managers available & other groups have tried changing some of the lower level asm lock usage http://synopse.info/forum/viewtopic.php?id=57
But, I am asking this question in the scope of not doing things at such a low level.
Thanks
Hi J. My code is deliberately very simple :
TTaskThread = class(TThread)
public
procedure Execute; override;
end;
procedure TTaskThread.Execute;
var
i: integer;
begin
Self.FreeOnTerminate := True;
for i := 0 to 1000000000 do
FloatToStr(i*1.31234);
end;
procedure TfrmMain.Button1Click(Sender: TObject);
var
t1, t2, t3: TTaskThread;
begin
t1 := TTaskThread.Create(True);
t2 := TTaskThread.Create(True);
t3 := TTaskThread.Create(True);
t1.Start;
t2.Start;
t3.Start;
end;
This is a 'test code', where the CPU (via performance monitor) maxes out at 25% (I have 4 cores). If the FloatToStr line is swapped for a non string operation, e.g. Power(i, 2), then the performance monitor shows the expected 75% usage.
(Yes there are better ways to measure this, but I think this is sufficient for the scope of this question)
I have explored this issue fairly thoroughly. The purpose of the question was to put forth the crux of the issue in a very simple form.
I am asking about limitations when using the FloatToStr method. And asking is there an implementation incarnation which will permit better usage of available cores.
Thanks.
I second what everyone else has said in the comments. It is one of the dirty little secrets of Delphi that the FastMM memory manager is not scalable.
Since memory managers can be replaced you can simply replace FastMM with a scalable memory manager. This is a rapidly changing field. New scalable memory managers pop up every few months. The problem is that it is hard to write a correct scalable memory manager. What are you prepared to trust? One thing that can be said in FastMM's favour is that it is robust.
Rather than replacing the memory manager, it is better to replace the need to replace the memory manager. Simply avoid heap allocation. Find a way to do your work with need for repeated calls to allocate dynamic memory. Even if you had a scalable heap manager, heap allocation would still cost.
Once you decide to avoid heap allocation the next decision is what to use instead of FloatToStr. In my experience the Delphi runtime library does not offer much support. For example, I recently discovered that there is no good way to convert an integer to text using a caller supplied buffer. So, you may need to roll your own conversion functions. As a simple first step to prove the point, try calling sprintf from msvcrt.dll. This will provide a proof of concept.
If you can't change the memory manager (MM) the only thing to do is to avoid using it where MM could be a bottleneck.
As for float to string conversion (Disclamer: I tested the code below with Delphi XE) instead of
procedure Test1;
var
i: integer;
S: string;
begin
for i := 0 to 10 do begin
S:= FloatToStr(i*1.31234);
Writeln(S);
end;
end;
you can use
procedure Test2;
var
i: integer;
S: string;
Value: Extended;
begin
SetLength(S, 64);
for i := 0 to 10 do begin
Value:= i*1.31234;
FillChar(PChar(S)^, 64, 0);
FloatToText(PChar(S), Value, fvExtended, ffGeneral, 15, 0);
Writeln(S);
end;
end;
which produce the same result but does not allocate memory inside the loop.
And take attention
function FloatToStr(Value: Extended): string; overload;
function FloatToStr(Value: Extended; const FormatSettings: TFormatSettings): string; overload;
The first form of FloatToStr is not thread-safe, because it uses localization information contained in global variables. The second form of FloatToStr, which is thread-safe, refers to localization information contained in the FormatSettings parameter. Before calling the thread-safe form of FloatToStr, you must populate FormatSettings with localization information. To populate FormatSettings with a set of default locale values, call GetLocaleFormatSettings.
Much thanks for your knowledge and help so far. As per your suggestions I've attempted to write an equivalent FloatToStr method in a way which avoids heap allocation. To some success. This is by no means a solid fool proof implementation, just nice and simple proof of concept which could be extended upon to achieve a more satisfying solution.
(Should also note using XE6 64-bit)
Experiment result/observations:
the CPU usage % was proportional to the number of threads started
(i.e. each thread = 1 core maxed out via performance monitor).
as expected, with more threads started, performance degraded somewhat for each individual one (i.e. time measured to perform task - see code).
times are just rough averages
8 cores 3.3GHz - 1 thread took 4200ms. 6 threads took 5200ms each.
8 cores 2.5GHz - 1 thread took 4800ms. 2=>4800ms, 4=>5000ms, 6=>6300ms.
I did not calculate the overall time for a total multi thread run. Just observed CPU usage % and measured individual thread times.
Personally I find it a little hilarious that this actually works :) Or perhaps I have done something horribly wrong ?
Surely there are library units out there which resolve these things ?
The code:
unit Main;
interface
uses
Winapi.Windows, Winapi.Messages, System.SysUtils, System.Variants, System.Classes, Vcl.Graphics,
Vcl.Controls, Vcl.Forms, Vcl.Dialogs, Vcl.StdCtrls,
Generics.Collections,
DateUtils;
type
TfrmParallel = class(TForm)
Button1: TButton;
Memo1: TMemo;
procedure Button1Click(Sender: TObject);
private
{ Private declarations }
public
{ Public declarations }
end;
TTaskThread = class(TThread)
private
Fl: TList<double>;
public
procedure Add(l: TList<double>);
procedure Execute; override;
end;
var
frmParallel: TfrmParallel;
implementation
{$R *.dfm}
{ TTaskThread }
procedure TTaskThread.Add(l: TList<double>);
begin
Fl := l;
end;
procedure TTaskThread.Execute;
var
i, j: integer;
s, xs: shortstring;
FR: TFloatRec;
V: double;
Precision, D: integer;
ZeroCount: integer;
Start, Finish: TDateTime;
procedure AppendByteToString(var Result: shortstring; const B: Byte);
const
A1 = '1';
A2 = '2';
A3 = '3';
A4 = '4';
A5 = '5';
A6 = '6';
A7 = '7';
A8 = '8';
A9 = '9';
A0 = '0';
begin
if B = 49 then
Result := Result + A1
else if B = 50 then
Result := Result + A2
else if B = 51 then
Result := Result + A3
else if B = 52 then
Result := Result + A4
else if B = 53 then
Result := Result + A5
else if B = 54 then
Result := Result + A6
else if B = 55 then
Result := Result + A7
else if B = 56 then
Result := Result + A8
else if B = 57 then
Result := Result + A9
else
Result := Result + A0;
end;
procedure AppendDP(var Result: shortstring);
begin
Result := Result + '.';
end;
begin
Precision := 9;
D := 1000;
Self.FreeOnTerminate := True;
//
Start := Now;
for i := 0 to Fl.Count - 1 do
begin
V := Fl[i];
// //orignal way - just for testing
// xs := shortstring(FloatToStrF(V, TFloatFormat.ffGeneral, Precision, D));
//1. get float rec
FloatToDecimal(FR, V, TFloatValue.fvExtended, Precision, D);
//2. check sign
if FR.Negative then
s := '-'
else
s := '';
//2. handle negative exponent
if FR.Exponent < 1 then
begin
AppendByteToString(s, 0);
AppendDP(s);
for j := 1 to Abs(FR.Exponent) do
AppendByteToString(s, 0);
end;
//3. count consecutive zeroes
ZeroCount := 0;
for j := Precision - 1 downto 0 do
begin
if (FR.Digits[j] > 48) and (FR.Digits[j] < 58) then
Break;
Inc(ZeroCount);
end;
//4. build string
for j := 0 to Length(FR.Digits) - 1 do
begin
if j = Precision then
Break;
//cut off where there are only zeroes left up to precision
if (j + ZeroCount) = Precision then
Break;
//insert decimal point - for positive exponent
if (FR.Exponent > 0) and (j = FR.Exponent) then
AppendDP(s);
//append next digit
AppendByteToString(s, FR.Digits[j]);
end;
// //use just to test agreement with FloatToStrF
// if s <> xs then
// frmParallel.Memo1.Lines.Add(string(s + '|' + xs));
end;
Fl.Free;
Finish := Now;
//
frmParallel.Memo1.Lines.Add(IntToStr(MillisecondsBetween(Start, Finish)));
//!YES LINE IS NOT THREAD SAFE!
end;
procedure TfrmParallel.Button1Click(Sender: TObject);
var
i: integer;
t: TTaskThread;
l: TList<double>;
begin
//pre generating the doubles is not required, is just a more useful test for me
l := TList<double>.Create;
for i := 0 to 10000000 do
l.Add(Now/(-i-1)); //some double generation
//
t := TTaskThread.Create(True);
t.Add(l);
t.Start;
end;
end.
FastMM4, by default, on thread contention, when one thread cannot acquire access to data, locked by another thread, calls Windows API function Sleep(0), and then, if the lock is still not available enters a loop by calling Sleep(1) after each check of the lock.
Each call to Sleep(0) experiences the expensive cost of a context switch, which can be 10000+ cycles; it also suffers the cost of ring 3 to ring 0 transitions, which can be 1000+ cycles. As about Sleep(1) – besides the costs associated with Sleep(0) – it also delays execution by at least 1 millisecond, ceding control to other threads, and, if there are no threads waiting to be executed by a physical CPU core, puts the core into sleep, effectively reducing CPU usage and power consumption.
That’s why, in your case, CPU use never reached 100% - because of the Sleep(1) issued by FastMM4.
This way of acquiring locks is not optimal.
A better way would have been a spin-lock of about 5000 pause instructions, and, if the lock was still busy, calling SwitchToThread() API call. If pause is not available (on very old processors with no SSE2 support) or SwitchToThread() API call was not available (on very old Windows versions, prior to Windows 2000), the best solution would be to utilize EnterCriticalSection / LeaveCriticalSection, that don’t have latency associated by Sleep(1), and which also very effectively cedes control of the CPU core to other threads.
I have modified FastMM4 to use a new approach to waiting for a lock: CriticalSections instead of Sleep(). With these options, the Sleep() will never be used but EnterCriticalSection / LeaveCriticalSection will be used instead. Testing has shown that the approach of using CriticalSections instead of Sleep (which was used by default before in FastMM4) provides significant gain in situations when the number of threads working with the memory manager is the same or higher than the number of physical cores. The gain is even more evident on computers with multiple physical CPUs and Non-Uniform Memory Access (NUMA). I have implemented compile-time options to take away the original FastMM4 approach of using Sleep(InitialSleepTime) and then Sleep(AdditionalSleepTime) (or Sleep(0) and Sleep(1)) and replace them with EnterCriticalSection / LeaveCriticalSection to save valuable CPU cycles wasted by Sleep(0) and to improve speed (reduce latency) that was affected each time by at least 1 millisecond by Sleep(1), because the Critical Sections are much more CPU-friendly and have definitely lower latency than Sleep(1).
When these options are enabled, FastMM4-AVX it checks:
whether the CPU supports SSE2 and thus the "pause" instruction, and
whether the operating system has the SwitchToThread() API call, and,
and in this case uses "pause" spin-loop for 5000 iterations and then SwitchToThread() instead of critical sections; If a CPU doesn't have the "pause" instrcution or Windows doesn't have the SwitchToThread() API function, it will use EnterCriticalSection / LeaveCriticalSection.
I have made available the fork called FastMM4-AVX at https://github.com/maximmasiutin/FastMM4
Here are the comparison of the Original FastMM4 version 4.992, with default options compiled for Win64 by Delphi 10.2 Tokyo (Release with Optimization), and the current FastMM4-AVX branch. Under some scenarios, the FastMM4-AVX branch is more than twice as fast comparing to the Original FastMM4. The tests have been run on two different computers: one under Xeon E6-2543v2 with 2 CPU sockets, each has 6 physical cores (12 logical threads) - with only 5 physical core per socket enabled for the test application. Another test was done under a i7-7700K CPU.
Used the "Multi-threaded allocate, use and free" and "NexusDB" test cases from the FastCode Challenge Memory Manager test suite, modified to run under 64-bit.
Xeon E6-2543v2 2*CPU i7-7700K CPU
(allocated 20 logical (allocated 8 logical
threads, 10 physical threads, 4 physical
cores, NUMA) cores)
Orig. AVX-br. Ratio Orig. AVX-br. Ratio
------ ----- ------ ----- ----- ------
02-threads realloc 96552 59951 62.09% 65213 49471 75.86%
04-threads realloc 97998 39494 40.30% 64402 47714 74.09%
08-threads realloc 98325 33743 34.32% 64796 58754 90.68%
16-threads realloc 116708 45855 39.29% 71457 60173 84.21%
16-threads realloc 116273 45161 38.84% 70722 60293 85.25%
31-threads realloc 122528 53616 43.76% 70939 62962 88.76%
64-threads realloc 137661 54330 39.47% 73696 64824 87.96%
NexusDB 02 threads 122846 90380 73.72% 79479 66153 83.23%
NexusDB 04 threads 122131 53103 43.77% 69183 43001 62.16%
NexusDB 08 threads 124419 40914 32.88% 64977 33609 51.72%
NexusDB 12 threads 181239 55818 30.80% 83983 44658 53.18%
NexusDB 16 threads 135211 62044 43.61% 59917 32463 54.18%
NexusDB 31 threads 134815 48132 33.46% 54686 31184 57.02%
NexusDB 64 threads 187094 57672 30.25% 63089 41955 66.50%
Your code that calls FloatToStr is OK, since it allocates a result string using the memory manager, then reallocates it, etc. Even better idea would have been to explicitly deallocate it, for example:
procedure TTaskThread.Execute;
var
i: integer;
s: string;
begin
for i := 0 to 1000000000 do
begin
s := FloatToStr(i*1.31234);
Finalize(s);
end;
end;
You can find better tests of the memory manager in the FastCode challenge test suite at https://github.com/maximmasiutin/FastCodeBenchmark
Also, please note that reference counters in Delphi strings use locking operations, which are inherently slow. For example, on an Intel 2400MHz processor with Tiger Lake microarchitecture (released in October 2020), LOCK ADD is about 18 CPU cycles (7.5ns), while non-locked simple ADD is about 0.75 CPU cycles (0.3ns). If your code ensures that the strings are not assigned and modified from different threads, then you may not need this locking. One of the approaches to ensure that a string with multiple references is not manipulated from different threads is to call UniquesString() before such use. Therefore, to improve speed, you may modify the System.pas and to remove the LOCK prefix from the assembly instructions that operate the string reference counters. For example, instead of
LOCK INC [EDX-skew].StrRec.refCnt
there will be
INC [EDX-skew].StrRec.refCnt
However, compiling and using your own, custom version of System.pas may not be an easy task. You can find more information about reference counter locking in Delphi strings in a separate answer.
I am working on an implementation of the Differential Evolution optimization algorithm, and want to speed up the calculation time by calculating population members in parallel.
I am using the OmniThread library, and have successfully parallelized my loop, only to find that it runs SLOWER than the serial implementation.
I have reduced the code to its essence to test the parallelization, and the reduced version exhibits the same problem: the parallel version is slower than the serial one.
The key is that I pass multiple dynamic arrays to which output should be written for each member of the population. Each array has one of the dimensions dedicated to the population member, so for each population member a different set of array indices is accessed. This also means that in the parallel implementation no 2 threads will write to the same array element.
Below the code I used to test (the actual code in the Differential Evolution has a DoWork procedure with even more const parameters and var arrays)
unit Unit1;
interface
type
TGoalFunction = reference to function(const X, B: array of extended): extended;
TArrayExtended1D = array of extended;
TArrayExtended2D = array of TArrayExtended1D;
TClassToTest = class abstract
private
class procedure DoWork(const AGoalFunction: TGoalFunction; const AInputArray: TArrayExtended2D; var AOutputArray1: TArrayExtended1D; var AOutputArray2: TArrayExtended2D; const AIndex, AIndex2: integer);
public
class procedure RunSerial;
class procedure RunParallel;
end;
function HyperSphere(const X, B: array of extended): extended;
const
DIMENSION1 = 5000;
DIMENSION2 = 5000;
LOOPS = 10;
implementation
uses
OtlParallel;
function HyperSphere(const X, B: array of extended): extended;
var
I: Integer;
begin
Result := 0;
for I := 0 to Length(X) - 1 do
Result := Result + X[I]*X[I];
end;
{ TClassToTest }
class procedure TClassToTest.DoWork(const AGoalFunction: TGoalFunction; const AInputArray: TArrayExtended2D; var AOutputArray1: TArrayExtended1D; var AOutputArray2: TArrayExtended2D; const AIndex, AIndex2: integer);
var
I: Integer;
begin
AOutputArray1[AIndex] := AGoalFunction(AInputArray[AIndex], []);
for I := 0 to Length(AOutputArray2[AIndex]) - 1 do
AOutputArray2[AIndex, I] := Random*AIndex2;
end;
class procedure TClassToTest.RunParallel;
var
LGoalFunction: TGoalFunction;
LInputArray: TArrayExtended2D;
LOutputArray1: TArrayExtended1D;
LOutputArray2: TArrayExtended2D;
I, J, K: Integer;
begin
SetLength(LInputArray, DIMENSION1, DIMENSION2);
for I := 0 to DIMENSION1 - 1 do
begin
for J := 0 to DIMENSION2 - 1 do
LInputArray[I, J] := Random;
end;
SetLength(LOutputArray1, DIMENSION1);
SetLength(LOutputArray2, DIMENSION1, DIMENSION2);
LGoalFunction := HyperSphere;
for I := 0 to LOOPS - 1 do
begin
Parallel.ForEach(0, DIMENSION1 - 1).Execute(
procedure (const value: integer)
begin
DoWork(LGoalFunction, LInputArray, LOutputArray1, LOutputArray2, value, I);
end
);
for J := 0 to DIMENSION1 - 1 do
begin
for K := 0 to DIMENSION2 - 1 do
LInputArray[J, K] := LOutputArray2[J, K];
end;
end;
end;
class procedure TClassToTest.RunSerial;
var
LGoalFunction: TGoalFunction;
LInputArray: TArrayExtended2D;
LOutputArray1: TArrayExtended1D;
LOutputArray2: TArrayExtended2D;
I, J, K: Integer;
begin
SetLength(LInputArray, DIMENSION1, DIMENSION2);
for I := 0 to DIMENSION1 - 1 do
begin
for J := 0 to DIMENSION2 - 1 do
LInputArray[I, J] := Random;
end;
SetLength(LOutputArray1, DIMENSION1);
SetLength(LOutputArray2, DIMENSION1, DIMENSION2);
LGoalFunction := HyperSphere;
for I := 0 to LOOPS - 1 do
begin
for J := 0 to DIMENSION1 - 1 do
begin
DoWork(LGoalFunction, LInputArray, LOutputArray1, LOutputArray2, J, I);
end;
for J := 0 to DIMENSION1 - 1 do
begin
for K := 0 to DIMENSION2 - 1 do
LInputArray[J, K] := LOutputArray2[J, K];
end;
end;
end;
end.
I was expecting a speedup of around x6 on my 8-core processor, but was faced with a slight slowdown. What should I change to get the speedup from running the DoWork procedure in parallel?
Note that I'd prefer to keep the actual work in the DoWork procedure, since I have to be able to call the same algorithm with and without parallelization (boolean flag) while keeping the body of the code shared for easy maintenance
This is due to the lack of thread safety of Random. The implementation of which is:
// global var
var
RandSeed: Longint = 0; { Base for random number generator }
function Random: Extended;
const
two2neg32: double = ((1.0/$10000) / $10000); // 2^-32
var
Temp: Longint;
F: Extended;
begin
Temp := RandSeed * $08088405 + 1;
RandSeed := Temp;
F := Int64(Cardinal(Temp));
Result := F * two2neg32;
end;
Because RandSeed is a global variable, which is modified by a call to Random, the threads end up having contended writes to RandSeed. And those contended writes cause your performance problem. They effectively serialize your parallel code. Severely enough to make it slower than the true serial code.
Add the code below to the top of the implementation section of your unit and you'll see the difference:
threadvar
RandSeed: Longint;
function Random: Double;
const
two2neg32: double = ((1.0/$10000) / $10000); // 2^-32
var
Temp: Longint;
F: Double;
begin
Temp := RandSeed * $08088405 + 1;
RandSeed := Temp;
F := Int64(Cardinal(Temp));
Result := F * two2neg32;
end;
With that change to avoid shared, contended writes, you'll find that the parallel version is faster, as expected. You don't get linear scaling with processor count. My guess is that is because your pattern of memory access is sub-optimal in the parallel version of the code.
I'm guessing that you are only using Random as a means to generate some data. But if you do need an RNG, you'll want to arrange that each task uses their own private instance of an RNG.
You can also speed up your code a little using Sqr(X) rather than X*X, and also by switching to Double instead of Extended.
Some time ago I was experiencing exactly the same issue. It turned out to be that the bottleneck was that OTL for Parallel.ForEach calls with a range creates a hidden enumerator which in cases where the task is very small and the loop is called often is the bottleneck.
A more performant solution looked something like this:
Parallel.ForEach(0, MAXCORES)
.NumTasks(MAXCORES)
.Execute(
procedure (const p:Integer)
var
chunkSize : Integer;
myStart, myEnd : Integer;
i: Integer;
begin
chunkSize := DIMENSION div MAXCORES;
myStart := p * chunkSize;
myEnd := min( myStart+chunkSize-1, DIMENSION -1);
for I := myStart to MyEnd do
DoSomething(i);
end);
This code scaled up quite linearly regardless of the load within the DoSomething call
I've tried running this (with the Random fix and using Doubles) on an i7 (8 hyper threads) and get the times 1650ms for parallel and 5240ms for serial. Given the code content I don't find this to be particularly unexpected scale up. The code as it stands will have near to a 100% successful pipeline prediction - all branches predicted, function call returns cached, even cache prefetch working well. On a typical modern PC this means that the code is probably going to be memory bandwidth limited in which scale up is going to depend a great deal on your memory performance rather than how many cores you have.
The only other issue is potential contention for FPU resources which will be highly dependent on your internal processor architecture.
I suspect that if the workload was more complex a greater scale up would be seen between serial and parallel as the serial version will be losing time to code triggered pipeline breaks whilst the parallel version will remain memory limited. I've done a fair bit of high performance computing work in Delphi and well optimised algorithms doing simple calculations can become totally memory bound with multi-threaded performance at scale ups of as little as 2 on a good 8 core machine due to memory bandwidth limits. This sort of issue can be particularly well illustrated if you have over-clocking capability as performance yield from over-clocking the CPU gives a very good indication of the level of memory waits since everything else speeds up proportionally to the over-clocking.
If you want to get into the details of processor architecture and how they impact what you are doing then http://www.agner.org/optimize/ is a good place to learn how much there is to learn.
I have a Delphi 5 legacy application and there's a part in which a "string" value is been assigned to an "OleVariant" variable. Something like this:
var
X: OleVariant;
S: string;
Begin
S:= ‘This string should contain 200 characters as per design’;
X:= S;
End;
If the length of “S” is greater than 128, then the value of “X” gets truncated and it only holds a maximum of 128 characters.
Is there a way to overcome this?
I believe there is a way, because if I create my own demo application from scratch (in the same PC, with the same Delphi 5), it allows me to pass longer string values and no truncating is done.
Maybe it is something about the project settings or compiler directives. I have played around with this idea, but I have no workaround yet.
Any help is appreciated. Thanks.
Demo:
procedure TForm1.Button1Click(Sender: TObject);
var
X: OleVariant;
S: string;
begin
//in the Edit I pass a string of 240 chars, let's say;
S:= Edit1.Text;
X:= S;
ShowMessage(IntToStr(Length(X)) + ' : ' + IntToStr(Length(S)));
//this showmessage shows "128 : 240"
end;
Try this OleVariantToString and StringToOleVariant functions at http://www.foxbase.ru/delphi/vzaimnye-preobrazovaniya-olevariant-i-string.htm
They work perfectly for me.
uses Classes, Variants;
function OleVariantToString(const Value: OleVariant): string;
var ss: TStringStream;
Size: integer;
Data: PByteArray;
begin
Result:='';
if Length(Value) = 0 then Exit;
ss:=TStringStream.Create;
try
Size := VarArrayHighBound (Value, 1) - VarArrayLowBound(Value, 1) + 1;
Data := VarArrayLock(Value);
try
ss.Position := 0;
ss.WriteBuffer(Data^, Size);
ss.Position := 0;
Result:=ss.DataString;
finally
VarArrayUnlock(Value);
end;
finally
ss.Free;
end;
end;
function StringToOleVariant(const Value: string): OleVariant;
var Data: PByteArray;
ss: TStringStream;
begin
Result:=null;
if Value='' then Exit;
ss:=TStringStream.Create(Value);
try
Result := VarArrayCreate ([0, ss.Size - 1], varByte);
Data := VarArrayLock(Result);
try
ss.Position := 0;
ss.ReadBuffer(Data^, ss.Size);
finally
VarArrayUnlock(Result);
end;
finally
ss.Free;
end;
end;
One explanation is that OleVariant holds the entire string but that you are looking at the debugger tooltip. In older Delphi versions the debugger tooltip truncates at 128 characters for strings held in a variant. Note that the debugger tooltip for a plain string does not truncate at this length. Try showing the variant in a dialog box and you will see that the entire string is present.
I checked this out on Delphi 6 and there was no truncation with your code (other than the debugger tooltip). Andreas did likewise on Delphi 4 and Rodrigo did so with Delphi 5. I cannot imagine that it could really be the case that strings in a Delphi 5 OleVariant are truncated at 128 characters.
If you really are seeing what you are report then I can think of the following explanations:
Your code is erroneously truncating the string, but you have not yet found the code that does this. Only you can debug that.
You have a local bug private to your Delphi installation. Are you by any chance compiling your own RTL?
I made this work. Summary: instead of filling an “OleVariant” with a “string”; I filled a “Variant” and then typecasted that “Variant” to “OleVariant”. Take a look at the code below so that you can get the idea.
procedure TForm1.Button1Click(Sender: TObject);
var
//X: OleVariant;
X: Variant;
S: string;
begin
//Let's say in the Edit1 I pass a string of 240 chars,
S:= Edit1.Text;
X:= S;
//ShowMessage(IntToStr(Length(X)) + ' : ' + IntToStr(Length(S)));
ShowMessage(IntToStr(Length(OleVariant(X))) + ' : ' + IntToStr(Length(S)));
//This ShowMessage shows "128 : 240"
end;
Honestly, I don’t know for sure why this makes a difference, but it does. It works ok now.
Thanks a lot for your help folks!