Invoke progressively getting slower - multithreading

I've been investigating performance issues in my app, and it boils down to the time taken to call Invoke progressively getting longer. I am using System.Diagnostics.Stopwatch to time the Invoke call itself, and while it starts off at 20ms, after a few hundred calls it is around 4000ms. Logging shows the time steadily increasing (at first by ~2ms per call, then by ~100ms and more). I have three Invokes, all are exhibiting the same behaviour.
I am loading medical images, and I need to keep my UI responsive while doing so, hence the use of a background worker, where I can load and process images, but once loaded they need to be added to the man UI for the user to see.
The problem didn't present itself until I tried to load a study of over 800 images. Previously my test sets have been ~100 images, ranging in total size from 400MB to 16GB. The problem set is only 2GB in size and takes close to 10 minutes to approach 50%, and the 16GB set loads in ~30s total, thus ruling out total image size as the issue. For reference my development machine has 32GB RAM. I have ensured that it is not the contents of the invoked method by commenting the entire thing out.
What I want to understand is how is it possible for the time taken to invoke to progressively increase? Is this actually a thing? My call stacks are not getting deeper, Number of threads is consistent, what resource is being consumed to cause this? What am I missing!?
public void UpdateThumbnailInfo(Thumbnail thumb, ThumbnailInfo info)
{
if (InvokeRequired)
{
var sw = new Stopwatch();
sw.Start();
Invoke((Action<Thumbnail, ThumbnailInfo>) UpdateThumbnailInfo, thumb, info);
Log.Debug("Update Thumbnail Info Timer: {Time} ms - {File}", (int) sw.ElapsedMilliseconds, info.Filename);
}
else
{
// Do stuff here
}
}

Looks like you are calling UpdateThumbnailInfo from a different thread. If so, then this is the expected behavior. What is happening is you are queuing hundreds of tasks on the UI thread. For every loaded image the UI needs to do a lot of things, so as the number of images increases, the overall operations grow slow.
A few things that you can do:
* Use BeginInvoke in place of Invoke. As your function is void type, you will not need EndInvoke
* Use SuspendLayout and ResumeLayout to prevent UI from incrementally updating, and rather update everything once when all images are loaded.

Related

Performance issue while using Parallel.foreach() with MaximumDegreeOfParallelism set as ProcessorCount

I wanted to process records from a database concurrently and within minimum time. So I thought of using parallel.foreach() loop to process the records with the value of MaximumDegreeOfParallelism set as ProcessorCount.
ParallelOptions po = new ParallelOptions
{
};
po.MaxDegreeOfParallelism = Environment.ProcessorCount;
Parallel.ForEach(listUsers, po, (user) =>
{
//Parallel processing
ProcessEachUser(user);
});
But to my surprise, the CPU utilization was not even close to 20%. When I dig into the issue and read the MSDN article on this(http://msdn.microsoft.com/en-us/library/system.threading.tasks.paralleloptions.maxdegreeofparallelism(v=vs.110).aspx), I tried using a specific value of MaximumDegreeOfParallelism as -1. As said in the article thet this value removes the limit on the number of concurrently running processes, the performance of my program improved to a high extent.
But that also doesn't met my requirement for the maximum time taken to process all the records in the database. So I further analyzed it more and found that there are two terms as MinThreads and MaxThreads in the threadpool. By default the values of Min Thread and MaxThread are 10 and 1000 respectively. And on start only 10 threads are created and this number keeps on increasing to a max of 1000 with every new user unless a previous thread has finished its execution.
So I set the initial value of MinThread to 900 in place of 10 using
System.Threading.ThreadPool.SetMinThreads(100, 100);
so that just from the start only minimum of 900 threads are created and thought that it will improve the performance significantly. This did create 900 threads, but it also increased the number of failure on processing each user very much. So I did not achieve much using this logic. So I changed the value of MinThreads to 100 only and found that the performance was much better now.
But I wanted to improve more as my requirement of time boundation was still not met as it was still exceeding the time limit to process all the records. As you may think I was using all the best possible things to get the maximum performance in parallel processing, I was also thinking the same.
But to meet the time limit I thought of giving a shot in the dark. Now I created two different executable files(Slaves) in place of only one and assigned them each half of the users from DB. Both the executable were doing the same thing and were executing concurrently. I created another Master program to start these two Slaves at the same time.
To my surprise, it reduced the time taken to process all the records nearly to the half.
Now my question is as simple as that I do not understand the logic behind Master Slave thing giving better performance compared to a single EXE with all the logic same in both the Slaves and the previous EXE. So I would highly appreciate if someone will explain his in detail.
But to my surprise, the CPU utilization was not even close to 20%.
…
It uses the Http Requests to some Web API's hosted in other networks.
This means that CPU utilization is entirely the wrong thing to look at. When using the network, it's your network connection that's going to be the limiting factor, or possibly some network-related limit, certainly not CPU.
Now I created two different executable files … To my surprise, it reduced the time taken to process all the records nearly to the half.
This points to an artificial, per process limit, most likely ServicePointManager.DefaultConnectionLimit. Try setting it to a larger value than the default at the start of your program and see if it helps.

Limiting work in progress of parallel operations of a streamed resource

I've found myself recently using the SemaphoreSlim class to limit the work in progress of a parallelisable operation on a (large) streamed resource:
// The below code is an example of the structure of the code, there are some
// omissions around handling of tasks that do not run to completion that should be in production code
SemaphoreSlim semaphore = new SemaphoreSlim(Environment.ProcessorCount * someMagicNumber);
foreach (var result in StreamResults())
{
semaphore.Wait();
var task = DoWorkAsync(result).ContinueWith(t => semaphore.Release());
...
}
This is to avoid bringing too many results into memory and the program being unable to cope (generally evidenced via an OutOfMemoryException). Though the code works and is reasonably performant, it still feels ungainly. Notably the someMagicNumber multiplier, which although tuned via profiling, may not be as optimal as it could be and isn't resilient to changes to the implementation of DoWorkAsync.
In the same way that thread pooling can overcome the obstacle of scheduling many things for execution, I would like something that can overcome the obstacle of scheduling many things to be loaded into memory based on the resources that are available.
Since it is deterministically impossible to decide whether an OutOfMemoryException will occur, I appreciate that what I'm looking for may only be achievable via statistical means or even not at all, but I hope that I'm missing something.
Here I'd say that you're probably overthinking this problem. The consequences for overshooting are rather high (the program crashes). The consequences for being too low are that the program might be slowed down. As long as you still have some buffer beyond a minimum value, further increases to the buffer will generally have little to no effect, unless the processing time of that task in the pipe is extraordinary volatile.
If your buffer is constantly filling up it generally means that the task before it in the pipe executes quite a bit quicker than the task that follows it, so even without a fairly small buffer it is likely to always ensure the task following it has some work. The buffer size needed to get 90% of the benefits of a buffer is usually going to be quite small (a few dozen items maybe) whereas the side needed to get an OOM error are like 6+ orders of magnate higher. As long as you're somewhere in-between those two numbers (and that's a pretty big range to land in) you'll be just fine.
Just run your static tests, pick a static number, maybe add a few percent extra for "just in case" and you should be good. At most, I'd move some of the magic numbers to a config file so that they can be altered without a recompile in the event that the input data or the machine specs change radically.

Need thoughts on profiling of multi-threading in C on Linux

My application scenario is like this: I want to evaluate the performance gain one can achieve on a quad-core machine for processing the same amount of data. I have following two configurations:
i) 1-Process: A program without any threading and processes data from 1M .. 1G, while system was assumed to run only single core of its 4-cores.
ii) 4-threads-Process: A program with 4-threads (all threads performing same operation) but processing 25% of the input data.
In my program for creating 4-threads, I used pthread's default options (i.e., without any specific pthread_attr_t). I believe the performance gain of 4-thread configuration comparing to 1-Process configuration should be closer to 400% (or somewhere between 350% and 400%).
I profiled the time spent in creation of threads just like this below:
timer_start(&threadCreationTimer);
pthread_create( &thread0, NULL, fun0, NULL );
pthread_create( &thread1, NULL, fun1, NULL );
pthread_create( &thread2, NULL, fun2, NULL );
pthread_create( &thread3, NULL, fun3, NULL );
threadCreationTime = timer_stop(&threadCreationTimer);
pthread_join(&thread0, NULL);
pthread_join(&thread1, NULL);
pthread_join(&thread2, NULL);
pthread_join(&thread3, NULL);
Since increase in the size of the input data may also increase in the memory requirement of each thread, then so loading all data in advance is definitely not a workable option. Therefore, in order to ensure not to increase the memory requirement of each thread, each thread reads data in small chunks, process it and reads next chunk process it and so on. Hence, structure of the code of my functions run by threads is like this:
timer_start(&threadTimer[i]);
while(!dataFinished[i])
{
threadTime[i] += timer_stop(&threadTimer[i]);
data_source();
timer_start(&threadTimer[i]);
process();
}
threadTime[i] += timer_stop(&threadTimer[i]);
Variable dataFinished[i] is marked true by process when the it received and process all needed data. Process() knows when to do that :-)
In the main function, I am calculating the time taken by 4-threaded configuration as below:
execTime4Thread = max(threadTime[0], threadTime[1], threadTime[2], threadTime[3]) + threadCreationTime.
And performance gain is calculated by simply
gain = execTime1process / execTime4Thread * 100
Issue:
On small data size around 1M to 4M, the performance gain is generally good (between 350% to 400%). However, the trend of performance gain is exponentially decreasing with increase in the input size. It keeps decreasing until some data size of upto 50M or so, and then become stable around 200%. Once it reached that point, it remains almost stable for even 1GB of data.
My question is can anybody suggest the main reasoning of this behaviour (i.e., performance drop at the start and but remaining stable later)?
And suggestions how to fix that?
For your information, I also investigated the behaviour of threadCreationTime and threadTime for each thread to see what's happening. For 1M of data the values of these variables are small and but with increase in the data size both these two variables increase exponentially (but threadCreationTime should remain almost same regardless of data size and threadTime should increase at a rate corresponding to data being processing). After keep on increasing until 50M or so threadCreationTime becomes stable and threadTime (just like performance drop becomes stable) and threadCreationTime keep increasing at a constant rate corresponding to increase in data to be processed (which is considered understandable).
Do you think increasing the stack size of each thread, process priority stuff or custom values of other parameters type of scheduler (using pthread_attr_init) can help?
PS: The results are obtained while running the programs under Linux's fail safe mode with root (i.e., minimal OS is running without GUI and networking stuff).
Since increase in the size of the input data may also increase in the
memory requirement of each thread, then so loading all data in advance
is definitely not a workable option. Therefore, in order to ensure not
to increase the memory requirement of each thread, each thread reads
data in small chunks, process it and reads next chunk process it and
so on.
Just this, alone, can cause a drastic speed decrease.
If there is sufficient memory, reading one large chunk of input data will always be faster than reading data in small chunks, especially from each thread. Any I/O benefits from chunking (caching effects) disappears when you break it down into pieces. Even allocating one big chunk of memory is much cheaper than allocating small chunks many, many times.
As a sanity check, you can run htop to ensure that at least all your cores are being topped out during the run. If not, your bottleneck could be outside of your multi-threading code.
Within the threading,
threading context switches due to many threads can cause sub-optimal speedup
as mentioned by others, a cold cache due to not reading memory contiguously can cause slowdowns
But re-reading your OP, I suspect the slowdown has something to do with your data input/memory allocation. Where exactly are you reading your data from? Some kind of socket? Are you sure you need to allocate memory more than once in your thread?
Some algorithm in your worker threads is likely to be suboptimal/expensive.
Are your thread starting on creation ? If it is the case, then the following will happen :
while your parent thread is creating thread, the thread already created will start to run. When you hit timerStop (ThreadCreation timer), the four have already run
for a certain time. So threadCreationTime overlaps threadTime[i]
As it is now, you don't know what you are measuring. This won't solve your problem, because obviously you have a problem since threadTime does not augment linearly, but at least you won't add overlapping times.
To have more info you can use the perf tool if it is available on your distro.
for example :
perf stat -e cache-misses <your_prog>
and see what happens with a two thread version, a three thread version etc...

How to solve memory leak caused by System.Diagnostics.PerformanceCounter

Summary
I have written a process monitor command-line application that takes as parameters:
The process name or process ID
A CPU Threshold percent.
What the program does, is watches all processes with the passed name or pid, and if their CPU usage gets over the threshold%, it kills them.
I have two classes:
ProcessMonitor and ProcessMonitorList
The former, wraps around System.Diagnostics.PerformanceCounter
The latter is an IEnumarable that allows a list-like structure of the former.
The problem
The program itself works fine, however if I watch the Memory Usage on Task Manager, it grows in increments of about 20kB per second. Note: the program polls the CPU counter through PerformanceCounter every second.
This program needs to be running on a heavily used server, and there are a great number of processes it is watching. (20-30).
Investigation So far
I have used PerfMon to monitor the Private Bytes of the process versus the Total number of Bytes in all Heaps and according to the logic presented in the article referenced below, my results indicate that while fluctuating, the value remains bounded within an acceptable range, and hence there is no memory leak:
Article
I have also used FxCop to analyze my code, and it did not come up with anything relevant.
The Plot Thickens
Not being comfortable with just saying, Oh then there's no memory leak, I investigated further, and found (through debugging) that the following lines of code demonstrate where the leak is occurring, with the arrow showing the exact line.
_pc = new PerformanceCounter("Process", "% Processor Time", processName);
The above is where _pc is initiated, and is in the constructor of my ProcessMonitor class.
The below is the method that is causing the memory leak. This method is being called every second from my main.
public float NextValue()
{
if (HasExited()) return PROCESS_ENDED;
if (_pc != null)
{
_lastSample = _pc.NextValue(); //<-----------------------
return _lastSample;
}
else return -1;
}
This indicates to me that the leak exists inside the NextValue() method, which is inside the System.Diagnostics.PerformanceCounter class.
My Questions:
Is this a known problem, and how do I get around it?
Is my assumption that the task manager's memory usage increasing implies that there is indeed a memory leak correct?
Are there any better ways to monitor multiple instances of a specific process and shut them down if they go over a specific threshold CPU usage, and then send an email?
So I think I figured it out.
Using the Reflector tool, I was able to examine the code inside System.Diagnostics.
It appears that the NextValue method calls
GC.SuppressFinalization();
This means that (I think, and please correct if I am wrong) that I needed to explicitly call Dispose() on all my classes.
So, what I did is implement IDisposable on all of my classes, especially the one that wrapped around PerformanceCounter.
I wrote more explicit cleanup of my IList<PerformanceMonitor>, and the internals,
and voilà, the memory behavior changed.
It oscillates, but the memory usage is clearly bounded between an acceptable range over a long period of time.

Please tell me what is wrong with my threading!

I have a function where I will compress a bunch of files into a single compressed file..it is taking a long time(to compress),so I tried implementing threading in my application..Say if I have 20 files for compression,I separated that as 5*4=20,inorder to do that I have separate variables(which are used for compression) for all 4 threads in order to avoid locks and I will wait until the 4 thread finishes..Now..the threads are working but i see no improvement in their performance..normally it will take 1 min for 20 files(for example) after implementing threading ...there is only 5 or 3 sec difference., sometimes the same.
here i will show the code for 1 thread(so it is for other3 threads)
//main thread
myClassObject->thread1 = AfxBeginThread((AFX_THREADPROC)MyThreadFunction1,myClassObject);
....
HANDLE threadHandles[4];
threadHandles[0] = myClassObject->thread1->m_hThread;
....
WaitForSingleObject(myClassObject->thread1->m_hThread,INFINITE);
UINT MyThreadFunction(LPARAM lparam)
{
CMerger* myClassObject = (CMerger*)lparam;
CString outputPath = myClassObject->compressedFilePath.GetAt(0);//contains the o/p path
wchar_t* compressInputData[] = {myClassObject->thread1outPath,
COMPRESS,(wchar_t*)(LPCTSTR)(outputPath)};
HINSTANCE loadmyDll;
loadmydll = LoadLibrary(myClassObject->thread1outPath);
fp_Decompress callCompressAction = NULL;
int getCompressResult=0;
myClassObject->MyCompressFunction(compressInputData,loadClient7zdll,callCompressAction,myClassObject->thread1outPath,
getCompressResult,minIndex,myClassObject->firstThread,myClassObject);
return 0;
}
Firstly, you only wait on one of the threads. I think you want WaitForMultipleObjects.
As for the lack of speed up have you considered that your actual bottleneck is NOT the compression but the file loading? File loading is slow and 4 threads contending for time slices of the hard disk "could" even result in lower performance.
This is why premature optimisation is evil. You need to profile, profile and profile again to work out where your REAL bottlenecks are.
Edit: I can't really comment on your WaitForMultipleObjects unless I see the code. I have never had any problems with it myself ...
As for a bottleneck. Its a metaphor if you try to pour a large amount of liquid out of a cylinder by tipping it upside-down then the water leaves at a constant rate. If you try to do this with a bottle you will notice that it can't do it as fast. This is because there is only so much liquid that can flow through the thin part of the bottle (not to mention the air entering into it). Thus the limitation of your water emptying from the container is limited by the neck of the bottle (the thin part).
In programming when you talk about a bottle neck you are talking about the slowest part of the code. In this case if your threads spend most of their time waiting for the disk load to complete then you are going to get very little speed up by multi-threading as you can only load so much at once. In fact when you try to load 4 times as much at once then you will start to find that you have to wait around just as long for the load to complete. In your single threading case you wait around and once its loaded you compress. In the 4 threaded case you are waiting around 4 times as long for all the loads to complete and then you compress all 4 files simultaneously. This is why you get a small speed up. Unfortunately due to the fact you are spending most of your time waiting for the loads to complete you won't see anything approaching a 4x speed up. Hence the limiting factor of your method is not the compression but the loading the file from disk and hence it gets called a bottleneck.
Edit2: In a case such as you are suggesting you will find the best speed up would be had by eliminating the amount of time you are waiting for data to load from disk.
1) If you load a file as multiple of disk pages (usually 2048 byte but you can query windows to get the size) you get best possible load performance. If you load sizes that aren't a multiple of this you will get quite a serious performance hit.
2) Look at asynchronous loading. For example you could be loading all of file 2 (or more) in to memory while you are processing file 1. This means that you aren't waiting around for the load to complete. Its unlikely, though, you'll get a vast speed up here as you'll probably still end up waiting for the load. The other thing to try is to load "chunks" of the audio file asynchronously. ie:
Load chunk 1.
Start chunk 2 loading.
Process chunk 1.
Wait for chunk 2 to load.
Start chunk 3 loading.
Process chunk 2.
(And so on)
3) You could just buy a faster disk drive.

Resources