GCD dispatch_async memory leak? - memory-leaks

The following code will occupy ~410MB of memory and will not release it again. (The version using dispatch_sync instead of dispatch_async will require ~8MB memory)
I would expect a spike of high memory usage but it should go down again... Where is the leak?
int main(int argc, const char * argv[]) {
#autoreleasepool {
for (int i = 0; i < 100000; i++) {
dispatch_async(dispatch_get_global_queue(QOS_CLASS_UTILITY, 0), ^{
NSLog(#"test");
});
}
NSLog(#"Waiting.");
[[NSRunLoop mainRunLoop] runUntilDate:[NSDate dateWithTimeIntervalSinceNow:60]];
}
return 0;
}
I tried:
Adding #autoreleasepool around and inside the loop
Adding NSRunLoop run to the loop
I tried several combinations and never saw a decrease of memory (even after waiting minutes).
I'm aware of the GCD reference guide which contains the following statement:
Although GCD dispatch queues have their own autorelease pools, they make no guarantees as to when those pools are drained.
Is there a memory leak in this code? If not, is there a way to enforce the queue to release/drain the finished blocks?

Objective-C block it is a C structure, I think you create 100000 the block objects to execute them in background threads and them wait while system can run them. Your device can execute limited count of threads, it means that many blocks will wait before OS start them.
If you change "async" to "sync", a next block object will be created after a previous block will be finished and destroyed.
UPD
About GCD pool.
GCD executes tasks on GCD thread pool, threads are created by the system, and managed by system. System caches threads to save CPU time, every dispatch task executes on free thread.
From documentation:
——
Blocks submitted to dispatch queues are executed on a pool of threads fully managed by the system. No guarantee is made as to the thread on which a task executes.
——
If you run the tasks as synchronized tasks, then exist the free thread (from GCD thread pool) to execute next task, after current task’s finished (because main thread is waiting while task execute, and does not add new tasks to the queue), and system does not allocate new NSThread (On my mac I’ve seen 2 threads). If you run the tasks as async, then the system can allocate many NSThreads (to achieve of maximum performance, on my mac it is near 67 threads), because the global queue contain many tasks.
Here you can read about max count of GCD thread pool.
I’ve seen in Alocations profiler that there are many NSThreads allocated and not destructed. I think it is system pool, that will be freed if necessary.

Always put #autoreleasepool inside every GCD call and you will have no problems. I had the same problem and this is the only workaround.
int main(int argc, const char * argv[]) {
#autoreleasepool {
for (int i = 0; i < 100000; i++) {
dispatch_async(dispatch_get_global_queue(QOS_CLASS_UTILITY, 0), ^{
// everything INSIDE in an #autoreleasepool
#autoreleasepool {
NSLog(#"test");
}
});
}
NSLog(#"Waiting.");
[[NSRunLoop mainRunLoop] runUntilDate:[NSDate dateWithTimeIntervalSinceNow:60]];
}
return 0;
}

Related

what would be the right way to go for my scenario, thread array, thread pool or tasks?

I am working on a small microfinance application that processes financial transactions, the frequency of these transaction are quite high, which is why I am planning to make it a multi-threaded application that can process multiple transactions in parallel.
I have already designed all the workers that are thread safe,
what I need help for is how to manage these threads. here are some of my options
1.make a specified number of thread pool threads at startup and keep them running like in a infinite loop where they could keep looking for new transactions and if any are found start processing
example code:
void Start_Job(){
for (int l_ThreadId = 0; l_ThreadId < PaymentNoOfWorkerThread; l_ThreadId++)
{
ThreadPool.QueueUserWorkItem(Execute, (object)l_TrackingId);
}
}
void Execute(object l_TrackingId)
{
while(true)
{
var new_txns = Get_New_Txns(); //get new txns if any returns a queue
while(new_txns.count > 0 ){
process_txn(new_txns.Dequeue())
}
Thread.Sleep(some_time);
}
}
2.look for new transactions and assign a thread pool thread for each transaction (my understanding that these threads would be reused after their execution is complete for new txns)
example code:
void Start_Job(){
while(true){
var new_txns = Get_New_Txns(); //get new txns if any returns a queue
for (int l_ThreadId = 0; l_ThreadId < new_txns.count; l_ThreadId++)
{
ThreadPool.QueueUserWorkItem(Execute, (object)new_txn.Dequeue());
}
}
Thread.Sleep(some_time);
}
void Execute(object Txn)
{
process_txn(txn);
}
3.do the above but with tasks.
which option would be most efficient and well suited for my application,
thanks in advance :)
ThreadPool.QueueUserWorkItem is an older API and you shouldn't be using it directly
anymore. Tasks is the way to go and Thread pool is managed automatically for you.
What may suite your application would depend on what happens in process_txn and is subjective, so this is very generic guideline:
If process_txn is a compute bound operation: for example it performs only CPU bound calculations, then you may look at the Task Parallel Library. It will help you use the CPU cores more efficiently.
If process_txn is less of CPU and more IO bound operations: meaning if it may read/write from files/database or connects to some other remote service, then what you should look at is asynchronous programming and make sure your IO operations are all asynchronous which means your threads are never blocked on IO. This will help your service to be more scalable. Also depending on what your queue is, see if you can await on the queue asynchronously, so that none of your application threads are blocked just waiting on the queue.

mmap: performance when using multithreading

I have a program which performs some operations on a lot of files (> 10 000). It spawns N worker threads and each thread mmaps some file, does some work and munmaps it.
The problem I am facing right now is that whenever I use just 1 process with N worker threads, it has worse performance than spawning 2 processes each with N/2 worker threads. I can see this in iotop because 1 process+N threads uses only around 75% of the disk bandwidth whereas 2 processes+N/2 threads use full bandwidth.
Some notes:
This happens only if I use mmap()/munmap(). I have tried to replace it with fopen()/fread() and it worked just fine. But since the mmap()/munmap() comes with 3rd party library, I would like to use it in its original form.
madvise() is called with MADV_SEQUENTIAL but it doesn't seem to change anything (or it just slows it down) if I remove it or change the advise argument.
Thread affinity doesn't seem to matter. I have tried to limit each thread to specific core. I have also tried to limit threads to core pairs (Hyper Threading). No results so far.
Load reported by htop seems to be the same even in both cases.
So my questions are:
Is there anything about mmap() I am not aware of when used in multithreaded environment?
If so, why do 2 processes have better performance?
EDIT:
As pointed out in the comments, it is running on server with 2xCPU. I should probably try to set thread affinities such that it is always running on the same CPU but I think I already tried that and it didn't work.
Here is a piece of code with which I can reproduce the same issue as with my production software.
#include <condition_variable>
#include <deque>
#include <filesystem>
#include <iostream>
#include <mutex>
#include <thread>
#include <vector>
#include <fcntl.h>
#include <sys/mman.h>
#include <unistd.h>
#ifndef WORKERS
#define WORKERS 16
#endif
bool stop = false;
std::mutex queue_mutex;
std::condition_variable queue_cv;
std::pair<const std::uint8_t*, std::size_t> map_file(const std::string& file_path)
{
int fd = open(file_path.data(), O_RDONLY);
if (fd != -1)
{
auto dir_ent = std::filesystem::directory_entry{file_path.data()};
if (dir_ent.is_regular_file())
{
auto size = dir_ent.file_size();
auto data = mmap(nullptr, size, PROT_READ, MAP_PRIVATE, fd, 0);
madvise(data, size, MADV_SEQUENTIAL);
close(fd);
return { reinterpret_cast<const std::uint8_t*>(data), size };
}
close(fd);
}
return { nullptr, 0 };
}
void unmap_file(const std::uint8_t* data, std::size_t size)
{
munmap((void*)data, size);
}
int main(int argc, char* argv[])
{
std::deque<std::string> queue;
std::vector<std::thread> threads;
for (std::size_t i = 0; i < WORKERS; ++i)
{
threads.emplace_back(
[&]() {
std::string path;
while (true)
{
{
std::unique_lock<std::mutex> lock(queue_mutex);
while (!stop && queue.empty())
queue_cv.wait(lock);
if (stop && queue.empty())
return;
path = queue.front();
queue.pop_front();
}
auto [data, size] = map_file(path);
std::uint8_t b = 0;
for (auto itr = data; itr < data + size; ++itr)
b ^= *itr;
unmap_file(data, size);
std::cout << (int)b << std::endl;
}
}
);
}
for (auto& p : std::filesystem::recursive_directory_iterator{argv[1]})
{
std::unique_lock<std::mutex> lock(queue_mutex);
if (p.is_regular_file())
{
queue.push_back(p.path().native());
queue_cv.notify_one();
}
}
stop = true;
queue_cv.notify_all();
for (auto& t : threads)
t.join();
return 0;
}
Is there anything about mmap() I am not aware of when used in multithreaded environment?
Yes. mmap() requires significant virtual memory manipulation - effectively single-threading your process in places. Per this post from one Linus Torvalds:
... playing games with the virtual memory mapping is very expensive
in itself. It has a number of quite real disadvantages that people tend
to ignore because memory copying is seen as something very slow, and
sometimes optimizing that copy away is seen as an obvious improvment.
Downsides to mmap:
quite noticeable setup and teardown costs. And I mean noticeable.
It's things like following the page tables to unmap everything
cleanly. It's the book-keeping for maintaining a list of all the
mappings. It's The TLB flush needed after unmapping stuff.
page faulting is expensive. That's how the mapping gets populated,
and it's quite slow.
Note that much of the above also has to be single-threaded across the entire machine, such as the actual mapping of physical memory.
So the virtual memory manipulations mapping files requires are not only expensive, they really can't be done in parallel - there's only one chunk of actual physical memory that the kernel has to keep track of, and multiple threads can't parallelize changes to a process's virtual address space.
You'd almost certainly get better performance reusing a memory buffer for each file, where each buffer is created once and is large enough to hold any file read into it, then reading from the file using low-level POSIX read() call(s). You might want to experiment with using page-aligned buffers and using direct IO by calling open() with the O_DIRECT flag (Linux-specific) to bypass the page cache since you apparently never re-read any data and any caching is a waste of memory and CPU cycles.
Reusing the buffer also completely eliminates any munmap() or delete/free().
You'd have to manage the buffers, though. Perhaps prepopulating a queue with N precreated buffers, and returning a buffer to the queue when done with a file?
As far as
If so, why do 2 processes have better performance?
The use of two processes splits the process-specific virtual memory manipulations caused by mmap() calls into two separable sets that can run in parallel.
A few notes:
Try running your application with perf stat -ddd <app> and have a look at context-switches, cpu-migrations and page-faults numbers.
The threads probably contend for vm_area_struct in the kernel process structure on mmap and page faults. Try passing MAP_POPULATE or MAP_LOCKED flag into mmap to minimize page faults. Alternatively, try mmap with MAP_POPULATE or MAP_LOCKED flag in the main thread only (you may like to ensure that all threads run on the same NUMA node in this case).
You may also like to experiment with MAP_HUGETLB and one of MAP_HUGE_2MB, MAP_HUGE_1GB flags.
Try binding threads to the same NUMA node with numactl to make sure that threads only access local NUMA memory. E.g. numactl --membind=0 --cpunodebind=0 <app>.
Lock the mutex before stop = true, otherwise the condition variable notification can get lost and deadlock the waiting thread forever.
p.is_regular_file() check doesn't require the mutex to be locked.
std::deque can be replaced with std::list and use splice to push and pop elements to minimize the time the mutex is locked.

thread synchronization using only shared memory

Recently, I was interviewed at a couple of companies, and was asked the same question:
"You've got N worker threads that can communicate only via shared memory, any other synchronization primitives are not available. The shared memory contains a counter which is initially 0, and each thread must increment it once. Another thread may be added, and there is more space on the shared memory in addition to the counter"
In other words, there are multiple threads, and their access to a shared resource (in this case, a counter, but can be anything else) must be synchronized using shared memory only.
So my solution was as follows:
Define 3 more integer variables on the shared memory: REQUEST, GRANTED, FINISHED, and initialize them to -1.
Before starting the worker threads, start another manager thread that will coordinate between the worker threads.
Manager thread pseudocode:
while (true) {
if(GRANTED equals FINISHED) {
GRANTED = REQUEST;
}
}
Worker thread pseudocode:
incremented = false;
while (incremented equals false) {
REQUEST = this thread ID;
if(GRANTED equals this thread ID) {
increment the counter;
incremented = true;
FINISHED = this thread ID;
}
}
The question is whether this solution is OK?
Are there other solutions?
Also, this solution is not fair, because a worker may try many times until it gets a chance to actually increment the counter. How to make it fair?

Creating multithreads continuously

I have a string list includes file paths. The count of list elements is 80. I want to create 8 threads continuously until files in list have moved. If a thread finishes its work, I will create one thread so that thread count must be 8.
Can anybody help me?
Unless each thread is writing to a different drive, having multiple threads copying files is slower than doing it with a single thread. The disk drive can only do one thing at a time. If you have eight threads all trying to write to the same disk drive, then it takes extra time to do disk head seeks and such.
Also, if you don't have at least eight CPU cores, then trying to run eight concurrent threads is going to require extra thread context switches. If you're doing this on a four-core machine, then you shouldn't have more than four threads working on it.
If you really need to have eight threads doing this, then put all of the file paths into a BlockingCollection, start eight threads, and have them go to work. So you have eight persistent threads rather than starting and stopping threads all the time. Something like this:
BlockingCollection<string> filePaths = new BlockingCollection<string>();
List<Thread> threads = new List<Thread>();
// add paths to queue
foreach (var path in ListOfFilePaths)
filePaths.Add(path);
filePaths.CompleteAdding();
// start threads to process the paths
for (int i = 0; i < 8; ++i)
{
Thread t = new Thread(CopyFiles);
threads.Add(t);
t.Start();
}
// threads are working. At some point you'll need to clean up:
foreach (var t in threads)
{
t.Join();
}
Your CopyFiles method looks like this:
void CopyFiles()
{
foreach (var path in filePaths.GetConsumingEnumerable())
{
CopyTheFile(path);
}
}
Since you're working with .NET 4.0, you could use Task instead of Thread. The code would be substantially similar.

Limit number of concurrent thread in a thread pool

In my code I have a loop, inside this loop I send several requests to a remote webservice. WS providers said: "The webservice can host at most n threads", so i need to cap my code since I can't send n+1 threads.
If I've to send m threads I would that first n threads will be executed immediately and as soon one of these is completed a new thread (one of the remaining m-n threads) will be executed and so on, until all m threads are executed.
I have thinked of a Thread Pool and explicit setting of the max thread number to n. Is this enough?
For this I would avoid the use of multiple threads. Instead, wrapping the entire loop up which can be run on a single thread. However, if you do want to launch multiple threads using the/a thread pool then I would use the Semaphore class to facilitate the required thread limit; here's how...
A semaphore is like a mean night club bouncer, it has been provide a club capacity and is not allowed to exceed this limit. Once the club is full, no one else can enter... A queue builds up outside. Then as one person leaves another can enter (analogy thanks to J. Albahari).
A Semaphore with a value of one is equivalent to a Mutex or Lock except that the Semaphore has no owner so that it is thread ignorant. Any thread can call Release on a Semaphore whereas with a Mutex/Lock only the thread that obtained the Mutex/Lock can release it.
Now, for your case we are able to use Semaphores to limit concurrency and prevent too many threads from executing a particular piece of code at once. In the following example five threads try to enter a night club that only allows entry to three...
class BadAssClub
{
static SemaphoreSlim sem = new SemaphoreSlim(3);
static void Main()
{
for (int i = 1; i <= 5; i++)
new Thread(Enter).Start(i);
}
// Enfore only three threads running this method at once.
static void Enter(int i)
{
try
{
Console.WriteLine(i + " wants to enter.");
sem.Wait();
Console.WriteLine(i + " is in!");
Thread.Sleep(1000 * (int)i);
Console.WriteLine(i + " is leaving...");
}
finally
{
sem.Release();
}
}
}
I hope this helps.
Edit. You can also use the ThreadPool.SetMaxThreads Method. This method restricts the number of threads allowed to run in the thread pool. But it does this 'globally' for the thread pool itself. This means that if you are running SQL queries or other methods in libraries that you application uses then new threads will not be spun-up due to this blocking. This may not be relevant to you, in which case use the SetMaxThreads method. If you want to block for a particular method however, it is safer to use Semphores.

Resources