I've a multithreaded (Cilk) program where each thread use a temporary
std::set. There are a lot of allocations on these std::set so that I'm
trying to use some pool allocators namely boost::fast_pool_allocator:
using allocator = boost::fast_pool_allocator< SGroup::type >;
using set = std::set<SGroup::type, std::less<SGroup::type>, allocator>;
But now the performances are much worse because of concurrent access to the
allocator. One crucial fact is that the sets are never communicated among the
threads so that I can use a thread local allocators. However, as shown in the
previous code, I'm not constructing allocator objects but passing template
parameters to the std::set constructor.
So here is my question: is it possible to construct multiple
boost::fast_pool_allocator to use them as thread local pool allocator ?
Edit : I removed stupid std::pair allocations.
EDIT
Mmm. I had an answer here that I pieced together from things I remembered seeing. However, upon further inspection it looks like all the allocators actually work with Singleton Pools that are never thread safe without synchronization. In fact, the null_mutex is likely in a detail namespace for this very reason: it only makes sense to use it if you know the program doesn't use threads (well, outisde the main thread) at all.
Aside from this apparent debacle, you could probably use object_pool directly. But it's not an allocator, so it wouldn't serve you for your container example.
Original Answer Text:
You can pass an allocator instance at construction:
#include <boost/pool/pool.hpp>
#include <boost/pool/pool_alloc.hpp>
#include <boost/thread.hpp>
#include <set>
struct SGroup
{
int data;
typedef int type;
};
using allocator = boost::fast_pool_allocator<SGroup::type>;
using set = std::set<SGroup::type, std::less<SGroup::type>, allocator>;
void thread_function()
{
allocator alloc; // thread local
set myset(set::key_compare(), alloc);
// do stuff
}
int main()
{
boost::thread_group group;
for (int i = 0; i<10; ++i)
group.create_thread(thread_function);
group.join_all();
}
Let me read the docs on how to disable thread-awareness on the allocator :)
Found it in an example:
typedef boost::fast_pool_allocator<SGroup::type,
boost::default_user_allocator_new_delete,
boost::details::pool::null_mutex> allocator;
The example in boost/libs/pool/example/time_pool_alloc.hpp should help you get started benchmarking the difference(s) in performance
Related
I have a program which performs some operations on a lot of files (> 10 000). It spawns N worker threads and each thread mmaps some file, does some work and munmaps it.
The problem I am facing right now is that whenever I use just 1 process with N worker threads, it has worse performance than spawning 2 processes each with N/2 worker threads. I can see this in iotop because 1 process+N threads uses only around 75% of the disk bandwidth whereas 2 processes+N/2 threads use full bandwidth.
Some notes:
This happens only if I use mmap()/munmap(). I have tried to replace it with fopen()/fread() and it worked just fine. But since the mmap()/munmap() comes with 3rd party library, I would like to use it in its original form.
madvise() is called with MADV_SEQUENTIAL but it doesn't seem to change anything (or it just slows it down) if I remove it or change the advise argument.
Thread affinity doesn't seem to matter. I have tried to limit each thread to specific core. I have also tried to limit threads to core pairs (Hyper Threading). No results so far.
Load reported by htop seems to be the same even in both cases.
So my questions are:
Is there anything about mmap() I am not aware of when used in multithreaded environment?
If so, why do 2 processes have better performance?
EDIT:
As pointed out in the comments, it is running on server with 2xCPU. I should probably try to set thread affinities such that it is always running on the same CPU but I think I already tried that and it didn't work.
Here is a piece of code with which I can reproduce the same issue as with my production software.
#include <condition_variable>
#include <deque>
#include <filesystem>
#include <iostream>
#include <mutex>
#include <thread>
#include <vector>
#include <fcntl.h>
#include <sys/mman.h>
#include <unistd.h>
#ifndef WORKERS
#define WORKERS 16
#endif
bool stop = false;
std::mutex queue_mutex;
std::condition_variable queue_cv;
std::pair<const std::uint8_t*, std::size_t> map_file(const std::string& file_path)
{
int fd = open(file_path.data(), O_RDONLY);
if (fd != -1)
{
auto dir_ent = std::filesystem::directory_entry{file_path.data()};
if (dir_ent.is_regular_file())
{
auto size = dir_ent.file_size();
auto data = mmap(nullptr, size, PROT_READ, MAP_PRIVATE, fd, 0);
madvise(data, size, MADV_SEQUENTIAL);
close(fd);
return { reinterpret_cast<const std::uint8_t*>(data), size };
}
close(fd);
}
return { nullptr, 0 };
}
void unmap_file(const std::uint8_t* data, std::size_t size)
{
munmap((void*)data, size);
}
int main(int argc, char* argv[])
{
std::deque<std::string> queue;
std::vector<std::thread> threads;
for (std::size_t i = 0; i < WORKERS; ++i)
{
threads.emplace_back(
[&]() {
std::string path;
while (true)
{
{
std::unique_lock<std::mutex> lock(queue_mutex);
while (!stop && queue.empty())
queue_cv.wait(lock);
if (stop && queue.empty())
return;
path = queue.front();
queue.pop_front();
}
auto [data, size] = map_file(path);
std::uint8_t b = 0;
for (auto itr = data; itr < data + size; ++itr)
b ^= *itr;
unmap_file(data, size);
std::cout << (int)b << std::endl;
}
}
);
}
for (auto& p : std::filesystem::recursive_directory_iterator{argv[1]})
{
std::unique_lock<std::mutex> lock(queue_mutex);
if (p.is_regular_file())
{
queue.push_back(p.path().native());
queue_cv.notify_one();
}
}
stop = true;
queue_cv.notify_all();
for (auto& t : threads)
t.join();
return 0;
}
Is there anything about mmap() I am not aware of when used in multithreaded environment?
Yes. mmap() requires significant virtual memory manipulation - effectively single-threading your process in places. Per this post from one Linus Torvalds:
... playing games with the virtual memory mapping is very expensive
in itself. It has a number of quite real disadvantages that people tend
to ignore because memory copying is seen as something very slow, and
sometimes optimizing that copy away is seen as an obvious improvment.
Downsides to mmap:
quite noticeable setup and teardown costs. And I mean noticeable.
It's things like following the page tables to unmap everything
cleanly. It's the book-keeping for maintaining a list of all the
mappings. It's The TLB flush needed after unmapping stuff.
page faulting is expensive. That's how the mapping gets populated,
and it's quite slow.
Note that much of the above also has to be single-threaded across the entire machine, such as the actual mapping of physical memory.
So the virtual memory manipulations mapping files requires are not only expensive, they really can't be done in parallel - there's only one chunk of actual physical memory that the kernel has to keep track of, and multiple threads can't parallelize changes to a process's virtual address space.
You'd almost certainly get better performance reusing a memory buffer for each file, where each buffer is created once and is large enough to hold any file read into it, then reading from the file using low-level POSIX read() call(s). You might want to experiment with using page-aligned buffers and using direct IO by calling open() with the O_DIRECT flag (Linux-specific) to bypass the page cache since you apparently never re-read any data and any caching is a waste of memory and CPU cycles.
Reusing the buffer also completely eliminates any munmap() or delete/free().
You'd have to manage the buffers, though. Perhaps prepopulating a queue with N precreated buffers, and returning a buffer to the queue when done with a file?
As far as
If so, why do 2 processes have better performance?
The use of two processes splits the process-specific virtual memory manipulations caused by mmap() calls into two separable sets that can run in parallel.
A few notes:
Try running your application with perf stat -ddd <app> and have a look at context-switches, cpu-migrations and page-faults numbers.
The threads probably contend for vm_area_struct in the kernel process structure on mmap and page faults. Try passing MAP_POPULATE or MAP_LOCKED flag into mmap to minimize page faults. Alternatively, try mmap with MAP_POPULATE or MAP_LOCKED flag in the main thread only (you may like to ensure that all threads run on the same NUMA node in this case).
You may also like to experiment with MAP_HUGETLB and one of MAP_HUGE_2MB, MAP_HUGE_1GB flags.
Try binding threads to the same NUMA node with numactl to make sure that threads only access local NUMA memory. E.g. numactl --membind=0 --cpunodebind=0 <app>.
Lock the mutex before stop = true, otherwise the condition variable notification can get lost and deadlock the waiting thread forever.
p.is_regular_file() check doesn't require the mutex to be locked.
std::deque can be replaced with std::list and use splice to push and pop elements to minimize the time the mutex is locked.
I am unable to find a function to convert a thread id (pid_t) into a pthread_t which would allow me to call pthread_cancel() or pthread_kill().
Even if pthreads doesn't provide one is there a Linux specific function?
I don't think such a function exists but I would be happy to be corrected.
Background
I am well aware that it is usually preferable to have threads manage their own lifetimes via condition variables and the like.
This use is for testing purposes. I am trying to find a way to test how an application behaves when one of its threads 'dies'. So I'm really looking for a way to kill a thread. Using syscall(tgkill()) kills the process, so instead I provided a means for a tester to give the process the id of the thread to kill. I now need to turn that id into a pthread_t so that I can then:
use pthread_kill(tid,0) to check for its existence followed by
calling pthread_kill() or pthread_cancel() as appropriate.
This is probably taking testing to an unnecessary extreme. If I really want to do that some kind of mock pthreads implementation might be better.
Indeed if you really want robust isolation you are typically better off using processes rather than threads.
I don't think such a function exists but I would be happy to be corrected.
As a workaround I can create a table mapping &pthread_t to pid_t and ensure that I always invoke pthread_create() via a wrapper that adds an entry to this table. This works very well and allows me to convert an OS thread id to a pthread_t which I can then terminate using pthread_cancel(). Here is a snippet of the mechanism:
typedef void* (*threadFunc)(void*);
static void* start_thread(void* arg)
{
threadFunc threadRoutine = routine_to_start;
record_thread_start(pthread_self(),syscall(SYS_gettid));
routine_to_start = NULL; //let creating thread know its safe to continue
return threadRoutine(arg);
}
Sensible conversion requires there to be a 1:1 mapping between pthread_t and pid_t tid, which is the case with NPTL, but hasn't always been the case, and won't be the case on every pthread platform. That said...
Two options:
A) override the actual pthread_create, using LD_PRELOAD and dlsym, and keep track of each pthread_t and their corresponding pid_t there. To get the thread pid_t you can either take advantage of the pthread private headers to de-opaque the pthread_t and access the pid_t inside there, or if you want to stick to documented APIs pthread_sigqueue each pthread_t thread as it is created and have a sigaction signal handler call gettid and pass you back the thread pid_t, with appropriate synchronisation between your new pthread_create and the signal handler[1].
B) You can read the all of the thread pid_t from /proc/<process pid_t>/task/. Then use the SYS_rt_tgsigqueueinfo[2] syscall to implement a new function thread_sigqueue, a pid_t variant of pthread_sigqueue so that you can signal the pid_t thread, and from the sigaction signal handler call pthread_self passing out the value with suitable synchronization, etc.
Notes:
1 - I think it's worth writing 2 executeOnThread variants (one for pthread_t and one for pid_t style thread ids) that take a std::function<void()> (for C++), or a void(*)(void*) function pointer and void* parameter (for C), and SIGUSR1 that thread to execute the passed function in a sigaction that you also setup to perform relevant synchronization with the calling thread. It's handy to be able to use the thread-dependent APIs like pthread_self, gettid, backtrace, getrusage, etc. without devising a custom execution scheme each time.
2 - SYS_rt_tgsigqueueinfo is a low level syscall meant for implementing sigqueue/pthread_sigqueue, rather than application use, but is still a documented API, and we're using it to implement another variant of sigqueue, so fair game IMHO.
#include<iostream>
#include<vector>
#include<thread>
#include<string>
using namespace std;
vector<string> s;
void add()
{
while(true)
{
getchar();
s.push_back("added");
}
}
void show()
{
while(true)
{
//cout<<"";
while(!s.empty())
{
cout<<(*s.begin())<<endl;
s.erase(s.begin());
}
}
}
int main()
{
thread one(add);
thread two(show);
one.join();
two.join();
}
In debug mode there is no such a problem. In release mode if the comment line is uncommented it works again. But with just like this, there is a problem. What is the problem?
std::vector (as any other std:: container) is not generally thread-safe. It means that concurrent modifying access to the same vector from multiple thread is generally not supported. What that means is that while you can call non-modifying functions of the vector from many threads at the same time (for instance, you can call begin() and end() with no problems), modification functions should have exclusive access to the vector object. To achieve this exclusivity, you need to use thread-synchronization primitives to 'signal' your intention to obtain exclusive access to the vector, perform your modification and than 'signal' that exclusive access is no longer need.
Note, this is not enough to perform that sort of routine when you modify (insert) data to the vector. You will also have to do the same dance when you read data from the vector, since modifications need exclusive access, and even the read will violate this exclusivity. The non-technical term I've used here, 'signalling', has a technical counterpart - it is called critical section. Here we say that you 'enter critical section' and 'leave critical section'.
There a more than one way to enter and leave critical section. The stapples of this are so-called mutexes, and they should be enough for your learning. Just keep in mind there are other ways as well, which you'll learn in the due course.
I have a native c++ library that gets used by a managed C++ application. The native library is compiled with no CLR support and the managed C++ application with it (/CLR compiler option).
When I use a std::mutex in the native library I get a heap corruption when the owning native class is deleted. The use of mutex.h is blocked by managed C++ so I'm guessing that could be part of the reason.
The minimal native class that demonstrates the issue is:
Header:
#pragma once
#include <stdio.h>
#ifndef __cplusplus_cli
#include <mutex>
#endif
namespace MyNamespace {
class SomeNativeLibrary
{
public:
SomeNativeLibrary();
~SomeNativeLibrary();
void DoSomething();
#ifndef __cplusplus_cli
std::mutex aMutex;
#endif
};
}
Implementation:
#include "SomeNativeLibrary.h"
namespace MyNamespace {
SomeNativeLibrary::SomeNativeLibrary()
{}
SomeNativeLibrary::~SomeNativeLibrary()
{}
void SomeNativeLibrary::DoSomething(){
printf("I did something.\n");
}
}
Managed C++ Console Application:
int main(array<System::String ^> ^args)
{
Console::WriteLine(L"Unit Test Console:");
MyNamespace::SomeNativeLibrary *someNativelib = new MyNamespace::SomeNativeLibrary();
someNativelib->DoSomething();
delete someNativelib;
getchar();
return 0;
}
The heap corruption debug error occurs when the attempt is made to delete the someNativeLib pointer.
Is there anything I can do to use a std::mutex safely in the native library or is there an alternative I could use? In my live code the mutex is used for is to ensure that only a single thread accesses a std::vector.
The solution was to use a CRITICAL_SECTION as the lock instead. It's actually more efficient than a mutex in my case anyway since the lock is only for threads in the same process.
Not sure were you reading your own post but there is a clue in your code:
#ifndef __cplusplus_cli
std::mutex aMutex;
#endif
Member 'aMutex' compiles only if compile condition '__cplusplus_cli' is undefined.
So the moment you included that header in Managed C++ it vanished from definition.
So your Native project and Managed project have mismatch in class definition for beginners == mostly ends in Access Violation if attempted to write to location beyond class memory (non existing member in CLI version if instantiated there).
Or just HEAP CORRUPTION in managed code to put it simply.
So what you have done is no go, ever!
But I was actually amazed that you managed to include native lib, and successfully compile both projects. I must ask how did you hack project properties to manage that. Or you just may have found yet another bug ;)
About question: for posterity
Yes CRITICAL_SECTION helps, Yes it's more faster from mutex since it is implemented in single process and some versions of it even in hardware (!). Also had plenty of changes since it's introduction and some nasty DEAD-LOCKS issues.
example: https://microsoft.public.win32.programmer.kernel.narkive.com/xS8sFPCG/criticalsection-deadlock-with-owningthread-of-zero
end up just killing entire OS. So lock only very small piece of code that actually only accesses the data, and exit locks immediately.
As replacement, you could just use plain "C" kernel mutex or events if not planning cross-platform support (Linux/iOS/Android/Win/MCU/...).
There is a ton of other replacements coming from Windows kernel.
// mutex
https://learn.microsoft.com/en-us/windows/win32/api/synchapi/nf-synchapi-createmutexw
HANDLE hMutex = CreateMutex(NULL, TRUE, _T("MutexName"));
NOTE: mutex name rules 'Global' vs 'Local'.
// or event
https://learn.microsoft.com/en-us/windows/win32/api/synchapi/nf-synchapi-createeventw
HANDLE hEvent = CreateEvent(NULL, TRUE, FALSE, _T("EventName"));
to either set, or clear/reset event state, just call SetEvent(hEvent), or ResetEvent(hEvent).
To wait for signal (set) again simply
int nret = WaitForSingleObject(hEvent, -1);
INFINITE define (-1) wait for infinity, a value which can be replaced with actual milliseconds timeout, and then return value could be evaluated against S_OK. If not S_OK it's most likely TIMEOUT.
There is a sea of synchronization functions/techniques depending on which you actually need to do !?:
https://learn.microsoft.com/en-us/windows/win32/sync/synchronization-functions
But use it with sanity check.
Since you might be just trying to fix wrong thing in the end, each solution depends on the actual problem. And the moment you think it's too complex know it's wrong solution, simplest solutions were always best, but always verify and validate.
I observed that my copy of MSVC10 came with containers that appeared to allow state based allocators, and wrote a simple pool allocator, that allocates pools for a specific type.
However, I discovered that if _ITERATOR_DEBUG_LEVEL != 0 the MSVC vector creates a proxy allocator from the passed allocator (for iterator tracking?), uses the proxy, then lets the proxy fall out of scope, expecting the allocated memory to remain. This causes problems because my allocator attempts to release it's pool upon destruction. Is this allowed by the C++0x standard?
The code is roughly like this:
class _Container_proxy{};
template<class T, class _Alloc>
class vector {
_Alloc _Alval;
public:
vector() {
// construct _Alloc<_Container_proxy> _Alproxy
typename _Alloc::template rebind<_Container_proxy>::other
_Alproxy(_Alval);
//allocate
this->_Myproxy = _Alproxy.allocate(1);
/*other stuff, but no deallocation*/
} //_Alproxy goes out of scope
~_Vector_val() { // destroy proxy
// construct _Alloc<_Container_proxy> _Alproxy
typename _Alloc::template rebind<_Container_proxy>::other
_Alproxy(_Alval);
/*stuff, but no allocation*/
_Alproxy.deallocate(this->_Myproxy, 1);
} //_Alproxy goes out of scope again
According to the giant table of allocator requirements in section 17.6.3.5, an allocator must be copyable. Containers are allowed to copy them freely. So you need to store the pool in a std::shared_ptr or something similar in order to prevent deletion while one of the allocators is in existence.