mmap: performance when using multithreading

mmap: performance when using multithreading - linux

I have a program which performs some operations on a lot of files (> 10 000). It spawns N worker threads and each thread mmaps some file, does some work and munmaps it.
The problem I am facing right now is that whenever I use just 1 process with N worker threads, it has worse performance than spawning 2 processes each with N/2 worker threads. I can see this in iotop because 1 process+N threads uses only around 75% of the disk bandwidth whereas 2 processes+N/2 threads use full bandwidth.
Some notes:
This happens only if I use mmap()/munmap(). I have tried to replace it with fopen()/fread() and it worked just fine. But since the mmap()/munmap() comes with 3rd party library, I would like to use it in its original form.
madvise() is called with MADV_SEQUENTIAL but it doesn't seem to change anything (or it just slows it down) if I remove it or change the advise argument.
Thread affinity doesn't seem to matter. I have tried to limit each thread to specific core. I have also tried to limit threads to core pairs (Hyper Threading). No results so far.
Load reported by htop seems to be the same even in both cases.
So my questions are:
Is there anything about mmap() I am not aware of when used in multithreaded environment?
If so, why do 2 processes have better performance?
EDIT:
As pointed out in the comments, it is running on server with 2xCPU. I should probably try to set thread affinities such that it is always running on the same CPU but I think I already tried that and it didn't work.
Here is a piece of code with which I can reproduce the same issue as with my production software.
#include <condition_variable>
#include <deque>
#include <filesystem>
#include <iostream>
#include <mutex>
#include <thread>
#include <vector>
#include <fcntl.h>
#include <sys/mman.h>
#include <unistd.h>
#ifndef WORKERS
#define WORKERS 16
#endif
bool stop = false;
std::mutex queue_mutex;
std::condition_variable queue_cv;
std::pair<const std::uint8_t*, std::size_t> map_file(const std::string& file_path)
{
int fd = open(file_path.data(), O_RDONLY);
if (fd != -1)
{
auto dir_ent = std::filesystem::directory_entry{file_path.data()};
if (dir_ent.is_regular_file())
{
auto size = dir_ent.file_size();
auto data = mmap(nullptr, size, PROT_READ, MAP_PRIVATE, fd, 0);
madvise(data, size, MADV_SEQUENTIAL);
close(fd);
return { reinterpret_cast<const std::uint8_t*>(data), size };
}
close(fd);
}
return { nullptr, 0 };
}
void unmap_file(const std::uint8_t* data, std::size_t size)
{
munmap((void*)data, size);
}
int main(int argc, char* argv[])
{
std::deque<std::string> queue;
std::vector<std::thread> threads;
for (std::size_t i = 0; i < WORKERS; ++i)
{
threads.emplace_back(
[&]() {
std::string path;
while (true)
{
{
std::unique_lock<std::mutex> lock(queue_mutex);
while (!stop && queue.empty())
queue_cv.wait(lock);
if (stop && queue.empty())
return;
path = queue.front();
queue.pop_front();
}
auto [data, size] = map_file(path);
std::uint8_t b = 0;
for (auto itr = data; itr < data + size; ++itr)
b ^= *itr;
unmap_file(data, size);
std::cout << (int)b << std::endl;
}
}
);
}
for (auto& p : std::filesystem::recursive_directory_iterator{argv[1]})
{
std::unique_lock<std::mutex> lock(queue_mutex);
if (p.is_regular_file())
{
queue.push_back(p.path().native());
queue_cv.notify_one();
}
}
stop = true;
queue_cv.notify_all();
for (auto& t : threads)
t.join();
return 0;
}

Is there anything about mmap() I am not aware of when used in multithreaded environment?
Yes. mmap() requires significant virtual memory manipulation - effectively single-threading your process in places. Per this post from one Linus Torvalds:
... playing games with the virtual memory mapping is very expensive
in itself. It has a number of quite real disadvantages that people tend
to ignore because memory copying is seen as something very slow, and
sometimes optimizing that copy away is seen as an obvious improvment.
Downsides to mmap:
quite noticeable setup and teardown costs. And I mean noticeable.
It's things like following the page tables to unmap everything
cleanly. It's the book-keeping for maintaining a list of all the
mappings. It's The TLB flush needed after unmapping stuff.
page faulting is expensive. That's how the mapping gets populated,
and it's quite slow.
Note that much of the above also has to be single-threaded across the entire machine, such as the actual mapping of physical memory.
So the virtual memory manipulations mapping files requires are not only expensive, they really can't be done in parallel - there's only one chunk of actual physical memory that the kernel has to keep track of, and multiple threads can't parallelize changes to a process's virtual address space.
You'd almost certainly get better performance reusing a memory buffer for each file, where each buffer is created once and is large enough to hold any file read into it, then reading from the file using low-level POSIX read() call(s). You might want to experiment with using page-aligned buffers and using direct IO by calling open() with the O_DIRECT flag (Linux-specific) to bypass the page cache since you apparently never re-read any data and any caching is a waste of memory and CPU cycles.
Reusing the buffer also completely eliminates any munmap() or delete/free().
You'd have to manage the buffers, though. Perhaps prepopulating a queue with N precreated buffers, and returning a buffer to the queue when done with a file?
As far as
If so, why do 2 processes have better performance?
The use of two processes splits the process-specific virtual memory manipulations caused by mmap() calls into two separable sets that can run in parallel.

A few notes:
Try running your application with perf stat -ddd <app> and have a look at context-switches, cpu-migrations and page-faults numbers.
The threads probably contend for vm_area_struct in the kernel process structure on mmap and page faults. Try passing MAP_POPULATE or MAP_LOCKED flag into mmap to minimize page faults. Alternatively, try mmap with MAP_POPULATE or MAP_LOCKED flag in the main thread only (you may like to ensure that all threads run on the same NUMA node in this case).
You may also like to experiment with MAP_HUGETLB and one of MAP_HUGE_2MB, MAP_HUGE_1GB flags.
Try binding threads to the same NUMA node with numactl to make sure that threads only access local NUMA memory. E.g. numactl --membind=0 --cpunodebind=0 <app>.
Lock the mutex before stop = true, otherwise the condition variable notification can get lost and deadlock the waiting thread forever.
p.is_regular_file() check doesn't require the mutex to be locked.
std::deque can be replaced with std::list and use splice to push and pop elements to minimize the time the mutex is locked.

Related

High availability computing: How to deal with a non-returning system call, without risking false positives?

I have a process that's running on a Linux computer as part of a high-availability system. The process has a main thread that receives requests from the other computers on the network and responds to them. There is also a heartbeat thread that sends out multicast heartbeat packets periodically, to let the other processes on the network know that this process is still alive and available -- if they don't heart any heartbeat packets from it for a while, one of them will assume this process has died and will take over its duties, so that the system as a whole can continue to work.
This all works pretty well, but the other day the entire system failed, and when I investigated why I found the following:
Due to (what is apparently) a bug in the box's Linux kernel, there was a kernel "oops" induced by a system call that this process's main thread made.
Because of the kernel "oops", the system call never returned, leaving the process's main thread permanently hung.
The heartbeat thread, OTOH, continue to operate correctly, which meant that the other nodes on the network never realized that this node had failed, and none of them stepped in to take over its duties... and so the requested tasks were not performed and the system's operation effectively halted.
My question is, is there an elegant solution that can handle this sort of failure? (Obviously one thing to do is fix the Linux kernel so it doesn't "oops", but given the complexity of the Linux kernel, it would be nice if my software could handle future other kernel bugs more gracefully as well).
One solution I don't like would be to put the heartbeat generator into the main thread, rather than running it as a separate thread, or in some other way tie it to the main thread so that if the main thread gets hung up indefinitely, heartbeats won't get sent. The reason I don't like this solution is because the main thread is not a real-time thread, and so doing this would introduce the possibility of occasional false-positives where a slow-to-complete operation was mistaken for a node failure. I'd like to avoid false positives if I can.
Ideally there would be some way to ensure that a failed syscall either returns an error code, or if that's not possible, crashes my process; either of those would halt the generation of heartbeat packets and allow a failover to proceed. Is there any way to do that, or does an unreliable kernel doom my user process to unreliability as well?

My second suggestion is to use ptrace to find the current instruction pointer. You can have a parent thread that ptraces your process and interrupts it every second to check the current RIP value. This is somewhat complex, so I've written a demonstration program: (x86_64 only, but that should be fixable by changing the register names.)
#define _GNU_SOURCE
#include <unistd.h>
#include <sched.h>
#include <stdlib.h>
#include <stdio.h>
#include <sys/syscall.h>
#include <sys/ptrace.h>
#include <sys/wait.h>
#include <sys/types.h>
#include <linux/ptrace.h>
#include <sys/user.h>
#include <time.h>
// this number is arbitrary - find a better one.
#define STACK_SIZE (1024 * 1024)
int main_thread(void *ptr) {
// "main" thread is now running under the monitor
printf("Hello from main!");
while (1) {
int c = getchar();
if (c == EOF) { break; }
nanosleep(&(struct timespec) {0, 200 * 1000 * 1000}, NULL);
putchar(c);
}
return 0;
}
int main(int argc, char *argv[]) {
void *vstack = malloc(STACK_SIZE);
pid_t v;
if (clone(main_thread, vstack + STACK_SIZE, CLONE_PARENT_SETTID | CLONE_FILES | CLONE_FS | CLONE_IO, NULL, &v) == -1) { // you'll want to check these flags
perror("failed to spawn child task");
return 3;
}
printf("Target: %d; %d\n", v, getpid());
long ptv = ptrace(PTRACE_SEIZE, v, NULL, NULL);
if (ptv == -1) {
perror("failed monitor sieze");
exit(1);
}
struct user_regs_struct regs;
fprintf(stderr, "beginning monitor...\n");
while (1) {
sleep(1);
long ptv = ptrace(PTRACE_INTERRUPT, v, NULL, NULL);
if (ptv == -1) {
perror("failed to interrupt main thread");
break;
}
int status;
if (waitpid(v, &status, __WCLONE) == -1) {
perror("target wait failed");
break;
}
if (!WIFSTOPPED(status)) { // this section is messy. do it better.
fputs("target wait went wrong", stderr);
break;
}
if ((status >> 8) != (SIGTRAP | PTRACE_EVENT_STOP << 8)) {
fputs("target wait went wrong (2)", stderr);
break;
}
ptv = ptrace(PTRACE_GETREGS, v, NULL, &regs);
if (ptv == -1) {
perror("failed to peek at registers of thread");
break;
}
fprintf(stderr, "%d -> RIP %x RSP %x\n", time(NULL), regs.rip, regs.rsp);
ptv = ptrace(PTRACE_CONT, v, NULL, NULL);
if (ptv == -1) {
perror("failed to resume main thread");
break;
}
}
return 2;
}
Note that this is not production-quality code. You'll need to do a bunch of fixing things up.
Based on this, you should be able to figure out whether or not the program counter is advancing, and could combine this with other pieces of information (such as /proc/PID/status) to find if it's busy in a system call. You might also be able to extend the usage of ptrace to check what system calls are being used, so that you can check if it's a reasonable one to be waiting on.
This is a hacky solution, but I don't think that you'll find a non-hacky solution for this problem. Despite the hackiness, I don't think (this is untested) that it would be particularly slow; my implementation pauses the monitored thread once per second for a very short amount of time - which I would guess would be in the 100s of microseconds range. That's around 0.01% efficiency loss, theoretically.

I think you need a shared activity marker.
Have the main thread (or in a more general application, all worker threads) update the shared activity marker with the current time (or clock tick, e.g. by computing the "current" nanosecond from clock_gettime(CLOCK_MONOTONIC, ...)), and have the heartbeat thread periodically check when this activity marker was last updated, cancelling itself (and thus stopping the heartbeat broadcast) if there has not been any activity update within a reasonable time.
This scheme can easily be extended with a state flag if the workload is very sporadic. The main work thread sets the flag and updates the activity marker when it begins a unit of work, and clears the flag when the work has completed. If there is no work being done then the heartbeat is sent without checking the activity marker. If work is being done then the heartbeat is stopped if the time since the activity marker was updated exceeds the maximum processing time allowed for a unit of work. (Multiple worker threads each need their own activity marker and flag in this case, and the heartbeat thread can be designed to stop when any one worker thread gets stuck, or only when all worker threads get stuck, depending on their purposes and importance to the overall system).
(The activity marker value (and the work flag) will of course have to be protected by a mutex that must be acquired before reading or writing the value.)
Perhaps the heartbeat thread can also cause the whole process to commit suicide (e.g. kill(getpid(), SIGQUIT)) so that it can be restarted by having it be called in a loop in a wrapper script, especially if a process restart clears the condition in the kernel which would cause the problem in the first place.

One possible method would be to have another set of heartbeat messages from the main thread to the heartbeat thread. If it stops receiving messages for a certain amount of time, it stops sending them out as well. (And could try other recovery such as restarting the process.)
To solve the issue of the main thread actually just being in a long sleep, have a (properly-synchronized) flag that the heartbeat thread sets when it has decided that the main thread must have failed - and the main thread should check this flag at appropriate times (e.g. after the potential wait) to make sure that it hasn't been reported as dead. If it has, it stops running, because its job would have already been taken up by a different node.
The main thread can also send I-am-alive events to the heartbeat thread at other times than once around the loop - for example, if it's going into a long-running operation. Without this, there's no way to tell the difference between a failed main thread and a sleeping main thread.

GCD dispatch_async memory leak?

The following code will occupy ~410MB of memory and will not release it again. (The version using dispatch_sync instead of dispatch_async will require ~8MB memory)
I would expect a spike of high memory usage but it should go down again... Where is the leak?
int main(int argc, const char * argv[]) {
#autoreleasepool {
for (int i = 0; i < 100000; i++) {
dispatch_async(dispatch_get_global_queue(QOS_CLASS_UTILITY, 0), ^{
NSLog(#"test");
});
}
NSLog(#"Waiting.");
[[NSRunLoop mainRunLoop] runUntilDate:[NSDate dateWithTimeIntervalSinceNow:60]];
}
return 0;
}
I tried:
Adding #autoreleasepool around and inside the loop
Adding NSRunLoop run to the loop
I tried several combinations and never saw a decrease of memory (even after waiting minutes).
I'm aware of the GCD reference guide which contains the following statement:
Although GCD dispatch queues have their own autorelease pools, they make no guarantees as to when those pools are drained.
Is there a memory leak in this code? If not, is there a way to enforce the queue to release/drain the finished blocks?

Objective-C block it is a C structure, I think you create 100000 the block objects to execute them in background threads and them wait while system can run them. Your device can execute limited count of threads, it means that many blocks will wait before OS start them.
If you change "async" to "sync", a next block object will be created after a previous block will be finished and destroyed.
UPD
About GCD pool.
GCD executes tasks on GCD thread pool, threads are created by the system, and managed by system. System caches threads to save CPU time, every dispatch task executes on free thread.
From documentation:
——
Blocks submitted to dispatch queues are executed on a pool of threads fully managed by the system. No guarantee is made as to the thread on which a task executes.
——
If you run the tasks as synchronized tasks, then exist the free thread (from GCD thread pool) to execute next task, after current task’s finished (because main thread is waiting while task execute, and does not add new tasks to the queue), and system does not allocate new NSThread (On my mac I’ve seen 2 threads). If you run the tasks as async, then the system can allocate many NSThreads (to achieve of maximum performance, on my mac it is near 67 threads), because the global queue contain many tasks.
Here you can read about max count of GCD thread pool.
I’ve seen in Alocations profiler that there are many NSThreads allocated and not destructed. I think it is system pool, that will be freed if necessary.

Always put #autoreleasepool inside every GCD call and you will have no problems. I had the same problem and this is the only workaround.
int main(int argc, const char * argv[]) {
#autoreleasepool {
for (int i = 0; i < 100000; i++) {
dispatch_async(dispatch_get_global_queue(QOS_CLASS_UTILITY, 0), ^{
// everything INSIDE in an #autoreleasepool
#autoreleasepool {
NSLog(#"test");
}
});
}
NSLog(#"Waiting.");
[[NSRunLoop mainRunLoop] runUntilDate:[NSDate dateWithTimeIntervalSinceNow:60]];
}
return 0;
}

Thread local boost fast_pool_allocator

I've a multithreaded (Cilk) program where each thread use a temporary
std::set. There are a lot of allocations on these std::set so that I'm
trying to use some pool allocators namely boost::fast_pool_allocator:
using allocator = boost::fast_pool_allocator< SGroup::type >;
using set = std::set<SGroup::type, std::less<SGroup::type>, allocator>;
But now the performances are much worse because of concurrent access to the
allocator. One crucial fact is that the sets are never communicated among the
threads so that I can use a thread local allocators. However, as shown in the
previous code, I'm not constructing allocator objects but passing template
parameters to the std::set constructor.
So here is my question: is it possible to construct multiple
boost::fast_pool_allocator to use them as thread local pool allocator ?
Edit : I removed stupid std::pair allocations.

EDIT
Mmm. I had an answer here that I pieced together from things I remembered seeing. However, upon further inspection it looks like all the allocators actually work with Singleton Pools that are never thread safe without synchronization. In fact, the null_mutex is likely in a detail namespace for this very reason: it only makes sense to use it if you know the program doesn't use threads (well, outisde the main thread) at all.
Aside from this apparent debacle, you could probably use object_pool directly. But it's not an allocator, so it wouldn't serve you for your container example.
Original Answer Text:
You can pass an allocator instance at construction:
#include <boost/pool/pool.hpp>
#include <boost/pool/pool_alloc.hpp>
#include <boost/thread.hpp>
#include <set>
struct SGroup
{
int data;
typedef int type;
};
using allocator = boost::fast_pool_allocator<SGroup::type>;
using set = std::set<SGroup::type, std::less<SGroup::type>, allocator>;
void thread_function()
{
allocator alloc; // thread local
set myset(set::key_compare(), alloc);
// do stuff
}
int main()
{
boost::thread_group group;
for (int i = 0; i<10; ++i)
group.create_thread(thread_function);
group.join_all();
}
Let me read the docs on how to disable thread-awareness on the allocator :)
Found it in an example:
typedef boost::fast_pool_allocator<SGroup::type,
boost::default_user_allocator_new_delete,
boost::details::pool::null_mutex> allocator;
The example in boost/libs/pool/example/time_pool_alloc.hpp should help you get started benchmarking the difference(s) in performance

linux RT scheduling

Our product is running linux 2.6.32, and we have some user space processes which run periodically - 'keep alive - like' processes. We don't place hard requirements on these processes - they just need to run once in several seconds and refresh some watchdog.
We gave these processes a scheduling class of RR or FIFO with max priority, and yet, we see many false positives - it seems like they don't get the CPU for several seconds.
I find it very odd because I know that Linux, while not being a RT OS, can still yield very good performance (I see people talking about orders of several msec) - and I can't even make that the process run once in 5 sec.
The logic of Linux RT scheduler seems very straight forward and simple, so I suspected that these processes get blocked by something else - I/O contention, interrupts or kernel threads taking too long - but now I'm not so sure:
I wrote a really basic program to simulate such a process - it wakes up every 1 second and measures the time spent since the last time it finished running. The time measurement doesn't include blocking on any I/O as far as I understand, so the results printed by this process reflect the behavior of the scheduler:
#include <sched.h>
#include <stdio.h>
#include <sys/time.h>
#include <sys/param.h>
#include <time.h>
#define MICROSECONDSINASEC 1000000
#define MILLISECONDSINASEC 1000
int main()
{
struct sched_param schedParam;
struct timeval now, start;
int spent_time = 0;
time_t current_time;
schedParam.sched_priority = sched_get_priority_max(SCHED_RR);
int retVal = sched_setscheduler(0, SCHED_RR, &schedParam);
if (retVal != 0)
{
printf("failed setting RT sched");
return 0;
}
gettimeofday(&start, 0);
start.tv_sec -= 1;
start.tv_usec += MICROSECONDSINASEC;
while(1)
{
sleep(1);
gettimeofday(&now, 0);
now.tv_sec -= 1;
now.tv_usec += MICROSECONDSINASEC;
spent_time = MILLISECONDSINASEC * (now.tv_sec - start.tv_sec) + ((now.tv_usec - start.tv_usec) / MILLISECONDSINASEC);
FILE *fl = fopen("output_log.txt", "aw");
if (spent_time > 1100)
{
time(&current_time);
fprintf(fl,"\n (%s) - had a gap of %d [msec] instead of 1000\n", ctime(&current_time), spent_time);
}
fclose(fl);
gettimeofday(&start, 0);
}
return 0;
}
I ran this process overnight on several machines - including ones that don't run our product (just plain Linux) and I still saw gaps of several seconds - even though I made sure the process DID get the priority - and I can't figure out why - technically this process should preeampt any other running process, so how can it wait so long to run?
A few notes:
I ran these processes mostly on virtual machines - so maybe there can be an intervention from the hypervisor. But in the past I've seen such behavior on physical machines as well.
Making the process RT did improve the results drastically, but not totally prevented the problem.
There are no other RT processes running on the machine except for the Linux migration and watchdog process (which I don't believe can cause starvation to my processes).
What can I do? I feel like I'm missing something very basic here.
thanks!

why doesn't the System Monitor show correct CPU affinity?

I have search for questions/answers on CPU affinity and read the results but I am still cannot get my threads to nail up to a single CPU.
I am working on an application that will be run on a dedicated linux box so I am not concerned about other processes, only my own. This app currently spawns off one pthread and then the main thread enters a while loop to process control messages using POSIX msg queues. This while loop blocks waiting for a control msg to come in and then processes it. So the main thread is very simple and non-critical. My code is working very well as I can send this app messages and it will process them just fine. All control messages are very small in size and are use to just control the functionality of the application, that is, only a few control messages are ever send/received.
Before I enter this while loop, I use sched_getaffinity() to log all of the CPUs available. Then I use sched_setaffinity() to set this process to a single CPU. Then I call sched_getaffinity() again to check if it is set to run on only one CPU and it is indeed correct.
The single pthread that was spawned off does a similar thing. The first thing I do in the newly created pthread is call pthread_getaffinity_np() and check the available CPUs, then call pthread_setaffinity_np() to set it to a different CPU then call pthread_getaffinity_np() to check if it is set as desired and it is indeed correct.
This is what is confusing. When I run the app and view the CPU History in System Monitor, I see no difference from when I run the app without all of this set affinity stuff. The scheduler still runs a couple of seconds in each of the 4 CPUs on this quad core box. So it appears that the scheduler is ignoring my affinity settings.
Am I wrong in expecting to see some proof that the main thread and the pthread are actually running in their own single CPU?? or have I forgotten to do something more to get this to work as I intend?
Thanks,
-Andres

You have no answers, I will give you what I can:some partial help
Assuming you checked the return values from pthread_setaffinity_np:
How you assign your cpuset is very important, create it in the main thread. For what you want. It will propagate to successive threads. Did you check return codes?
The cpuset you actually get will be the intersection of hardware available cpus and the cpuset you define.
min.h in the code below is a generic build include file. You have to define _GNU_SOURCE - please note the comment on the last line of the code. CPUSET and CPUSETSIZE are macros. I think I define them somewhere else, I do not remember. They may be in a standard header.
#define _GNU_SOURCE
#include "min.h"
#include <pthread.h>
int
main(int argc, char **argv)
{
int s, j;
cpu_set_t cpuset;
pthread_t tid=pthread_self();
// Set affinity mask to include CPUs 0 & 1
CPU_ZERO(&cpuset);
for (j = 0; j < 2; j++)
CPU_SET(j, &cpuset);
s = pthread_setaffinity_np(tid, sizeof(cpu_set_t), &cpuset);
if (s != 0)
{
fprintf(stderr, "%d ", s);
perror(" pthread_setaffinity_np");
exit(1);
}
// lets see what we really have in the actual affinity mask assigned our thread
s = pthread_getaffinity_np(tid, sizeof(cpu_set_t), &cpuset);
if (s != 0)
{
fprintf(stderr, "%d ", s);
perror(" pthread_setaffinity_np");
exit(1);
}
printf("my cpuset has:\n");
for (j = 0; j < CPU_SETSIZE; j++)
if (CPU_ISSET(j, &cpuset))
printf(" CPU %d\n", j);
// #Andres note: any pthread_create call from here on creates a thread with the identical
// cpuset - you do not have to call it in every thread.
return 0;
}

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string