C++ data sharing between threads c++ - multithreading

Originally coming from Java, I'm having problem with data sharing between 2 threads in C++11. I have thoroughly read through the multithreading posts here without help and i would simply like to know why my approach is not OK C++ syntax for multithreading.
My application in short:
I have one thread reading a hardware sensor and dumping that data to some shared data monitor
I want another thread listening to data changes of that very monitor and draw some graphical stuff based on the new data (yes, I'm using conditional varible in my monitor)
Below is my Main class with the main method:
#include <cstdlib>
#include <iostream>
#include <thread>
#include <sweep/sweep.hpp>
#include <pcl/ModelCoefficients.h>
#include <pcl/point_types.h>
#include <pcl/io/pcd_io.h>
#include <pcl/filters/extract_indices.h>
#include <pcl/features/normal_3d.h>
#include "include/LiDAR.h"
#include "include/Visualizer.h"
void run_LiDAR(LiDAR* lidar){
lidar->run();
}
void visualize(Visualizer* visualizer){
visualizer->run();
}
int main(int argc, char* argv[]) try {
Monitor mon; //The monitor holding shared data
LiDAR sensor(&mon); //Sensor object dumping data to the monitor
Visualizer vis(&mon); //Visualizer listening to data changes and updates the visuals accordingly
std::thread sweep_thread(run_LiDAR, &sensor); //Starting the LiDAR thread
std::cout << "Started Sweep thread" << std::endl;
std::thread visualizer_thread(visualize, vis);
std::cout << "Started Visualizer thread" << std::endl;
while(1){
//Do some calculations on the data in Monitor mon
mon.cluster();
}
}
The sensor thread dumping the data works good and so does the main thread running the clustering algorithms. However I get the following error message:
In file included from MY_DIRECTORY/Main.cpp:3: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/thread:336:5: error: attempt to use a deleted function
__invoke(_VSTD::move(_VSTD::get<1>(__t)), _VSTD::move(_VSTD::get<_Indices>(__t))...);
If I comment the line:
std::thread visualizer_thread(visualize, vis);
My program builds and works...
What am I not getting?
Kind regards,

What is happening is that Visualizer doesn't have a move constructor.
std::thread visualizer_thread(visualize, vis);
visualize() expects a pointer.
As an aside, you should make you have a mechanism to end your threads in an orderly manner, since the data (sensor, vis) will destroy itself when main() exits, leaving the threads reading/writing to unallocated data on the stack!
Using dynamic allocation using std::unique_ptr or std::shared_ptr (which are moveable) can eliminate the issue.

Related

Do QThreads run on parallel?

I have two threads running and they simply print a message. Here is an minimalistic example of it.
Here is my Header.h:
#pragma once
#include <QtCore/QThread>
#include <QtCore/QDebug>
class WorkerOne : public QObject {
Q_OBJECT
public Q_SLOTS:
void printFirstMessage() {
while (1) {
qDebug() << "<<< Message from the FIRST worker" << QThread::currentThreadId();
}
}
};
class WorkerTwo : public QObject {
Q_OBJECT
public Q_SLOTS:
void printSecondMessage() {
while (1) {
qDebug() << ">>> Message from the SECOND worker" << QThread::currentThreadId();
}
}
};
And, of course, my main:
#include <QtCore/QCoreApplication>
#include "Header.h"
int main(int argc, char *argv[])
{
QCoreApplication a(argc, argv);
WorkerOne kek1;
QThread t1;
kek1.moveToThread(&t1);
t1.setObjectName("FIRST THREAD");
QThread t2;
WorkerTwo kek2;
kek2.moveToThread(&t2);
t2.setObjectName("SECOND THREAD");
QObject::connect(&t1, &QThread::started, &kek1, &WorkerOne::printFirstMessage);
QObject::connect(&t2, &QThread::started, &kek2, &WorkerTwo::printSecondMessage);
t1.start();
t2.start();
return a.exec();
}
When I start application I see an expected output of it:
As you may see, thread id is different. It's was added to be sure they are running on different threads.
I set the only one breakpoint in printFirstMessage and run the application in debug mode attached to the debugger. Once the debugger stops at my breakpoint - I wait for a while and press Continue, so my debugger stops at the same breakpoint again.
What do I expect to see? I expect to see only one <<< Message from the FIRST worker and a lot of messages from the second worker. But what do I see? I see only two messages: the first one from the first worker and the second one from the second worker.
I pressed Continue a lot of times and the result is more or less the same. That's weird to me, because I expected the second thread to be running while the first one is stopped by debugger.
I decided to test it using std::thread and wrote the following code:
#include <thread>
#include <iostream>
void foo1() {
while (true) {
std::cout << "Function ONE\n";
}
}
void foo2() {
while (true) {
std::cout << "The second function\n";
}
}
int main() {
std::thread t1(&foo1);
std::thread t2(&foo2);
t1.join();
t2.join();
}
Set a breakpoint in the first one, starts the app, after stopping at the breakpoint I hit Continue and see that console contains a lot of messages from the second function and only one from the first function (exactly this I expected using QThread as well):
Could someone explain how does it works with QThread? By the way, I tested it using QtConcurrent::run instead of QThread and the result was as expected: the second function is running while the first one is stopped because of a breakpoint.
Yes, multiple QThread instances are allowed to run in parallel. Whether they effectively run in parallel is up to your OS and depends on multiple factors:
The number of physical (and logical) CPU cores. This is typically not more than 4 or 8 on a consumer computer. This is the maximum number of threads (including the threads of other programs and your OS itself) that can be effectively run in parallel. The number of cores is much lower than the number of threads typically running on a computer. If your computer consists of only 1 core, you will still be able to use multiple QThread's but the OS scheduler will alternate between executing those threads. QThread::idealThreadCount can be used to query the number of (logical) CPU cores.
Each thread has a QThread::Priority. The OS thread scheduler may use this value to prioritize (or de-prioritize) one thread over another. A thread with a lower priority may get less CPU time than a thread with a higher priority when the CPU cores are busy.
The (workload on the) other threads that are currently running.
Debugging your program definitely alters the normal execution of a multi thread program:
Interrupting and continuing a thread has a certain overhead. In the meantime, the other threads may still/already perform some operations.
As pointed out by G.M., most of the time all threads are interrupted when a breakpoint is hit. How fast the others threads are interrupted is not well defined.
Often a debugger has a configuration option to allow interrupting a single thread, while the others continue running, see f.ex. this question.
The number of loops that are executed while the other thread is interrupted/started again, depends on the number of CPU instructions that are needed to perform a single loop. Calling qDebug() and QThread::currentThreadId() is definitely slower than a single std::cout.
Conclusion: You don't have any hard garanty about the scheduling of a thread. However, in normal operation, both threads will get almost the same amount of CPU time on average as the OS scheduler has no reason the favor one over the other. Using a debugger completely alters this normal behavior.

mmap: performance when using multithreading

I have a program which performs some operations on a lot of files (> 10 000). It spawns N worker threads and each thread mmaps some file, does some work and munmaps it.
The problem I am facing right now is that whenever I use just 1 process with N worker threads, it has worse performance than spawning 2 processes each with N/2 worker threads. I can see this in iotop because 1 process+N threads uses only around 75% of the disk bandwidth whereas 2 processes+N/2 threads use full bandwidth.
Some notes:
This happens only if I use mmap()/munmap(). I have tried to replace it with fopen()/fread() and it worked just fine. But since the mmap()/munmap() comes with 3rd party library, I would like to use it in its original form.
madvise() is called with MADV_SEQUENTIAL but it doesn't seem to change anything (or it just slows it down) if I remove it or change the advise argument.
Thread affinity doesn't seem to matter. I have tried to limit each thread to specific core. I have also tried to limit threads to core pairs (Hyper Threading). No results so far.
Load reported by htop seems to be the same even in both cases.
So my questions are:
Is there anything about mmap() I am not aware of when used in multithreaded environment?
If so, why do 2 processes have better performance?
EDIT:
As pointed out in the comments, it is running on server with 2xCPU. I should probably try to set thread affinities such that it is always running on the same CPU but I think I already tried that and it didn't work.
Here is a piece of code with which I can reproduce the same issue as with my production software.
#include <condition_variable>
#include <deque>
#include <filesystem>
#include <iostream>
#include <mutex>
#include <thread>
#include <vector>
#include <fcntl.h>
#include <sys/mman.h>
#include <unistd.h>
#ifndef WORKERS
#define WORKERS 16
#endif
bool stop = false;
std::mutex queue_mutex;
std::condition_variable queue_cv;
std::pair<const std::uint8_t*, std::size_t> map_file(const std::string& file_path)
{
int fd = open(file_path.data(), O_RDONLY);
if (fd != -1)
{
auto dir_ent = std::filesystem::directory_entry{file_path.data()};
if (dir_ent.is_regular_file())
{
auto size = dir_ent.file_size();
auto data = mmap(nullptr, size, PROT_READ, MAP_PRIVATE, fd, 0);
madvise(data, size, MADV_SEQUENTIAL);
close(fd);
return { reinterpret_cast<const std::uint8_t*>(data), size };
}
close(fd);
}
return { nullptr, 0 };
}
void unmap_file(const std::uint8_t* data, std::size_t size)
{
munmap((void*)data, size);
}
int main(int argc, char* argv[])
{
std::deque<std::string> queue;
std::vector<std::thread> threads;
for (std::size_t i = 0; i < WORKERS; ++i)
{
threads.emplace_back(
[&]() {
std::string path;
while (true)
{
{
std::unique_lock<std::mutex> lock(queue_mutex);
while (!stop && queue.empty())
queue_cv.wait(lock);
if (stop && queue.empty())
return;
path = queue.front();
queue.pop_front();
}
auto [data, size] = map_file(path);
std::uint8_t b = 0;
for (auto itr = data; itr < data + size; ++itr)
b ^= *itr;
unmap_file(data, size);
std::cout << (int)b << std::endl;
}
}
);
}
for (auto& p : std::filesystem::recursive_directory_iterator{argv[1]})
{
std::unique_lock<std::mutex> lock(queue_mutex);
if (p.is_regular_file())
{
queue.push_back(p.path().native());
queue_cv.notify_one();
}
}
stop = true;
queue_cv.notify_all();
for (auto& t : threads)
t.join();
return 0;
}
Is there anything about mmap() I am not aware of when used in multithreaded environment?
Yes. mmap() requires significant virtual memory manipulation - effectively single-threading your process in places. Per this post from one Linus Torvalds:
... playing games with the virtual memory mapping is very expensive
in itself. It has a number of quite real disadvantages that people tend
to ignore because memory copying is seen as something very slow, and
sometimes optimizing that copy away is seen as an obvious improvment.
Downsides to mmap:
quite noticeable setup and teardown costs. And I mean noticeable.
It's things like following the page tables to unmap everything
cleanly. It's the book-keeping for maintaining a list of all the
mappings. It's The TLB flush needed after unmapping stuff.
page faulting is expensive. That's how the mapping gets populated,
and it's quite slow.
Note that much of the above also has to be single-threaded across the entire machine, such as the actual mapping of physical memory.
So the virtual memory manipulations mapping files requires are not only expensive, they really can't be done in parallel - there's only one chunk of actual physical memory that the kernel has to keep track of, and multiple threads can't parallelize changes to a process's virtual address space.
You'd almost certainly get better performance reusing a memory buffer for each file, where each buffer is created once and is large enough to hold any file read into it, then reading from the file using low-level POSIX read() call(s). You might want to experiment with using page-aligned buffers and using direct IO by calling open() with the O_DIRECT flag (Linux-specific) to bypass the page cache since you apparently never re-read any data and any caching is a waste of memory and CPU cycles.
Reusing the buffer also completely eliminates any munmap() or delete/free().
You'd have to manage the buffers, though. Perhaps prepopulating a queue with N precreated buffers, and returning a buffer to the queue when done with a file?
As far as
If so, why do 2 processes have better performance?
The use of two processes splits the process-specific virtual memory manipulations caused by mmap() calls into two separable sets that can run in parallel.
A few notes:
Try running your application with perf stat -ddd <app> and have a look at context-switches, cpu-migrations and page-faults numbers.
The threads probably contend for vm_area_struct in the kernel process structure on mmap and page faults. Try passing MAP_POPULATE or MAP_LOCKED flag into mmap to minimize page faults. Alternatively, try mmap with MAP_POPULATE or MAP_LOCKED flag in the main thread only (you may like to ensure that all threads run on the same NUMA node in this case).
You may also like to experiment with MAP_HUGETLB and one of MAP_HUGE_2MB, MAP_HUGE_1GB flags.
Try binding threads to the same NUMA node with numactl to make sure that threads only access local NUMA memory. E.g. numactl --membind=0 --cpunodebind=0 <app>.
Lock the mutex before stop = true, otherwise the condition variable notification can get lost and deadlock the waiting thread forever.
p.is_regular_file() check doesn't require the mutex to be locked.
std::deque can be replaced with std::list and use splice to push and pop elements to minimize the time the mutex is locked.

Linux kernel module blocking entire Linux

I wrote my the first simple Linux module for Led flashing. If I use for pause between Led ON and Led Off the command ssleep(1) then everything is okay but if I use udelay(40) then entire Linux and applications such as SSH, Webserver etc are frozen. Could you help me why it happens and how fix it ?
#include <linux/init.h>
#include <linux/module.h>
#include <linux/delay.h>
#include <linux/gpio.h>
#include <mach/gpio.h>
MODULE_LICENSE("GPL");
static int led_on_init(void)
{
gpio_direction_output(AT91_PIN_PA24, 0);
int i = 1;
while (i == 1)
{
gpio_set_value(AT91_PIN_PA24, 1);
/*udelay(40);*/
ssleep(1);
gpio_set_value(AT91_PIN_PA24, 0);
ssleep(1);
/*udelay(40);*/
}
}
static void led_on_exit(void)
{
gpio_set_value(AT91_PIN_PA24, 0);
}
module_init(led_on_init);
module_exit(led_on_exit);
udelay is a busy-waiting function while sleep will schedule the current task out(to run other tasks) and go back when the time is up.
So if your kernel is not configured as preemptible one, the cpu which is running udelay won't have chance to be scheduled. If your machine only has 1 cpu, the entire machine will be blocked.
In your circumstance, it's recommended to use sleep instead of udelay.

threads and locks

I do not not know anything about multithreading programming so wanted to post a general question here. How can I do the following:
main()
run MyMethod every 30 seconds
MyMethod()
1. get data
2. do calculations
3. save result into file
How can I make sure that I finish saving results (MyMethod step 3) before main start running MyMethod again ? Basically I have to lock that thread somehow until MyMethod is done. Feel free to use any language as example I'm more interested in the concept how such things are done in reality.
Thanks
You don't need synchronization. You only need to make sure that the thread work is completed, since saving happens at the end.
#include <thread>
#include <unistd.h>
int MyMethod(){
// some code
}
int run(){
std::thread thrd(MyMethod);
sleep(30);
thrd.join();
}
int main(){
while(true)
run();
}

Saving gmon.out before killing a process

I would like to use gprof to profile a daemon. My daemon uses a 3rd party library, with which it registers some callbacks, then calls a main function, that never returns. I need to call kill (either SIGTERM or SIGKILL) to terminate the daemon. Unfortunately, gprof's manual page says the following:
The profiled program must call "exit"(2) or return normally for the
profiling information to be saved in the gmon.out file.
Is there is way to save profiling information for processes which are killed with SIGTERM or SIGKILL ?
First, I would like to thank #wallyk for giving me good initial pointers. I solved my issue as follows. Apparently, libc's gprof exit handler is called _mcleanup. So, I registered a signal handler for SIGUSR1 (unused by the 3rd party library) and called _mcleanup and _exit. Works perfectly! The code looks as follows:
#include <dlfcn.h>
#include <stdio.h>
#include <unistd.h>
void sigUsr1Handler(int sig)
{
fprintf(stderr, "Exiting on SIGUSR1\n");
void (*_mcleanup)(void);
_mcleanup = (void (*)(void))dlsym(RTLD_DEFAULT, "_mcleanup");
if (_mcleanup == NULL)
fprintf(stderr, "Unable to find gprof exit hook\n");
else _mcleanup();
_exit(0);
}
int main(int argc, char* argv[])
{
signal(SIGUSR1, sigUsr1Handler);
neverReturningLibraryFunction();
}
You could add a signal handler for a signal the third party library doesn't catch or ignore. Probably SIGUSR1 is good enough, but will either have to experiment or read the library's documentation—if it is thorough enough.
Your signal handler can simply call exit().

Resources