Parallel ray tracing in 16x16 chunks - multithreading

My ray tracer is currently multi threaded, I'm basically dividing the image into as many chunks as the system has and rendering them parallel. However, not all chunks have the same rendering time, so most of the time half of the run time is only 50% cpu usage.
Code
std::shared_ptr<bitmap_image> image = std::make_shared<bitmap_image>(WIDTH, HEIGHT);
auto nThreads = std::thread::hardware_concurrency();
std::cout << "Resolution: " << WIDTH << "x" << HEIGHT << std::endl;
std::cout << "Supersampling: " << SUPERSAMPLING << std::endl;
std::cout << "Ray depth: " << DEPTH << std::endl;
std::cout << "Threads: " << nThreads << std::endl;
std::vector<RenderThread> renderThreads(nThreads);
std::vector<std::thread> tt;
auto size = WIDTH*HEIGHT;
auto chunk = size / nThreads;
auto rem = size % nThreads;
//launch threads
for (unsigned i = 0; i < nThreads - 1; i++)
{
tt.emplace_back(std::thread(&RenderThread::LaunchThread, &renderThreads[i], i * chunk, (i + 1) * chunk, image));
}
tt.emplace_back(std::thread(&RenderThread::LaunchThread, &renderThreads[nThreads-1], (nThreads - 1)*chunk, nThreads*chunk + rem, image));
for (auto& t : tt)
t.join();
I would like to divide the image into 16x16 chunks or something similar and render them paralelly, so after each chunk gets rendered, the thread switches to the next and so on... This would greatly increase cpu usage and run time.
How do I set up my ray tracer render these 16x16 chunks in a multithreaded manner?

I assume the question is "How to distribute the blocks to the various threads?"
In your current solution, you're figuring out the regions ahead of time and assigning them to the threads. The trick is to turn this idea on its head. Make the threads ask for what to do next whenever they finish a chunk of work.
Here's an outline of what the threads will do:
void WorkerThread(Manager *manager) {
while (auto task = manager->GetTask()) {
task->Execute();
}
}
So you create a Manager object that returns a chunk of work (in the form of a Task) each time a thread calls its GetTask method. Since that method will be called from multiple threads, you have to be sure it uses appropriate synchronization.
std::unique_ptr<Task> Manager::GetTask() {
std::lock_guard guard(mutex);
std::unique_ptr<Task> t;
if (next_row < HEIGHT) {
t = std::make_unique<Task>(next_row);
++next_row;
}
return t;
}
In this example, the manager creates a new task to ray trace the next row. (You could use 16x16 blocks instead of rows if you like.) When all the tasks have been issued, it just returns an empty pointer, which essentially tells the calling thread that there's nothing left to do, and the calling thread will then exit.
If you made all the Tasks in advance and had the manager dole them as they are requested, this would be a typical "work queue" solution. (General work queues also allow new Tasks to be added on the fly, but you don't need that feature for this particular problem.)

I do this a bit differently:
obtain number of CPU and or cores
You did not specify OS so you need to use your OS api for this. search for System affinity mask.
divide screen into threads
I am dividing screen by lines instead of 16x16 blocks so I do not need to have a que or something. Simply create thread for each CPU/core that will process only its horizontal lines rays. That is simple so each thread should have its ID number counting from zero and number of CPU/cores n so lines belonging to each process are:
y = ID + i*n
where i={0,1,2,3,... } once y is bigger or equal then screen resolution stop. This type of access has its advantages for example accessing screen buffer via ScanLines will not be conflicting between threads as each thread access only its lines...
I am also setting affinity mask for each thread so it uses its own CPU/core only it give me a small boost so there is not so much process switching (but that was on older OS versions hard to say what it does now).
synchronize threads
basically you should wait until all threads are finished. if they are then render the result on screen. Your threads can either stop and you will create new ones on next frame or jump to Sleep loops until rendering forced again...
I am using the latter approach so I do not need to create and configure the threads over and over again but beware Sleep(1) can sleep a lot more then just 1 ms.

Related

Memory not be freed on Mac when vector push_back string

Code as below, found that when vector push_back string on a Mac demo app, memory not be freed. I thought the stack variable will be freed when out of function scope, am I wrong? Thanks for any tips.
in model.h:
#pragma once
namespace NS {
const uint8_t kModel[8779041] = {4,0,188,250,....};
}
in ViewController.mm:
- (void)start {
std::vector<std::string> params = {};
std::string strModel(reinterpret_cast<const char *>(NS::kModel), sizeof(NS:kModel));
params.push_back(strModel);
}
The answer to your question depends on your understanding of the the "free" memory. The behaviour you are observing can be reproduced as simple as with a couple lines of code:
void myFunc() {
const auto *ptr = new uint8_t[8779041]{};
delete[] ptr;
}
Let's run this function and see how the memory consumption graph changes:
int main() {
myFunc(); // 1 MB
std::cout << "Check point" << std::endl; // 9.4 MB
return 0;
}
If you put one breakpoint right at the line with myFunc() invocation and another one at the line with "Check point" console output, you will witness how memory consumption for the process jumps by about 8 MB (for my system and machine configuration Xcode shows sudden jump from 1 MB to 9.4 MB). But wait, isn't it supposed to be 1 MB again after the function, as the allocated memory is freed at the end of the function? Well, not exactly.. The system doesn't regain this memory right away, because it's not that cheap operation to begin with, and if your process requests the same amount memory 1 CPU cycle later, it would be quite a redundant work. Thus, the system usually doesn't bother shrinking memory dedicated to a process either until it's needed for another process, and until it runs out of available resources (it also can be some kind of fixed timer, but overall I would say this is implementation-defined). Another common reason the memory is not freed, is because you often observe it through debug mode, where the memory remains dedicated to the process to track some tricky scenarios (like NSZombie objects, which address needs to remain accessible to the process in order to report the use-after-free occasions).
The most important here is that internally, the process can differentiate between "deleted" and "occupied" memory pages, thus it can re-occupy memory which is already deleted. As a result, no matter how many times you call the same function, the memory dedicated to the process remains the same:
int main() {
myFunc(); // 1 MB
std::cout << "Check point" << std::endl; // 9.4 MB
for (int i = 0; i < 10000; ++i) {
myFunc();
}
std::cout << "Another point" << std::endl; // 9.4 MB
return 0;
}

std::map insert thread safe in c++11?

I have very simple code in which multiple threads are trying to insert data in std::map and as per my understanding this should led to program crash because this is data race
std::map<long long,long long> k1map;
void Ktask()
{
for(int i=0;i<1000;i++)
{
long long random_variable = (std::rand())%1000;
std::cout << "Thread ID -> " << std::this_thread::get_id() << " with looping index " << i << std::endl;
k1map.insert(std::make_pair(random_variable, random_variable));
}
}
int main()
{
std::srand((int)std::time(0)); // use current time as seed for random generator
for (int i = 0; i < 1000; ++i)
{
std::thread t(Ktask);
std::cout << "Thread created " << t.get_id() << std::endl;
t.detach();
}
return 0;
}
However i ran it multiple time and there is no application crash and if run same code with pthread and c++03 application is crashing so I am wondering is there some change in c++11 that make map insert thread safe ?
No, std::map::insert is not thread-safe.
There are many reasons why your example may not crash. Your threads may be running in a serial fashion due to the system scheduler, or because they finish very quickly (1000 iterations isn't that much). Your map will fill up quickly (only having 1000 nodes) and therefore later insertions won't actually modify the structure and reduce possibility of crashes. Or perhaps the implementation you're using IS thread-safe.
For most standard library types, the only thread safety guarantee you get is that it is safe to use separate object instances in separate threads. That's it.
And std::map is not one of the exceptions to that rule. An implementation might offer you more of a guarantee, or you could just be getting lucky.
And when it comes to fixing threading bugs, there's only one kind of luck.

How to parallelize "while" loop by the using of PPL

I need to parallelize "while" loop by the means of PPL. I have the following code in Visual C++ in MS VS 2013.
int WordCount::CountWordsInTextFiles(basic_string<char> p_FolderPath, vector<basic_string<char>>& p_TextFilesNames)
{
// Word counter in all files.
atomic<unsigned> wordsInFilesTotally = 0;
// Critical section.
critical_section cs;
// Set specified folder as current folder.
::SetCurrentDirectory(p_FolderPath.c_str());
// Concurrent iteration through p_TextFilesNames vector.
parallel_for(size_t(0), p_TextFilesNames.size(), [&](size_t i)
{
// Create a stream to read from file.
ifstream fileStream(p_TextFilesNames[i]);
// Check if the file is opened
if (fileStream.is_open())
{
// Word counter in a particular file.
unsigned wordsInFile = 0;
// Read from file.
while (fileStream.good())
{
string word;
fileStream >> word;
// Count total number of words in all files.
wordsInFilesTotally++;
// Count total number of words in a particular file.
wordsInFile++;
}
// Verify the values.
cs.lock();
cout << endl << "In file " << p_TextFilesNames[i] << " there are " << wordsInFile << " words" << endl;
cs.unlock();
}
});
// Destroy critical section.
cs.~critical_section();
// Return total number of words in all files in the folder.
return wordsInFilesTotally;
}
This code does parallel iteration through std::vector in outer loop. Parallelism is provided by concurrency::parallel_for() algorithm. But this code also has nested "while" loop that executes reading from file. I need to parallelize this nested "while" loop. How can this nested "while" loop can be parallelized by the means of PPL. Please help.
As user High Performance Mark hints in his comment, parallel reads from the same ifstream instance will cause undefined and incorrect behavior. (For some more discussion, see question "Is std::ifstream thread-safe & lock-free?".) You're basically at the parallelization limit here with this particular algorithm.
As a side note, even reading multiple different file streams in parallel will not really speed things up if they are all being read from the same physical volume. The disk hardware can only actually support so many parallel requests (typically not more than one at a time, queuing up any requests that come in while it is busy). For some more background, you might want to check out Mark Friedman's Top Six FAQs on Windows 2000 Disk Performance; the performance counters are Windows-specific, but most of the information is of general use.

SystemC: channels vs port value update

While working on a SystemC project, I discovered that probably I have some confused ideas about signals and ports. Let's say I have something like this:
//cell.hpp
SC_MODULE(Cell)
{
sc_in<sc_uint<16> > datain;
sc_in<sc_uint<1> > addr_en;
sc_in<sc_uint<1> > enable;
sc_out<sc_uint<16> > dataout;
SC_CTOR(Cell)
{
SC_THREAD(memory_cell);
sensitive << enable << datain << addr_en;
}
private:
void memory_cell();
};
//cell.cpp
void Cell::memory_cell()
{
unsigned short data_cell=11;
while(true)
{
//wait for some input
wait();
if (enable->read()==1 && addr_en->read()==1)
{
data_cell=datain->read();
}
else
{
if(enable->read()==0 && addr_en->read()==1)
{
dataout->write(data_cell);
}
}
}
}
//test.cpp
SC_MODULE(TestBench)
{
sc_signal<sc_uint<1> > address_en_s;
sc_signal<sc_uint<16> > datain_s;
sc_signal<sc_uint<1> > enable_s;
sc_signal<sc_uint<16> > dataout_s;
Cell cella;
SC_CTOR(TestBench) : cella("cella")
{
// Binding
cella.addr_en(address_en_s);
cella.datain(datain_s);
cella.enable(enable_s);
cella.dataout(dataout_s);
SC_THREAD(stimulus_thread);
}
private:
void stimulus_thread() {
//write a value:
datain_s.write(81);
address_en_s.write(1);
enable_s.write(1);
wait(SC_ZERO_TIME);
//read what we have written:
enable_s.write(0);
address_en_s.write(1);
wait(SC_ZERO_TIME);
cout << "Output value: " << dataout_s.read() << endl;
//let's cycle the memory again:
address_en_s.write(0);
wait(SC_ZERO_TIME);
cout << "Output value: " << dataout_s.read() << endl;
}
};
I've tried running this modules and I've noticed something weird (at least, weird for me): when the stimulus writes a value (81), after the wait(SC_ZERO_TIME) the memory thread finds its datain, enable and address_enable values already updated. This is what I expected to happen. The same happens when the stimulus changes the enable_es value, in order to run another cycle in the memory thread and copy the data_cell value into the memory cell dataout port. What I don't understand is why after the memory module writes into its dataout port and goes again to the wait() statement at the beginning of the while loop, the stimulus module still has the old value on its dataout_s channel (0), and not the new value(81), which has just been copied by the memory module. Then, if I run another cycle of the memory loop (for example changing some values on the stimulus channels), the dataout channel finnally updates.
In other words, it looks like that if I write into the stimulus channels and then switch to the memory thread, the memory finds the values updated. But if the memory thread writes into its ports, and then I switch to the stimulus thread, the thread still sees the old values on its channels (binded to the memory ports).
The example above is not working as I expected because of a wrong delta cycle synchronization.
Generally speaking, lets suppose we have two threads running on two modules, A and B, connected through a channel. If I write something in threadA during delta cycle number 1, it will only be available in thread B during delta cycle 2. And if thread B writes something during its delta cycle 2, thread A has to wait until delta cycle 3 in order to read it.
Being aware of this, stimulus thread would need two consecutive wait(SC_ZERO_TIME) statements in order to read the correct output from the memory, because it has to forward its delta value.

Linux clock_gettime() elapse spikes?

I'm try to get high resolution timestamp on linux. Using clock_gettime(), as below, I got "spike" elapses that looks pretty horrible at almost 26 micro second elapse. Most of the "dt"'s are around 30 ns. I was on linux 2.6.32, Red Hat 4.4.6. 'lscpu' shows CPU MHz=2666.121. I thought that means each each clock tick needs about 2 ns. So, asking for ns resolution didn't see like too unreasonable here.
output of program (sorry wasn't able to post this without making it a list. It thinks it's code some how)
1397534268,40823395 1397534268,40827950,dt=4555
1397534268,41233555 1397534268,41236716,dt=3161
1397534268,41389902 1397534268,41392922,dt=3020
1397534268,46488430 1397534268,46491674,dt=3244
1397534268,46531297 1397534268,46534279,dt=2982
1397534268,46823368 1397534268,46849336,dt=25968
1397534268,46915657 1397534268,46918663,dt=3006
1397534268,51488643 1397534268,51491791,dt=3148
1397534268,51530490 1397534268,51533496,dt=3006
1397534268,51823307 1397534268,51826904,dt=3597
1397534268,55823359 1397534268,55827826,dt=4467
1397534268,60531184 1397534268,60534183,dt=2999
1397534268,60823381 1397534268,60844866,dt=21485
1397534268,60913003 1397534268,60915998,dt=2995
1397534268,65823269 1397534268,65827742,dt=4473
1397534268,70823376 1397534268,70835280,dt=11904
1397534268,75823489 1397534268,75828872,dt=5383
1397534268,80823503 1397534268,80859500,dt=35997
1397534268,86823381 1397534268,86831907,dt=8526
Any ideas? thanks
#include <vector>
#include <iostream>
#include <time.h>
long long elapse( const timespec& t1, const timespec& t2 )
{
return ( t2.tv_sec * 1000000000L + t2.tv_nsec ) -
t1.tv_sec * 1000000000L + t1.tv_nsec );
}
int main()
{
const unsigned n=30000;
timespec ts;
std::vector<timespec> t( n );
for( unsigned i=0; i < n; ++i )
{
clock_gettime( CLOCK_REALTIME, &ts );
t[i] = ts;
}
std::vector<long> dt( n );
for( unsigned i=1; i < n; ++i )
{
dt[i] = elapse( t[i-1], t[i] );
if( dt[i] > 1000 )
{
std::cerr <<
t[i-1].tv_sec << ","
<< t[i-1].tv_nsec << " "
<< t[i].tv_sec << ","
<< t[i].tv_nsec
<< ",dt=" << dt[i] << std::endl;
}
else
{
//normally I get dt[i] = approx 30-35 nano secs
}
}
return 0;
}
The numbers you quoted are in the 3 to 30 microsecond range (3,000 to 30,000 nanoseconds). That is too short a time to be a context switch to another thread/process, let the other thread run, and context switch back to your thread. Most likely the core where your process was running was used by the kernel to service an external interrupt (e.g. network card, disk, timer), then returned to running your process.
You can watch the linux interrupt counters (per CPU core and per source) with this command
watch -d -n 0.2 cat /proc/interrupts
The -n 0.2 will cause the command to be issued at 5Hz, the -d flag will highlight what has changed.
The source of the interrupt could also be a TLB shootdown, which results in an IPI (Inter-Processor Interrupt). You can read more about TLB shootdowns here.
If you want to reduce the number of interrupts serviced by the core running your thread/process, you need to set the interrupt affinity. You can learn more about Red Hat Interrupts and IRQ (Interrupt requests) tuning here, and here.
Worth noting is that you are using CLOCK_REALTIME which isn't guaranteed to be "smooth", it could jump around as the system clock is "disciplined" to keep accurate time by a service like NTP (Network Time Protocol) or PTP (Precision Time Protocol). For your purposes it is better to use CLOCK_MONOTONIC, you can read more about the difference here. When a clock is "disciplined" the clock can jump by a "step" - this is unusual and certainly not the cause of the many spikes you see.
Could you check the resolution with clock_getres()?
I suspect what you are measuring here is called "OS Noise". This is often caused by your program getting pre-empted by the operating system. The operating system then performs other work. There are numerous causes, but commonly it is: other runnable tasks, hardware interrupts, or timer events.
The FTQ/FWQ benchmarks were designed to measure this characteristic and the summary contains some further information:
https://asc.llnl.gov/sequoia/benchmarks/FTQ_summary_v1.1.pdf

Resources