How to parallelize "while" loop by the using of PPL - visual-c++

I need to parallelize "while" loop by the means of PPL. I have the following code in Visual C++ in MS VS 2013.
int WordCount::CountWordsInTextFiles(basic_string<char> p_FolderPath, vector<basic_string<char>>& p_TextFilesNames)
{
// Word counter in all files.
atomic<unsigned> wordsInFilesTotally = 0;
// Critical section.
critical_section cs;
// Set specified folder as current folder.
::SetCurrentDirectory(p_FolderPath.c_str());
// Concurrent iteration through p_TextFilesNames vector.
parallel_for(size_t(0), p_TextFilesNames.size(), [&](size_t i)
{
// Create a stream to read from file.
ifstream fileStream(p_TextFilesNames[i]);
// Check if the file is opened
if (fileStream.is_open())
{
// Word counter in a particular file.
unsigned wordsInFile = 0;
// Read from file.
while (fileStream.good())
{
string word;
fileStream >> word;
// Count total number of words in all files.
wordsInFilesTotally++;
// Count total number of words in a particular file.
wordsInFile++;
}
// Verify the values.
cs.lock();
cout << endl << "In file " << p_TextFilesNames[i] << " there are " << wordsInFile << " words" << endl;
cs.unlock();
}
});
// Destroy critical section.
cs.~critical_section();
// Return total number of words in all files in the folder.
return wordsInFilesTotally;
}
This code does parallel iteration through std::vector in outer loop. Parallelism is provided by concurrency::parallel_for() algorithm. But this code also has nested "while" loop that executes reading from file. I need to parallelize this nested "while" loop. How can this nested "while" loop can be parallelized by the means of PPL. Please help.

As user High Performance Mark hints in his comment, parallel reads from the same ifstream instance will cause undefined and incorrect behavior. (For some more discussion, see question "Is std::ifstream thread-safe & lock-free?".) You're basically at the parallelization limit here with this particular algorithm.
As a side note, even reading multiple different file streams in parallel will not really speed things up if they are all being read from the same physical volume. The disk hardware can only actually support so many parallel requests (typically not more than one at a time, queuing up any requests that come in while it is busy). For some more background, you might want to check out Mark Friedman's Top Six FAQs on Windows 2000 Disk Performance; the performance counters are Windows-specific, but most of the information is of general use.

Related

How to integrate the outputs from Callbacks in C++/CPLEX

I used macro Callbacks in my code and I want to integrate outputs with together. At each node I need the value of Pseudocosts, slacks, variableBranch and etc. But I don’t know how to integrate these data that I retrieve with different Callbacks. I don’t run all Callbacks with together. Each time I run the code with different Callbacks the values of NodeID or Node aren’t equal. For example in pic1 I run BranchCallback to retrieve Pseudocosts and in pic3 I used UserCutCallback to retrieve the values of variables at each nodes. As can be seen in pic1 the last node is 126 but in pic3 the last node is 164. I want to create data structure in excel with each nodes but I don’t know which number of nodes I have to consider?126 or 164?
For example in pic1, can I say all information(values of pseudocosts ) about node10 is belong to node10 in pic3? And in pic3, all information(values of slacks ) about node10 is belong to node10 in pic1?
ILOUSERCUTCALLBACK1(Myvalue, IloArray<IloNumVarArray>, vars) {
for (int i = 0; i < nbworkers; ++i) {
for (int j = 0; j < nbmachines; j++) {
cout << "getvalue(" << vars[i][j] << ") = "
<< getValue(vars[i][j]) << endl;
}
}
}
You ask many things at the same time. I am going to answer the question that is related to the subject. For everything else please create a new question and show your code, your actual output and explain how that differs from the expected output.
The IloCplex class has a function out(). That function returns a reference to the stream to which CPLEX sends all its output. You can pass that reference to your callbacks and then write to that stream from your callbacks.
For example:
ILOUSERCUTCALLBACK1(MyCallback, std::ostream &, output) {
output << "Message from callback" << std::endl;
}
and then
cplex.use(MyCallback(cplex.getEnv(), cplex.out()));
to create and register the callback.
UPDATE After you edited your question, it seems your problem is not to print output from a callback but something else.
First of all note that it is expected to get different search paths if you do one run with a user cut callback and another run with a branch callback. There is no way to relate the node numbers or node ids from one run to the other. The statistics you want to obtain must be acquired with one single run.
Moreover, in order to identify a node you should not use the node number or the number of nodes processed (the number in the left-most column in the log). Instead you should use the ID of a node. This is what is known as sequence number in the C++ API. This is the only thing you can use to identify a node. These ids should match with the node ids shown at the very right of the log in case dynamic search is disabled (which happens automatically if you use control callbacks). These node IDs are available from all callbacks and you can use them collect information collected from different callbacks for the same node.
Update 2: Actually, I was wrong. The number returned by getNodeId() and the number shown in the NodeID column of the log are different, see my answer to this question. There is no way to relate these two numbers. Sorry for the confusion. I think you asked a similar question in another forum as well and I claimed the two numbers to be the same. That was wrong as well. Sorry again.
So basically your only option to relate things in a callback to things in a log is to perform a single-threaded run and then look at the order in which things are printed.
However, in order to trace the tree (along with pseudo costs etc.) you don't need the log. You can do everything from a callback by just using sequence numbers. The most difficult part is tracking the parent/child relationship which can be done like this (not that this is not thread safe):
struct Parent {
typedef IloCplex::MIPCallbackI::NodeId NodeId;
struct Less {
bool operator()(NodeId const &n1, NodeId const &n2) const {
return n1._id < n2._id;
}
};
typedef std::map<NodeId,NodeId,Less> MapType;
MapType parents;
void set(NodeId child, NodeId parent) { parents[child] = parent; }
IloCplex::MIPCallbackI::NodeId get(NodeId child) const {
MapType::const_iterator it = parents.find(child);
return (it == parents.end()) ? NodeId() : it->second;
}
};
Parent parent;
ILOBRANCHCALLBACK0(BranchCallback) {
std::cout << "CALLBACK[B]: " << getNodeId()
<< " (" << parent.get(getNodeId()) << ")" << std::endl;
int const n = getNbranches();
for (int i = 0; i < n; ++i) {
NodeId id = makeBranch(i);
parent.set(id, getNodeId());
}
}

Parallel ray tracing in 16x16 chunks

My ray tracer is currently multi threaded, I'm basically dividing the image into as many chunks as the system has and rendering them parallel. However, not all chunks have the same rendering time, so most of the time half of the run time is only 50% cpu usage.
Code
std::shared_ptr<bitmap_image> image = std::make_shared<bitmap_image>(WIDTH, HEIGHT);
auto nThreads = std::thread::hardware_concurrency();
std::cout << "Resolution: " << WIDTH << "x" << HEIGHT << std::endl;
std::cout << "Supersampling: " << SUPERSAMPLING << std::endl;
std::cout << "Ray depth: " << DEPTH << std::endl;
std::cout << "Threads: " << nThreads << std::endl;
std::vector<RenderThread> renderThreads(nThreads);
std::vector<std::thread> tt;
auto size = WIDTH*HEIGHT;
auto chunk = size / nThreads;
auto rem = size % nThreads;
//launch threads
for (unsigned i = 0; i < nThreads - 1; i++)
{
tt.emplace_back(std::thread(&RenderThread::LaunchThread, &renderThreads[i], i * chunk, (i + 1) * chunk, image));
}
tt.emplace_back(std::thread(&RenderThread::LaunchThread, &renderThreads[nThreads-1], (nThreads - 1)*chunk, nThreads*chunk + rem, image));
for (auto& t : tt)
t.join();
I would like to divide the image into 16x16 chunks or something similar and render them paralelly, so after each chunk gets rendered, the thread switches to the next and so on... This would greatly increase cpu usage and run time.
How do I set up my ray tracer render these 16x16 chunks in a multithreaded manner?
I assume the question is "How to distribute the blocks to the various threads?"
In your current solution, you're figuring out the regions ahead of time and assigning them to the threads. The trick is to turn this idea on its head. Make the threads ask for what to do next whenever they finish a chunk of work.
Here's an outline of what the threads will do:
void WorkerThread(Manager *manager) {
while (auto task = manager->GetTask()) {
task->Execute();
}
}
So you create a Manager object that returns a chunk of work (in the form of a Task) each time a thread calls its GetTask method. Since that method will be called from multiple threads, you have to be sure it uses appropriate synchronization.
std::unique_ptr<Task> Manager::GetTask() {
std::lock_guard guard(mutex);
std::unique_ptr<Task> t;
if (next_row < HEIGHT) {
t = std::make_unique<Task>(next_row);
++next_row;
}
return t;
}
In this example, the manager creates a new task to ray trace the next row. (You could use 16x16 blocks instead of rows if you like.) When all the tasks have been issued, it just returns an empty pointer, which essentially tells the calling thread that there's nothing left to do, and the calling thread will then exit.
If you made all the Tasks in advance and had the manager dole them as they are requested, this would be a typical "work queue" solution. (General work queues also allow new Tasks to be added on the fly, but you don't need that feature for this particular problem.)
I do this a bit differently:
obtain number of CPU and or cores
You did not specify OS so you need to use your OS api for this. search for System affinity mask.
divide screen into threads
I am dividing screen by lines instead of 16x16 blocks so I do not need to have a que or something. Simply create thread for each CPU/core that will process only its horizontal lines rays. That is simple so each thread should have its ID number counting from zero and number of CPU/cores n so lines belonging to each process are:
y = ID + i*n
where i={0,1,2,3,... } once y is bigger or equal then screen resolution stop. This type of access has its advantages for example accessing screen buffer via ScanLines will not be conflicting between threads as each thread access only its lines...
I am also setting affinity mask for each thread so it uses its own CPU/core only it give me a small boost so there is not so much process switching (but that was on older OS versions hard to say what it does now).
synchronize threads
basically you should wait until all threads are finished. if they are then render the result on screen. Your threads can either stop and you will create new ones on next frame or jump to Sleep loops until rendering forced again...
I am using the latter approach so I do not need to create and configure the threads over and over again but beware Sleep(1) can sleep a lot more then just 1 ms.

std::map insert thread safe in c++11?

I have very simple code in which multiple threads are trying to insert data in std::map and as per my understanding this should led to program crash because this is data race
std::map<long long,long long> k1map;
void Ktask()
{
for(int i=0;i<1000;i++)
{
long long random_variable = (std::rand())%1000;
std::cout << "Thread ID -> " << std::this_thread::get_id() << " with looping index " << i << std::endl;
k1map.insert(std::make_pair(random_variable, random_variable));
}
}
int main()
{
std::srand((int)std::time(0)); // use current time as seed for random generator
for (int i = 0; i < 1000; ++i)
{
std::thread t(Ktask);
std::cout << "Thread created " << t.get_id() << std::endl;
t.detach();
}
return 0;
}
However i ran it multiple time and there is no application crash and if run same code with pthread and c++03 application is crashing so I am wondering is there some change in c++11 that make map insert thread safe ?
No, std::map::insert is not thread-safe.
There are many reasons why your example may not crash. Your threads may be running in a serial fashion due to the system scheduler, or because they finish very quickly (1000 iterations isn't that much). Your map will fill up quickly (only having 1000 nodes) and therefore later insertions won't actually modify the structure and reduce possibility of crashes. Or perhaps the implementation you're using IS thread-safe.
For most standard library types, the only thread safety guarantee you get is that it is safe to use separate object instances in separate threads. That's it.
And std::map is not one of the exceptions to that rule. An implementation might offer you more of a guarantee, or you could just be getting lucky.
And when it comes to fixing threading bugs, there's only one kind of luck.

Improve serial building of a string with openMP {Copeland-Erdős constant}

I'm building a program to find substrings of Copeland-Erdős constant in C++11
Copeland-Erdős constant is a string with all primes in order:
2,3,5,7,11,13… → 23571113…
I need to check if a substring given is inside that constant, and do it in a quick way.
By the moment I've build a serial program using Miller Rabin function for checking if the numbers generated by a counter are primes or not and add to the main string (constant). To find 8th Marsene Number (231-1) the program spends 8 minutes.
And then, I use find to check if the substring given is in the constant and the position where it starts.
PROBLEMS:
I use serial programming. I start at 0 and check if all numbers are prime to add them or not... I don't know if there is any other way to do it. The substring can be a mix of primes. ex: 1..{1131}..7 (substring of 11,13,17)
Do you have any proposal to improve the program execution time by using OpenMP?
I want to calculate 9th Mersene Number in "human time". I've spend more than one day and it doesn't find it (well, arrive to the number).
gcc version 4.4.7 20120313
Main.cpp
while (found == -1 && lastNumber < LIMIT) //while not found & not pass our limit
{
//I generate at least a string with double size of the input (llargada)
for (lastNumber; primers.length() <= 2*llargada; lastNumber++){
if (is_prime_mr(lastNumber))
primers += to_string(lastNumber); //if prime, we add it to the main string
}
found = primers.find(sequencia); //search substring and keep position
if (found == string::npos){ //if not found
indexOfZero += primers.length()/2; //keep IndexOfZero, the position of string in global constant
primers.erase(0,primers.length()/2); //delete first middle part of calculated string
}
}
if (found != -1){
cout << "FOUNDED!" << endl;
cout << "POS: " << indexOfZero << " + " << found << " = " << indexOfZero+found << endl;} //that give us the real position of the substring in the main string
//although we only spend 2*inputString.size() memory
else
cout << "NOT FOUND" << endl;
Improving serial execution:
For starters, you do not need to check every number to see if it's prime, but rather every odd number (except for 2). We know that no even number past two can be prime. This should cut down your execution time in half.
Also, I do not understand why you have a nested loop. You should only have to check your list once.
Also, I fear that your algorithm might not be correct. Currently, if you do not find the substring, you delete half of your string and move on. However, if you have 50 non-primes in a row, you could end up deleting the entire string except for the very last character. But what if the substring you're looking for is 3 digits and needed 2 of the previous characters? Then you've erased some of the information needed to find your solution!
Finally, you should only search for your substring if you've actually found a prime number. Otherwise, you have already searched for it last iteration and nothing has been added to your string.
Combining all of these ideas, you have:
primers = "23";
lastNumber = 3;
found = -1;
while (found == -1)
{
lastNumber += 2;
if (is_prime_mr(lastNumber)) {
primers += to_string(lastNumber); //if prime, we add it to the main string
found = primers.find(sequencia); //search substring and keep position
if (found == string::npos)
found = -1;
else
break;
}
}
Also, you should write your own find function to only check the last few digits (where few = length of your most recent concatenation to the global string primers). If the substring wasn't in the previous global string, there's only a few places it could pop up in your newest string. That algorithm should be O(1) as opposed to O(n).
int findSub(std::string total, std::string substring, std::string lastAddition);
With this change your if statement should change to:
if (found != -1)
break;
Adding parallelism:
Unfortunately, as-is, your algorithm is inherently serial because you have to iterate through all the primes one-by-one, adding them to the list in a row in order to find your answer. There's no simple OpenMP way to parallelize your algorithm.
However, you can take advantage of parallelism by breaking up your string into pieces and having each thread work separately. Then, the only tricky thing you have to do is consider the boundaries between the final strings to double check you haven't missed anything. Something like as follows:
bool globalFound = false;
bool found;
std::vector<std::string> primers;
#pragma omp parallel private(lastNumber, myFinalNumber, found, my_id, num_threads)
{
my_id = omp_get_thread_num();
num_threads = omp_get_num_threads();
if (my_id == 0) { // first thread starts at 0... well, actually 3
primers.resize(num_threads);
#pragma omp barrier
primers[my_id] = "23";
lastNumber = 3;
}
else {
// barrier needed to ensure that primers is initialized to correct size
#pragma omp barrier
primers[my_id] = "";
lastNumber = (my_id/(double)num_threads)*LIMIT - 2; // figure out my starting place
if (lastNumber % 2 == 0) // ensure I'm not even
lastNumber++;
}
found = false;
myFinalNumber = ((my_id+1)/(double)num_threads)*LIMIT - 2;
while (!globalFound && lastNumber < myFinalNumber)
{
lastNumber += 2;
if (is_prime_mr(lastNumber)) {
primers[my_id] += to_string(lastNumber);
found = findSub(primers[my_id], sequencia, to_string(lastNumber)); // your new version of find
if (found) {
#pragma omp atomic
globalFound = true;
break;
}
}
}
}
if (!globalFound) {
// Result was not found in any thread, so check for boundaries/endpoints
globalFound = findVectorSubstring(primers, sequencia);
}
I'll let you finish this (by writing the smart find, findVectorSubstring - should only be checking for boundaries between elements of primers, and double checking you understand the logic of this new algorithm). Furthermore, if the arbitrary LIMIT that you setup turns out to be too small, you can always wrap this whole thing in a loop that searches between i*LIMIT and (i+1)*LIMIT.
Lastly, yes there will be load balancing issues. I can certainly imagine threads finding an uneven amount of prime numbers. Therefore, certain threads will be doing more work in the find function than others. However, a smart version of find() should be O(1) whereas is_prime_mr() is probably O(n) or O(logn), so I'm assuming that the majority of the execution time will be spent in the is_prime_mr() function. Therefore, I do not believe the load balancing will be too bad.
Hope this helps.

SystemC: channels vs port value update

While working on a SystemC project, I discovered that probably I have some confused ideas about signals and ports. Let's say I have something like this:
//cell.hpp
SC_MODULE(Cell)
{
sc_in<sc_uint<16> > datain;
sc_in<sc_uint<1> > addr_en;
sc_in<sc_uint<1> > enable;
sc_out<sc_uint<16> > dataout;
SC_CTOR(Cell)
{
SC_THREAD(memory_cell);
sensitive << enable << datain << addr_en;
}
private:
void memory_cell();
};
//cell.cpp
void Cell::memory_cell()
{
unsigned short data_cell=11;
while(true)
{
//wait for some input
wait();
if (enable->read()==1 && addr_en->read()==1)
{
data_cell=datain->read();
}
else
{
if(enable->read()==0 && addr_en->read()==1)
{
dataout->write(data_cell);
}
}
}
}
//test.cpp
SC_MODULE(TestBench)
{
sc_signal<sc_uint<1> > address_en_s;
sc_signal<sc_uint<16> > datain_s;
sc_signal<sc_uint<1> > enable_s;
sc_signal<sc_uint<16> > dataout_s;
Cell cella;
SC_CTOR(TestBench) : cella("cella")
{
// Binding
cella.addr_en(address_en_s);
cella.datain(datain_s);
cella.enable(enable_s);
cella.dataout(dataout_s);
SC_THREAD(stimulus_thread);
}
private:
void stimulus_thread() {
//write a value:
datain_s.write(81);
address_en_s.write(1);
enable_s.write(1);
wait(SC_ZERO_TIME);
//read what we have written:
enable_s.write(0);
address_en_s.write(1);
wait(SC_ZERO_TIME);
cout << "Output value: " << dataout_s.read() << endl;
//let's cycle the memory again:
address_en_s.write(0);
wait(SC_ZERO_TIME);
cout << "Output value: " << dataout_s.read() << endl;
}
};
I've tried running this modules and I've noticed something weird (at least, weird for me): when the stimulus writes a value (81), after the wait(SC_ZERO_TIME) the memory thread finds its datain, enable and address_enable values already updated. This is what I expected to happen. The same happens when the stimulus changes the enable_es value, in order to run another cycle in the memory thread and copy the data_cell value into the memory cell dataout port. What I don't understand is why after the memory module writes into its dataout port and goes again to the wait() statement at the beginning of the while loop, the stimulus module still has the old value on its dataout_s channel (0), and not the new value(81), which has just been copied by the memory module. Then, if I run another cycle of the memory loop (for example changing some values on the stimulus channels), the dataout channel finnally updates.
In other words, it looks like that if I write into the stimulus channels and then switch to the memory thread, the memory finds the values updated. But if the memory thread writes into its ports, and then I switch to the stimulus thread, the thread still sees the old values on its channels (binded to the memory ports).
The example above is not working as I expected because of a wrong delta cycle synchronization.
Generally speaking, lets suppose we have two threads running on two modules, A and B, connected through a channel. If I write something in threadA during delta cycle number 1, it will only be available in thread B during delta cycle 2. And if thread B writes something during its delta cycle 2, thread A has to wait until delta cycle 3 in order to read it.
Being aware of this, stimulus thread would need two consecutive wait(SC_ZERO_TIME) statements in order to read the correct output from the memory, because it has to forward its delta value.

Resources