I am researching about fuzzing approaches, and I want to be sure which approach is suitable for Race Condition problem. Therefor I have a question about race condition itself.
Let's suppose we have a global variable and some threads have access to it without any restriction. How can we trigger the existing race condition? Is it enough to run just the function that uses the global variable with several threads? I mean just running the function will trigger race condition anyway?
Here, I put some code, and I know it has race condition problem. I want to know which inputs should give the functions to trigger the corresponding race condition problem.
#include<thread>
#include<vector>
#include<iostream>
#include<experimental/filesystem>
#include<Windows.h>
#include<atomic>
using namespace std;
namespace fs = experimental::filesystem;
volatile int totalSum;
//atomic<int> totalSum;
volatile int* numbersArray;
void threadProc(int startIndex, int endIndex)
{
Sleep(300);
for(int i = startIndex; i < endIndex; i++)
{
totalSum += numbersArray[i];
}
}
void performAddition(int maxNum, int threadCount)
{
totalSum = 0;
numbersArray = new int[maxNum];
for(int i = 0; i < maxNum; i++)
{
numbersArray[i] = i + 1;
}
int numbersPerThread = maxNum / threadCount;
vector<thread> workerThreads;
for(int i = 0; i < threadCount; i++)
{
int startIndex = i * numbersPerThread;
int endIndex = startIndex + numbersPerThread;
if (i == threadCount - 1)
endIndex = maxNum;
workerThreads.emplace_back(threadProc, startIndex, endIndex);
}
for(int i = 0; i < workerThreads.size(); i++)
{
workerThreads[i].join();
}
delete[] numbersArray;
}
void printUsage(char* progname)
{
cout << "usage: " << fs::path(progname).filename() << " maxNum threadCount\t with 1<maxNum<=10000, 0<threadCount<=maxNum" << endl;
}
int main(int argc, char* argv[])
{
if(argc != 3)
{
printUsage(argv[0]);
return -1;
}
long int maxNum = strtol(argv[1], nullptr, 10);
long int threadCount = strtol(argv[2], nullptr, 10);
if(maxNum <= 1 || maxNum > 10000 || threadCount <= 0 || threadCount > maxNum)
{
printUsage(argv[0]);
return -2;
}
performAddition(maxNum, threadCount);
cout << "Result: " << totalSum << " (soll: " << (maxNum * (maxNum + 1))/2 << ")" << endl;
return totalSum;
}
Thanks for your help
There may be many cases of race conditions. One of example for your case:
one thread:
reads commonly accessible variable (1)
increments it (2)
sets the common member variable to resulting value (to 2)
second thread starts just after the first thread read the common value
it read the same value (1)
incremented the value it read. (2)
then writes the calculated value to common member variable at the same time as first one. (2)
As a result
the member value was incremented only by one (to value of 2) , but it should increment by two (to value of 3) since two threads were acting on it.
Testing race conditions:
for your purpose (in the above example) you can detect race condition when you get different result than expected.
Triggerring
if you may want the described situation always to happen for the purpose of - you will need to coordinate the work of two threads. This will allow you to do your testing
Nevertheless coordination of two threads will violate definition race condition if it is defined as: "A race condition or race hazard is the behavior of an electronics, software, or other system where the system's substantive behavior is dependent on the sequence or timing of other uncontrollable events.". So you need to know what you want, and in summary race condition is an unwanted behavior, that in your case you want to happen what can make sense for testing purpose.
If you are asking generally - when a race condition can occur - it depends on your software design (e.g you can have shared atomic integers which are ok to be used), hardware design (eg. variables stored in temporary registers) and generally luck.
Hope this helps,
Witold
Related
I've recently learned about false sharing, which in my understanding stems from the CPU's attempt to create cache coherence between different cores.
However, doesn't the following example demonstrate that cache coherence is violated?
The example below launches several threads that increase a global variable x, and several threads that assign the value of x to y, and an observer that tests if y>x. The condition y>x should never happen if there was memory coherence between the cores, as y is only increased after x was increased. However, this condition does happen according to the results of running this program. I tested it on visual studio both 64 and 86, both debug and release with pretty much the same results.
So, does memory coherence only happen when it's bad and never when it's good? :)
Please explain how cache coherence works and how it doesn't work. If you can guide me to a book that explains the subject I'll be grateful.
edit: I've added mfence where ever possible, still there is no memory coherence (presumably due to stale cache).
Also, I know the program has a data race, that's the whole point. My question is: Why is there a data race if the cpu maintains cache coherence (if it wasn't maintaining cache coherence, then what is false sharing and how does it happen?). Thank you.
#include <intrin.h>
#include <windows.h>
#include <iostream>
#include <thread>
#include <atomic>
#include <list>
#include <chrono>
#include <ratio>
#define N 1000000
#define SEPARATE_CACHE_LINES 0
#define USE_ATOMIC 0
#pragma pack(1)
struct
{
__declspec (align(64)) volatile long x;
#if SEPARATE_CACHE_LINES
__declspec (align(64))
#endif
volatile long y;
} data;
volatile long &g_x = data.x;
volatile long &g_y = data.y;
int g_observed;
std::atomic<bool> g_start;
void Observer()
{
while (!g_start);
for (int i = 0;i < N;++i)
{
_mm_mfence();
long y = g_y;
_mm_mfence();
long x = g_x;
_mm_mfence();
if (y > x)
{
++g_observed;
}
}
}
void XIncreaser()
{
while (!g_start);
for (int i = 0;i < N;++i)
{
#if USE_ATOMIC
InterlockedAdd(&g_x,1);
#else
_mm_mfence();
int x = g_x+1;
_mm_mfence();
g_x = x;
_mm_mfence();
#endif
}
}
void YAssigner()
{
while (!g_start);
for (int i = 0;i < N;++i)
{
#if USE_ATOMIC
long x = g_x;
InterlockedExchange(&g_y, x);
#else
_mm_mfence();
int x = g_x;
_mm_mfence();
g_y = x;
_mm_mfence();
#endif
}
}
int main()
{
using namespace std::chrono;
g_x = 0;
g_y = 0;
g_observed = 0;
g_start = false;
const int NAssigners = 4;
const int NIncreasers = 4;
std::list<std::thread> threads;
for (int i = 0;i < NAssigners;++i)
{
threads.emplace_back(YAssigner);
}
for (int i = 0;i < NIncreasers;++i)
{
threads.emplace_back(XIncreaser);
}
threads.emplace_back(Observer);
auto tic = high_resolution_clock::now();
g_start = true;
for (std::thread& t : threads)
{
t.join();
}
auto toc = high_resolution_clock::now();
std::cout << "x = " << g_x << " y = " << g_y << " number of times y > x = " << g_observed << std::endl;
std::cout << "&x = " << (int*)&g_x << " &y = " << (int*)&g_y << std::endl;
std::chrono::duration<double> t = toc - tic;
std::cout << "time elapsed = " << t.count() << std::endl;
std::cout << "USE_ATOMIC = " << USE_ATOMIC << " SEPARATE_CACHE_LINES = " << SEPARATE_CACHE_LINES << std::endl;
return 0;
}
Example output:
x = 1583672 y = 1583672 number of times y > x = 254
&x = 00007FF62BE95800 &y = 00007FF62BE95804
time elapsed = 0.187785
USE_ATOMIC = 0 SEPARATE_CACHE_LINES = 0
False sharing is mainly related to performance, not coherence or program order. The cpu cache works on a granularity which is typically 16, 32, 64,... bytes. That means if two independent data items are close together in memory, they will experience each others cache operations. Specifically, if &a % CACHE_LINE_SIZE == &b % CACHE_LINE_SIZE, then they will share a cache line.
For example, if cpu0 & 1 are fighting over a, and cpu 2 & 3 are fighting over b, the cache line containing a & b will thrash between each of the 4 caches. This is the effect of false sharing, and it causes a large performance drop.
False sharing happens because the coherence algorithm in the caches demand that there is a consistent view of memory. A good way to examine it is to put two atomic counters in a structure spaced out by one or two k:
struct a {
long a;
long pad[1024];
long b;
};
and find a nice little machine language function to do an atomic increment. Then cut loose NCPU/2 threads incrementing a and NCPU/2 threads incrementing b until they reach a big number.
Then repeat, commenting out the pad array. Compare the times.
When you are trying to get at machine details, clarity and precision are your friends; C++ and weird attribute declarations aren’t.
As a hobby project I have been developing an IDE for Chef, an esoteric programming language. While writing various test programs in Chef I've realised that implementing a simple sort algorithm, or even comparing two integers to see which one is greater, is a major challenge when the only compare-and-branch statement in the language is a loop which will repeat while a number is non zero. For example:
Dissolve the sugar. <-- execute loop if value of 'sugar' is non zero
Add flour to mixing bowl. <-- add value of 'flour' into mixing bowl
Set aside. <-- break out of the loop
Stir until dissolved. <-- mark the end of the loop
I do have a working solution to compare two integers in Chef, but it is 40 lines long!
Here is an equivalent of my approach in Java, which most will find more readable than the Chef code I wrote :-)
public static void main(String[] args) {
int first = 100;
int second = 200;
int looper = 1;
int tester;
int difference = first - second;
int inverse = difference * -1;
while (looper != 0) {
difference -= 1;
inverse -= 1;
tester = 1;
while (difference != 0) {
tester--;
break;
}
while (tester != 0) {
System.out.println("First is bigger");
exit(1);
}
tester = 1;
while (inverse != 0) {
tester--;
break;
}
while (tester != 0) {
System.out.println("Second is bigger");
exit(1);
}
}
}
My question is, what's the best way of comparing two numbers when all I have is a loop while non-zero ?
For example, there is a given string which is consisted of 1s and 0s:
s = "00000000001111111111100001111111110000";
What is the efficient way to get the count of longest 1s substring in s? (11)
What is the efficient way to get the count of longest 0s substring in s? (10)
I appreciate the question would be answered from an algorithmic perspective.
I think the most straight-forward way is to walk through the bit-string while recording the max lengths for all 0 and all 1 sub-strings. This is of O (n) complexity as suggested by others.
If you can afford some sort of a data-parallel computation, you might want to look at parallel patterns as explained here. Specifically, take a look at parallel reduction. I think this problem can be implemented in O (log n) time if you can afford one of those methods.
I'm trying to think of a parallel reduction for this problem:
On the first level of the reduction, each thread will process chunks of 8 bit strings (depending on the number of threads you have and the length of the string) and produce a summary of the bit string like: 0 -> x, 1 -> y, 0 -> z, ....
On the next level each thread will merge two of these summaries into one, any possible joins will be performed at this phase (basically, if the previous summary ended with a 0 (1) and the next summary begins with a 0 (1), then the last entry and the first entry of the two summaries can be collapsed into one).
On the top level there will be just one structure with the overall summary of the bit string, which you'll have to step through to figure out the largest sequences (but this time they are all in summary form, so it should be faster). Or, you can make each summary structure keep track of the larges 0 and 1 sub-strings, this will make it unnecessary to walk through the final structure.
I guess this approach only makes sense in a very limited scope, but since you seem to be very keen on getting better than O (n)...
OK, here is one solution I come up with, I'm not sure whether this is bug-free. Correct me if you discover a bug or suggest a better way to do it. Vote it if you agree with this solution. Thanks!
#include <iostream>
using namespace std;
int main(){
int s[] = {0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0};
int length = sizeof(s) / sizeof(s[0]);
int one_start = 0;
int one_n = 0;
int max_one_n = 0;
int zero_start = 0;
int zero_n = 0;
int max_zero_n = 0;
for(int i=0; i<length; i++){
// Calculate 1s
if(one_start==0 && s[i]==1){
one_start = 1;
one_n++;
}
else if(one_start==1 && s[i]==1){
one_n++;
}
else if(one_start==1 && s[i]==0){
one_start = 0;
if(one_n > max_one_n){
max_one_n = one_n;
}
one_n = 0; // Reset
}
// Calculate 0s
if(zero_start==0 && s[i]==0){
zero_start = 1;
zero_n++;
}
else if(zero_start==1 && s[i]==0){
zero_n++;
}
else if(one_start==1 && s[i]==1){
zero_start = 0;
if(zero_n > max_zero_n){
max_zero_n = zero_n;
}
zero_n = 0; // Reset
}
}
if(one_n > max_one_n){
max_one_n = one_n;
}
if(zero_n > max_zero_n){
max_zero_n = zero_n;
}
cout << "max_one_n: " << max_one_n << endl;
cout << "max_zero_n: " << max_zero_n << endl;
return 0;
}
Worst case is always O(n), you can always find input which forces the algorithm to check every bit.
But you can probably get average slightly better than that (more simply if you scan just for 0 or 1, not both), because you can skip the length of currently found longest sequence and scan backwards. At the very least this will reduce the constant factor of O(n), but at least with random input, more items also means longer sequences, and thus longer and longer skips. But the difference to O(n) will not be much...
My MPI code deadlocks when I run this simple code on 512 processes on a cluster. I am far from the memory limit. If I increase the number of procesess to 2048, which is far too many for this problem, the code runs again. The deadlock occurs in the line containing the MPI_File_write_all.
Any suggestions?
int count = imax*jmax*kmax;
// CREATE THE SUBARRAY
MPI_Datatype subarray;
int totsize [3] = {kmax, jtot, itot};
int subsize [3] = {kmax, jmax, imax};
int substart[3] = {0, mpicoordy*jmax, mpicoordx*imax};
MPI_Type_create_subarray(3, totsize, subsize, substart, MPI_ORDER_C, MPI_DOUBLE, &subarray);
MPI_Type_commit(&subarray);
// SET THE VALUE OF THE GRID EQUAL TO THE PROCESS ID FOR CHECKING
if(mpiid == 0) std::printf("Setting the value of the array\n");
for(int i=0; i<count; i++)
u[i] = (double)mpiid;
// WRITE THE FULL GRID USING MPI-IO
if(mpiid == 0) std::printf("Write the full array to disk\n");
char filename[] = "u.dump";
MPI_File fh;
if(MPI_File_open(commxy, filename, MPI_MODE_CREATE | MPI_MODE_WRONLY | MPI_MODE_EXCL, MPI_INFO_NULL, &fh))
return 1;
// select noncontiguous part of 3d array to store the selected data
MPI_Offset fileoff = 0; // the offset within the file (header size)
char name[] = "native";
if(MPI_File_set_view(fh, fileoff, MPI_DOUBLE, subarray, name, MPI_INFO_NULL))
return 1;
if(MPI_File_write_all(fh, u, count, MPI_DOUBLE, MPI_STATUS_IGNORE))
return 1;
if(MPI_File_close(&fh))
return 1;
Your code looks right upon quick inspection. I would suggest that you let your MPI-IO library help tell you what's wrong: instead of returning from error, why don't you at least display the error? Here's some code that might help:
static void handle_error(int errcode, char *str)
{
char msg[MPI_MAX_ERROR_STRING];
int resultlen;
MPI_Error_string(errcode, msg, &resultlen);
fprintf(stderr, "%s: %s\n", str, msg);
MPI_Abort(MPI_COMM_WORLD, 1);
}
Is MPI_SUCCESS guaranteed to be 0? I'd rather see
errcode = MPI_File_routine();
if (errcode != MPI_SUCCESS) handle_error(errcode, "MPI_File_open(1)");
Put that in and if you are doing something tricky like setting a file view with offsets that are not monotonically non-decreasing, the error string might suggest what's wrong.
public void DoSomething(byte[] array, byte[] array2, int start, int counter)
{
int length = array.Length;
int index = 0;
while (count >= needleLen)
{
index = Array.IndexOf(array, array2[0], start, count - length + 1);
int i = 0;
int p = 0;
for (i = 0, p = index; i < length; i++, p++)
{
if (array[p] != array2[i])
{
break;
}
}
Given that your for loop appears to be using a loop body dependent on ordering, it's most likely not a candidate for parallelization.
However, you aren't showing the "work" involved here, so it's difficult to tell what it's doing. Since the loop relies on both i and p, and it appears that they would vary independently, it's unlikely to be rewritten using a simple Parallel.For without reworking or rethinking your algorithm.
In order for a loop body to be a good candidate for parallelization, it typically needs to be order independent, and have no ordering constraints. The fact that you're basing your loop on two independent variables suggests that these requirements are not valid in this algorithm.