Please explain cache coherence - multithreading

I've recently learned about false sharing, which in my understanding stems from the CPU's attempt to create cache coherence between different cores.
However, doesn't the following example demonstrate that cache coherence is violated?
The example below launches several threads that increase a global variable x, and several threads that assign the value of x to y, and an observer that tests if y>x. The condition y>x should never happen if there was memory coherence between the cores, as y is only increased after x was increased. However, this condition does happen according to the results of running this program. I tested it on visual studio both 64 and 86, both debug and release with pretty much the same results.
So, does memory coherence only happen when it's bad and never when it's good? :)
Please explain how cache coherence works and how it doesn't work. If you can guide me to a book that explains the subject I'll be grateful.
edit: I've added mfence where ever possible, still there is no memory coherence (presumably due to stale cache).
Also, I know the program has a data race, that's the whole point. My question is: Why is there a data race if the cpu maintains cache coherence (if it wasn't maintaining cache coherence, then what is false sharing and how does it happen?). Thank you.
#include <intrin.h>
#include <windows.h>
#include <iostream>
#include <thread>
#include <atomic>
#include <list>
#include <chrono>
#include <ratio>
#define N 1000000
#define SEPARATE_CACHE_LINES 0
#define USE_ATOMIC 0
#pragma pack(1)
struct
{
__declspec (align(64)) volatile long x;
#if SEPARATE_CACHE_LINES
__declspec (align(64))
#endif
volatile long y;
} data;
volatile long &g_x = data.x;
volatile long &g_y = data.y;
int g_observed;
std::atomic<bool> g_start;
void Observer()
{
while (!g_start);
for (int i = 0;i < N;++i)
{
_mm_mfence();
long y = g_y;
_mm_mfence();
long x = g_x;
_mm_mfence();
if (y > x)
{
++g_observed;
}
}
}
void XIncreaser()
{
while (!g_start);
for (int i = 0;i < N;++i)
{
#if USE_ATOMIC
InterlockedAdd(&g_x,1);
#else
_mm_mfence();
int x = g_x+1;
_mm_mfence();
g_x = x;
_mm_mfence();
#endif
}
}
void YAssigner()
{
while (!g_start);
for (int i = 0;i < N;++i)
{
#if USE_ATOMIC
long x = g_x;
InterlockedExchange(&g_y, x);
#else
_mm_mfence();
int x = g_x;
_mm_mfence();
g_y = x;
_mm_mfence();
#endif
}
}
int main()
{
using namespace std::chrono;
g_x = 0;
g_y = 0;
g_observed = 0;
g_start = false;
const int NAssigners = 4;
const int NIncreasers = 4;
std::list<std::thread> threads;
for (int i = 0;i < NAssigners;++i)
{
threads.emplace_back(YAssigner);
}
for (int i = 0;i < NIncreasers;++i)
{
threads.emplace_back(XIncreaser);
}
threads.emplace_back(Observer);
auto tic = high_resolution_clock::now();
g_start = true;
for (std::thread& t : threads)
{
t.join();
}
auto toc = high_resolution_clock::now();
std::cout << "x = " << g_x << " y = " << g_y << " number of times y > x = " << g_observed << std::endl;
std::cout << "&x = " << (int*)&g_x << " &y = " << (int*)&g_y << std::endl;
std::chrono::duration<double> t = toc - tic;
std::cout << "time elapsed = " << t.count() << std::endl;
std::cout << "USE_ATOMIC = " << USE_ATOMIC << " SEPARATE_CACHE_LINES = " << SEPARATE_CACHE_LINES << std::endl;
return 0;
}
Example output:
x = 1583672 y = 1583672 number of times y > x = 254
&x = 00007FF62BE95800 &y = 00007FF62BE95804
time elapsed = 0.187785
USE_ATOMIC = 0 SEPARATE_CACHE_LINES = 0

False sharing is mainly related to performance, not coherence or program order. The cpu cache works on a granularity which is typically 16, 32, 64,... bytes. That means if two independent data items are close together in memory, they will experience each others cache operations. Specifically, if &a % CACHE_LINE_SIZE == &b % CACHE_LINE_SIZE, then they will share a cache line.
For example, if cpu0 & 1 are fighting over a, and cpu 2 & 3 are fighting over b, the cache line containing a & b will thrash between each of the 4 caches. This is the effect of false sharing, and it causes a large performance drop.
False sharing happens because the coherence algorithm in the caches demand that there is a consistent view of memory. A good way to examine it is to put two atomic counters in a structure spaced out by one or two k:
struct a {
long a;
long pad[1024];
long b;
};
and find a nice little machine language function to do an atomic increment. Then cut loose NCPU/2 threads incrementing a and NCPU/2 threads incrementing b until they reach a big number.
Then repeat, commenting out the pad array. Compare the times.
When you are trying to get at machine details, clarity and precision are your friends; C++ and weird attribute declarations aren’t.

Related

How to trigger a race condition?

I am researching about fuzzing approaches, and I want to be sure which approach is suitable for Race Condition problem. Therefor I have a question about race condition itself.
Let's suppose we have a global variable and some threads have access to it without any restriction. How can we trigger the existing race condition? Is it enough to run just the function that uses the global variable with several threads? I mean just running the function will trigger race condition anyway?
Here, I put some code, and I know it has race condition problem. I want to know which inputs should give the functions to trigger the corresponding race condition problem.
#include<thread>
#include<vector>
#include<iostream>
#include<experimental/filesystem>
#include<Windows.h>
#include<atomic>
using namespace std;
namespace fs = experimental::filesystem;
volatile int totalSum;
//atomic<int> totalSum;
volatile int* numbersArray;
void threadProc(int startIndex, int endIndex)
{
Sleep(300);
for(int i = startIndex; i < endIndex; i++)
{
totalSum += numbersArray[i];
}
}
void performAddition(int maxNum, int threadCount)
{
totalSum = 0;
numbersArray = new int[maxNum];
for(int i = 0; i < maxNum; i++)
{
numbersArray[i] = i + 1;
}
int numbersPerThread = maxNum / threadCount;
vector<thread> workerThreads;
for(int i = 0; i < threadCount; i++)
{
int startIndex = i * numbersPerThread;
int endIndex = startIndex + numbersPerThread;
if (i == threadCount - 1)
endIndex = maxNum;
workerThreads.emplace_back(threadProc, startIndex, endIndex);
}
for(int i = 0; i < workerThreads.size(); i++)
{
workerThreads[i].join();
}
delete[] numbersArray;
}
void printUsage(char* progname)
{
cout << "usage: " << fs::path(progname).filename() << " maxNum threadCount\t with 1<maxNum<=10000, 0<threadCount<=maxNum" << endl;
}
int main(int argc, char* argv[])
{
if(argc != 3)
{
printUsage(argv[0]);
return -1;
}
long int maxNum = strtol(argv[1], nullptr, 10);
long int threadCount = strtol(argv[2], nullptr, 10);
if(maxNum <= 1 || maxNum > 10000 || threadCount <= 0 || threadCount > maxNum)
{
printUsage(argv[0]);
return -2;
}
performAddition(maxNum, threadCount);
cout << "Result: " << totalSum << " (soll: " << (maxNum * (maxNum + 1))/2 << ")" << endl;
return totalSum;
}
Thanks for your help
There may be many cases of race conditions. One of example for your case:
one thread:
reads commonly accessible variable (1)
increments it (2)
sets the common member variable to resulting value (to 2)
second thread starts just after the first thread read the common value
it read the same value (1)
incremented the value it read. (2)
then writes the calculated value to common member variable at the same time as first one. (2)
As a result
the member value was incremented only by one (to value of 2) , but it should increment by two (to value of 3) since two threads were acting on it.
Testing race conditions:
for your purpose (in the above example) you can detect race condition when you get different result than expected.
Triggerring
if you may want the described situation always to happen for the purpose of - you will need to coordinate the work of two threads. This will allow you to do your testing
Nevertheless coordination of two threads will violate definition race condition if it is defined as: "A race condition or race hazard is the behavior of an electronics, software, or other system where the system's substantive behavior is dependent on the sequence or timing of other uncontrollable events.". So you need to know what you want, and in summary race condition is an unwanted behavior, that in your case you want to happen what can make sense for testing purpose.
If you are asking generally - when a race condition can occur - it depends on your software design (e.g you can have shared atomic integers which are ok to be used), hardware design (eg. variables stored in temporary registers) and generally luck.
Hope this helps,
Witold

C++ Threaded Template Vector Quicksort

Threaded quick sort method:
#include <iostream>
#include <fstream>
#include <string>
#include <vector>
#include "MD5.h"
#include <thread>
using namespace std;
template<typename T>
void quickSort(vector<T> &arr, int left, int right) {
int i = left, j = right; //Make local copys to modify
T tmp; //Termorary variable to use for swaping.
T pivot = arr[(left + right) / 2]; //Find the centerpoint. if 0.5 truncate.
while (i <= j) {
while (arr[i] < pivot) //is i < pivot?
i++;
while (arr[j] > pivot) //Is j > pivot?
j--;
if (i <= j) { //Swap
tmp = arr[i];
arr[i] = arr[j];
arr[j] = tmp;
i++;
j--;
}
};
thread left_t; //Left thread
thread right_t; //Right thread
if (left < j)
left_t = thread(quickSort<T>, ref(arr), left, j);
if (i < right)
right_t = thread(quickSort<T>, ref(arr), i, right);
if (left < j)
left_t.join();
if (left < j)
right_t.join();
}
int main()
{
vector<int> table;
for (int i = 0; i < 100; i++)
{
table.push_back(rand() % 100);
}
cout << "Before" << endl;
for each(int val in table)
{
cout << val << endl;
}
quickSort(table, 0, 99);
cout << "After" << endl;
for each(int val in table)
{
cout << val << endl;
}
char temp = cin.get();
return 0;
}
Above program lags like mad hell and Spams "abort()" has been called.
Im thinking it has something to do with vectors and it Having threading issues
Iv seen the Question asked by Daniel Makardich, His Utilizes a Vector int While mine uses Vector T
You don't have any problem with quick sort, but with passing a templated function to a thread. There is no function quickSort. You need to explicitly give type, to instantiate the function template:
#include <thread>
#include <iostream>
template<typename T>
void f(T a) { std::cout << a << '\n'; }
int main () {
std::thread t;
int a;
std::string b("b");
t = std::thread(f, a); // Won't work
t = std::thread(f<int>, a);
t.join();
t = std::thread(f<decltype(b)>, b); // a bit fancier, more dynamic way
t.join();
return 0;
}
I suspect in your case this should do:
left_t = thread(quickSort<T>, ref(arr), left, j);
And similar for right_t. Also, you have mistake there trying to use operator()() instead of constructing an object. That is why the error is different.
Can't verify though, cause there's no minimal verifiable example =/
I don't know if it's possible to make compiler to use automatic type deduction for f passed as a param, if anyone knows that would probably make it a better answer.
Problem was with thread joins and what #luk32 said
Needed to convert the threads to pointers to threads.
thread* left_t = nullptr; //Left thread
thread* right_t = nullptr; //Right thread
if (left < j)
left_t = new thread(quickSort<T>, ref(arr), left, j);
if (i < right)
right_t = new thread(quickSort<T>, ref(arr), i, right);
if (left_t)
{
left_t->join();
delete left_t;
}
if (right_t)
{
right_t->join();
delete right_t;
}
Seems like if you create a default constructed thread object. But don't use it, it still wants to be joined. and if you do join it, it will complain.

Standard Deviation Code Output Error

I can't see my output from this code I have written. I'm trying to calculate the mean and standard deviation of a set of numbers from a file. I'm lost as to what the problem is and I won't know if my code is right until I can see output. Here's what I have so far:
#include <iostream>
#include <cmath>
#include <fstream>
using namespace std;
int main()
{
// Declare Variables
int n;
int xi;
int sdv;
int sum;
int sum2;
int sum3;
int mean;
// Declare and open input files
ifstream inData;
inData.open("score.dat");
if (!inData) // Incorrect Files
{
cout << "Cannot open input file." << endl;
return 1;
}
// Initialize variables and output
inData >> xi;
n = 0;
sum = 0;
sum3 = 0;
sdv = 0;
mean = 0;
while (inData)
{
sum += xi;
sum2 = sum * sum;
sum3 += (xi * xi);
mean = sum / n;
sdv = (sum3 - sum2) / (n * (n - 1));
inData >> xi;
}
// Print commands
cout << "The Standard Deviation of the Tests is:" << sdv << endl;
cout << "The Mean of the Tests is: " << mean << endl;
inData.close();
system("pause");
return 0;
}
After running your code I found a few bugs. You probably don't get any output because the program crashes the first time through the while loop on the line:
mean = sum / n;
due to a divide by zero error.
The other bug is that you don't increase (increment) the value of n in your loop, so you always only have one number.
Adding an n++ at the beginning of your loop will fix that
while (inData)
{
n++;
sum += xi;
...
But you still get a divide by zero on the sdv when n == 1:
sdv = (sum3 - sum2) / (1 * (1 - 1));
if you add a condition before the division it will work:
if (n >= 2)
sdv = (sum3 - sum2) / (n * (n - 1));
Look into debugging tools like gdb to help catch things like this.

How to solve http://www.spoj.com/problems/MST1/ in n is 10^9

Using Bottom to up DP approach, I am able to solve the problem How to solve http://www.spoj.com/problems/MST1/ upto 10^8.
If input is very large n upto 10^9. I will not be able to create lookup table for upto 10^9. So what will be better approach to solve the problem ?
Is there any heuristic solution ?
#include <iostream>
#include <climits>
#include <algorithm>
using namespace std;
int main()
{
const int N_MAX = 20000001;
int *DP = new int[N_MAX];
DP[1] = 0;
for (int i = 2; i < N_MAX; i++) {
int minimum = DP[i - 1];
if (i % 3 == 0) minimum = min(minimum, DP[i/3]);
if (i % 2 == 0) minimum = min(minimum, DP[i/2]);
DP[i] = minimum + 1;
}
int T, N; cin >> T;
int c = 1;
while (T--) {
cin >> N;
cout << "Case " << c++ << ": " << DP[N] << endl;
}
delete[] DP;
}

ArrayFire frame search algorithm crash

I am new to ArrayFire and CUDA development in general, I just started using ArrayFire a couple of days ago after failing miserably using Thrust.
I am building an ArrayFire-based algorithm that is supposed to search a single 32x32 pixel frame in a database of a couple hundred thousand 32x32 frames that are stored into device memory.
At first I initialize a matrix that has 1024 + 1 pixels as rows (I need an extra one to keep a frame group id) and a predefined number (this case 1000) of frames, indexed by coloumn.
Here's the function that performs the search, if I uncomment "pixels_uint32 = device_frame_ptr[pixel_group_idx];" the program crashes. The pointer seems to be valid so I do not understand why this happens. Maybe there is something I do not know regarding accessing device memory in this way?
#include <iostream>
#include <stdio.h>
#include <sys/types.h>
#include <arrayfire.h>
#include "utils.h"
using namespace af;
using namespace std;
/////////////////////////// CUDA settings ////////////////////////////////
#define TEST_DEBUG false
#define MAX_NUMBER_OF_FRAMES 1000 // maximum (2499999 frames) X (1024 + 1 pixels per frame) x (2 bytes per pixel) = 5.124.997.950 bytes (~ 5GB)
#define BLOB_FINGERPRINT_SIZE 1024 //32x32
//percentage of macroblocks that should match: 0.9 means 90%
#define MACROBLOCK_COMPARISON_OVERALL_THRESHOLD 768 //1024 * 0.75
//////////////////////// End of CUDA settings ////////////////////////////
array search_frame(array d_db_vec)
{
try {
uint number_of_uint32_for_frame = BLOB_FINGERPRINT_SIZE / 2;
// create one-element array to hold the result of the computation
array frame_found(1,MAX_NUMBER_OF_FRAMES, u32);
frame_found = 0;
gfor (array frame_idx, MAX_NUMBER_OF_FRAMES) {
// get the blob id it's the last coloumn of the matrix
array blob_id = d_db_vec(number_of_uint32_for_frame, frame_idx); // addressing with (pixel_idx, frame_idx)
// define some hardcoded pixel to search for
uint8_t searched_r = 0x0;
uint8_t searched_g = 0x3F;
uint8_t searched_b = 0x0;
uint8_t b1 = 0;
uint8_t g1 = 0;
uint8_t r1 = 0;
uint8_t b2 = 0;
uint8_t g2 = 0;
uint8_t r2 = 0;
uint32_t sum1 = 0;
uint32_t sum2 = 0;
uint32_t *device_frame_ptr = NULL;
uint32_t pixels_uint32 = 0;
uint pixel_match_counter = 0;
//uint pixel_match_counter = 0;
array frame = d_db_vec(span, frame_idx);
device_frame_ptr = frame.device<uint32_t>();
for (uint pixel_group_idx = 0; pixel_group_idx < number_of_uint32_for_frame; pixel_group_idx++) {
// test to see if the whole matrix is traversed
// d_db_vec(pixel_group_idx, frame_idx) = 0;
/////////////////////////////// PROBLEMATIC CODE ///////////////////////////////////
pixels_uint32 = 0x7E007E0;
//pixels_uint32 = device_frame_ptr[pixel_group_idx]; //why does this crash the program?
// if I uncomment the above line the program tries to copy the u32 frame into the pixels_uint32 variable
// something goes wrong, since the pointer device_frame_ptr is not NULL and the elements should be there judging by the lines above
////////////////////////////////////////////////////////////////////////////////////
// splitting the first pixel into its components
b1 = (pixels_uint32 & 0xF8000000) >> 27; //(input & 11111000000000000000000000000000)
g1 = (pixels_uint32 & 0x07E00000) >> 21; //(input & 00000111111000000000000000000000)
r1 = (pixels_uint32 & 0x001F0000) >> 16; //(input & 00000000000111110000000000000000)
// splitting the second pixel into its components
b2 = (pixels_uint32 & 0xF800) >> 11; //(input & 00000000000000001111100000000000)
g2 = (pixels_uint32 & 0x07E0) >> 5; //(input & 00000000000000000000011111100000)
r2 = (pixels_uint32 & 0x001F); //(input & 00000000000000000000000000011111)
// checking if they are a match
sum1 = abs(searched_r - r1) + abs(searched_g - g1) + abs(searched_b - b1);
sum2 = abs(searched_r - r2) + abs(searched_g - g2) + abs(searched_b - b2);
// if they match, increment the local counter
pixel_match_counter = (sum1 <= 16) ? pixel_match_counter + 1 : pixel_match_counter;
pixel_match_counter = (sum2 <= 16) ? pixel_match_counter + 1 : pixel_match_counter;
}
bool is_found = pixel_match_counter > MACROBLOCK_COMPARISON_OVERALL_THRESHOLD;
// write down if the frame is a match or not
frame_found(0,frame_idx) = is_found ? frame_found(0,frame_idx) : blob_id;
}
// test to see if the whole matrix is traversed - this has to print zeroes
if (TEST_DEBUG)
print(d_db_vec);
// return the matches array
return frame_found;
} catch (af::exception& e) {
fprintf(stderr, "%s\n", e.what());
throw;
}
}
// make 2 green pixels
uint32_t make_test_pixel_group() {
uint32_t b1 = 0x0; //11111000000000000000000000000000
uint32_t g1 = 0x7E00000; //00000111111000000000000000000000
uint32_t r1 = 0x0; //00000000000111110000000000000000
uint32_t b2 = 0x0; //00000000000000001111100000000000
uint32_t g2 = 0x7E0; //00000000000000000000011111100000
uint32_t r2 = 0x0; //00000000000000000000000000011111
uint32_t green_pix = b1 | g1 | r1 | b2 | g2 | r2;
return green_pix;
}
int main(int argc, char ** argv)
{
info();
/////////////////////////////////////// CREATE THE DATABASE ///////////////////////////////////////
uint number_of_uint32_for_frame = BLOB_FINGERPRINT_SIZE / 2;
array d_db_vec(number_of_uint32_for_frame + 1, // fingerprint size + 1 extra u32 for blob id
MAX_NUMBER_OF_FRAMES, // number of frames
u32); // type of elements is 32-bit unsigned integer (unsigned) with the configuration RGBRGB (565565)
if (TEST_DEBUG == true) {
for (uint frame_idx = 0; frame_idx < MAX_NUMBER_OF_FRAMES; frame_idx++) {
for (uint pix_idx = 0; pix_idx < number_of_uint32_for_frame; pix_idx++) {
d_db_vec(pix_idx, frame_idx) = make_test_pixel_group(); // fill everything with green :D
}
}
} else {
d_db_vec = rand(number_of_uint32_for_frame + 1, MAX_NUMBER_OF_FRAMES);
}
cout << "Setting blob ids. \n\n";
for (uint frame_idx = 0; frame_idx < MAX_NUMBER_OF_FRAMES; frame_idx++) {
// set the blob id to 123456
d_db_vec(number_of_uint32_for_frame, frame_idx) = 123456; // blob_id = 123456
}
if (TEST_DEBUG)
print(d_db_vec);
cout << "Done setting blob ids. \n\n";
//////////////////////////////////// CREATE THE SEARCHED FRAME ///////////////////////////////////
// to be done, for now we use the hardcoded values at line 37-39 to simulate the searched pixel:
//37 uint8_t searched_r = 0x0;
//38 uint8_t searched_g = 0x3F;
//39 uint8_t searched_b = 0x0;
///////////////////////////////////////////// SEARCH /////////////////////////////////////////////
clock_t timer = startTimer();
for (int i = 0; i< 1000; i++) {
array frame_found = search_frame(d_db_vec);
if (TEST_DEBUG)
print(frame_found);
}
stopTimer(timer);
return 0;
}
Here is the console output with the line commented:
arrayfire/examples/helloworld$ ./helloworld
ArrayFire v1.9.1 (64-bit Linux, build 9af23ea)
License: Server (27000#server.accelereyes.com)
CUDA toolkit 5.0, driver 304.54
GPU0 Tesla C2075, 5376 MB, Compute 2.0
Memory Usage: 5312 MB free (5376 MB total)
Setting blob ids.
Done setting blob ids.
Time: 0.03 seconds.
Here is the console output with the line uncommented:
arrayfire/examples/helloworld$ ./helloworld
ArrayFire v1.9.1 (64-bit Linux, build 9af23ea)
License: Server (27000#server.accelereyes.com)
CUDA toolkit 5.0, driver 304.54
GPU0 Tesla C2075, 5376 MB, Compute 2.0
Memory Usage: 5312 MB free (5376 MB total)
Setting blob ids.
Done setting blob ids.
Segmentation fault
Thanks in advance for any help on this issue. I really tried everything but without success.
Disclaimer: I am the lead developer of arrayfire. I see that you have posted on AccelerEyes forums as well, but I am posting here to clear up some common issues with your code.
Do not use .device(), .host(), .scalar() inside gfor loop. This will cause divergences inside the GFOR loop, and GFOR was not designed for this.
You can not index into a device pointer. The pointer refers to a location on the GPU. When you do device_frame_ptr[pixel_group_idx];, the system is looking for the equivalent position on the CPU. This is the reason for your segmentation fault.
Use vectorized code. For example, you don't need the inner for loop of the gfor. Instead of doing b1 = (pixels_uint32 & 0xF8000000) >> 27; inside a for loop, You can do array B1 = (frame & 0xF800000000) >> 27;. i.e. instead of getting data back to CPU and using a for loop, you are doing the entire operation inside the GPU.
Don't use if-else or ternary operators inside GFOR. These cause divergences again. For example, pixel_match_counter = sum(sum1 <= 16) + sum(sum2 < 16); and found(0, found_idx) = is_found * found(0, found_idx) + (1 - is_found) * blob_id.
I have answered the particular problem you are facing. If you have any follow up questions, please follow up on our forums and / or our support email. Stackoverflow is good for asking a specific question, but not to debug your entire program.

Resources