Interpreting time command output on a multi threaded program - linux

I have a multi threaded program and I am profiling time taken starting before all pthread_create's and after all pthread_join's.
Now I find that this time, lets call it X, which is shown below in "Done in xms" is actually user + sys time of time output. In my app the number argument to a.out controls how many threads to spawn. ./a.out 1 spawn 1 pthread and ./a.out 2 spawns 2 threads where each thread does the same amount of work.
I was expecting X to be the real time instead of user + sys time. Can someone please tell me why this is not so? Then this really means my app is indeed running parallel without any locking between threads.
[jithin#whatsoeverclever tests]$ time ./a.out 1
Done in 320ms
real 0m0.347s
user 0m0.300s
sys 0m0.046s
[jithin#whatsoeverclever tests]$ time ./a.out 2
Done in 450ms
real 0m0.266s
user 0m0.383s
sys 0m0.087s
[jithin#whatsoeverclever tests]$ time ./a.out 3
Done in 630ms
real 0m0.310s
user 0m0.532s
sys 0m0.105s
Code
int main(int argc, char **argv) {
//Read the words
getWords();
//Set number of words to use
int maxWords = words.size();
if(argc > 1) {
int numWords = atoi(argv[1]);
if(numWords > 0 && numWords < maxWords) maxWords = numWords;
}
//Init model
model = new Model(MODEL_PATH);
pthread_t *threads = new pthread_t[maxWords];
pthread_attr_t attr;
void *status;
// Initialize and set thread joinable
pthread_attr_init(&attr);
pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE);
int rc;
clock_t startTime = clock();
for(unsigned i=0; i<maxWords; i++) {
//create thread
rc = pthread_create(&threads[i], NULL, processWord, (void *)&words[i] );
if (rc){
cout << "Error:unable to create thread: " << i << "," << rc << endl;
exit(-1);
}
}
// free attribute and wait for the other threads
pthread_attr_destroy(&attr);
for(unsigned i=0; i<maxWords; i++) {
rc = pthread_join(threads[i], &status);
if (rc){
cout << "Error:unable to join thread: " << i << "," << rc << endl;
exit(-1);
}
}
clock_t endTime = clock();
float diff = (((float)endTime - (float)startTime) / 1000000.0F ) * 1000;
cout<<"Done in "<< diff << "ms\n";
delete[] threads;
delete model;
}

The clock function is specifically documented to return the processor time used by a process. If you want to measure wall time elapsed, it's not the right function.

Related

read variable value in main from thread in c++

I need to have a thread which executes a function in while loop(say increments an int value). In the main I need to have a while loop which executes some function(say a for loop which counts from 0 to 5) and then reads the current value of a variable in the thread. The thread must keep running its own while loop irrespective of whats going on in main. However the value of the thread variable must not change while main reads the variable.
I guess this problem can be solved using atomic. However this is a toy problem in which the variable in the thread is an int. In my actual problem the thread variable if of type Eigen::quaternionf or float[4]. So I need to ensure that the entire Eigen::quaternionf or float[4] is held constant when it is read from main.
The cout in the thread is only for debugging. If the code runs with thread safety, it can be removed. I read from another post that using cout in a thread safe manner may need to write a new wrapper around cout with a mutex. I want to avoid it.
My main concern is reading the variable in correct order in main.
My code fails(today is my first day with multithreading) and is as below along with observed output(selected parts). the code fails to keep the order of the output using cout(garbled output). Also I am not sure that the thread variable is correctly read by the main.
#include <thread>
#include <mutex>
#include <iostream>
int i = 0;
void safe_increment(std::mutex& i_mutex)
{
while(1)
{
std::lock_guard<std::mutex> lock(i_mutex);
++i;
std::cout << "thread: "<< std::this_thread::get_id() << ", i=" << i << '\n';
}
}
int main()
{
std::mutex i_mutex;
std::thread t1(safe_increment, std::ref(i_mutex));
while(1)
{
for(int k =0; k < 5; k++)
{
std::cout << "main: k =" << k << '\n';
}
std::lock_guard<std::mutex> lock(i_mutex);
std::cout << "main: i=" << i << '\n';
}
}
The output(selected parts) I get is
thread: 139711042705152, i=223893
thread: 139711042705152, i=223894
thread: 139711042705152, i=223895
main: i=223895
main: k =0
thread: main: k =1139711042705152
main: k =2
main: k =3
, i=main: k =4
223896
thread: 139711042705152, i=223897
thread: 139711042705152, i=223898
thread: 139711042705152, i=224801
thread: 139711042705152, i=224802
main: i=224802
main: k =0
main: k =1
thread: main: k =2
main: k =3
main: k =4
139711042705152, i=224803
thread: 139711042705152, i=224804
thread: 139711042705152, i=224805
i is properly synchronized with the mutex. well done! obviously this runs until you force it to stop, so when you do find a better way to end execution, be sure to join on your thread.
to fix the garbling, you need to synchronize on std::cout:
int main()
{
std::mutex i_mutex;
std::thread t1(safe_increment, std::ref(i_mutex));
while(1)
{
std::lock_guard<std::mutex> lock(i_mutex);//moved up here to sync on std::cout << k
for(int k =0; k < 5; k++)
{
std::cout << "main: k =" << k << '\n';
}
std::cout << "main: i=" << i << '\n';
if (i > 100) break;
}
t1.join(); //thread will continue and main will wait until it is done
//your thread needs to have some way out of its while(1) as well.
}
the thread can maybe be this:
void safe_increment(std::mutex& i_mutex)
{
while(1)
{
std::lock_guard<std::mutex> lock(i_mutex);
++i;
std::cout << "thread: "<< std::this_thread::get_id() << ", i=" << i << '\n';
if (i > 111) break;
}
}

How many threads per core?

I am running a multi-threaded program on my computer which has 4 cores. I am creating threads that run with SCHED_FIFO, SCHED_OTHER, and SCHED_RR priorities. What is the maximum number of each type of thread that can run simultaneously?
For example,
I'm pretty sure only four SCHED_FIFO threads can run at a time (one per core)
but I'm not sure about the other two.
edit my code, as asked (it's long, but most of it is for testing how long each thread completes a delay task)
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <pthread.h>
#include <sys/time.h>
#include <time.h>
#include <string.h>
void *ThreadRunner(void *vargp);
void DisplayThreadSchdStats(void);
void delayTask(void);
int threadNumber = 0;
pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;
pthread_cond_t cond = PTHREAD_COND_INITIALIZER;
#define NUM_THREADS 9
//used to store the information of each thread
typedef struct{
pthread_t threadID;
int policy;
struct sched_param param;
long startTime;
long taskStartTime;
long endTime1;
long endTime2;
long endTime3;
long runTime;
char startDate[30];
char endDate[30];
}ThreadInfo;
ThreadInfo myThreadInfo[NUM_THREADS];
//main function
int main(void){
printf("running...\n");
int fifoPri = 60;
int rrPri = 30;
//create the 9 threads and assign their scheduling policies
for(int i=0; i<NUM_THREADS; i++){
if(i%3 == SCHED_OTHER){
myThreadInfo[i].policy = SCHED_OTHER;
myThreadInfo[i].param.sched_priority = 0;
}
else if (i%3 == SCHED_FIFO){
myThreadInfo[i].policy = SCHED_RR;
myThreadInfo[i].param.sched_priority = rrPri++;
}
else{
myThreadInfo[i].policy = SCHED_FIFO;
myThreadInfo[i].param.sched_priority = fifoPri++;
}
pthread_create( &myThreadInfo[i].threadID, NULL, ThreadRunner, &myThreadInfo[i]);
pthread_cond_wait(&cond, &mutex);
}
printf("\n\n");
//join each thread
for(int g = 0; g < NUM_THREADS; g++){
pthread_join(myThreadInfo[g].threadID, NULL);
}
//print out the stats for each thread
DisplayThreadSchdStats();
return 0;
}
//used to print out all of the threads, along with their stats
void DisplayThreadSchdStats(void){
int otherNum = 0;
long task1RR = 0;
long task2RR = 0;
long task3RR = 0;
long task1FIFO = 0;
long task2FIFO = 0;
long task3FIFO = 0;
long task1OTHER = 0;
long task2OTHER = 0;
long task3OTHER = 0;
for(int g = 0; g < threadNumber; g++){
printf("\nThread# [%d] id [0x%x] exiting...\n", g + 1, (int) myThreadInfo[g].threadID);
printf("DisplayThreadSchdStats:\n");
printf(" threadID = 0x%x \n", (int) myThreadInfo[g].threadID);
if(myThreadInfo[g].policy == 0){
printf(" policy = SHED_OTHER\n");
task1OTHER += (myThreadInfo[g].endTime1 - myThreadInfo[g].taskStartTime);
task2OTHER += (myThreadInfo[g].endTime2 - myThreadInfo[g].endTime1);
task3OTHER += (myThreadInfo[g].endTime3 - myThreadInfo[g].endTime2);
otherNum++;
}
if(myThreadInfo[g].policy == 1){
printf(" policy = SHED_FIFO\n");
task1FIFO += (myThreadInfo[g].endTime1 - myThreadInfo[g].taskStartTime);
task2FIFO += (myThreadInfo[g].endTime2 - myThreadInfo[g].endTime1);
task3FIFO += (myThreadInfo[g].endTime3 - myThreadInfo[g].endTime2);
}
if(myThreadInfo[g].policy == 2){
printf(" policy = SHED_RR\n");
task1RR+= (myThreadInfo[g].endTime1 - myThreadInfo[g].taskStartTime);
task2RR += (myThreadInfo[g].endTime2 - myThreadInfo[g].endTime1);
task3RR += (myThreadInfo[g].endTime3 - myThreadInfo[g].endTime2);
}
printf(" priority = %d \n", myThreadInfo[g].param.sched_priority);
printf(" startTime = %s\n", myThreadInfo[g].startDate);
printf(" endTime = %s\n", myThreadInfo[g].endDate);
printf(" Task start TimeStamp in micro seconds [%ld]\n", myThreadInfo[g].taskStartTime);
printf(" Task end TimeStamp in micro seconds [%ld] Delta [%lu]us\n", myThreadInfo[g].endTime1 , (myThreadInfo[g].endTime1 - myThreadInfo[g].taskStartTime));
printf(" Task end Timestamp in micro seconds [%ld] Delta [%lu]us\n", myThreadInfo[g].endTime2, (myThreadInfo[g].endTime2 - myThreadInfo[g].endTime1));
printf(" Task end Timestamp in micro seconds [%ld] Delta [%lu]us\n\n\n", myThreadInfo[g].endTime3, (myThreadInfo[g].endTime3 - myThreadInfo[g].endTime2));
printf("\n\n");
}
printf("Analysis: \n");
printf(" for SCHED_OTHER, task 1 took %lu, task2 took %lu, and task 3 took %lu. (average = %lu)\n", (task1OTHER/otherNum), (task2OTHER/otherNum), (task3OTHER/otherNum), (task1OTHER/otherNum + task2OTHER/otherNum + task3OTHER/otherNum)/3 );
printf(" for SCHED_RR, task 1 took %lu, task2 took %lu, and task 3 took %lu. (average = %lu)\n", (task1RR/otherNum), (task2RR/otherNum), (task3RR/otherNum), (task1RR/otherNum + task2RR/otherNum + task3RR/otherNum)/3 );
printf(" for SCHED_FIFO, task 1 took %lu, task2 took %lu, and task 3 took %lu. (average = %lu)\n", (task1FIFO/otherNum), (task2FIFO/otherNum), (task3FIFO/otherNum) , (task1FIFO/otherNum + task2FIFO/otherNum + task3FIFO/otherNum)/3);
}
//the function that runs the threads
void *ThreadRunner(void *vargp){
pthread_mutex_lock(&mutex);
char date[30];
struct tm *ts;
size_t last;
time_t timestamp = time(NULL);
ts = localtime(&timestamp);
last = strftime(date, 30, "%c", ts);
threadNumber++;
ThreadInfo* currentThread;
currentThread = (ThreadInfo*)vargp;
//set the start time
struct timeval tv;
gettimeofday(&tv, NULL);
long milltime0 = (tv.tv_sec) * 1000 + (tv.tv_usec) / 1000;
currentThread->startTime = milltime0;
//set the start date
strcpy(currentThread->startDate, date);
if(pthread_setschedparam(pthread_self(), currentThread->policy,(const struct sched_param *) &(currentThread->param))){
perror("pthread_setschedparam failed");
pthread_exit(NULL);
}
if(pthread_getschedparam(pthread_self(), &currentThread->policy,(struct sched_param *) &currentThread->param)){
perror("pthread_getschedparam failed");
pthread_exit(NULL);
}
gettimeofday(&tv, NULL);
long startTime = (tv.tv_sec) * 1000 + (tv.tv_usec) / 1000;
currentThread->taskStartTime = startTime;
//delay task #1
delayTask();
//set the end time of task 1
gettimeofday(&tv, NULL);
long milltime1 = (tv.tv_sec) * 1000 + (tv.tv_usec) / 1000;
currentThread->endTime1 = milltime1;
//delay task #2
delayTask();
//set the end time of task 2
gettimeofday(&tv, NULL);
long milltime2 = (tv.tv_sec) * 1000 + (tv.tv_usec) / 1000;
currentThread->endTime2 = milltime2;
//delay task #3
delayTask();
//set the end time of task 3
gettimeofday(&tv, NULL);
long milltime3 = (tv.tv_sec) * 1000 + (tv.tv_usec) / 1000;
currentThread->endTime3 = milltime3;
//set the end date
timestamp = time(NULL);
ts = localtime(&timestamp);
last = strftime(date, 30, "%c", ts);
strcpy(currentThread->endDate, date);
//set the total run time of the thread
long runTime = milltime3 - milltime0;
currentThread->runTime = runTime;
//unlock mutex
pthread_mutex_unlock(&mutex);
pthread_cond_signal(&cond);
pthread_exit(NULL);
}
//used to delay each thread
void delayTask(void){
for(int i = 0; i < 5000000; i++){
printf("%d", i % 2);
}
}
In short: no guarantees how many threads will be run parallelly, but all of them will run concurrently.
No matter how many threads you start in an application controlled by a general-purpose operating system, they all will run concurrently. That is, each thread will be provided with some non-zero time to run, and no particular execution order of execution of threads' sections outside OS-defined synchronization primitives (waiting on mutexes, locks etc.) is guaranteed. The only limit on thread number may be imposed by OS'es policies.
How many of your threads will be chosen to run parallelly at any given moment of time is not defined. The number cannot obviously exceed number of logical processors visible to an OS (remember that the OS itself may be run inside a virtual machine, and there are hardware tricks like SMT), and your threads will be competing with other threads present in the same system. OSes do offer APIs to query which threads/processes are currently in running state and which are blocked or ready but not scheduled, otherwise writing programs like top would become problematic.
Explicitly setting priorities to threads may affect the operating system's choices and increase average number of your threads being executed parallelly. Note that it can either help or hurt if used without thinking. Still, it will never be strictly equal to four inside a multitasking OS as long as there are other processes. The only way to make sure 100% of CPU's hardware is dedicated to your threads 100% of the time is to run a barebone application, outside of any OS outside of any hypervisor (and even then there are peculiarities, see "Intel System Management Mode").
Inside a mostly idle general purpose OS, if your threads are compute-intensive, I would guess the average parallel utilization ratio would be 3.9 — 4.0. But a slightest perturbation — and all bets are off.

Multiples threads running on one core instead of four depending on the OS

I am using Raspbian on Raspberry 3.
I need to divide my code in few blocks (2 or 4) and assign a thread per block to speed up calculations.
At the moment, I am testing with simple loops (see attached code) on one thread and then on 4 threads. And executions time on 4 threads is always 4 times longer, so it looks like this 4 threads are scheduled to run on the same CPU.
How to assign each thread to run on other CPUs? Even 2 threads on 2 CPUs should make big difference to me.
I even tried to use g++6 and no improvement. And using parallel libs openmp in the code with "#pragma omp for" still running on one CPU.
I tried to run this code on Fedora Linux x86 and I had the same behavior, but on Windows 8.1 and VS2015 i have got different results where time was the same one one thread and then on 4 threads, so it was running on different CPUs.
Would you have any suggestions??
Thank you.
#include <iostream>
//#include <arm_neon.h>
#include <ctime>
#include <thread>
#include <mutex>
#include <iostream>
#include <vector>
using namespace std;
float simd_dot0() {
unsigned int i;
unsigned long rezult;
for (i = 0; i < 0xfffffff; i++) {
rezult = i;
}
return rezult;
}
int main() {
unsigned num_cpus = std::thread::hardware_concurrency();
std::mutex iomutex;
std::vector<std::thread> threads(num_cpus);
cout << "Start Test 1 CPU" << endl; // prints !!!Hello World!!!
double t_start, t_end, scan_time;
scan_time = 0;
t_start = clock();
simd_dot0();
t_end = clock();
scan_time += t_end - t_start;
std::cout << "\nExecution time on 1 CPU: "
<< 1000.0 * scan_time / CLOCKS_PER_SEC << "ms" << std::endl;
cout << "Finish Test on 1 CPU" << endl; // prints !!!Hello World!!!
cout << "Start Test 4 CPU" << endl; // prints !!!Hello World!!!
scan_time = 0;
t_start = clock();
for (unsigned i = 0; i < 4; ++i) {
threads[i] = std::thread([&iomutex, i] {
{
simd_dot0();
std::cout << "\nExecution time on CPU: "
<< i << std::endl;
}
// Simulate important work done by the tread by sleeping for a bit...
});
}
for (auto& t : threads) {
t.join();
}
t_end = clock();
scan_time += t_end - t_start;
std::cout << "\nExecution time on 4 CPUs: "
<< 1000.0 * scan_time / CLOCKS_PER_SEC << "ms" << std::endl;
cout << "Finish Test on 4 CPU" << endl; // prints !!!Hello World!!!
cout << "!!!Hello World!!!" << endl; // prints !!!Hello World!!!
while (1);
return 0;
}
Edit :
On Raspberry Pi3 Raspbian I used g++4.9 and 6 with the following flags :
-std=c++11 -ftree-vectorize -Wl--no-as-needed -lpthread -march=armv8-a+crc -mcpu=cortex-a53 -mfpu=neon-fp-armv8 -funsafe-math-optimizations -O3

std::async performance on Windows and Solaris 10

I'm running a simple threaded test program on both a Windows machine (compiled using MSVS2015) and a server running Solaris 10 (compiled using GCC 4.9.3). On Windows I'm getting significant performance increases from increasing the threads from 1 to the amount of cores available; however, the very same code does not see any performance gains at all on Solaris 10.
The Windows machine has 4 cores (8 logical) and the Unix machine has 8 cores (16 logical).
What could be the cause for this? I'm compiling with -pthread, and it is creating threads since it prints all the "S"es before the first "F". I don't have root access on the Solaris machine, and from what I can see there's no installed tool which I can use to view a process' affinity.
Example code:
#include <iostream>
#include <vector>
#include <future>
#include <random>
#include <chrono>
std::default_random_engine gen(std::chrono::system_clock::now().time_since_epoch().count());
std::normal_distribution<double> randn(0.0, 1.0);
double generate_randn(uint64_t iterations)
{
// Print "S" when a thread starts
std::cout << "S";
std::cout.flush();
double rvalue = 0;
for (int i = 0; i < iterations; i++)
{
rvalue += randn(gen);
}
// Print "F" when a thread finishes
std::cout << "F";
std::cout.flush();
return rvalue/iterations;
}
int main(int argc, char *argv[])
{
if (argc < 2)
return 0;
uint64_t count = 100000000;
uint32_t threads = std::atoi(argv[1]);
double total = 0;
std::vector<std::future<double>> futures;
std::chrono::high_resolution_clock::time_point t1;
std::chrono::high_resolution_clock::time_point t2;
// Start timing
t1 = std::chrono::high_resolution_clock::now();
for (int i = 0; i < threads; i++)
{
// Start async tasks
futures.push_back(std::async(std::launch::async, generate_randn, count/threads));
}
for (auto &future : futures)
{
// Wait for tasks to finish
future.wait();
total += future.get();
}
// End timing
t2 = std::chrono::high_resolution_clock::now();
// Take the average of the threads' results
total /= threads;
std::cout << std::endl;
std::cout << total << std::endl;
std::cout << "Finished in " << std::chrono::duration_cast<std::chrono::milliseconds>(t2 - t1).count() << " ms" << std::endl;
}
As a general rule, classes defined by the C++ standard library do not have any internal locking. Modifying an instance of a standard library class from more than one thread, or reading it from one thread while writing it from another, is undefined behavior, unless "objects of that type are explicitly specified as being sharable without data races". (N3337, sections 17.6.4.10 and 17.6.5.9.) The RNG classes are not "explicitly specified as being sharable without data races". (cout is an example of a stdlib object that is "sharable with data races" — as long as you haven't done ios::sync_with_stdio(false).)
As such, your program is incorrect because it accesses a global RNG object from more than one thread simultaneously; every time you request another random number, the internal state of the generator is modified. On Solaris, this seems to result in serialization of accesses, whereas on Windows it is probably instead causing you not to get properly "random" numbers.
The cure is to create separate RNGs for each thread. Then each thread will operate independently, and they will neither slow each other down nor step on each other's toes. This is a special case of a very general principle: multithreading always works better the less shared data there is.
There's an additional wrinkle to worry about: each thread will call system_clock::now at very nearly the same time, so you may end up with some of the per-thread RNGs seeded with the same value. It would be better to seed them all from a random_device object. random_device requests random numbers from the operating system, and does not need to be seeded; but it can be very slow. The random_device should be created and used inside main, and seeds passed to each worker function, because a global random_device accessed from multiple threads (as in the previous edition of this answer) is just as undefined as a global default_random_engine.
All told, your program should look something like this:
#include <iostream>
#include <vector>
#include <future>
#include <random>
#include <chrono>
static double generate_randn(uint64_t iterations, unsigned int seed)
{
// Print "S" when a thread starts
std::cout << "S";
std::cout.flush();
std::default_random_engine gen(seed);
std::normal_distribution<double> randn(0.0, 1.0);
double rvalue = 0;
for (int i = 0; i < iterations; i++)
{
rvalue += randn(gen);
}
// Print "F" when a thread finishes
std::cout << "F";
std::cout.flush();
return rvalue/iterations;
}
int main(int argc, char *argv[])
{
if (argc < 2)
return 0;
uint64_t count = 100000000;
uint32_t threads = std::atoi(argv[1]);
double total = 0;
std::vector<std::future<double>> futures;
std::chrono::high_resolution_clock::time_point t1;
std::chrono::high_resolution_clock::time_point t2;
std::random_device make_seed;
// Start timing
t1 = std::chrono::high_resolution_clock::now();
for (int i = 0; i < threads; i++)
{
// Start async tasks
futures.push_back(std::async(std::launch::async,
generate_randn,
count/threads,
make_seed()));
}
for (auto &future : futures)
{
// Wait for tasks to finish
future.wait();
total += future.get();
}
// End timing
t2 = std::chrono::high_resolution_clock::now();
// Take the average of the threads' results
total /= threads;
std::cout << '\n' << total
<< "\nFinished in "
<< std::chrono::duration_cast<
std::chrono::milliseconds>(t2 - t1).count()
<< " ms\n";
}
(This isn't really an answer, but it won't fit into a comment, especially with the command formatting an links.)
You can profile your executable on Solaris using Solaris Studio's collect utility. On Solaris, that will be able to show you where your threads are contending.
collect -d /tmp -p high -s all app [app args]
Then view the results using the analyzer utility:
analyzer /tmp/test.1.er &
Replace /tmp/test.1.er with the path to the output generated by a collect profile run.
If your threads are contending over some resource(s) as #zwol posted in his answer, you will see it.
Oracle marketing brief for the toolset can be found here: http://www.oracle.com/technetwork/server-storage/solarisstudio/documentation/o11-151-perf-analyzer-brief-1405338.pdf
You can also try compiling your code with Solaris Studio for more data.

Multithreading in MSVC is showing no improvement

I am trying to run the following code to test the speedup I can get on my system, and check that my code is mult-threading. Using gcc on linux, I get a factor of about 7. Using Visual Studio on Windows, I get no improvement. In MSVS 2012, I set /Qpar and /MD ... am I missing something? What am I doing wrong?
#include <iostream>
#include <thread>
#ifdef WIN32
#include <windows.h>
double getTime() {
LARGE_INTEGER freq, val;
QueryPerformanceFrequency(&freq);
QueryPerformanceCounter(&val);
return 1000*(double)val.QuadPart / (double)freq.QuadPart;
};
#define clock_type double
#else
#include <ctime>
#define clock_type std::clock_t
#define getTime std::clock
#endif
static const int num_threads = 10;
//This function will be called from a thread
void f()
{
volatile double d=0;
for(int n=0; n<10000; ++n)
for(int m=0; m<10000; ++m)
d += d*n*m;
}
int main()
{
clock_type c_start = getTime();
auto t_start = std::chrono::high_resolution_clock::now();
std::thread t[num_threads];
//Launch a group of threads
for (int i = 0; i < num_threads; ++i) {
t[i] = std::thread(f);
}
//Join the threads with the main thread
for (int i = 0; i < num_threads; ++i) {
t[i].join();
}
clock_type c_end = getTime();
auto t_end = std::chrono::high_resolution_clock::now();
std::cout << "CPU time used: "
<< 1000.0 * (c_end-c_start) / CLOCKS_PER_SEC
<< " ms\n";
std::cout << "Wall clock time passed: "
<< std::chrono::duration_cast<std::chrono::milliseconds>(t_end - t_start).count()
<< " ms\n";
std::cout << "Acceleration factor: "
<< 1000.0 * (c_end-c_start) / CLOCKS_PER_SEC / std::chrono::duration_cast<std::chrono::milliseconds>(t_end - t_start).count() << "\n";
return 0;
}
The output using MSVS is:
CPU time used: 1003.64 ms
Wall clock time passed: 998 ms
Acceleration factor: 1.00565
In Linux, I get:
CPU time used: 5264.83 ms
Wall clock time passed: 698 ms
Acceleration factor: 7.54274
EDIT 1: increased size of matrix in f() from 1000 to 10000.
EDIT 2: added getTime() function using QueryPerformanceCounter, and included #define's to switch between std::clock() and getTime()
On MSVC, clock returns wall time and is thus not standards compliant.

Resources