Visual Studio VC2013 enabling SSE4.1 without AVX - visual-c++

I have a rather simple question, but after searching quite some time I found no real answer yet. Microsoft suggests to enable AVX enhanced instruction set in order to also make use of SSE4 optimized code.
Unfortunately, despite some readings, this enforces also use of an AVX capable CPU. Is there a way known to enable SSE4 without enforcing AVX in VC2013?
Background of this question is obvious I think, SSE4 is longer supported and only requires older CPU's (first from 2006 I think), while AVX requires CPU's from 2011. The dll in question only uses optimizations for SSE4, but for now I have to stick with SSE2 sacrificing performance in order to keep it working.

It seems that /arch:SSE2 flag adds support for SSE2 and later intrinsics. I don't have Visual Studio installed but this example works(_mm_floor_ps is SSE4 specific) :
#include <smmintrin.h>
#include <iostream>
using namespace std;
int main()
{
__declspec(align(16)) float values[4] = {1.3f, 2.1f, 4.3f, 5.1f};
for(int i = 0; i < 4; i++)
cout << values[i] << ' ';
cout << endl;
__m128 x = _mm_load_ps(values);
x = _mm_floor_ps(x);
_mm_store_ps(values, x);
for(int i = 0; i < 4; i++)
cout << values[i] << ' ';
cout << endl;
}
You can try it online here.

Related

Why the run time is shorter when I use a lock in a c++ program?

I am practise the multithreaded programming with cpp. And when I use the std::lock_guard in the same code, its run time becomes shorter than before. That's amazing, why?
The lock version:
#include <iostream>
#include <thread>
#include <mutex>
#include <ctime>
using namespace std;
class test {
std::mutex m;
int a;
public:
test() :a(0) {}
void add() {
std::lock_guard<std::mutex> guard(m);
for(int i = 0; i < 1e9; i++) {
a++;
}
}
void print() {
std::cout << a << std::endl;
}
};
int main() {
test t;
auto start = clock();
std::thread t1(&test::add, ref(t));
std::thread t2(&test::add, ref(t));
t1.join();
t2.join();
auto end = clock();
t.print();
cout << "time = " << double(end - start) / CLOCKS_PER_SEC << "s" << endl;
return 0;
}
and the ouput is:
2000000000
time = 5.71852s
the no lock version is:
#include <iostream>
#include <thread>
#include <mutex>
#include <ctime>
using namespace std;
class test {
std::mutex m;
int a;
public:
test() :a(0) {}
void add() {
// std::lock_guard<std::mutex> guard(m);
for(int i = 0; i < 1e9; i++) {
a++;
}
}
void print() {
std::cout << a << std::endl;
}
};
int main() {
test t;
auto start = clock();
std::thread t1(&test::add, ref(t));
std::thread t2(&test::add, ref(t));
t1.join();
t2.join();
auto end = clock();
t.print();
cout << "time = " << double(end - start) / CLOCKS_PER_SEC << "s" << endl;
return 0;
}
and the output is:
1010269798
time = 10.765s
I'm using the ubuntu1804, g++ version is :
g++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
In my opinion, the lock is an extra operation, it should cost more time of course.
Maybe someone can help me? Thanks.
Modifying a variable from multiple threads cause an undefined behaviour. This means the compiler ans the processor are free to do whatever they want in this case (like removing the loop for example, or not reloading the variable from memory since it is not supposed to be modified by another thread in the first place). As a result, studying performance of this case is not really relevant.
Assuming the compiler do not perform any (allowed) advanced optimizations, the program should contain a race condition. It is certainly slower because of a cache-line bouncing effect: multiple cores compete for the same locked cache-line and moving it from one core to another is very slow compared to increasing the variable from the L1 cache (this is certainly the overhead you see). Indeed, on standard x86-64 platforms like mainstream Intel processors, moving a locked cache line from one core to another means invalidating copies of the cache line of other L1/L2 cores and fetching it from the L3 cache which is much slower than the L1 (lower throughput & much higher latency). Note that this behaviour is dependent of the target platform (mainly the processor, besides compiler optimizations), but most platforms work similarly. For more information please read this and that about cache-coherence protocols.

Difference between Linux time and Performance clocks in code

I was running a simple test for timing of some C++ code, and I ran across an artifact that I am not 100% positive about.
Setup
My code uses C++11 high_resolution_clock to measure elapsed time. I also wrap the execution of my program using Linux's time command (/usr/bin/time). For my program, the high_resolution_clock reports ~2s while time reports ~7s (~6.5s user and ~.5s system). Also using the verbose option on time shows that my program used 100% of the CPU with 1 voluntary context switch and 10 involuntary context switches (/usr/bin/time -v).
Question
My question is what causes such a dramatic difference between OS time measurements and performance time measurements?
My initial thoughts
Through my knowledge of operating systems, I am assuming these differences are solely caused by context switches with other programs (as noted by time -v).
Is this the only reason for this difference? And should I trust the time reported by my program or the system when looking at code performance?
Again, my assumption is to trust the computed time from my program over Linux's time, because it times more than just my program's CPU usage.
Caveats
I am not posting code, as it isn't really relevant to the issue at hand. If you wish to know it is a simple test that times 100,000,000 random floating point arithmetic operations.
I know other clocks in my C++ code might be more or less appropriate for difference circumstances (this stack overflow question). High_resolution_clock is just an example.
Edit: Code as requested
#include <chrono>
#include <cstdlib>
#include <iostream>
#include <vector>
using namespace std;
using namespace std::chrono;
int main() {
size_t n = 100000000;
double d = 1;
auto start_hrc = high_resolution_clock::now();
for(size_t i = 0; i < n; ++i) {
switch(rand() % 4) {
case 0: d += 0.0001; break;
case 1: d -= 0.0001; break;
case 2: d *= 0.0001; break;
case 3: d /= 0.0001; break;
}
}
auto end_hrc = high_resolution_clock::now();
duration<double> diff_hrc = end_hrc - start_hrc;
cout << d << endl << endl;
cout << "Time-HRC: " << diff_hrc.count() << " s" << endl;
}
My question is what causes such a dramatic difference between OS time measurements and performance time measurements?
It looks like your system takes a while to start your application. Probably a resource issue: not enough free memory (swapping) or oversubscribed CPU.
No dramatic difference is observed on my desktop:
Time-HRC: 1.39005 s
real 0m1.391s
user 0m1.387s
sys 0m0.004s

std::async performance on Windows and Solaris 10

I'm running a simple threaded test program on both a Windows machine (compiled using MSVS2015) and a server running Solaris 10 (compiled using GCC 4.9.3). On Windows I'm getting significant performance increases from increasing the threads from 1 to the amount of cores available; however, the very same code does not see any performance gains at all on Solaris 10.
The Windows machine has 4 cores (8 logical) and the Unix machine has 8 cores (16 logical).
What could be the cause for this? I'm compiling with -pthread, and it is creating threads since it prints all the "S"es before the first "F". I don't have root access on the Solaris machine, and from what I can see there's no installed tool which I can use to view a process' affinity.
Example code:
#include <iostream>
#include <vector>
#include <future>
#include <random>
#include <chrono>
std::default_random_engine gen(std::chrono::system_clock::now().time_since_epoch().count());
std::normal_distribution<double> randn(0.0, 1.0);
double generate_randn(uint64_t iterations)
{
// Print "S" when a thread starts
std::cout << "S";
std::cout.flush();
double rvalue = 0;
for (int i = 0; i < iterations; i++)
{
rvalue += randn(gen);
}
// Print "F" when a thread finishes
std::cout << "F";
std::cout.flush();
return rvalue/iterations;
}
int main(int argc, char *argv[])
{
if (argc < 2)
return 0;
uint64_t count = 100000000;
uint32_t threads = std::atoi(argv[1]);
double total = 0;
std::vector<std::future<double>> futures;
std::chrono::high_resolution_clock::time_point t1;
std::chrono::high_resolution_clock::time_point t2;
// Start timing
t1 = std::chrono::high_resolution_clock::now();
for (int i = 0; i < threads; i++)
{
// Start async tasks
futures.push_back(std::async(std::launch::async, generate_randn, count/threads));
}
for (auto &future : futures)
{
// Wait for tasks to finish
future.wait();
total += future.get();
}
// End timing
t2 = std::chrono::high_resolution_clock::now();
// Take the average of the threads' results
total /= threads;
std::cout << std::endl;
std::cout << total << std::endl;
std::cout << "Finished in " << std::chrono::duration_cast<std::chrono::milliseconds>(t2 - t1).count() << " ms" << std::endl;
}
As a general rule, classes defined by the C++ standard library do not have any internal locking. Modifying an instance of a standard library class from more than one thread, or reading it from one thread while writing it from another, is undefined behavior, unless "objects of that type are explicitly specified as being sharable without data races". (N3337, sections 17.6.4.10 and 17.6.5.9.) The RNG classes are not "explicitly specified as being sharable without data races". (cout is an example of a stdlib object that is "sharable with data races" — as long as you haven't done ios::sync_with_stdio(false).)
As such, your program is incorrect because it accesses a global RNG object from more than one thread simultaneously; every time you request another random number, the internal state of the generator is modified. On Solaris, this seems to result in serialization of accesses, whereas on Windows it is probably instead causing you not to get properly "random" numbers.
The cure is to create separate RNGs for each thread. Then each thread will operate independently, and they will neither slow each other down nor step on each other's toes. This is a special case of a very general principle: multithreading always works better the less shared data there is.
There's an additional wrinkle to worry about: each thread will call system_clock::now at very nearly the same time, so you may end up with some of the per-thread RNGs seeded with the same value. It would be better to seed them all from a random_device object. random_device requests random numbers from the operating system, and does not need to be seeded; but it can be very slow. The random_device should be created and used inside main, and seeds passed to each worker function, because a global random_device accessed from multiple threads (as in the previous edition of this answer) is just as undefined as a global default_random_engine.
All told, your program should look something like this:
#include <iostream>
#include <vector>
#include <future>
#include <random>
#include <chrono>
static double generate_randn(uint64_t iterations, unsigned int seed)
{
// Print "S" when a thread starts
std::cout << "S";
std::cout.flush();
std::default_random_engine gen(seed);
std::normal_distribution<double> randn(0.0, 1.0);
double rvalue = 0;
for (int i = 0; i < iterations; i++)
{
rvalue += randn(gen);
}
// Print "F" when a thread finishes
std::cout << "F";
std::cout.flush();
return rvalue/iterations;
}
int main(int argc, char *argv[])
{
if (argc < 2)
return 0;
uint64_t count = 100000000;
uint32_t threads = std::atoi(argv[1]);
double total = 0;
std::vector<std::future<double>> futures;
std::chrono::high_resolution_clock::time_point t1;
std::chrono::high_resolution_clock::time_point t2;
std::random_device make_seed;
// Start timing
t1 = std::chrono::high_resolution_clock::now();
for (int i = 0; i < threads; i++)
{
// Start async tasks
futures.push_back(std::async(std::launch::async,
generate_randn,
count/threads,
make_seed()));
}
for (auto &future : futures)
{
// Wait for tasks to finish
future.wait();
total += future.get();
}
// End timing
t2 = std::chrono::high_resolution_clock::now();
// Take the average of the threads' results
total /= threads;
std::cout << '\n' << total
<< "\nFinished in "
<< std::chrono::duration_cast<
std::chrono::milliseconds>(t2 - t1).count()
<< " ms\n";
}
(This isn't really an answer, but it won't fit into a comment, especially with the command formatting an links.)
You can profile your executable on Solaris using Solaris Studio's collect utility. On Solaris, that will be able to show you where your threads are contending.
collect -d /tmp -p high -s all app [app args]
Then view the results using the analyzer utility:
analyzer /tmp/test.1.er &
Replace /tmp/test.1.er with the path to the output generated by a collect profile run.
If your threads are contending over some resource(s) as #zwol posted in his answer, you will see it.
Oracle marketing brief for the toolset can be found here: http://www.oracle.com/technetwork/server-storage/solarisstudio/documentation/o11-151-perf-analyzer-brief-1405338.pdf
You can also try compiling your code with Solaris Studio for more data.

Creating pointer to the sub arrays of mass allocated one dimensional array and release VC++ build

This is my first post I hope I am not making any mistake.
I have the following code. I am trying to allocate and access a two dimensional array in one shot and more importantly in one byte array. I also need to be able to access each sub array individually as shown in the code. It works fine in the debug mode. Though in the release build in VS 2012, it causes some problems during runtime, when the compiler optimizations are applied. If I disable the release compiler optimizations then it works. Do I need to do some kind of special cast to inform the compiler?
My priorities in code is fast allocation and network communication of complete array and at the same time working with its sub arrays.
I prefer not to use boost.
Thanks a lot :)
void PrintBytes(char* x,byte* data,int length)
{
using namespace std;
cout<<x<<endl;
for( int i = 0; i < length; i++ )
{
std::cout << "0x" << std::setbase(16) << std::setw(2) << std::setfill('0');
std::cout << static_cast<unsigned int>( data[ i ] ) << " ";
}
std::cout << std::dec;
cout<<endl;
}
byte* set = new byte[SET_SIZE*input_size];
for (int i=0;i<SET_SIZE;i++)
{
sprintf((char*)&set[i*input_size], "M%06d", i+1);
}
PrintByte((byte*)&set[i*input_size]);

setitimer on linux rounding up?

When I set a short timeout with setitimer and then query the set value (with getitimer or another setitimer) on a Linux 2.6.26 system (Debian 5.0.5), I get back a value higher than I set:
#include <sys/time.h>
#include <iostream>
int main() {
struct itimerval wanted, got;
wanted.it_value.tv_sec = 0;
wanted.it_value.tv_usec = 7000;
wanted.it_interval.tv_sec = 0;
wanted.it_interval.tv_usec = 0;
setitimer(ITIMER_VIRTUAL, &wanted, NULL);
getitimer(ITIMER_VIRTUAL, &got);
std::cerr << "we said: " << wanted.it_value.tv_usec << "\n"
<< "linux set: " << got.it_value.tv_usec << std::endl;
return 0;
}
returns:
we said: 7000
linux set: 12000
This is problematic, since we use the times reported as remaining after some computations, and they are way too large, too.
Is this a known problem? (googling did not work.) Does anyone have a good workaround?
In the POSIX documentation of the setitimer function there is a note
Implementations may place limitations on the granularity of timer values. For each interval timer, if the requested timer value requires a finer granularity than the implementation supports, the actual timer value shall be rounded up to the next supported value
The granularity in your system seems to be higher than 1000 usec (seems to be 6000 usec) and the timer value is rounded up. The timer granularity is the problem if you need such precision.

Resources