zLib transparent write mode "wT" performance degradation - linux

I would expect zLib transparent mode ( gzptintf() ) as fast as regular fprintf(). I found zLib gzprintf() with "wT" is 2.5x slower than fprintf(). Is there any workaround on this performance issue?
I’m using libz.so.1.2.8 on Linux (fedora 22, kernel 4.0.5, Intel(R) Core(TM) i7-3770 CPU # 3.40GHz) to provide output file compress option to my event trace collector. To keep legacy compatibility I need transparent file format writing mode.
As I see, the option “T” in gzopen allow to write files with no compression and no gzip header record.
The problem is in performance. The transparent mode is ~2.5x slower than simple standard fprintf.
Here is quick test result (values are in TSC):
zLib]$ ./zlib_transparent
Performance fprintf vs gzprintf (transparent):
fprintf 22883026324
zLib transp 62305122876
ratio 2.72277
The source for this test:
#include <stdio.h>
#include <zlib.h>
#include <iostream>
#include <sstream>
#include <iomanip>
#define NUMITERATIONS 10000000
static double buffer[NUMITERATIONS];
static __inline__ unsigned long long rdtsc(void){
unsigned hi, lo;
__asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
return ( (unsigned long long)lo)|( ((unsigned long long)hi)<<32 );
long long test_fprintf(double *buffer){
long long t = rdtsc();
double tmp = 0;
FILE *file = fopen("fprintf_file.txt", "w");
for (int i = 0; i < NUMITERATIONS; ++i) {
fprintf(file, "[%f:%f]\n", buffer[i], buffer[i] - tmp);
tmp = buffer[i] + i;
return rdtsc() - t;
long long test_zlib_transparent(double *buffer){
long long t = rdtsc();
#ifdef USE_ZLIB
double tmp = 0;
gzFile file = gzopen("zlib_file.txt.gz", "wT");
for (int i = 0; i < NUMITERATIONS; ++i) {
gzprintf(file, "[%f:%f]\n", buffer[i], buffer[i] - tmp);
tmp = buffer[i] + i;
return rdtsc() - t;
int main(){
std::cout << "Performance fprintf vs gzprintf (transparent):" << std::endl;
long long dPrint = test_fprintf(buffer);
std::cout << " fprintf " << dPrint << std::endl;
long long dStream = test_zlib_transparent(buffer);
std::cout << "zLib transp " << dStream << std::endl;
std::cout << "ratio " << double(dStream)/double(dPrint) << std::endl;
return 0;
g++ -g -O3 -DUSE_ZLIB=1 -DUSE_FPRINTF=1 zlib_transparent.cpp -o zlib_transparent –lz
Thank you

My bad. (I wrote gzprintf().)
write() is being called too often. You will get approximately the same performance as zlib if you replace fprintf() with snprintf() and write().
I will improve this in the next version of zlib. If you would like to try it, apply this diff. I don't know how it will perform on Linux, but on Mac OS X, gzprintf() in transparent mode is now 10% faster than fprintf(). (Wasn't expecting that.)


Why the run time is shorter when I use a lock in a c++ program?

I am practise the multithreaded programming with cpp. And when I use the std::lock_guard in the same code, its run time becomes shorter than before. That's amazing, why?
The lock version:
#include <iostream>
#include <thread>
#include <mutex>
#include <ctime>
using namespace std;
class test {
std::mutex m;
int a;
test() :a(0) {}
void add() {
std::lock_guard<std::mutex> guard(m);
for(int i = 0; i < 1e9; i++) {
void print() {
std::cout << a << std::endl;
int main() {
test t;
auto start = clock();
std::thread t1(&test::add, ref(t));
std::thread t2(&test::add, ref(t));
auto end = clock();
cout << "time = " << double(end - start) / CLOCKS_PER_SEC << "s" << endl;
return 0;
and the ouput is:
time = 5.71852s
the no lock version is:
#include <iostream>
#include <thread>
#include <mutex>
#include <ctime>
using namespace std;
class test {
std::mutex m;
int a;
test() :a(0) {}
void add() {
// std::lock_guard<std::mutex> guard(m);
for(int i = 0; i < 1e9; i++) {
void print() {
std::cout << a << std::endl;
int main() {
test t;
auto start = clock();
std::thread t1(&test::add, ref(t));
std::thread t2(&test::add, ref(t));
auto end = clock();
cout << "time = " << double(end - start) / CLOCKS_PER_SEC << "s" << endl;
return 0;
and the output is:
time = 10.765s
I'm using the ubuntu1804, g++ version is :
g++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
In my opinion, the lock is an extra operation, it should cost more time of course.
Maybe someone can help me? Thanks.
Modifying a variable from multiple threads cause an undefined behaviour. This means the compiler ans the processor are free to do whatever they want in this case (like removing the loop for example, or not reloading the variable from memory since it is not supposed to be modified by another thread in the first place). As a result, studying performance of this case is not really relevant.
Assuming the compiler do not perform any (allowed) advanced optimizations, the program should contain a race condition. It is certainly slower because of a cache-line bouncing effect: multiple cores compete for the same locked cache-line and moving it from one core to another is very slow compared to increasing the variable from the L1 cache (this is certainly the overhead you see). Indeed, on standard x86-64 platforms like mainstream Intel processors, moving a locked cache line from one core to another means invalidating copies of the cache line of other L1/L2 cores and fetching it from the L3 cache which is much slower than the L1 (lower throughput & much higher latency). Note that this behaviour is dependent of the target platform (mainly the processor, besides compiler optimizations), but most platforms work similarly. For more information please read this and that about cache-coherence protocols.

Multiples threads running on one core instead of four depending on the OS

I am using Raspbian on Raspberry 3.
I need to divide my code in few blocks (2 or 4) and assign a thread per block to speed up calculations.
At the moment, I am testing with simple loops (see attached code) on one thread and then on 4 threads. And executions time on 4 threads is always 4 times longer, so it looks like this 4 threads are scheduled to run on the same CPU.
How to assign each thread to run on other CPUs? Even 2 threads on 2 CPUs should make big difference to me.
I even tried to use g++6 and no improvement. And using parallel libs openmp in the code with "#pragma omp for" still running on one CPU.
I tried to run this code on Fedora Linux x86 and I had the same behavior, but on Windows 8.1 and VS2015 i have got different results where time was the same one one thread and then on 4 threads, so it was running on different CPUs.
Would you have any suggestions??
Thank you.
#include <iostream>
//#include <arm_neon.h>
#include <ctime>
#include <thread>
#include <mutex>
#include <iostream>
#include <vector>
using namespace std;
float simd_dot0() {
unsigned int i;
unsigned long rezult;
for (i = 0; i < 0xfffffff; i++) {
rezult = i;
return rezult;
int main() {
unsigned num_cpus = std::thread::hardware_concurrency();
std::mutex iomutex;
std::vector<std::thread> threads(num_cpus);
cout << "Start Test 1 CPU" << endl; // prints !!!Hello World!!!
double t_start, t_end, scan_time;
scan_time = 0;
t_start = clock();
t_end = clock();
scan_time += t_end - t_start;
std::cout << "\nExecution time on 1 CPU: "
<< 1000.0 * scan_time / CLOCKS_PER_SEC << "ms" << std::endl;
cout << "Finish Test on 1 CPU" << endl; // prints !!!Hello World!!!
cout << "Start Test 4 CPU" << endl; // prints !!!Hello World!!!
scan_time = 0;
t_start = clock();
for (unsigned i = 0; i < 4; ++i) {
threads[i] = std::thread([&iomutex, i] {
std::cout << "\nExecution time on CPU: "
<< i << std::endl;
// Simulate important work done by the tread by sleeping for a bit...
for (auto& t : threads) {
t_end = clock();
scan_time += t_end - t_start;
std::cout << "\nExecution time on 4 CPUs: "
<< 1000.0 * scan_time / CLOCKS_PER_SEC << "ms" << std::endl;
cout << "Finish Test on 4 CPU" << endl; // prints !!!Hello World!!!
cout << "!!!Hello World!!!" << endl; // prints !!!Hello World!!!
while (1);
return 0;
Edit :
On Raspberry Pi3 Raspbian I used g++4.9 and 6 with the following flags :
-std=c++11 -ftree-vectorize -Wl--no-as-needed -lpthread -march=armv8-a+crc -mcpu=cortex-a53 -mfpu=neon-fp-armv8 -funsafe-math-optimizations -O3

std::async performance on Windows and Solaris 10

I'm running a simple threaded test program on both a Windows machine (compiled using MSVS2015) and a server running Solaris 10 (compiled using GCC 4.9.3). On Windows I'm getting significant performance increases from increasing the threads from 1 to the amount of cores available; however, the very same code does not see any performance gains at all on Solaris 10.
The Windows machine has 4 cores (8 logical) and the Unix machine has 8 cores (16 logical).
What could be the cause for this? I'm compiling with -pthread, and it is creating threads since it prints all the "S"es before the first "F". I don't have root access on the Solaris machine, and from what I can see there's no installed tool which I can use to view a process' affinity.
Example code:
#include <iostream>
#include <vector>
#include <future>
#include <random>
#include <chrono>
std::default_random_engine gen(std::chrono::system_clock::now().time_since_epoch().count());
std::normal_distribution<double> randn(0.0, 1.0);
double generate_randn(uint64_t iterations)
// Print "S" when a thread starts
std::cout << "S";
double rvalue = 0;
for (int i = 0; i < iterations; i++)
rvalue += randn(gen);
// Print "F" when a thread finishes
std::cout << "F";
return rvalue/iterations;
int main(int argc, char *argv[])
if (argc < 2)
return 0;
uint64_t count = 100000000;
uint32_t threads = std::atoi(argv[1]);
double total = 0;
std::vector<std::future<double>> futures;
std::chrono::high_resolution_clock::time_point t1;
std::chrono::high_resolution_clock::time_point t2;
// Start timing
t1 = std::chrono::high_resolution_clock::now();
for (int i = 0; i < threads; i++)
// Start async tasks
futures.push_back(std::async(std::launch::async, generate_randn, count/threads));
for (auto &future : futures)
// Wait for tasks to finish
total += future.get();
// End timing
t2 = std::chrono::high_resolution_clock::now();
// Take the average of the threads' results
total /= threads;
std::cout << std::endl;
std::cout << total << std::endl;
std::cout << "Finished in " << std::chrono::duration_cast<std::chrono::milliseconds>(t2 - t1).count() << " ms" << std::endl;
As a general rule, classes defined by the C++ standard library do not have any internal locking. Modifying an instance of a standard library class from more than one thread, or reading it from one thread while writing it from another, is undefined behavior, unless "objects of that type are explicitly specified as being sharable without data races". (N3337, sections and The RNG classes are not "explicitly specified as being sharable without data races". (cout is an example of a stdlib object that is "sharable with data races" — as long as you haven't done ios::sync_with_stdio(false).)
As such, your program is incorrect because it accesses a global RNG object from more than one thread simultaneously; every time you request another random number, the internal state of the generator is modified. On Solaris, this seems to result in serialization of accesses, whereas on Windows it is probably instead causing you not to get properly "random" numbers.
The cure is to create separate RNGs for each thread. Then each thread will operate independently, and they will neither slow each other down nor step on each other's toes. This is a special case of a very general principle: multithreading always works better the less shared data there is.
There's an additional wrinkle to worry about: each thread will call system_clock::now at very nearly the same time, so you may end up with some of the per-thread RNGs seeded with the same value. It would be better to seed them all from a random_device object. random_device requests random numbers from the operating system, and does not need to be seeded; but it can be very slow. The random_device should be created and used inside main, and seeds passed to each worker function, because a global random_device accessed from multiple threads (as in the previous edition of this answer) is just as undefined as a global default_random_engine.
All told, your program should look something like this:
#include <iostream>
#include <vector>
#include <future>
#include <random>
#include <chrono>
static double generate_randn(uint64_t iterations, unsigned int seed)
// Print "S" when a thread starts
std::cout << "S";
std::default_random_engine gen(seed);
std::normal_distribution<double> randn(0.0, 1.0);
double rvalue = 0;
for (int i = 0; i < iterations; i++)
rvalue += randn(gen);
// Print "F" when a thread finishes
std::cout << "F";
return rvalue/iterations;
int main(int argc, char *argv[])
if (argc < 2)
return 0;
uint64_t count = 100000000;
uint32_t threads = std::atoi(argv[1]);
double total = 0;
std::vector<std::future<double>> futures;
std::chrono::high_resolution_clock::time_point t1;
std::chrono::high_resolution_clock::time_point t2;
std::random_device make_seed;
// Start timing
t1 = std::chrono::high_resolution_clock::now();
for (int i = 0; i < threads; i++)
// Start async tasks
for (auto &future : futures)
// Wait for tasks to finish
total += future.get();
// End timing
t2 = std::chrono::high_resolution_clock::now();
// Take the average of the threads' results
total /= threads;
std::cout << '\n' << total
<< "\nFinished in "
<< std::chrono::duration_cast<
std::chrono::milliseconds>(t2 - t1).count()
<< " ms\n";
(This isn't really an answer, but it won't fit into a comment, especially with the command formatting an links.)
You can profile your executable on Solaris using Solaris Studio's collect utility. On Solaris, that will be able to show you where your threads are contending.
collect -d /tmp -p high -s all app [app args]
Then view the results using the analyzer utility:
analyzer /tmp/test.1.er &
Replace /tmp/test.1.er with the path to the output generated by a collect profile run.
If your threads are contending over some resource(s) as #zwol posted in his answer, you will see it.
Oracle marketing brief for the toolset can be found here: http://www.oracle.com/technetwork/server-storage/solarisstudio/documentation/o11-151-perf-analyzer-brief-1405338.pdf
You can also try compiling your code with Solaris Studio for more data.

Using thrust with openmp: no substantial speed up obtained

I am interested in porting a code I had written using mostly the Thrust GPU library to multicore CPU's. Thankfully, the website says that thrust code can be used with threading environments such as OpenMP / Intel TBB.
I wrote a simple code below for sorting a large array to see the speedup using a machine which can support upto 16 Open MP threads.
The timings obtained on this machine for sorting a random array of size 16 million are
STL : 1.47 s
Thrust (16 threads) : 1.21 s
There seems to be barely any speed-up. I would like to know how to get a substantial speed-up for sorting arrays using OpenMP like I do with GPUs.
The code is below (the file sort.cu). Compilation was performed as follows:
nvcc -O2 -o sort sort.cu -Xcompiler -fopenmp -DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_BACKEND_OMP -lgomp
The NVCC version is 5.5
The Thrust library version being used is v1.7.0
#include <iostream>
#include <iomanip>
#include <cmath>
#include <cstdlib>
#include <stdio.h>
#include <algorithm>
#include <ctime>
#include <time.h>
#include "thrust/sort.h"
int main(int argc, char *argv[])
int N = 16000000;
double* myarr = new double[N];
for (int i = 0; i < N; ++i)
myarr[i] = (1.0*rand())/RAND_MAX;
std::cout << "-------------\n";
clock_t start,stop;
std::cout << "Time taken for sorting the array with STL is " << (stop-start)/(double)CLOCKS_PER_SEC;
for (int i = 0; i < N; ++i)
myarr[i] = (1.0*rand())/RAND_MAX;
//std::cout << myarr[i] << std::endl;
std::cout << "------------------\n";
std::cout << "Time taken for sorting the array with Thrust is " << (stop-start)/(double)CLOCKS_PER_SEC;
return 0;
The device backend refers to the behavior of operations performed on a thrust::device_vector or similar reference. Thrust interprets the array/pointer you are passing it as a host pointer, and performs host-based operations on it, which are not affected by the device backend setting.
There are a variety of ways to fix this issue. If you read the device backend documentation you will find general examples and omp-specific examples. You could even specify a different host backend which should have the desired behavior (OMP usage) with your code, I think.
Once you fix this, you'll get an additional result surprise, perhaps: thrust appears to sort the array quickly, but reports a very long execution time. I believe this is due (on linux, anyway) to the clock() function being affected by the number of OMP threads in use.
The following code/sample run has those issues addressed, and seems to give me a ~3x speedup for 4 threads.
$ cat t592.cu
#include <iostream>
#include <iomanip>
#include <cmath>
#include <cstdlib>
#include <stdio.h>
#include <algorithm>
#include <ctime>
#include <sys/time.h>
#include <time.h>
#include <thrust/device_ptr.h>
#include <thrust/sort.h>
int main(int argc, char *argv[])
int N = 16000000;
double* myarr = new double[N];
for (int i = 0; i < N; ++i)
myarr[i] = (1.0*rand())/RAND_MAX;
std::cout << "-------------\n";
timeval t1, t2;
gettimeofday(&t1, NULL);
gettimeofday(&t2, NULL);
float et = (((t2.tv_sec*1000000)+t2.tv_usec)-((t1.tv_sec*1000000)+t1.tv_usec))/float(1000000);
std::cout << "Time taken for sorting the array with STL is " << et << std::endl;;
for (int i = 0; i < N; ++i)
myarr[i] = (1.0*rand())/RAND_MAX;
//std::cout << myarr[i] << std::endl;
thrust::device_ptr<double> darr = thrust::device_pointer_cast<double>(myarr);
gettimeofday(&t1, NULL);
gettimeofday(&t2, NULL);
et = (((t2.tv_sec*1000000)+t2.tv_usec)-((t1.tv_sec*1000000)+t1.tv_usec))/float(1000000);
std::cout << "------------------\n";
std::cout << "Time taken for sorting the array with Thrust is " << et << std::endl ;
return 0;
$ nvcc -O2 -o t592 t592.cu -Xcompiler -fopenmp -DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_BACKEND_OMP -lgomp
$ OMP_NUM_THREADS=4 ./t592
Time taken for sorting the array with STL is 1.31956
Time taken for sorting the array with Thrust is 0.468176
Your mileage may vary. In particular, you may not see any improvement as you go above 4 threads. There may be a number of factors which prevent an OMP code from scaling beyond a certain number of threads. Sorting generally tends to be a memory-bound algorithm, so you will probably observe an increase until you have saturated the memory subsystem, and then no further increase from additional cores. Depending on your system, it's possible you could be in this situation already, in which case you may not see any improvement from OMP style multithreading.

Multithreading in MSVC is showing no improvement

I am trying to run the following code to test the speedup I can get on my system, and check that my code is mult-threading. Using gcc on linux, I get a factor of about 7. Using Visual Studio on Windows, I get no improvement. In MSVS 2012, I set /Qpar and /MD ... am I missing something? What am I doing wrong?
#include <iostream>
#include <thread>
#ifdef WIN32
#include <windows.h>
double getTime() {
LARGE_INTEGER freq, val;
return 1000*(double)val.QuadPart / (double)freq.QuadPart;
#define clock_type double
#include <ctime>
#define clock_type std::clock_t
#define getTime std::clock
static const int num_threads = 10;
//This function will be called from a thread
void f()
volatile double d=0;
for(int n=0; n<10000; ++n)
for(int m=0; m<10000; ++m)
d += d*n*m;
int main()
clock_type c_start = getTime();
auto t_start = std::chrono::high_resolution_clock::now();
std::thread t[num_threads];
//Launch a group of threads
for (int i = 0; i < num_threads; ++i) {
t[i] = std::thread(f);
//Join the threads with the main thread
for (int i = 0; i < num_threads; ++i) {
clock_type c_end = getTime();
auto t_end = std::chrono::high_resolution_clock::now();
std::cout << "CPU time used: "
<< 1000.0 * (c_end-c_start) / CLOCKS_PER_SEC
<< " ms\n";
std::cout << "Wall clock time passed: "
<< std::chrono::duration_cast<std::chrono::milliseconds>(t_end - t_start).count()
<< " ms\n";
std::cout << "Acceleration factor: "
<< 1000.0 * (c_end-c_start) / CLOCKS_PER_SEC / std::chrono::duration_cast<std::chrono::milliseconds>(t_end - t_start).count() << "\n";
return 0;
The output using MSVS is:
CPU time used: 1003.64 ms
Wall clock time passed: 998 ms
Acceleration factor: 1.00565
In Linux, I get:
CPU time used: 5264.83 ms
Wall clock time passed: 698 ms
Acceleration factor: 7.54274
EDIT 1: increased size of matrix in f() from 1000 to 10000.
EDIT 2: added getTime() function using QueryPerformanceCounter, and included #define's to switch between std::clock() and getTime()
On MSVC, clock returns wall time and is thus not standards compliant.
