what could be the reason made a program running on two compute with a big different ipc(instrument per second)? - linux

OS: Linux 4.4.198-1.el7.elrepo.x86_64
CPU: Intel(R) Xeon(R) Gold 6148 CPU # 2.40GHz * 4
MEM: 376 GB
I have a program(do some LSTM model inference based on TensorFlow 1.14),it runs on two machines with the same hardware, one got a bad performance while the other got a better one(about 10x diff).
I used intel pqos tool diagnosed that two processes and got a big different IPC number (one is 0.07 while the other one is 2.5), both processes are binding on some specified CPU core, and each machine payload does not heavy. This problem appears two weeks ago, before that this bad machine works as well, history command shows nothing configuration changed.
I checked many environ information, including the kernel, fs, process scheduler, io scheduler, program, and libraries md5, they are the same, the bad computer iLo show none error, the program mainly burns the CPU.
I used sysbench to test two machines(cpu & memory) which show about 25% performance difference, the bad machine does prime calculation is slower. Could it be some hardware problem?
I don't know what is the root cause resulted in the difference in IPC (equaled to performance), How can I dig into the situation?

Related

openmp: performance decreases with multiple threads on my desktop, but the opposite over my server

I believe I read pretty much about StackOverflow threads regarding decreasing performance when increasing threads number with OpenMP. Mostly they were due to false sharing. My situation is quite different because my two machines are showing the opposite result.
I'm running STREAM benchmark, and information about my machines is written below:
Intel Xeon Gold-6148. 20 cores (40 threads), 2.4 GHz, 27.5MB LLC
Intel Core i5-9400. 6 cores (6 threads), 2.9 GHz, 9MB LLC
Memory information is similar between the two machines so I will omit it.
I ran the benchmark quite a lot of times and checked the variance between runs is small enough. The result is quite interesting.
Gold-6148(a.k.a server) gets enhanced results when increasing threads number with the OMP_NUM_THREADS option. However, the result of i5-9400(a.k.a desktop) decreases with multiple threads number.
I set the STREAM_ARRAY_SIZE with 20m, and double-checked with various sizes, so it won't affect the result.
Also, I doubted maybe because of any difference over glibc/gomp library between the two, but no difference.
Any idea why this is happening? I just absentmindedly watching the chart below again and again...
i5-9400 result
gold-6148 result
Memory information is similar between the two machines so I will omit it.
You cannot simply omit the most important information when talking about the STREAM benchmark. Xeon Gold 6148 has six DDR4-2666 memory channels split over two separate memory controllers while i5-9400 (assuming i5-9700 is a typo since i7-9700 is an octo-core i7 and not a hexa-core i5 CPU) has only two DDR4-2666 memory channels on a single memory controller. Therefore, 6148 with memory modules installed on all six channels is capable of delivering 3x the memory bandwidth of i5-9400 with the same type of memory modules on both channels. It can also handle more simultaneous memory requests and therefore provides better memory utilisation with more than one thread. Thus, the actual memory configuration is quite important.
Interpreting the STREAM results requires deep understanding of the underlying CPU architecture. There is a nice article by Georg Hager on that topic.

Performance check between shared cluster and laptop with Intel(R)Core™ i7

I am not really familiar with shared clusters, but I am assuming performance should not differ much in terms of completing a single task when compared with a laptop processor. I have a C++ code which I ran on my laptop with Intel(R)Core™ i7-4558U 2.80 GHz CPU and 16.0 GB RAM, with the operating system of 64 bit Windows 10. On the other hand, I have results of the same code from a publication which belong to the tests conducted on a shared cluster with Intel Xeon 2.3 GHz CPU and 4 GB memory limit with Linux operating system. The program uses CPLEX as the solver: my laptop has IBM Cplex 12.7 whereas previous runs used IBM CPLEX 12.4 (Cplex, 2012). My results seem to take 300 times more than the reported results of the previous run. Does this much difference make sense? If so what could be the driver behind it?
This could be attributed to performance variability (see, for example, section 5 of the MIPLIB 2010 paper here). In a nutshell, minor differences in problem formulation (e.g., order of constraints, input format, etc.), or running on different platforms, can have a great effect on the time to solve. With CPLEX 12.7, you can use the interactive to help you evaluate variability.

why performance improvement of CPU-intensive task differs between windows and linux when using multi processes

Here is my situation:
my company need to run tests on tons of test samples. But if we start a single process on a windows PC machine, this test could last for hours, even days. so we try to split the test set and start a process to test each one of the slices on a multi-core linux server.
we expect a linear performance improvement for the server solution, but the truth is we could only observe a 2~3 times improvement when the test task finished by 10~20 processes.
I tried several means to locate the problem:
disable hyper-threading;
use max-performance power policy
use taskset to pin each process on different core
but no luck, the problem remains.
Why does this happen? which is the root cause, our code, OS or hardware?
here is the info of my pc and server:
PC: os: win10; cpu: i5-4570, 2 physical core; mem : 16gb
server: os: redhat 6.5 cpu: E5-2630 v3, 2 physical core; mem : 32gb
Edit:
About CPU: the server has 2 processors, and each of them has 8 physical cores. check this link for more information.
About My Test: it's handwriting recognition related(that's why it's a cpu-sensitive task).
About IO: the performance check points do not involve much IO if logging doesn't count.
we expect a linear performance improvement for the server solution,
but the truth is we could only observe a 2~3 times improvement when
the test task finished by 10~20 processes.
This seems very logical considering there are only 2 cores on the system. Starting 10-20 processes will only add some overhead due to task switching.
Also, I/O could be a bottleneck here too, if multiple processes are reading from disk at the same time.
Ideally, the number of running threads should not exceed 2 x the number of cores.

Experienced strange rdtsc behavior comparing physical hardware and kvm-based VMs

I have a following problem. I run several stress tests on a Linux machine
$ uname -a
Linux debian 3.14-2-686-pae #1 SMP Debian 3.14.15-2 (2014-08-09) i686 GNU/Linux
It's an Intel i5 Intel(R) Core(TM) i5-2400 CPU # 3.10GHz, 8 G RAM, 300 G HDD.
These tests are not I/O intensive, I mostly compute double arithmetic in the following way:
start = rdtsc();
do_arithmetic();
stop = rdtsc();
diff = stop - start;
I repeat these tests many times, running my benchmarking application on a physical machine or on a KVM based VM:
qemu-system-i386 disk.img -m 2000 -device virtio-net-pci,netdev=net1,mac=52:54:00:12:34:03 -netdev type=tap,id=net1,ifname=tap0,script=no,downscript=no -cpu host,+vmx -enable-kvm -nographichere
I collect data statistics (i.e., diffs) for many trials. For the physical machine (not loaded), I get the data distribution of processing delay mostly likely to be a very narrow lognormal.
When I repeat the experiment on the virtual machine (physical and virtual machines are not loaded), the lognormal distribution is still there (of a little bit wider shape), however, I collect a few points with completion times much shorter (about two times) than the absolute minimum gathered for the physical machine!! (Notice that the completion time distribution on the physical machine is very narrow lying close to the min value). Also there are some points with completion times much longer than the average completion time on the hardware machine.
I guess that my rdtsc benchmarking method is not very accurate for the VM environment. Can you please suggest a method to improve my benchmarking system that could provide reliable (comparable) statistics between the physical and the kvm-based virtual environment? At least something, that won't show me that the VM is 2x faster than a hardware PC in a small number of cases.
Thanks in advance for any suggestions or comments on this subject.
Best regards
maybe you can try clock_gettime(CLOCK_THREAD_CPUTIME_ID,&ts),see man clock_gettime for more information
It seems that it's not the problem of rdtsc at all. I am using my i5 Intel core with a fixed limited frequency through the acpi_cpufreq driver with the userspace governor. Even though the CPU speed is fixed at let's say 2.4 G (out of 3.3G), there are some calculations performed with the maximum speed of 3.3 G. Roughly speaking, I also encountered a very small number of such cases on the physical machine ~1 per 10000. On kvm, this behavior is of higher frequency, let's say about a few percent. I will further investigate this problem.

Linux per-process resource limits - a deep Red Hat Mystery

I have my own multithreaded C program which scales in speed smoothly with the number of CPU cores.. I can run it with 1, 2, 3, etc threads and get linear speedup.. up to about 5.5x speed on a 6-core CPU on a Ubuntu Linux box.
I had an opportunity to run the program on a very high end Sunfire x4450 with 4 quad-core Xeon processors, running Red Hat Enterprise Linux. I was eagerly anticipating seeing how fast the 16 cores could run my program with 16 threads..
But it runs at the same speed as just TWO threads!
Much hair-pulling and debugging later, I see that my program really is creating all the threads, they really are running simultaneously, but the threads themselves are slower than they should be. 2 threads runs about 1.7x faster than 1, but 3, 4, 8, 10, 16 threads all run at just net 1.9x! I can see all the threads are running (not stalled or sleeping), they're just slow.
To check that the HARDWARE wasn't at fault, I ran SIXTEEN copies of my program independently, simultaneously. They all ran at full speed. There really are 16 cores and they really do run at full speed and there really is enough RAM (in fact this machine has 64GB, and I only use 1GB per process).
So, my question is if there's some OPERATING SYSTEM explanation, perhaps some per-process resource limit which automatically scales back thread scheduling to keep one process from hogging the machine.
Clues are:
My program does not access the disk or network. It's CPU limited. Its speed scales linearly on a
single CPU box in Ubuntu Linux with
a hexacore i7 for 1-6 threads. 6
threads is effectively 6x speedup.
My program never runs faster than
2x speedup on this 16 core Sunfire
Xeon box, for any number of threads
from 2-16.
Running 16 copies of
my program single threaded runs
perfectly, all 16 running at once at
full speed.
top shows 1600% of
CPUs allocated. /proc/cpuinfo shows
all 16 cores running at full 2.9GHz
speed (not low frequency idle speed
of 1.6GHz)
There's 48GB of RAM free, it is not swapping.
What's happening? Is there some process CPU limit policy? How could I measure it if so?
What else could explain this behavior?
Thanks for your ideas to solve this, the Great Xeon Slowdown Mystery of 2010!
My initial guess would be shared memory bottlenecks. From what you say, your performance pretty much flatlines after 2 CPUs. You initially blame Redhat, but I'd be curious to see what happens if you install Ubuntu on the same hardware. I assume, of course, that you're running 64 bit SMP kernels across both tests.
It's probably not possible that the motherboard would peak at utilizing 2 CPUs. You have another machine with multiple cores that has provided better performance. Do you have hyperthreading turned on with the new machine? (and how does that answer compare to the old machine?). You're not, by chance, running in a virtualized environment?
Overall, your evidence is pointing to a ludicrously slow bottleneck somewhere. As you said, you're not I/O bound, so that leaves the CPU and memory. Either something is wrong with the hardware, or something is wrong with the hardware. Test one by changing the other, and you'll narrow down your possibilities quickly.
Do some research on rlimit - it's quite possible the shell/user acct you're running in has some RH-default or admin-set resource limits in place.
When you see this kind of odd scaling behaviour, especially if problems are seen with multiple threads, but not multiple processes, one thing to start looking at is the impacts of lock contention and other synchronisation primitives, which can cause threads running on different processors to have to wait for each other, potentially forcing multiple cores to flush their cache to main memory.
This means memory architecture starts to come into play, and that's going to be substantially faster when you have 6 cores on a single piece of silicon than when you're coordinating across 4 separate processors. Specifically, the single CPU case likely isn't needing to hit main memory for locking operations at all - everything is likely being handled at the L3 cache level, allowing the CPU to get on with things while data is flushed to main memory in the background.
While I expect the OP has lost interest in the question after all this time (or may not even have access to the hardware any more), one way to check this would be to see if the scaling up to 4 threads improves if the process affinity is set to lock it to a single physical CPU. Even better though would be to profile the application itself to see where it is spending it's time.As you change architectures and increase the number of cores, it gets harder and harder to guess where the bottlenecks are, so you really need to start measuring things directly, as in this example: http://postgresql.1045698.n5.nabble.com/Sun-Donated-a-Sun-Fire-T2000-to-the-PostgreSQL-community-td2057445.html

Resources