Benchmarking Disk Performance (Windows and Linux on Xen) - linux

I am kind of new to the world of virtualization...I am doing some tests using different tools (iometer, fio, dd tool, and bonnie++). The idea is to benchmark the disk performance for different operating systems in a virtualized environment and for different types of virtualization (Paravirt. and Full-virt.).
The results out of those tests for Windows (XP, 7 and 8) were not as I expected with almost all tools, since I got a relatively high performance results without installing the paravirt. drivers for Windows, and what was more surprisingly that after installing the paravirt. drivers the performance decreased.
Samples of my tests using fio tool:
Writing a sequential file the with size of 16 GB and with block size of 512 KB
Windows 7 (Full-Virt.): 87.2 MB/s average aggregate bandwidth
Ubuntu (Paravirt.): 72.9 MB/s average aggregate bandwidth
Any ideas about what is going on here (I am using openSUSE as an OS in case it matters) !!
Thanks

I'd downvote this if I could. Your core issue is you need to accurately identify the IO profile of the applications you are actually running. Writing 16 GB of sequential data is a very atypical workload.
Once you have a realistic understanding of the IO workload, you can then configure the testing tools to match.

Related

Why does some nodeJS tasks makes Ubuntu freeze and not Windows?

I have been facing an intriguing problem lastly.
I am working on a project with a pretty heavy front in Angular JS with a hundred of Jest tests. I have 16 Go of ram but the project is so heavy that sometimes it fills up completely the ram and often the computer cannot handle the project running plus a yarn test at the same time (which takes up to 3 to 4 Go of ram) or a cypress workflow test without big latency problems.
To avoid big freezes (up to several minutes) and crashes, I increased the swap to 16 Go.
That being said, for various reasons I had to work on the project on Windows 10 and faced none of these problems.
Everything runs smoothly, the graphical interface doesn't lag even with screen sharing even-though the ram is also completely filled up and the CPU at 100%.
I am even able to run 20 yarn test at the same time without much lag which seems completely impossible on Linux even with the increased swap.
I've seen that windows use ram compression by default and not linux but I only had up to 549 Mo of compressed ram during my comparisons.
I firstly though that it could be a problem with gnome which is known to be heavy and sometimes buggy but I also tested it with KDE and have the same results.
I also heard that windows allocate special resources to the graphical environment where linux may treat it like any other process but that alone cannot explain all the problems because the whole computer freezes on linux and not in windows.
So I'm starting to wonder if there is something about the memory or process management that windows do significantly better than linux.
My config :
Computer model : Dell XPS-15-7590
Processor : Intel core i7 9750H, 2,6 GHz, 4,5 GHz turbo max (6 cores, 12 threads)
RAM : 16 Go
Graphic card : GTX 1650M
Screen : 4K 16:9
SSD : NVME 512 Go
I was facing the same issue on Ubuntu 22.04 with 16GB RAM and Intel i5-12400 Processor
My solution was to limit the number of max workers on jest config
"maxWorkers": 4

How to utilize the High Performance cores on Apple Silicon

I have developed a macOS app which is heavily relying on multithreading (a call center simulator). It runs fine on my iMac 2019 and fills up all cores nicely. In my test scenario it simulates app. 1.4 mio. telephone calls in total in 100 iterations, each iteration as a dispatch item on a parallel dispatch queue.
Now I have bought a new Mac mini with M1 Apple Silicon and I was eager to see how the performance develops on that test machine. Well, it’s not bad but not as good as I expected:
System
Duration
iMac 2019, Intel 6-core i5, 3.0 GHz, Catalina macOS 10.15.7
19.95 s
Mac mini, M1 8-core, Big Sur macOS 11.2, Rosetta2
26.85 s
Mac mini, M1 8-core, Big Sur macOS 11.2, native ARM
17.07 s
Investigating a little bit further I noticed that at the start of the simulation all 8 cores of the M1 Mac are filled up properly but after a few seconds only the 4 high efficiency cores are used any more.
I have read the Apple docs „Optimize for Apple Silicon with performance and efficiency cores“ and double checked that the dispatch queue for the iterations is set up properly:
let simQueue = DispatchQueue.global(qos: .userInitiated)
But no success. After a few seconds of running the high performance cores are obviously not utilized any more. I even tried to set up the queue with qos set to .userInteracive up that didn’t help either. I also flagged the dispatch items with proper qos but that didn’t change anything. It looks to me that other apps (e.g. XCode) do utilize the high performance cores even for a longer time.
Does anybody know how to force a M1 Mac to utilize the high performance cores?
"M1 8 core" is really "M1 4 performance + 4 power saving cores". I expect it to have be a bit more performance than an Intel 6 core, but not much. Exactly has you see, 15% faster than six Intel cores or about as fast as 7 Intel cores would be. The current M1 chips are low end processors. "A bit better than Intel six cores" is quite good.
Your code must be running on the performance cores, otherwise there would be no chance at all to come close to the Intel performance. In that graph, nothing tells you which cores are used.
What happens most likely is that all cores start running, each trying to do one eighth of the work, and after about 8 seconds the performance cores have their work done. Then the power saving cores move their work to the performance cores. And you are just misinterpreting the image as only low performance cores doing the work.
I would guess that Apple has put a preference on using efficiency cores over performance for many reasons. Battery life being one, and most likely thermal reasons as well. This is the big question mark with a SoC that originally was designed for smartphones and tablets. MacOS is a much heavier OS then IOS or iPad OS. Apple most likely felt that performance cores were only needed in the cases where maximum throughput was needed. No doubt, I think some including myself with a M1 Mac Mini would like a way to adjust this balance between efficiency and performance cores. Personally overall, I would prefer all cores be capable of switching between efficiency and performance such as in Intel's Speed shift technology. This may come along with the M1's advancements in terms of Mac Pro models and other Pro models.

openmp: performance decreases with multiple threads on my desktop, but the opposite over my server

I believe I read pretty much about StackOverflow threads regarding decreasing performance when increasing threads number with OpenMP. Mostly they were due to false sharing. My situation is quite different because my two machines are showing the opposite result.
I'm running STREAM benchmark, and information about my machines is written below:
Intel Xeon Gold-6148. 20 cores (40 threads), 2.4 GHz, 27.5MB LLC
Intel Core i5-9400. 6 cores (6 threads), 2.9 GHz, 9MB LLC
Memory information is similar between the two machines so I will omit it.
I ran the benchmark quite a lot of times and checked the variance between runs is small enough. The result is quite interesting.
Gold-6148(a.k.a server) gets enhanced results when increasing threads number with the OMP_NUM_THREADS option. However, the result of i5-9400(a.k.a desktop) decreases with multiple threads number.
I set the STREAM_ARRAY_SIZE with 20m, and double-checked with various sizes, so it won't affect the result.
Also, I doubted maybe because of any difference over glibc/gomp library between the two, but no difference.
Any idea why this is happening? I just absentmindedly watching the chart below again and again...
i5-9400 result
gold-6148 result
Memory information is similar between the two machines so I will omit it.
You cannot simply omit the most important information when talking about the STREAM benchmark. Xeon Gold 6148 has six DDR4-2666 memory channels split over two separate memory controllers while i5-9400 (assuming i5-9700 is a typo since i7-9700 is an octo-core i7 and not a hexa-core i5 CPU) has only two DDR4-2666 memory channels on a single memory controller. Therefore, 6148 with memory modules installed on all six channels is capable of delivering 3x the memory bandwidth of i5-9400 with the same type of memory modules on both channels. It can also handle more simultaneous memory requests and therefore provides better memory utilisation with more than one thread. Thus, the actual memory configuration is quite important.
Interpreting the STREAM results requires deep understanding of the underlying CPU architecture. There is a nice article by Georg Hager on that topic.

what could be the reason made a program running on two compute with a big different ipc(instrument per second)?

OS: Linux 4.4.198-1.el7.elrepo.x86_64
CPU: Intel(R) Xeon(R) Gold 6148 CPU # 2.40GHz * 4
MEM: 376 GB
I have a program(do some LSTM model inference based on TensorFlow 1.14),it runs on two machines with the same hardware, one got a bad performance while the other got a better one(about 10x diff).
I used intel pqos tool diagnosed that two processes and got a big different IPC number (one is 0.07 while the other one is 2.5), both processes are binding on some specified CPU core, and each machine payload does not heavy. This problem appears two weeks ago, before that this bad machine works as well, history command shows nothing configuration changed.
I checked many environ information, including the kernel, fs, process scheduler, io scheduler, program, and libraries md5, they are the same, the bad computer iLo show none error, the program mainly burns the CPU.
I used sysbench to test two machines(cpu & memory) which show about 25% performance difference, the bad machine does prime calculation is slower. Could it be some hardware problem?
I don't know what is the root cause resulted in the difference in IPC (equaled to performance), How can I dig into the situation?

Performance check between shared cluster and laptop with Intel(R)Core™ i7

I am not really familiar with shared clusters, but I am assuming performance should not differ much in terms of completing a single task when compared with a laptop processor. I have a C++ code which I ran on my laptop with Intel(R)Core™ i7-4558U 2.80 GHz CPU and 16.0 GB RAM, with the operating system of 64 bit Windows 10. On the other hand, I have results of the same code from a publication which belong to the tests conducted on a shared cluster with Intel Xeon 2.3 GHz CPU and 4 GB memory limit with Linux operating system. The program uses CPLEX as the solver: my laptop has IBM Cplex 12.7 whereas previous runs used IBM CPLEX 12.4 (Cplex, 2012). My results seem to take 300 times more than the reported results of the previous run. Does this much difference make sense? If so what could be the driver behind it?
This could be attributed to performance variability (see, for example, section 5 of the MIPLIB 2010 paper here). In a nutshell, minor differences in problem formulation (e.g., order of constraints, input format, etc.), or running on different platforms, can have a great effect on the time to solve. With CPLEX 12.7, you can use the interactive to help you evaluate variability.

Resources