How to compare Spark performance under different hardware (GPU vs CPU)

How to compare Spark performance under different hardware (GPU vs CPU) - apache-spark

I found some benchmarks between GPU and CPU Spark-based systems that are not performed in the same hardware. Is this faire since a powerful CPU server could eventually outperforms a GPU server?
For example,
Here, the performance comparison is done in different hardware, AWS p3.2xlarge (Tesla V100) and AWS r4.2xlarge
https://youtu.be/tGqEZYUqexY?t=44
Here, the performance comparison is done in different hardware, Tesla V100 SXM2 and AWS m5.8xlarge
https://nlp.johnsnowlabs.com/docs/en/CPUvsGPUbenchmark
Other question, Is it not more fair, to compare Spark performance in the same hardware? Because Spark can run on GPUs Tesla V100 in both modes (pure CPU or with GPU acceleration).
Thanks,

Related

How to utilize the High Performance cores on Apple Silicon

I have developed a macOS app which is heavily relying on multithreading (a call center simulator). It runs fine on my iMac 2019 and fills up all cores nicely. In my test scenario it simulates app. 1.4 mio. telephone calls in total in 100 iterations, each iteration as a dispatch item on a parallel dispatch queue.
Now I have bought a new Mac mini with M1 Apple Silicon and I was eager to see how the performance develops on that test machine. Well, it’s not bad but not as good as I expected:
System
Duration
iMac 2019, Intel 6-core i5, 3.0 GHz, Catalina macOS 10.15.7
19.95 s
Mac mini, M1 8-core, Big Sur macOS 11.2, Rosetta2
26.85 s
Mac mini, M1 8-core, Big Sur macOS 11.2, native ARM
17.07 s
Investigating a little bit further I noticed that at the start of the simulation all 8 cores of the M1 Mac are filled up properly but after a few seconds only the 4 high efficiency cores are used any more.
I have read the Apple docs „Optimize for Apple Silicon with performance and efficiency cores“ and double checked that the dispatch queue for the iterations is set up properly:
let simQueue = DispatchQueue.global(qos: .userInitiated)
But no success. After a few seconds of running the high performance cores are obviously not utilized any more. I even tried to set up the queue with qos set to .userInteracive up that didn’t help either. I also flagged the dispatch items with proper qos but that didn’t change anything. It looks to me that other apps (e.g. XCode) do utilize the high performance cores even for a longer time.
Does anybody know how to force a M1 Mac to utilize the high performance cores?

"M1 8 core" is really "M1 4 performance + 4 power saving cores". I expect it to have be a bit more performance than an Intel 6 core, but not much. Exactly has you see, 15% faster than six Intel cores or about as fast as 7 Intel cores would be. The current M1 chips are low end processors. "A bit better than Intel six cores" is quite good.
Your code must be running on the performance cores, otherwise there would be no chance at all to come close to the Intel performance. In that graph, nothing tells you which cores are used.
What happens most likely is that all cores start running, each trying to do one eighth of the work, and after about 8 seconds the performance cores have their work done. Then the power saving cores move their work to the performance cores. And you are just misinterpreting the image as only low performance cores doing the work.

I would guess that Apple has put a preference on using efficiency cores over performance for many reasons. Battery life being one, and most likely thermal reasons as well. This is the big question mark with a SoC that originally was designed for smartphones and tablets. MacOS is a much heavier OS then IOS or iPad OS. Apple most likely felt that performance cores were only needed in the cases where maximum throughput was needed. No doubt, I think some including myself with a M1 Mac Mini would like a way to adjust this balance between efficiency and performance cores. Personally overall, I would prefer all cores be capable of switching between efficiency and performance such as in Intel's Speed shift technology. This may come along with the M1's advancements in terms of Mac Pro models and other Pro models.

what could be the reason made a program running on two compute with a big different ipc(instrument per second)?

OS: Linux 4.4.198-1.el7.elrepo.x86_64
CPU: Intel(R) Xeon(R) Gold 6148 CPU # 2.40GHz * 4
MEM: 376 GB
I have a program(do some LSTM model inference based on TensorFlow 1.14)，it runs on two machines with the same hardware, one got a bad performance while the other got a better one(about 10x diff).
I used intel pqos tool diagnosed that two processes and got a big different IPC number (one is 0.07 while the other one is 2.5), both processes are binding on some specified CPU core, and each machine payload does not heavy. This problem appears two weeks ago, before that this bad machine works as well, history command shows nothing configuration changed.
I checked many environ information, including the kernel, fs, process scheduler, io scheduler, program, and libraries md5, they are the same, the bad computer iLo show none error, the program mainly burns the CPU.
I used sysbench to test two machines(cpu & memory) which show about 25% performance difference, the bad machine does prime calculation is slower. Could it be some hardware problem?
I don't know what is the root cause resulted in the difference in IPC (equaled to performance), How can I dig into the situation?

Tensorflow. How to distribute ops between GPUs

I am running a very large Tensorflow model on google cloud ml-engine.
When using the scale tier basic_gpu (with batch_size=1) I get errors like:
Resource exhausted: OOM when allocating tensor with shape[1,155,240,240,16]
because the model is too large to fit in one GPU.
Using the tier comple_model_m_gpu which provides 4 GPUs I can spread the operations between the 4 GPUs.
However, I remember reading that communication between GPUs is slow and can create a bottleneck in training. Is this true?
If so, is there a recommended way of spreading operations between the GPUs that prevents this problem?

I recommend the following guide:
Optimizing for GPU
From the guide:
The best approach to handling variable updates depends on the model,
hardware, and even how the hardware has been configured.
A few suggestions based on the guide:
Try using P100s which have 16 GB of RAM (compared to 12 on the K80s). They are also significantly faster, although they also cost more
Place the variables on CPU: tf.train.replica_device_setter(worker_device=worker, ps_device='/cpu:0', ps_tasks=1)

Using Tesla P100 GPUs instead of Tesla K80 GPUs fixes this issue because P100s have something called Page Migration Engine.
Page Migration Engine frees developers to focus more on tuning for
computing performance and less on managing data movement. Applications
can now scale beyond the GPU's physical memory size to virtually
limitless amount of memory.

Is 1 vCPU on Google Compute Engine basically half of 1 physical CPU core?

Google's Machine types page states that:
For the n1 series of machine types, a virtual CPU is implemented as a
single hardware hyper-thread on a 2.6 GHz Intel Xeon E5 (Sandy
Bridge), 2.5 GHz Intel Xeon E5 v2 (Ivy Bridge)...etc
Assuming that a single physical CPU core with hyper-threading appears as two logical CPUs to an operating system, then if the n1-standard-2 machine that is described as 2 virtual CPUs and 7.5 GB of memory, then this essentially means 1 CPU core, right?
So if I'm trying to follow hardware recommendations for an InfluxDB instance that recommends 2 CPU cores, then I should aim for a Google Compute Engine machine that has 4vCPUs, correct?

Typically when software tells you how many cores they need they don't take hyper-threading into account. Remember, AMD didn't even have that (Hyper-Threading) until very recently. So 2 cores means 2 vCPUs. Yes, a single HT CPU core shows up as 2 CPUs to the OS, but does NOT quite perform as 2 truly independent CPU cores.

That's correct, you should aim for a GCE machine-type that has 4vCPUs... When you're migrating from an on-premises world, you're used to physical cores which have hyperthreading. In GCP, these are called vCPUs or virtual CPUs. A vCPU is equivalent to one hyperthread core. Therefore, if you have a single-core hyperthreaded CPU on premises, that would essentially be two virtual CPUs to one physical core. So always keep that in mind as oftentimes people will immediately do a test. They'll say, "I have a four-cores physical machine and I'm going to run four cores in the cloud" and ask "why their performance isn't the same?!!!"

if the n1-standard-2 machine that is described as 2 virtual CPUs and 7.5 GB of memory, then this essentially means 1 CPU core, right?
I believe, yes.
So if I'm trying to follow hardware recommendations for an InfluxDB instance that recommends 2 CPU cores, then I should aim for a Google Compute Engine machine that has 4vCPUs, correct?
I think, they means 2 physical cores regardless of hyper threading (HT) because the performance of HT is not a stable reference.
But IMO, the recommendation should also contains speed of each physical core.
If the software recommends 2 CPU cores, you need 4 vCPUs on GCP.

https://cloud.google.com/compute/docs/cpu-platforms says:
On Compute Engine, each virtual CPU (vCPU) is implemented as a single hardware multithread on one of the available CPU processors. On Intel Xeon processors, Intel Hyper-Threading Technology supports multiple app threads running on each physical processor core. You configure your Compute Engine VM instances with one or more of these multithreads as vCPUs. The specific size and shape of your VM instance determines the number of its vCPUs.

Long ago and far away, there was a 1 to 1 equivalence between a 'CPU' (such as what one sees in the output of "top"), a socket, a core, and a thread. (And "processor" and/or "chip" too if you like.)
So, many folks got into the habit of using two or more of those terms interchangeably. Particularly "CPU" and "core."
Then CPU designers started putting multiple cores on a single die/chip. So a "socket" or "processor" or "chip" was no longer a single core, but a "CPU" was still 1 to 1 with a "core." So, interchanging those two terms was still "ok."
Then CPU designers started putting multiple "threads" (eg hyperthreads) in a single core. The operating systems would present each hyperthread as a "CPU" so there was no longer a 1 to 1 correspondence between "CPU" and "thread" and "core."
And, different CPU families can have different numbers of threads per core.
But referring to "cores" when one means "CPUs" persists.

GPU vs CPU? Number of cores/threads in a GPU for program calculation acceleration?

I need some help understanding the concept of cores on a GPU vs. cores in a CPU for the purpose of doing parallel calculations.
When it comes to cores in a CPU, it seems pretty simple. I have a super intensive "for" loop that iterates four times. I have four cores in my Intel i5 2.26GHz CPU. I give one loop to each core. Each of the four loops is independent of the other. Boom - I now have four threads created and 100% CPU usage (instead of 25% CPU usage with only one core). My "for" loop now runs almost four times faster than it would have if I did not parallelize it. By the way, for the "for" loop, I was using the auto-parallelization available on Microsoft Visual Studio 2012, as in this online example:(http://msdn.microsoft.com/en-us/library/hh872235.aspx).
In contrast, I don't even know the number of cores in my laptop's GPU (Intel Graphics Media Accelerator HD, or Intel HD Graphics, with 1696MB shared memory) that I can use for parallel calculations. I don't even know a valid way of comparing the GPU to the CPU. When I see "12#500MHz" next to my graphics card description, I wonder if that means the graphics card has 12 cores for parallelization that can work kinda like the 4 cores in a CPU, except that the GPU cores run at 500MHz [slow] instead of 2.26GHz [fast]? Is there a GPU usage comparable to the CPU usage in Windows task manager? I'm an utter novice trying to use the C++ library in visual studio 2012, if that makes any difference. When I write the actual GPU software, the parallelization code looks like this:(http://msdn.microsoft.com/en-us/library/hh265137.aspx).
So, would you please fill some of the gaps or mistakes in my knowledge or help me compare the two? I don't need a super complicated answer, something as simple as "You can't compare a CPU core with a GPU core because of blankity blank" or "a GPU core isn't really a core like a CPU core is" would be very much appreciated.

First, the OS initiate more cores only if you ask for them in your code. Try using OpenMP or Win32 threads to achieve parallelism on your i5.
Second, the CPU clocking is more than GPU clocking. If the clocking of GPU is same as CPU, you can use it as a stove to cook. The cores in the GPU are more than CPU. There is a difference between a thread and core.
Third, I recommend you to read specifications and reference manuals for your CPU and GPU. Also, dont forget PCI-e. It is the bottleneck for Parallel Programming implementation.
Hope this clarifies your doubts. Any more questions, feel free to ask.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string