Emulating a heterogenous system, like an ARM Processor with P and E Cores

Emulating a heterogenous system, like an ARM Processor with P and E Cores - linux

I'm trying to emulate a processor which consists processor cores with different max frequencies per core, like ARM processors or newer Intel processors which have a couple of Performance Cores and Efficiency Cores.
I tried it with Qemu, but I only didn't get far, the only thing I found was qemu-system-aarch64 where you can configure cores per die and die count using nema but i did't find a possibilty to change frequency or core architechture for a specific die. Is it even possible with qemu or is there a alternative? Preferably the emulation should be able to run linux.
For clarification, I'm trying to show that on a heterogeneus system i.e. a processor with different core speeds a certain framework works better then another one.

Thanks to Nate I found Intel Simics which is able to simulate heterogeneous systems.

Related

How to detect heterogeneous CPU topologies on Linux?

Many recent CPUs out there--Alder Lake from Intel and many big.LITTLE designs from ARM--have heterogeneous CPU topologies: some cores are faster than others. There exists good ways to detect such CPUs on Windows and macOS, but Linux/Android seems lacking.
On x86, there is a CPUID bit: leaf 7 page 0 EDX bit 15. But ARM's CPUID equivalent is privileged, so OS help is required.
Is there a good way on Linux to detect this?
For posterity:
On Windows, use GetLogicalProcessorInformationEx to enumerate packages and cores, then look for .Processor.EfficiencyClass being different values among cores (queries using RelationProcessorCore).
On macOS, use sysctlbyname on hw.nperflevels and check whether it exists and its value is greater than 1.

Run a process on a specific CPU

Problem
I have a Soc containing let's say an Arm M7-core and an Arm A53-core, I want to only program the M7-core (Linux) and run a specific process on the A53-core.
Questions
Is that possible or should I program both of them?
I read about thread Affinity in this article, and here I am not sure whether Affinity controls the running CPU in the Soc or the running core in the CPU (ARM cpu has several cores), please help.

ARM big.LITTLE have three implementations: https://en.wikipedia.org/wiki/ARM_big.LITTLE . On the first two implementations you can only select a cpu pair (with an Arm M7-core and an Arm A53-core) to run your thread. Depending on the workload, your threads will be executed in an M7 or A53.
Only in Heterogeneous Multi-Processing (HMP) implementation OS scheduler sees all M7 and A53 cores and you can select a specific cpu type.
If the hardware has HMP, you can restrict your thread to a arbitrary set of cores using pthread_setaffinity_np ( https://man7.org/linux/man-pages/man3/pthread_setaffinity_np.3.html ). The cpu set macros (which manipulate core sets) identify cores by number, so you will have to discover which numbers are M7 or A53. Probably it is the same numbering in /proc/cpuinfo or /sys/devices/system/cpu/.

GPU vs CPU? Number of cores/threads in a GPU for program calculation acceleration?

I need some help understanding the concept of cores on a GPU vs. cores in a CPU for the purpose of doing parallel calculations.
When it comes to cores in a CPU, it seems pretty simple. I have a super intensive "for" loop that iterates four times. I have four cores in my Intel i5 2.26GHz CPU. I give one loop to each core. Each of the four loops is independent of the other. Boom - I now have four threads created and 100% CPU usage (instead of 25% CPU usage with only one core). My "for" loop now runs almost four times faster than it would have if I did not parallelize it. By the way, for the "for" loop, I was using the auto-parallelization available on Microsoft Visual Studio 2012, as in this online example:(http://msdn.microsoft.com/en-us/library/hh872235.aspx).
In contrast, I don't even know the number of cores in my laptop's GPU (Intel Graphics Media Accelerator HD, or Intel HD Graphics, with 1696MB shared memory) that I can use for parallel calculations. I don't even know a valid way of comparing the GPU to the CPU. When I see "12#500MHz" next to my graphics card description, I wonder if that means the graphics card has 12 cores for parallelization that can work kinda like the 4 cores in a CPU, except that the GPU cores run at 500MHz [slow] instead of 2.26GHz [fast]? Is there a GPU usage comparable to the CPU usage in Windows task manager? I'm an utter novice trying to use the C++ library in visual studio 2012, if that makes any difference. When I write the actual GPU software, the parallelization code looks like this:(http://msdn.microsoft.com/en-us/library/hh265137.aspx).
So, would you please fill some of the gaps or mistakes in my knowledge or help me compare the two? I don't need a super complicated answer, something as simple as "You can't compare a CPU core with a GPU core because of blankity blank" or "a GPU core isn't really a core like a CPU core is" would be very much appreciated.

First, the OS initiate more cores only if you ask for them in your code. Try using OpenMP or Win32 threads to achieve parallelism on your i5.
Second, the CPU clocking is more than GPU clocking. If the clocking of GPU is same as CPU, you can use it as a stove to cook. The cores in the GPU are more than CPU. There is a difference between a thread and core.
Third, I recommend you to read specifications and reference manuals for your CPU and GPU. Also, dont forget PCI-e. It is the bottleneck for Parallel Programming implementation.
Hope this clarifies your doubts. Any more questions, feel free to ask.

Direct Cpu Threads or OpenCL

I have search the various questions (and web) but did not find any satisfactory answer.
I am curious about whether to use threads to directly load the cores of the CPU or use an OpenCL implementation. Is OpenCl just there to make multi processors/cores just more portable, meaning porting the code to either GPU or CPU or is OpenCL faster and more efficient? I am aware that GPU's have more processing units but that is not the question. Is it indirect multi threading in code or using OpneCL?
Sorry I have another question...
If the IGP shares PCI lines with the Descrete Graphics Card and its drivers can not be loaded under Windows 7, I have to assume that it will not be available, even if you want to use the processing cores of the integrated GPU only. Is this correct or is there a way to access the IGP without drivers.

EDIT: As #Yann Vernier point out in the comment section, I haven't be strict enough with the terms I used. So in this post I use the term thread as a synonym of workitem. I'm not refering to the CPU threads.
I can’t really compare OCL with any other technologies that will allow using the different cores of a CPU as I only used OCL so far.
However I might bring some input about OCL especially that I don’t really agree with ScottD.
First of all, even though an OCL kernel developed to run on a GPU will run as well on a CPU it doesn’t mean that it’ll be efficient. The reason is simply that OCL doesn’t work the same way on CPU and GPU. To have a good understanding of how it differs, see the chap 6 of “heterogeneous computing with opencl”. To summary, while the GPU will launch a bunch of threads within a given workgroup at the same time, the CPU will execute on a core one thread after another within the same workgroup. See as well the point 3.4 of the standard about the two different types of programming models supported by OCL. This can explain why an OCL kernel could be less efficient on a CPU than a “classic” code: because it was design for a GPU. Whether a developer will target the CPU or the GPU is not a problem of “serious work” but is simply dependent of the type of programming model that suits best your need. Also, the fact that OCL support CPU as well is nice since it can degrade gracefully on computer not equipped with a proper GPU (though it must be hard to find such computer).
Regarding the AMD platform I’ve noticed some problem with the CPU as well on a laptop with an ATI. I observed low performance on some of my code and crashes as well. But the reason was due to the fact that the processor was an Intel. The AMD platform will declare to have a CPU device available even if it is an Intel CPU. However it won’t be able to use it as efficiently as it should. When I run the exact same code targeting the CPU but after installing (and using) the Intel platform all the issues were gone. That’s another possible reason for poor performance.
Regarding the iGPU, it does not share PCIe lines, it is on the CPU die (at least of Intel) and yes you need the driver to use it. I assume that you tried to install the driver and got a message like” your computer does not meet the minimum requirement…” or something similar. I guess it depends on the computer, but in my case, I have a desktop equipped with a NVIDIA and an i7 CPU (it has an HD4000 GPU). In order to use the iGPU I had first to enable it in the BIOS, which allowed me to install the driver. Of Course only one of the two GPU is used by the display at a time (depending on the BIOS setting), but I can access both with OCL.

In recent experiments using the Intel opencl tools we experienced that the opencl performance was very similar to CUDA and intrincics based AVX code on gcc and icc -- way better than earlier experiments (some years ago) where we saw opencl perform worse.

Highly concurrent multi-threaded application requires hardware

I am looking for a hardware, which must run about 256 computationally intensive real-time concurrent tasks in 24 hour mode (one multi-threaded C application). Each task takes about 40-50 MFLOPs, so all tasks require about 10 GFLOPs. CPU-RAM speed is insignificant. All tasks must be managed by a Linux Kernel (32 bit, with SMP).
I am looking for a one-mainboard solution with one multi-core CPU (if such CPU exist). If such CPU doesn't exist, then I need one mulit-socket mainboard solution (with multiple CPUs).
Can you please recommend me any professional CPU/Mainboard solution which will satisfy such requirements? It is also very important that there are no issues with Linux Kernel (2.6.25). No virtualization, no needs in huge RAM or CPU cache. I also would prefer Intel architecture and well-proved stability. I still have doubts that it is feasible at all.
Thank you in advance.
UPDATE:
I think I have found a right answer here and here.

UltraSPARC T2 has 8 cores with 8 threads each. Integrated high-bandwidth memory and IO. The T5140 carries two of them for 128 hardware threads.
The theoretical max raw performance of the 8 floating point units is 11 Giga flops per second (GFlops/s). A huge advantage over other implementations however is that 64 threads can share the units and thus we can achieve an extremely high percentage of theoretical peak. Our experiments have achieved nearly 90% of the 11 Gflop/s. - (http://blogs.oracle.com/deniss/entry/floating_point_performance_on_the)

Rent some Amazon EC2 nodes.
Updated: How about PS3's then? The NASA uses them for their simulation engines.
Maybe use CPU+GPU's in commercial servers?
Build it around FPGAs: nowadays, some variants include processors that can run Linux.

Even though you've given us the specs you think you need, we might be able to help you out better if you tell us what the application is intended to accomplish, and how it was implemented.
There may be a better way to split the work up or deal with it rather than your current solution.

Not Intel architecture but these run linux and have 64 cores on a single die.
TILEPro64

Get a bunch of four- or eight-core machines and split the processing across the machines using some sort of grid or clustering software. Maybe have a look at Beowulf.
As you mentioned, 10GFlops isn't exactly to be sneezed at so in a single machine, it'll be expensive. There's also the problem what you do when the machine breaks, you're unlikely to have a second machine of similar spec available. If you build a cluster using commodity hardware, you're a little more resilient and it's easier to find replacement machines.

MFLOPS and GFLOPS are very poor indicators of how well a program can run on any given CPU. These days, cache footprint is much more important; perhaps branch prediction accuracy as well.
There's almost no way to gauge performance of a given application on different architectures without actually giving it a spin. And even then, you may not get a good idea if you were unlucky enough to unknowingly build with compiler options that ruined your cache footprint, or used a bad threading library, or any of a hundred other things.

I see you'd prefer intel, but if you need one chip, I will again suggest the cell processor -
its theoretical peak performance is arount 25GFlops - kernel 2.6.25 had support for it already.
You could try a pre-slim playstation 3 for experimenting with (that would cost you little) or get yourself a server-based solution at around US$8K - you will have to re-write and fine tune your threads to take advabtage of the SPU co-processors there, but you could achieve your computational needs without breaking a sweat with a single CELL (1 PPC core + 8 SPU's)
NB.: with a playstation 3, you'd have only 6 available co-processors - but you don't seen to be on a budget with this project -
So you could at least try IBM's cell developer kit, which offers an emulator, to see if you can code your solution to run on it.
Thre are commercially available CELL products, both as stand-alone servers in blade form factory, and PCI Express add-on boards for PC workstations from
Mercury Computer Systems:
http://www.mc.com/microsites/cell/products.aspx?id=6986
Mercury does not list any prices on the site, but the pricing seens to be around the previoulsy mentioned U$8000.00 for these PCI Express cards.
A playstation 3 videogame can be purchased for about U$300.00 - and would allow you to prototype your application, and check if it is up to the needed performance. (I myself got one and have Fedora 9 running on it, although I did that as a hobbyst and have not, so far, used it for any calculations - I had also put together a Playstation-3 12 machinne cluster for Molecular simulations at the local University. The application they run did not take advantage of the multimedia SPU's, while I was in touch with then. But even so, clocked at 3.5GHz they performed better than standard ,s imlarly priced, PC's, even considering PS3's are priced 5x higher around here)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string