Something Similar to RAPL for non Sandy Bridge/xeon processors

Something Similar to RAPL for non Sandy Bridge/xeon processors - linux

First post ever here.
I wanted to know if there was something similar to the Running Average Power Limit for other processors(Intel i7) that aren't Sandy Bridge or Xeon Processors as the machine im working on in the lab.
For those who do not know. I pulled this description to bring you up to speed.
"RAPL(Running Average Power Limit) interface provides platform software
with the ability to monitor, control, and get notifications on SOC
power consumptions."
What I am looking for in particular is to acquire energy consumption measurements on a processor's individual cores after running some code like Matrix Multiplication or Vector Addition. Temperature would be excellent too but that's another question for another day(lm-sensors is a bit puzzling to me)
Thanks and Take Care.

Late answer on this: There's PowerTOP on Linux, but that works for Laptops only as it needs the battery discharge rate for that. It can display Watts per process, but don't ask me how accurate that is (personally I think there might be some problems with that). IIRC it counts the number of CPU wakeups from a CPU sleep state to calculate the energy consumption per process. Also, for AMD processors there's the fam15h_power driver in the lm-sensors software package. For rather new (2011 and newer) Bulldozer AMD CPUs you can get the energy consumption that way.
Note that RAPL does not provide energy consumption per core on a multicore CPU, but only for the whole CPU. You can get the energy consumption of core and non-core (like integrated graphics) separately, but per-core is not possible.

Related

How do I know if my CPU support high resolution timers?

As part of the linux kernel course we are explained that high resolution timers or may may not be supported by the hardware. The hardware that affects this support is only the CPU.
So I took my time and opened one of the intel CPUs specs
I am trying to understand If by reading the specs of a given CPU, I can determine if the OS can support high resolution timers.
In that specific manual I am uncertain what to look for, but my first guess is the "Clock Topology" section (2.6 in the link).
The section lists under it three types of HW clocks:
Base Clock reference clock (100MHz), PCIe reference clock and fixed clock 38.4MHz.
Now if the high high resolution clock support is based solely on the hardware, and not by some complex computation of multiple clocks and so forth, then the base reference clock's 100Hz is 10 nanoseconds, not 1. High resolution timers are supposedly support 1 nanoseconds resolutions.
I assume INTEL high-end CPU do support high resolution timers, but it seems I lack knowledge in how to read the manual and what is needed for that support.
Can someone elaborate more? does nanosecond resolution actually mean 1 nanosecond resolution? If this CPU does support HR-timers, what mechanisms are used to compensate the lack of HW support. Can this information be obtained from the OS itself?

Understanding cpu frequency, thread selection and more

With a 1270v3 and a single thread app I'm at the end of performance but when I watch monitoring tools like atop I don't understand how this whole stuff works. I tried to find a nice article about this sort of topic but they either have been explained in a language I don't understand or are not about the stuff I would like to know. I hope it is alright to ask this kind of stuff here.
From my understanding a single-thread app does only use one thread for all/most of the work. So the performance is defined by the single-thread power of the CPU.
A moment before I wrote this question I played around with CPU-frequency and noticed that although there are only two instances of the app running the usage is shared across all cores.
So I assume that the thread jumps around between these cores.
So I set the CPU scaling to performance with cpufreq-set -g performance. The result was that all CPU cores/threads stayed at about 2GHz like it was before besides one that is permanently on 3.5GHz (100%). As I only changed the scaling for one core, why is the usage still shared across all cores? I mean the app is running at about 300%, why doesn't it stick to the CPU core with the 100%?
Furthermore as I noticed that only one of the CPU's got scaled up I looked into the help page and found -r which should scale all cores with the performance settings. Unfortunately nothing does change. (Is this a bug in Ubuntu 1404?) So I used -c with the number 8 (8 threads) -> didn't work. 4 -> works but only scales 2 cores out of 8. 7 -> scaled 4 cores. So I'm wondering, does this not support hyper-threading or is the whole program that buggy?
However as I understand it, the CPU's with the max frequency together with the thread jump around in the monitoring tools as they display the average the usage, which than looks like shared. Did I figure this right?
Would forcing one cpu to 3.5GHz and forcing the app to this core improve performance or is all the stuff I'm wondering about only about avg calculation between the data they show each second.
If so am I right that I should run best with cpufreq-set -c 7 -g performance if power consumption doesn't matter?
Thanks for reading so far, I hope you have a moment to help me understand the whole thing.
Atop example screenshots:
http://i.imgur.com/VFEBvLx.png
http://i.imgur.com/cBKOnJM.png
http://i.imgur.com/bgQfwfU.png

I believe a lot of your confusion has to do with the fuzzy mapping of the capabilities of cpufreq to the actual capabilities of the hardware.
Here’s a description of what is taking place on the HW and in the OS.
A processor is a collection of cores on the same silicon substrate. The cores are what we used to call CPUs with some enhancements. CPUs now have the capability of running multiple HW threads (hyperthreading), each hardware thread being equivalent to one of the old type CPUs. Putting this all together, the 1270v3 is a quad core (if I recall correctly), meaning it has 4 cores on the same silicon substrate. Each core can support two HW threads, each HW thread being equivalent to what the OS calls a CPU (and I’ll call a virtual CPU). So from the OS perspective, the 1270v3 has 8 (virtual) CPUs.
The OS doesn’t see packages, cores or HW threads. It sees CPUs, and to it there appear to be 8 of them.
To further complicate the issue, modern processors have various HW supporting power saving states called P-states, C-states and package C-states. Why do I mention these? It’s because they are intimately associated with the frequency of the processor. And cpufreq professes to provide the user with control over the processor’s frequency.
Now, I’m not familiar with cpufreq outside of reading the manpage and other material on the web. From my reading, it has a lot of idiosyncrasies, so I’ll talk about what it is doing from a broad perspective.
In a general sense, cpufreq has a lot of generic capability that may or may not be supported by the HW or the kernel. This is confusing because it looks like the functionality is there but then things don’t happen as you would expect. For example, cpufreq gives the impression that you can set each CPU’s frequency independently. In reality, on a hyperthreaded system, two “CPUs” are associated with each core and must have the same frequency.
A lot of the functionality that cpufreq is supposed to control is associated with the power efficiency characteristics of the processor, but again, its mapping to the processor’s actual hardware capabilities is incomplete and misleading. Though cpufreq seems to allow you to set max and min frequencies, the processor hardware doesn’t work this way. In modern Intel processors, such as the 1270v3, there are something called P-states. These P-states are frequency-voltage pairs that slow down a processor’s frequency (and drop its voltage) to reduce the processor’s power consumption at the cost of the application taking longer to run. These frequency-voltage pairings aren’t arbitrary though cpufreq gives the impression that they are.
What does this all mean? In addition to the thread migration issues that the commenter mentioned, cpufreq isn’t going to behave the way you expect because it appears to have capability that it actually doesn’t, and the capability that it does actually have maps only roughly to the actual capabilities of the HW and OS.
I embedded some further comments in your text.
With a 1270v3 and a single thread app I'm at the end of performance but when I watch monitoring tools like atop I don't understand how this whole stuff works. I tried to find a nice article about this sort of topic but they either have been explained in a language I don't understand or are not about the stuff I would like to know. I hope it is alright to ask this kind of stuff here.
From my understanding a single-thread app does only use one thread for all/most of the work. [Yes, but this doesn’t mean that the thread is locked to a specific virtual CPU or core.] So the performance is defined by the single-thread power of the CPU. [It’s not that simple. The OS migrates threads around, it has its own maintenance processes, etc] A moment before I wrote this question I played around with CPU-frequency and noticed that although there are only two instances of the app running the usage is shared across all cores. So I assume that the thread jumps around between these cores. So I set the CPU scaling to performance with cpufreq-set -g performance. The result was that all CPU cores/threads stayed at about 2GHz like it was before besides one that is permanently on 3.5GHz (100%). As I only changed the scaling for one core, why is the usage still shared across all cores? I mean the app is running at about 300%, why doesn't it stick to the CPU core with the 100%? [Since I can’t see what you are observing, I don’t really understand what you are asking.]
Furthermore as I noticed that only one of the CPU's got scaled up I looked into the help page and found -r which should scale all cores with the performance settings. Unfortunately nothing does change. (Is this a bug in Ubuntu 1404?) So I used -c with the number 8 (8 threads) -> didn't work. 4 -> works but only scales 2 cores out of 8. 7 -> scaled 4 cores. [I haven’t used cpufreq so can’t directly speak to its behavior, but the manpage implies that “-c ” refers to a specific virtual CPU and not the number of virtual CPUs.] So I'm wondering, does this not support hyper-threading or is the whole program that buggy? [Again, I’m not sure from your explanation what you are doing, but the n->n/2 behavior makes sense to me. You can change the frequency of a core but since each core has two hyperthreads/virtual CPUs, two of those virtual CPUs must scale together.]
However as I understand it, the CPU's with the max frequency together with the thread jump around in the monitoring tools as they display the average the usage, which than looks like shared. Did I figure this right? [Again, I’m not sure what you are observing. Both physically and in atop, the CPU designation should not change, meaning CPU001 will always refer to the same virtual CPU. The core with the max frequency shouldn’t physically jump around, though the user thread may. Something to note is that monitoring tools can be pretty heavy users of the CPU. This heavy usage can make figuring out your processor usage difficult if it causes threads to jump around to different virtual CPUs.]
Would forcing one cpu to 3.5GHz and forcing the app to this core improve performance or is all the stuff I'm wondering about only about avg calculation between the data they show each second. [I found a pretty good explanation of atop with a lot of helpful screen shots: http://www.unixmen.com/linux-basics-monitor-system-resources-processes-using-atop/] If so am I right that I should run best with cpufreq-set -c 7 -g performance if power consumption doesn't matter? [It all depends upon what other processes are running on your system. If your system is mostly idle except for your processes, then forcing a core to a certain frequency won’t make a difference. [I’m suspicious of what a “governor” does. The language appears to refer to power-efficiency/performance (“balanced”, “powersave”, “performance”, etc) but the details don’t match the capability of today’s hardware.]
Thanks for reading so far, I hope you have a moment to help me

Thermal aware scheduler in linux

Currently i'm working on making a temperature aware version of linux for my university project. Right now I have to create a temperature aware scheduler which could take into account processor temperature and perform some scheduling. Is there any generalized way to get the temperature of the processor cores or can I integrate the coretemp driver with the linux kernel in any way ( I didn't find a way to do so on the internet ).

lm-sensors simply uses some device files exported by the kernel for CPU temperature, you can just read whatever these device files have as backing variables in the kernel to get the temperature information. In terms of a scheduler I would not write one from scratch and would start with the kernels CFS implementation and in your case modify the load balancer check to include temperature (currently it uses a metric that is the calculated cost of moving a task from one core to another in terms of cache issues, etc... I'm not sure if you want to keep this or not).

Temperature control is very difficult. The difficulty is with thermal capacity and conductance. It is quite easy to read a temperature. How you control it will depend on the system model. A Kalman filter or some higher order filter will be helpful. You don't know,
Sources of heat.
Distance from sensors.
Number of sensors.
Control elements, like a fan.
If you only measure at the CPU itself, the hard drive could have over heated 10 minutes ago, but the heat is only arriving at the CPU now. Throttling the CPU at this instance is not going to help. Only by getting a good thermal model of the system can you control the heat. Yet, you say you don't really know anything about the system? I don't see how a scheduler by itself can do this.
I have worked on mobile freezer application where operators would load pallets of ice cream, etc from a freezer to a truck. Very small distances between sensors and control elements can create havoc with a control system. Also, you want your ambient temperature to be read instantly if possible. There is a lot of lag in temperature control. A small distance could delay a reading by 5-15 minutes (ie, it take 5-15 minutes for heat to transfer 1cm).
I don't see the utility of what you are proposing. If you want this for a PC, then video cards, hard drives, power supplies, sound cards, etc. can create as much heat as the CPU. You can not generically model a PC; maybe you could with an Apple product. I don't think you will have a lot of success, but you will learn a lot from trying!

Intel MSR frequency scaling per - thread

I'm extending the Linux kernel in order to control the frequency of some threads: when they are scheduled onto a core (any core!), the core's frequency is changed by writing the proper p-state to the register IA32_PERF_CTL, as suggested in Intel's manual.
But when different threads with different "custom" frequencies are scheduled, it appears that the throughput of all the thread increases, as if all the cores run at the maximum set frequency.
I did many trials and measurements in different conditions of load and configuration, but the result is the same.
After some trials with CPUFreq (with no running app, I set different frequencies on each core, and finally the measured frequencies, with cpufreq-info -w, were equal), I wonder if the CPU cores can really run at different, independent frequencies, or if there are hardware policies or constraints.
Finally, is there a CPU model which makes this fine-grained frequency scaling feasible?
The CPU I am using is Intel Core i5 750

You cannot control individual core frequencies for active cores. You can, however, control frequencies of all active cores to be the same. The reasons are in the previous answers - all cores are on the same active voltage plane.
Hopefully, the next-gen Haswell processors will make it possible to control each core separately.

I think you're missing a big piece of the picture!
Read up on power and clocks domains. All processor cores within a domain run at the same P-state (i.e., the same frequency and voltage). The P-state that all cores will run at in that domain will always be the P-state of the core requesting the highest P-state in that domain. The MSRs don't reflect this at all, nor do the interfaces that the kernel exposes.
Anandtech has a good article on this:
http://www.anandtech.com/show/2658/2
"This is all very similar to AMD's Phenom, but where the two differ is in how they handle power management. While AMD will allow individual cores to request different clock speeds, Nehalem attempts to run all of its cores at the same frequency; if one core is idle then it's simply power gated and the core is effectively turned off."
I haven't hooked a power-meter up to SB/IB, but my guess is that the behavior is the same.

cpufreq-info will display information about which cores need to be synchronous in their P-states:
[root#navi ~]# cpufreq-info
cpufrequtils 008: cpufreq-info (C) Dominik Brodowski 2004-2009
Report errors and bugs to cpufreq#vger.kernel.org, please.
analyzing CPU 0:
driver: acpi-cpufreq
CPUs which run at the same hardware frequency: 0 1 <---- THIS
CPUs which need to have their frequency coordinated by software: 0 <--- and THIS
maximum transition latency: 10.0 us.
At least because of that, I'd recommend going through cpufreq interfaces instead of directly setting registers, as well as making it possible to run on non-intel CPUs which might have uncommon requirements.
Also check on how to make kernel threads stick to specific core, to avoid unexpecteded switching, if you didn't do so already.

I want to thank everyone for the contribution!
Further investigating, I found other details I share with the community.
As suggested, Nehalem places all the cores in a single clock domain, so that the maximum frequency set among all the cores is applied to all of them; some tools may show different frequencies on idle cores, but it is sufficient to run any application to make the frequency raise to the maximum.
This, from my tests, also applies to Sandy Bridge, where cores and LLC slices all reside in the same frequency/voltage domain.
I assume that this behavior also happens with Ivy Bridge, as it is only a 'tick' iteration.
Instead, I believe that Haswell places cores and LLC slices in different, singular domains, thus enabling per-core frequencies. This is also advertized in several pages like
http://www.anandtech.com/show/8423/intel-xeon-e5-version-3-up-to-18-haswell-ep-cores-/4

Highly concurrent multi-threaded application requires hardware

I am looking for a hardware, which must run about 256 computationally intensive real-time concurrent tasks in 24 hour mode (one multi-threaded C application). Each task takes about 40-50 MFLOPs, so all tasks require about 10 GFLOPs. CPU-RAM speed is insignificant. All tasks must be managed by a Linux Kernel (32 bit, with SMP).
I am looking for a one-mainboard solution with one multi-core CPU (if such CPU exist). If such CPU doesn't exist, then I need one mulit-socket mainboard solution (with multiple CPUs).
Can you please recommend me any professional CPU/Mainboard solution which will satisfy such requirements? It is also very important that there are no issues with Linux Kernel (2.6.25). No virtualization, no needs in huge RAM or CPU cache. I also would prefer Intel architecture and well-proved stability. I still have doubts that it is feasible at all.
Thank you in advance.
UPDATE:
I think I have found a right answer here and here.

UltraSPARC T2 has 8 cores with 8 threads each. Integrated high-bandwidth memory and IO. The T5140 carries two of them for 128 hardware threads.
The theoretical max raw performance of the 8 floating point units is 11 Giga flops per second (GFlops/s). A huge advantage over other implementations however is that 64 threads can share the units and thus we can achieve an extremely high percentage of theoretical peak. Our experiments have achieved nearly 90% of the 11 Gflop/s. - (http://blogs.oracle.com/deniss/entry/floating_point_performance_on_the)

Rent some Amazon EC2 nodes.
Updated: How about PS3's then? The NASA uses them for their simulation engines.
Maybe use CPU+GPU's in commercial servers?
Build it around FPGAs: nowadays, some variants include processors that can run Linux.

Even though you've given us the specs you think you need, we might be able to help you out better if you tell us what the application is intended to accomplish, and how it was implemented.
There may be a better way to split the work up or deal with it rather than your current solution.

Not Intel architecture but these run linux and have 64 cores on a single die.
TILEPro64

Get a bunch of four- or eight-core machines and split the processing across the machines using some sort of grid or clustering software. Maybe have a look at Beowulf.
As you mentioned, 10GFlops isn't exactly to be sneezed at so in a single machine, it'll be expensive. There's also the problem what you do when the machine breaks, you're unlikely to have a second machine of similar spec available. If you build a cluster using commodity hardware, you're a little more resilient and it's easier to find replacement machines.

MFLOPS and GFLOPS are very poor indicators of how well a program can run on any given CPU. These days, cache footprint is much more important; perhaps branch prediction accuracy as well.
There's almost no way to gauge performance of a given application on different architectures without actually giving it a spin. And even then, you may not get a good idea if you were unlucky enough to unknowingly build with compiler options that ruined your cache footprint, or used a bad threading library, or any of a hundred other things.

I see you'd prefer intel, but if you need one chip, I will again suggest the cell processor -
its theoretical peak performance is arount 25GFlops - kernel 2.6.25 had support for it already.
You could try a pre-slim playstation 3 for experimenting with (that would cost you little) or get yourself a server-based solution at around US$8K - you will have to re-write and fine tune your threads to take advabtage of the SPU co-processors there, but you could achieve your computational needs without breaking a sweat with a single CELL (1 PPC core + 8 SPU's)
NB.: with a playstation 3, you'd have only 6 available co-processors - but you don't seen to be on a budget with this project -
So you could at least try IBM's cell developer kit, which offers an emulator, to see if you can code your solution to run on it.
Thre are commercially available CELL products, both as stand-alone servers in blade form factory, and PCI Express add-on boards for PC workstations from
Mercury Computer Systems:
http://www.mc.com/microsites/cell/products.aspx?id=6986
Mercury does not list any prices on the site, but the pricing seens to be around the previoulsy mentioned U$8000.00 for these PCI Express cards.
A playstation 3 videogame can be purchased for about U$300.00 - and would allow you to prototype your application, and check if it is up to the needed performance. (I myself got one and have Fedora 9 running on it, although I did that as a hobbyst and have not, so far, used it for any calculations - I had also put together a Playstation-3 12 machinne cluster for Molecular simulations at the local University. The application they run did not take advantage of the multimedia SPU's, while I was in touch with then. But even so, clocked at 3.5GHz they performed better than standard ,s imlarly priced, PC's, even considering PS3's are priced 5x higher around here)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string