Specify CPU frequency as a kernel CMD_LINE parameter of Linux on boot? - linux

I replaced my i5 CPU of my laptop with a i7 CPU, so that it can run faster.
But because that the power of i7 is more, and the temperature is also higher than before, my laptop crashed frequently. So, I used cpupower to specify the MAX frequency of CPU, it works.
Now, my question is "Is there a way to specify the CPU frequency as a cmd_line parameter of the linux kernel, at boot time?", so I can ensure that the system has booted stably and correctly.
Btw, if new cpu runs under the freq of 2.5GHz at most, everything is ok, and the performance is twice more than the older. so I think it is worth to change my CPU.
thanks a lot!

UPDATE - 2018-11-25
Also, I want to mention that there are below commands to use CpuFreq subsystem without using any tool (like cpufrequtils as it is used to achieve the same purpose). Sometimes these tools lack features, or they simply don't work as we want. Because CpuFreq core creates a sysfs directory under /sys/devices/system/cpu/, some attributes are available as read-write to be changed at kernel level. These attribute changes are called as policies as CpuFreq has a Policy Interface in sysfs. Below commands should work at boot time and be persistent between boot times.
If scaling governor is selected as intel_pstate; (This part may help to avoid higher frequencies if intel_pstate is decided to be used)
Also turbo can be disabled because of wanting to prevent higher frequencies.
echo "1" | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo
After this, below command can be useful.
echo "70" | sudo tee /sys/devices/system/cpu/intel_pstate/max_perf_pct (70 can be changed by another percentage if clock speed and turbo speed is higher numbers. 70-80 should be enough to not reaching above 2.5 GHz)
This attribute is explained as below in https://www.kernel.org/doc/Documentation/cpu-freq/intel-pstate.txt and may help to decrement higher CPU frequencies.
max_perf_pct: Limits the maximum P-State that will be requested by the
driver. It states it as a percentage of the available performance.
Because P-States are operational states and by going Pn to P0, frequencies are increasing. So, limiting maximum P-states by percent of the maximum supported performance level can be useful. Check this link: https://software.intel.com/en-us/blogs/2008/05/29/what-exactly-is-a-p-state-pt-1
Also, in intel_pstate, CPUs share same properties. While using intel_pstate as scaling governor, per-CPU performance limits as cpufreq attributes (e.g. scaling_max_freq) can be used by adding below kernel parameter;
intel_pstate=per_cpu_perf_limits
Otherwise, CPUs can be set separately;
echo -n 2457600 > /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq
echo -n 2457600 > /sys/devices/system/cpu/cpu1/cpufreq/scaling_max_freq
echo -n 2457600 > /sys/devices/system/cpu/cpu2/cpufreq/scaling_max_freq
echo -n 2457600 > /sys/devices/system/cpu/cpu3/cpufreq/scaling_max_freq
But, there is an important part which is built-in script in Linux (/etc/init.d/ondemand). If ondemand or powersave is used as used as scaling governor, then configurations we set (like above) can collide with this script. The script should be disabled by below command;
sudo /usr/sbin/update-rc.d ondemand disable
Further info is in here: https://help.ubuntu.com/community/UbuntuStudio/Setting_CPU_Governor
After disabling ondemand, other scaling governors (like userspace, performance) can be set and be used by regarding above configuration.
These are all fundamental commands (both below and above part) and they should help solving CPU frequency scaling problem as I also wanted to give these information for future reference.
First of all, I want to give some information about CPU Frequency Scaling.
Three terms are related to this process (they are layers of a subsystem which is called as "CPU Performance Scaling") and they should be basically reviewed and discussed to ensure that everything is understood correctly.
CPUFreq Core
Scaling Driver
Scaling Governor
CPUFreq core is a basic framework and contains a common code infrastructure for all platforms that support this feature.
CPU frequency driver change CPU P-states that are managed by scaling governors and it communicates with hardware.
(P-States mean they are operational, in contrast of C-States, which they are idle states except C0 state. C0 state is also busy and active state.)
Scaling governors implement scaling algorithms.
By the way, CPU Performance Scaling is a deep topic and there are many things that should be considered. Basically, with the information above, below commands should meet your needs.
Firstly, I think intel_pstate is used as a scaling driver for now in your laptop. So, disabling it may provides us more advanced settings and more governors (intel_pstate has two governors that are powersave and performance). I think powersave is default governor for intel_pstate.
sudo vi /etc/default/grub
Add intel_pstate=disable to the GRUB_CMDLINE_LINUX_DEFAULT parameter.
GRUB_CMDLINE_LINUX_DEFAULT="intel_pstate=disable"
After adding the parameter execute below commands.
modprobe acpi-cpufreq
sudo update-grub
You can check kernel parameters at boot by below command
cat /proc/cmdline
By this way, acpi-cpufreq will be enabled as the scaling driver (because of disabling intel_pstate). So, the next thing can be setting governor as userspace to run the CPU as desired frequencies or letting it be as the default (ondemand should be default setting for acpi-cpufreq).
First Way of Setting Governor and Maximum Frequency Setting
If you want to change scaling governor (e.g. to userspace):
sudo update-rc.d ondemand disable (This command prevents above commands to be reset after reboot)
sudo apt install cpufrequtils (To control the CPU frequency scaling deamon)
echo 'GOVERNOR="userspace"' | sudo tee /etc/default/cpufrequtils
After these steps, we should have acpi-cpufreq as the scaling driver and ondemand (if you didn't change the governor) as the scaling governor. So, the last thing seem to be setting max frequency of the CPU.
Editing /etc/default/cpufrequtils like below should set CPU frequencies. If the file doesn't exist, create it.
MAX_SPEED="2457600"
MIN_SPEED="1536000"
Also check below lines in the same file.
ENABLE="true"
GOVERNOR="ondemand" (or userspace)
But, with this way, I think there is no guarentee for setting all CPU cores to the same frequency values. I saw some people say that below way (second way) set all CPU cores as their desired values but not first way.
Second Way of Setting Governor and Maximum Frequency Setting
Install tlp (Linux Power management tool)
sudo apt install tlp
After installing, edit /etc/default/tlp like below:
# Select a CPU frequency scaling governor: # ondemand, powersave,
performance, conservative # Intel Core i processor with intel_pstate
driver: # powersave, performance # Important: # You must
disable your distribution's governor settings or conflicts will #
occur. ondemand is sufficient for almost all workloads, you should
know # what you're doing! CPU_SCALING_GOVERNOR_ON_AC=ondemand
CPU_SCALING_GOVERNOR_ON_BAT=ondemand
# Set the min/max frequency available for the scaling governor. #
Possible values strongly depend on your CPU. For available frequencies
see # tlp-stat output, Section "+++ Processor".
CPU_SCALING_MIN_FREQ_ON_AC=0
CPU_SCALING_MAX_FREQ_ON_AC=0
CPU_SCALING_MIN_FREQ_ON_BAT=1536000
CPU_SCALING_MAX_FREQ_ON_BAT=2457600
Above settings should be kept after restarting or suspending the device.
I have tried to provide and explain ways to set the CPU frequency (also to keep settings persistent) and I may have forgotten something. So, please check the information above and try if these meet your needs. Also, you can use below command to ensure that everything is right.
cpufreq-info
Note: Please check below pages for more information.
Governors list
https://www.kernel.org/doc/Documentation/cpu-freq/governors.txt
https://www.kernel.org/doc/html/v4.14/admin-guide/pm/cpufreq.html
https://www.kernel.org/doc/html/v4.12/admin-guide/pm/intel_pstate.html

eventually I have time to reply this because I'm busy for doing other things.
I tried all of above solutions, and choosed "tlp + lm-sensors + psensor".
The following is my opinions:
cpupower is a simple but relatively poor of features tool, it only can set the MAX/MIN frequenc of CPU and the governor.
cpufrequtils is basically same as cpupower, except that it base on
acpi drivers, not the Intel genuin one. I guess a Intel genuin
driver with p_state support should be better choice for Intel CPU.
tlp is my choice at last, it has more features to monitor/throttle
the temperature and frequence of CPU, and more configurable options.
Yes, as Erdem Savasci said, with tlp the MAX/MIN freqs of all CPU cores can be set within one step, while those can NOT do with cpufrequtils.
In addition, I installed the lm-sensors and psensor. The former can be think as a driver for querying the temperature/frequence/Fan-speed, the latter is a GUI panel that can show information as above mentioned.
With these tools, I belive that my cpu would be running stablly.
But the solution to "ensure CPU run stablly AT BOOT TIME" has not be found yet.
All of above are started after boot, aren't they?
Sorry for my poor english, I'm a Chinese. Hope I has expressed correctly things.
Thanks again!

Related

Programmatically disable CPU core

It is known the way to disable logical CPUs in Linux, basically with echo 0 > /sys/devices/system/cpu/cpu<number>/online. This way, you are only telling to the OS to ignore that given (<number>) CPU.
My question goes further, is it possible not only to ignore it but to turn it off physically programmatically? I want that CPU to not receive any power, in order to make its energy consumption zero.
I know that it is possible disable cores from the BIOS (not always), but I want to know whether is possible to do it within a certain program or not.
When you do echo 0 > /sys/devices/system/cpu/cpu<number>/online, what happens next depends on the particular CPU. On ARM embedded systems the kernel will typically disable the clock that drives the particular core PLL so effectively you get what you want.
On Intel X86 systems, you can only disable the interrupts and call the hlt instruction (which Linux Kernel does). This effectively puts CPU to the power-saving state until it is woken up by another CPU at user request. If you have a laptop, you can verify that power draw indeed goes down when you disable the core by reading the power from /sys/class/power_supply/BAT{0,1}/current_now (or uevent for all values such as voltage) or using the "powertop" utility.
For example, here's the call chain for disabling the CPU core in Linux Kernel for Intel CPUs.
https://github.com/torvalds/linux/blob/master/drivers/cpufreq/intel_pstate.c
arch/x86/kernel/smp.c: smp_ops.play_dead = native_play_dead,
arch/x86/kernel/smpboot.c : native_play_dead() -> play_dead_common() -> local_irq_disable()
Before that, CPUFREQ also sets the CPU to the lowest power consumption level before disabling it though this does not seem to be strictly necessary.
intel_pstate_stop_cpu -> intel_cpufreq_stop_cpu -> intel_pstate_set_min_pstate -> intel_pstate_set_pstate -> wrmsrl_on_cpu(cpu->cpu, MSR_IA32_PERF_CTL, pstate_funcs.get_val(cpu, pstate));
On Intel X86 there does not seem to be an official way to disable the actual clocks and voltage regulators. Even if there was, it would be specific to the motherboard and thus your closest bet might be looking into BIOS such as coreboot.
Hmm, I realized I have no idea about Intel except looking into kernel sources.
In Windows 10 it became possible with new power management commands CPMINCORES CPMAXCORES.
Powercfg -setacvalueindex scheme_current sub_processor CPMAXCORES 50
Powercfg -setacvalueindex scheme_current sub_processor CPMINCORES 25
Powercfg -setactive scheme_current
Here 50% of cores are assigned for desired deep sleep, and 25% are forbidden to be parked. Very good in numeric simulations requiring increased clock rate (15% boost on Intel)
You can not choose which cores to park, but Windows 10 kernel checks Intel's Comet Lake and newer "prefered" (more power efficient) cores, and starts parking those not preferred.
It is not a strict parking, so at high load the kernel can use these cores with very low load.
just in case if you are looking for alternatives
You can get closest to this by using governors like cpufreq. Make Linux exclude the CPU and power saving mode will ensure that the core runs at minimal frequency.
You can also isolate cpus from the scheduler at kernel boot time.
Add isolcpus=0,1,2 to the kernel boot parameters.
https://www.linuxtopia.org/online_books/linux_kernel/kernel_configuration/re46.html
I know this is an old question but one way to disable the CPU is via grub config.
If you add to end of GRUB_CMDLINE_LINUX in /etc/default/grub (assuming you are using a standard Linux dist, if you are using an appliance the location of the grub config may be different), e.g.:
GRUB_CMDLINE_LINUX=".......Current config here **maxcpus**=2"
Then remake you grub config by running
grub2-mkconfig -o /boot/grub2/grub.cfg (or grub-mkconfig -o /boot/grub2/grub.cfg depending on your installation). Some distros may require nr_cpus instead of maxcpus.
Just some extra info:
If you are running a server with Multiple physical CPU then disabling one CPU may will most likely disable the memory set that is linked to that CPU, therefore it may have an effect on the performance of the server
Disabling the CPU this way, will not effect your type 1 hypervisor from accessing the CPU (this is based on xen hypervisor, I believe it will apply to vmware as well, if anyone can provide confirmation would be great). Depending on virtualbox setup, it may restrict the amount of CPU you can allocate to VM's unless you are running para-virtualization.
I am unsure however if you will have any power savings, most servers and even desktops these days, already control the power well, putting to sleep any device not needed for the current load. My concern would be by reducing the number of CPU (cores) then you will just be moving the load to the remaining CPU and due to the need to schedule the processors time, and potentially having instructions queued, and the effect of having a smaller number of cores available for interrupts (eg: network traffic), it may have a negative effect on power consumption.
AFAIK there is no system call or library function available as of now. or even ioctl implementation. So apart from creating new module / system call there are two ways I can think of :
using ASM asm(<assembly code>); where assembly code being architecture specific asm code to modify cpu flag.
system call in c (man 3 system). Assuming you just want to do it through c.

Evaluating SMI (System Management Interrupt) latency on Linux-CentOS/Intel machine

I am interested in evaluating the behavior (latency, frequency) of SMI handling on Linux machine running CentOS and used for a (very) soft real time application.
What tools are recommended (hwlatdetect for CentOS?), and what is the best course of action to go about this?
If no good tools are available for CentOS, am I correct to assume that installing a
different OS on the same machine should yield the same results since the underlying hardware/bios are the same?
Is there any source for ballpark figures on these parameters.
The machines are X86_64 architecture, running CentOS 6.4 (kernel 2.6.32-358.23.2.el2.centos.plus.x86_64.)
SMIs can certainly happen during normal operation. My home desktop has a chipset-driven SMI every second and a half which is enabled in the chipset. I've also seen some servers that have them twice a second due to a BIOS-driven CPU frequency scaling scheme. However, some systems can go long periods of time without an SMI occurring so it really depends.
Question #1: hwlatdetect is one option to detect the latency of SMIs occurring on your system. BIOSBITS is another option which is a bootable CD that can identify if SMIs are occuring. You can also write your own test by creating a kernel module that spins in a loop and takes timestamps (using RDTSC). If you see a long gap between two timestamp readings, you could consult CPU MSR 0x34 to see if the SMI counter incremented which would indicate that an SMI happened.
If you want to generate an SMI, you can make a kernel module that does an OUT CPU instruction to port 0xb2, e.g. write a value of 0 to this port. (You can also time this SMI by gathering a timestamp just before and just after the write to port 0xB2).
Question #2, SMIs operate at a layer below the OS so which OS you choose, shouldn't have any impact.
Question #3: BIOSBITS recommends that SMI latencies be kept under 150 microseconds.
SMI will put your system into SMM (System Management Mode) mode, which will postpone the
normal execution of kernel during the SMI handling time period. In other words, SMM
is neither real mode nor protected mode as we know of normal operation of kernel,
instead it executes some special instruction kept in SMRAM (stored in Bios Firmware). To detect it's latency you can try to trigger an SMI (it can be software generated) and try to catch the total time spent in SMM mode. To accomplish this you can write a Linux kernel module, cause you'll be require some special privileges to issue an SMI (I think).
For real time systems I think it's nice if you can avoid these sort of interrupts like SMI.
You can check whether System Management Interrupts (SMI) are serviced or not with turbostat. For example:
# turbostat sleep 120
[check column SMI for value greater than 0]
Of course, from that you can also compute a SMI frequency.
Knowing that SMIs are actually happening at a certain rate is important information. But you also want to know how much time System Management Mode (SMM) spends in those interrupts. For example, if an SMI interruption is only very short than it might be irrelevant for your realtime application. On the other hand, if you have hardware with long SMI interruptions you probably want to talk to the vendor, configure the firmware differently (if possible) and or switch to other hardware with less intrusive SMM.
The perf tool has a mode that measures how many cycles are spend in SMM during SMIs (using the information provided by certain CPU counters). Example:
# perf stat -a -A --smi-cost -- sleep 120
Performance counter stats for 'system wide':
SMI cycles% SMI#
CPU0 0.0% 0
CPU1 0.0% 0
CPU2 0.0% 0
CPU3 0.0% 0
120.002927948 seconds time elapsed
You can also look at the raw values with:
# perf stat -a -A --smi-cost --metric-only -- sleep 120
From that you can compute how much time an SMI takes on average on your machine. (divide cycles difference by the number of cycles per time unit).
It certainly makes sense to cross check the CPU counter based results with empiric ones.
You can use the Linux Hardware Latency Detector that is integrated in the Linux Kernel. Usage example:
# echo hwlat > /sys/kernel/debug/tracing/current_tracer
# echo 1 > /sys/kernel/debug/tracing/tracing_thresh
# watch -d -n 5 cat /sys/kernel/debug/tracing/tracing_max_latency
# echo "Don't forget to disable it again"
# echo nop > /sys/kernel/debug/tracing/current_tracer
Those tools are available on CentOS/RHEL 7 and should be available on other distributions, as well.
Regarding ballpark figures: Recently I came across a HP 2011-ish ProLiant Gen8 Xeon server that fires 504 SMIs per minute. Perf computes a rate of 0.1 % in SMM, and based on the counter values the averge time spent in an SMI is as high as several microseconds - but the Linux hwlat detector doesn't detect such high interruptions on that system.
That SMI rate matches what HP documents in its Configuring and tuning
HPE ProLiant Servers for low-latency applications guide (October, 2017):
Disabling System Management Interrupts to the processor provides one of
the greatest benefits to low-latency environments.
Disabling the Processor Power and Utilization Monitoring SMI has the greatest
effect because it generates a processor interrupt eight times a second in G6
and later servers.
(emphasis mine; and that guide also documents other SMI sources)
On a Supermicro board with Intel Atom C3758 and an Intel NUC (i5-4250U) system of mine there are exactly zero SMIs counted.
On an Intel i7-6600U based Dell laptop, the system reports 8 SMIs per minute, but the aperf counter is lower than the (unhalted) cycles counter which isn't supposed to happen.
Actually, SMI is used for more than just keyboard emulation. Servers use SMI to report and correct ECC memory errors, ACPI uses SMI to communicate with BIOS and perform some tasks, even enabling and disabling ACPI is done through SMI, BIOS often intercepts power state changes through SMI... there's more, this is just a few examples.
According to wikipage on System Management Mode, SMI is not used during normal operation, except perhaps to emulate a PS/2 keyboard with a USB physical keyboard.
And most Linux systems are able to drive genuine USB keyboard without that emulation. You could configure your BIOS to disable it.

Understanding cpu frequency, thread selection and more

With a 1270v3 and a single thread app I'm at the end of performance but when I watch monitoring tools like atop I don't understand how this whole stuff works. I tried to find a nice article about this sort of topic but they either have been explained in a language I don't understand or are not about the stuff I would like to know. I hope it is alright to ask this kind of stuff here.
From my understanding a single-thread app does only use one thread for all/most of the work. So the performance is defined by the single-thread power of the CPU.
A moment before I wrote this question I played around with CPU-frequency and noticed that although there are only two instances of the app running the usage is shared across all cores.
So I assume that the thread jumps around between these cores.
So I set the CPU scaling to performance with cpufreq-set -g performance. The result was that all CPU cores/threads stayed at about 2GHz like it was before besides one that is permanently on 3.5GHz (100%). As I only changed the scaling for one core, why is the usage still shared across all cores? I mean the app is running at about 300%, why doesn't it stick to the CPU core with the 100%?
Furthermore as I noticed that only one of the CPU's got scaled up I looked into the help page and found -r which should scale all cores with the performance settings. Unfortunately nothing does change. (Is this a bug in Ubuntu 1404?) So I used -c with the number 8 (8 threads) -> didn't work. 4 -> works but only scales 2 cores out of 8. 7 -> scaled 4 cores. So I'm wondering, does this not support hyper-threading or is the whole program that buggy?
However as I understand it, the CPU's with the max frequency together with the thread jump around in the monitoring tools as they display the average the usage, which than looks like shared. Did I figure this right?
Would forcing one cpu to 3.5GHz and forcing the app to this core improve performance or is all the stuff I'm wondering about only about avg calculation between the data they show each second.
If so am I right that I should run best with cpufreq-set -c 7 -g performance if power consumption doesn't matter?
Thanks for reading so far, I hope you have a moment to help me understand the whole thing.
Atop example screenshots:
http://i.imgur.com/VFEBvLx.png
http://i.imgur.com/cBKOnJM.png
http://i.imgur.com/bgQfwfU.png
I believe a lot of your confusion has to do with the fuzzy mapping of the capabilities of cpufreq to the actual capabilities of the hardware.
Here’s a description of what is taking place on the HW and in the OS.
A processor is a collection of cores on the same silicon substrate. The cores are what we used to call CPUs with some enhancements. CPUs now have the capability of running multiple HW threads (hyperthreading), each hardware thread being equivalent to one of the old type CPUs. Putting this all together, the 1270v3 is a quad core (if I recall correctly), meaning it has 4 cores on the same silicon substrate. Each core can support two HW threads, each HW thread being equivalent to what the OS calls a CPU (and I’ll call a virtual CPU). So from the OS perspective, the 1270v3 has 8 (virtual) CPUs.
The OS doesn’t see packages, cores or HW threads. It sees CPUs, and to it there appear to be 8 of them.
To further complicate the issue, modern processors have various HW supporting power saving states called P-states, C-states and package C-states. Why do I mention these? It’s because they are intimately associated with the frequency of the processor. And cpufreq professes to provide the user with control over the processor’s frequency.
Now, I’m not familiar with cpufreq outside of reading the manpage and other material on the web. From my reading, it has a lot of idiosyncrasies, so I’ll talk about what it is doing from a broad perspective.
In a general sense, cpufreq has a lot of generic capability that may or may not be supported by the HW or the kernel. This is confusing because it looks like the functionality is there but then things don’t happen as you would expect. For example, cpufreq gives the impression that you can set each CPU’s frequency independently. In reality, on a hyperthreaded system, two “CPUs” are associated with each core and must have the same frequency.
A lot of the functionality that cpufreq is supposed to control is associated with the power efficiency characteristics of the processor, but again, its mapping to the processor’s actual hardware capabilities is incomplete and misleading. Though cpufreq seems to allow you to set max and min frequencies, the processor hardware doesn’t work this way. In modern Intel processors, such as the 1270v3, there are something called P-states. These P-states are frequency-voltage pairs that slow down a processor’s frequency (and drop its voltage) to reduce the processor’s power consumption at the cost of the application taking longer to run. These frequency-voltage pairings aren’t arbitrary though cpufreq gives the impression that they are.
What does this all mean? In addition to the thread migration issues that the commenter mentioned, cpufreq isn’t going to behave the way you expect because it appears to have capability that it actually doesn’t, and the capability that it does actually have maps only roughly to the actual capabilities of the HW and OS.
I embedded some further comments in your text.
With a 1270v3 and a single thread app I'm at the end of performance but when I watch monitoring tools like atop I don't understand how this whole stuff works. I tried to find a nice article about this sort of topic but they either have been explained in a language I don't understand or are not about the stuff I would like to know. I hope it is alright to ask this kind of stuff here.
From my understanding a single-thread app does only use one thread for all/most of the work. [Yes, but this doesn’t mean that the thread is locked to a specific virtual CPU or core.] So the performance is defined by the single-thread power of the CPU. [It’s not that simple. The OS migrates threads around, it has its own maintenance processes, etc] A moment before I wrote this question I played around with CPU-frequency and noticed that although there are only two instances of the app running the usage is shared across all cores. So I assume that the thread jumps around between these cores. So I set the CPU scaling to performance with cpufreq-set -g performance. The result was that all CPU cores/threads stayed at about 2GHz like it was before besides one that is permanently on 3.5GHz (100%). As I only changed the scaling for one core, why is the usage still shared across all cores? I mean the app is running at about 300%, why doesn't it stick to the CPU core with the 100%? [Since I can’t see what you are observing, I don’t really understand what you are asking.]
Furthermore as I noticed that only one of the CPU's got scaled up I looked into the help page and found -r which should scale all cores with the performance settings. Unfortunately nothing does change. (Is this a bug in Ubuntu 1404?) So I used -c with the number 8 (8 threads) -> didn't work. 4 -> works but only scales 2 cores out of 8. 7 -> scaled 4 cores. [I haven’t used cpufreq so can’t directly speak to its behavior, but the manpage implies that “-c ” refers to a specific virtual CPU and not the number of virtual CPUs.] So I'm wondering, does this not support hyper-threading or is the whole program that buggy? [Again, I’m not sure from your explanation what you are doing, but the n->n/2 behavior makes sense to me. You can change the frequency of a core but since each core has two hyperthreads/virtual CPUs, two of those virtual CPUs must scale together.]
However as I understand it, the CPU's with the max frequency together with the thread jump around in the monitoring tools as they display the average the usage, which than looks like shared. Did I figure this right? [Again, I’m not sure what you are observing. Both physically and in atop, the CPU designation should not change, meaning CPU001 will always refer to the same virtual CPU. The core with the max frequency shouldn’t physically jump around, though the user thread may. Something to note is that monitoring tools can be pretty heavy users of the CPU. This heavy usage can make figuring out your processor usage difficult if it causes threads to jump around to different virtual CPUs.]
Would forcing one cpu to 3.5GHz and forcing the app to this core improve performance or is all the stuff I'm wondering about only about avg calculation between the data they show each second. [I found a pretty good explanation of atop with a lot of helpful screen shots: http://www.unixmen.com/linux-basics-monitor-system-resources-processes-using-atop/] If so am I right that I should run best with cpufreq-set -c 7 -g performance if power consumption doesn't matter? [It all depends upon what other processes are running on your system. If your system is mostly idle except for your processes, then forcing a core to a certain frequency won’t make a difference. [I’m suspicious of what a “governor” does. The language appears to refer to power-efficiency/performance (“balanced”, “powersave”, “performance”, etc) but the details don’t match the capability of today’s hardware.]
Thanks for reading so far, I hope you have a moment to help me

How to setup a computer for controlled experimentation profiling algorithms?

We work on empirically measuring the running time of certain algorithms (to check against their asymptotic behavior). I'm trying to come up with a set of rules to "clean up" our target computer before an experiment. This is not really performance at the level of Agner Fog, but still I would like to start with as clean a machine as possible (and keep it with a constant overhead as much as I can). I have so far:
Disable all power management
Disable screen server (perhaps disable X altogether?)
Disable network
Boot in single user mode [Kenneth Hoste]
Run the experiment more than once (to smooth out fortuitous events)
Frequency scaling configured to run at maximum frequency [binarym]
?
Obviously, repeating the experiment several times will give me some statistical power, but I'd still like to do this in as clean a machine as possible.
What other tricks do people know to keep a machine constant during program profiling? This is a Linux machine and it is ok if the "rules" are Linux specific.
Pull the network cable, so your system is not spending time on network traffic passing by.
Running in single-user mode also helps, because then fewer services are running which could disrupt your measurements.
Keep away from the system while the experiments are running, anything you do (log in, ssh into the machine, 'cat' a file, run an 'ls', etc.) will affect the measurements.
But, do realize there no such thing as a stable measurement, the only way to be sure is to run the experiment a large number of times, and use the proper statistical methods to report performance. This becomes especially important when you are going to compare performance between experiments.
I don't know if it's part of the "power management" topic you mentionned, but some CPUs implement frequency scaling. Make sure it's configured to run at its max frequency.
root#chupa-ThinkPad-Edge:/sys/devices/system/cpu/cpu0/cpufreq# ls
affected_cpus cpuinfo_cur_freq cpuinfo_transition_latency scaling_available_governors scaling_governor scaling_setspeed
bios_limit cpuinfo_max_freq related_cpus scaling_cur_freq scaling_max_freq stats
cpb cpuinfo_min_freq scaling_available_frequencies scaling_driver scaling_min_freq
root#chupa-ThinkPad-Edge:/sys/devices/system/cpu/cpu0/cpufreq# cat scaling_cur_freq
800000
root#chupa-ThinkPad-Edge:/sys/devices/system/cpu/cpu0/cpufreq# cat scaling_governor
ondemand
root#chupa-ThinkPad-Edge:/sys/devices/system/cpu/cpu0/cpufreq# cat scaling_available_governors
conservative ondemand userspace powersave performance
root#chupa-ThinkPad-Edge:/sys/devices/system/cpu/cpu0/cpufreq# echo performance > scaling_governor
root#chupa-ThinkPad-Edge:/sys/devices/system/cpu/cpu0/cpufreq# cat scaling_governor scaling_cur_freq
performance
1600000
root#chupa-ThinkPad-Edge:/sys/devices/system/cpu/cpu0/cpufreq# cd ../../cpu1/cpufreq
root#chupa-ThinkPad-Edge:/sys/devices/system/cpu/cpu1/cpufreq# cat scaling_governor scaling_cur_freq
performance
1600000

Intel MSR frequency scaling per - thread

I'm extending the Linux kernel in order to control the frequency of some threads: when they are scheduled onto a core (any core!), the core's frequency is changed by writing the proper p-state to the register IA32_PERF_CTL, as suggested in Intel's manual.
But when different threads with different "custom" frequencies are scheduled, it appears that the throughput of all the thread increases, as if all the cores run at the maximum set frequency.
I did many trials and measurements in different conditions of load and configuration, but the result is the same.
After some trials with CPUFreq (with no running app, I set different frequencies on each core, and finally the measured frequencies, with cpufreq-info -w, were equal), I wonder if the CPU cores can really run at different, independent frequencies, or if there are hardware policies or constraints.
Finally, is there a CPU model which makes this fine-grained frequency scaling feasible?
The CPU I am using is Intel Core i5 750
You cannot control individual core frequencies for active cores. You can, however, control frequencies of all active cores to be the same. The reasons are in the previous answers - all cores are on the same active voltage plane.
Hopefully, the next-gen Haswell processors will make it possible to control each core separately.
I think you're missing a big piece of the picture!
Read up on power and clocks domains. All processor cores within a domain run at the same P-state (i.e., the same frequency and voltage). The P-state that all cores will run at in that domain will always be the P-state of the core requesting the highest P-state in that domain. The MSRs don't reflect this at all, nor do the interfaces that the kernel exposes.
Anandtech has a good article on this:
http://www.anandtech.com/show/2658/2
"This is all very similar to AMD's Phenom, but where the two differ is in how they handle power management. While AMD will allow individual cores to request different clock speeds, Nehalem attempts to run all of its cores at the same frequency; if one core is idle then it's simply power gated and the core is effectively turned off."
I haven't hooked a power-meter up to SB/IB, but my guess is that the behavior is the same.
cpufreq-info will display information about which cores need to be synchronous in their P-states:
[root#navi ~]# cpufreq-info
cpufrequtils 008: cpufreq-info (C) Dominik Brodowski 2004-2009
Report errors and bugs to cpufreq#vger.kernel.org, please.
analyzing CPU 0:
driver: acpi-cpufreq
CPUs which run at the same hardware frequency: 0 1 <---- THIS
CPUs which need to have their frequency coordinated by software: 0 <--- and THIS
maximum transition latency: 10.0 us.
At least because of that, I'd recommend going through cpufreq interfaces instead of directly setting registers, as well as making it possible to run on non-intel CPUs which might have uncommon requirements.
Also check on how to make kernel threads stick to specific core, to avoid unexpecteded switching, if you didn't do so already.
I want to thank everyone for the contribution!
Further investigating, I found other details I share with the community.
As suggested, Nehalem places all the cores in a single clock domain, so that the maximum frequency set among all the cores is applied to all of them; some tools may show different frequencies on idle cores, but it is sufficient to run any application to make the frequency raise to the maximum.
This, from my tests, also applies to Sandy Bridge, where cores and LLC slices all reside in the same frequency/voltage domain.
I assume that this behavior also happens with Ivy Bridge, as it is only a 'tick' iteration.
Instead, I believe that Haswell places cores and LLC slices in different, singular domains, thus enabling per-core frequencies. This is also advertized in several pages like
http://www.anandtech.com/show/8423/intel-xeon-e5-version-3-up-to-18-haswell-ep-cores-/4

Resources