Increasing the speed of qemu clock - emulation

I'm trying to increase the speed at which qemu runs, so for example one tick of the real CPU will correspond to two ticks of the virtual time of qemu. Is this possible and if so does anyone have any pointers on how to go about doing this?

You can't do this. QEMU is not made for such thing.
QEMU does not simulate timing of execution. It doesn't know anything about CPU caches etc., so it can't be accurate even if it is desirable. It simply executes guest code, as fast as it can. No acceleration or deceleration.
At least do not seek solution this way.

Maybe you should give a look at tickpolicy options from libvirt https://docs.fedoraproject.org/en-US/Fedora_Draft_Documentation/0.1/html/Virtualization_Deployment_and_Administration_Guide/sect-Virtualization-Tips_and_tricks-Libvirt_Managed_Timers.html.
It's able to 'catchup' host time. The effect is that the guest seems accelerated.
The option is translated from libvirt to qemu as -rtc ,driftfix=slew
Note that it won't change how fast code will be executed.

There are qemu options which can alter the advance of time as seen by the guest.
The args I used (with qemu v4.2.0) are:
qemu-system-x86_64 -rtc base=localtime,clock=vm -icount shift=7,align=off,sleep=off ...
Note that icount is not compatible with hardware acceleration. Also note that a shift value too high may cause the guest OS to misbehave. For example, when I tried a value of 10, the linux kernel kept complaining of tasks stalled >120s.
Of possible interest: comments on https://github.com/zephyrproject-rtos/zephyr/issues/14173 and the linked issues/PRs.

Related

CPU/Threads usage on M1 Pro (Apple Silicon) using openMP

hope someone knows the answer to this...
I have a code that compiles perfectly well with openMP (it uses libsharp). However, I am finding it impossible to make the M1 Pro chip use all the 8 or 10 cores I have.
I am setting the threads variable correctly as export OMP_NUM_THREADS=10 such that the code correctly identifies it's supposed to be running with 10 threads (see image below showing a print-screen from my activity monitor):
Activity Monitor Print Screen
Print screen is showing that the code is compiled for Apple Silicon, uses 10 threads but not much of the CPU available.
Does anyone know how to properly compile/set the number of threads such that all the cores will be used?
This is trivial in x86 architectures.
Not really an answer, but long for a comment...
If both LLVM and GCC behave the same then it's not an OpenMP runtime issue. (And your monitor output shows that the correct number of threads have been created). I'm also not certain that it's really an Arm issue.
Are you comparing with an Apple x86 machine (so running the same operating system), or with a Linux x86 system?
The scheduling decisions of the two OSes are likely different, and (for instance) MacOS has no interface to bind threads to logicalCPUs.
As well as that, there's the issue of having some fast and some slow cores. That could mean that statically scheduled loops are inefficient.
I'm also confused by the fact that you arm to show multiple instances of your code running at the same time, so you are explicitly causing over-subscription of the logicalCPUs...

How to prevent compiler & linker from taking over entire machine?

When I compile a large project the compiler slows down the machine tremendously, virtually freezes it out. If I'm lucky a keystroke in vim takes a few seconds to register. If I'm not I may as well go for a walk since nothing can be done on my workstation at all.
Is there any way to prevent compiler and linker from consuming the entire machine? More generally, is it possible to limit a family of processes to a portion of computing resources, such as threads, memory, disk access bandwidth?
Something like limiting the resources available to the process tree that originates from the shell that runs the build would be ideal.
Most linux distros have a package called cpulimit. You can use this to limit the CPU usage for the gcc tool chain binaries.
It's mention as an answer to this question.
Limiting certain processes to CPU % - Linux
I'm not an expert about it but you could try starting the compilation with a specific cgroup that has limited resources. I don't know exactly how complicated it is to do it but it shouldn't be too hard.
You could also try changing the nice of the process to give it a lower priority so that it does take the entire machine but will be easily bumped by another process.

Xen binary rewriting method

In full virtualization, what is the CPL of guest OS?
in paravertualiation, CPL of guest OS is 1(ring 1)
is it same in full virtualization?
and I heard that some of the x86 privileged instructions are
not easily handled, thus "binary rewriting" method is required...
how does this "binary rewriting" happens??
I understand that in virtualization, CPU is not emulated.
so how can hypervisor change the binary instruction codes before
the CPU executes them?? do they predict the next instruction on memory and
update the memory contents before CPU gets there??
if this is true, I think hypervisor code(performing binary rewriting)
needs to intercept the CPU every time before some instruction of guest OS is
executed. I think this is absurd.
specific explanation will be appreciated.
thank you in advance..!!
If by full virtualization, you mean hardware supported virtualization, then the CPL of the guest is identical to if it was running on bare-metal.
Xen never rewrites the binary.
This is something that VMWare (as far as I understand). To the best of my understanding (but I have never seen the VMWare source code), the method consists of basically doing runtime patching of code that needs to run differently - typically, this involves replacing an existing op-code with something else - either causing a trap to the hypervisor, or a replacement set of code that "does the right thing". If I understand how this works in VMWare is that the hypervisor "learns" the code by single-stepping through a block, and either applies binary patches or marks the section as "clear" (doesn't need changing). The next time this code gets executed, it has already been patched or is clear, so it can run at "full speed".
In Xen, using paravirtualization (ring compression), then the code in the OS has been modified to be aware of the virtualized environment, and as such is "trusted" to understand certain things. But the hypervisor will still trap for example writes to the page-table (otherwise someone could write a malicious kernel module that modifies the page-table to map in another guest's memory, or some such).
The HVM method does intercept CERTAIN instructions - but the rest of the code runs at normal full speed, thanks to the hardware support in modern processors, such as SVM in AMD and VMX in Intel processors. ARM has a similar technology in the latest models of their processors, but I'm not sure what the name of it is.
I'm not sure if I've answered quite all of your questions, if I've missed something, or it's not clear enough, feel free to ask...

How to change kernel Timer frequency

I have a question about changing kernel frequency.
I compiled kernel by using:
make menuconfig(do some changes in config)
(under Processor type and features->Timer frequency to change frequency)
1.fakeroot make-kpkg --initrd --append-to-version=-mm kernel-image kernel-headers
2.export CONCURRENCY_LEVEL=3
3.sudo dpkg -i linux-image-3.2.14-mm_3.2.14-mm-10.00.Custom_amd64.deb
4.sudo dpkg -i linux-headers-3.2.14-mm_3.2.14-mm-10.00.Custom_amd64.deb
then say if I want to change the frequency of kernel,
what I did is:
I replaced .config file with my own config file
(since I want to do this automatically without opening make menuconfig ui)
then I repeat the step1,2,3,4 again
Is there anyway I do not need repeat the above 4 steps?
Thanks a lot!!!!
The timer frequency is fixed in Linux (unless you build a tickless kernel - CONFIG_NO_HZ=y - but the upper limit will still be fixed). You cannot change it at runtime or at boot time. You can only change it at compile time.
So the answer is: no. You need to rebuild the kernel when you want to change it.
The kernel timer frequency (CONFIG_HZ) is not configurable at runtime - you will have to compile a new kernel when you change the setting and you will have to reboot the system with the new kernel to see the effects of any change.
If you are doing this a lot, though, you should be able to create a little shell script to automate the kernel configure/build/install process. For example it should not be too hard to automate the procedure so that e.g.
./kernel-prep-with-hz 100
would rebuild and install a new kernel, only requiring from you to issue the final reboot command.
Keep in mind though, that the timer frequency may subtly affect various subsystems in unpredictable ways, although things have become a lot better since the tickless timer code was introduced.
Why do you want to do this anyway?
Maybe this will help. As the articale says, you can change the frequency between the available frequency that your system supports. (Check if CPUfreq is already enabled in your system)
Example, mine.
#cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies
2000000 1667000 1333000 1000000
#echo 1000000 > cpu0/cpufreq/scaling_min_freq
http://www.ibm.com/developerworks/linux/library/l-cpufreq-2/

Monitoring the instructions of a running program in ubuntu?

I'm a little stuck here.
The idea is that I'd like to get a file of every instruction run by a program during it's execution. I'd like to do it with just the executable in hand (no source) and be able to determine what operation is occuring on what address when.
For example, I'd like to be able to run it on Google Chrome, Firefox, etc.
I want to use this for a performance prediction system I'm working on. I figure if I'm able to obtain each instruction that is executed in order it is executed on system 1, I can attempt to simulate/model the run time of an identical program being run on system 2, because I'll be able to predict(although I know not with 100% accuracy) L1/L2 cache-misses, L1/L2 cache-hits, TLB hits/misses, page faults, time taken on floating point multiplication operations, etc.
I'd like to try to do this on two different systems:
System 1: Ubuntu 10.10 on Intel Core 2 Duo CPU
System 2: Ubuntu 12.04 on system with 2x AMD Sixteen Core Opteron model 6274
(I can definitely change the OS's as neccessary, but would prefer to stay with Ubuntu, if possible)
Is this possible / how could I go about doing it? I know with debuggers, you can use them to step through everything, but I don't have the source available.
I think, you can use qemu (or even bochs) or valgrind to monitor every executed instruction. They are x86 binary translation tools (excluding bochs - which is an interpreter of x86 code). There is a valgrind tool called cachegrind (+ kcachegrind gui), which is ready to emulate cache by instrumenting every memory access and simulating some L1/L2 cache model (sizes may be configured via command line options).
To get deeper (into pipeline) you may want to look on free ptlsim (http://www.ptlsim.org/)

Resources