Simulate an older machine respecting its overall capabilities? - io

Note I recon my question's main goal isn't programming-related; its means though are best known to programmers (for a reason ;)) That said, feel free to advise some other place should you believe it'll be viewed by better informed people in this complex field.
We're getting close to deploy refurbished PCs with carefully optimized systems at the local #diy facility I'm volunteering in. It's « The devil is in the details » time.
Question 1: What would be your specs sheet to simulating older machines keeping reasonably close to their overall capabilities?
Goal
Help testing OSes and apps capabilities on a few end-user IRL scenarios: kids/guests/digital-iliterate-person as well as retro-gaming/experiment-and-discover scenarios. And then configure/optimize the chosen setups.
Means we oughta garanty --and therefore test-- the box with optimized system duo capabilities for (
browse the today Web -- html5 incl.
Plays videos
View photos
Edit office documents
), that's it!
Target
Circa 2000-2005 boxes: mostly SSE (3DNow!) CPUs such as Ahtlons (k7, 32-bit) and Pentium (P6), 7200 IDE hdds, FSB 400/DDR400 (or below) SDRAM, and "some kind of" AGP graphics.
Means
Specs sheet, atm it looks at the target's:
cpu instructions *
cpu frequency *
FSB/bus speed and bandwidth
IO speed and bandwidth *
graphics *
* is possibly addressed in my current set-up (i.e. "well enough").
Current old box emulation set-up
Core i3 #3.3ghz with DDR3 1600 running Arch linux [1]
virtualization set up
cpulimit -l 60 \
qemu-system-i386 -cpu pentium3,enforce -enable-kvm \
-m 1G -vga std -display gtk -enable-kvm -hda hdd.img
First, it tries to stay below-or-equal the target CPu frequency;
b) to its instructions set:
/proc/cpuinfo (and google chrome ;) )
~$ test#guest inxi -f
CPU: Single core Pentium III (Katmai) (-UP-) cache: 2048 KB clocked at 3292.518 MHz
CPU Flags: apic cmov cx8 de fpu fxsr hypervisor mca mce mmx msr mtrr pae pge pse pse36
sep sse tsc vme x2apic
c) to the available RAM,
d) hdd.img stands on a spinning 7200 Sata HDD and has qcow2 format to try sticking closer to the target IO specs.
e) to the older 2D rendering capabilities AFAIK Qemu/KVM -vga std great gpu emulation ;) makes it a good choice to simulate that
Question two: Are the following limits really impacting?
It's unclear whether Qemu KVM succeeds to restrict the vcpu to the chosen cpu model e.g. shouldn't -cpu pentium3 show a 250 KB cache?
IO subsystem (big point here especially for finding out the correct Linux kernel virtual memory settings). It'd be really cool to stick to less-or-equal an IDE bus and 20 GB HDD efficiency.
What about the FSB bus/memory subsystem part??
Alyernatively, do uou know of tool, tips or tricks to achieve a better "reasonably equal-or-less" to the target capabilities VM set-up?
[1] limited time opening at the diy facility make it necessary to fine tune at my place. Host has VT-x set but no VT-d capabilities.
EDIT: How can I simulate a slow machine in a VM? and
Slow down CPU to simulate slower computers in browser testing, cover the cpu speed part. How To Simulate Lower CPU Processor Machines For Browser Testing and
Emulate old PC? [closed] too and are centered on Windows host.

Related

Programmatically disable CPU core

It is known the way to disable logical CPUs in Linux, basically with echo 0 > /sys/devices/system/cpu/cpu<number>/online. This way, you are only telling to the OS to ignore that given (<number>) CPU.
My question goes further, is it possible not only to ignore it but to turn it off physically programmatically? I want that CPU to not receive any power, in order to make its energy consumption zero.
I know that it is possible disable cores from the BIOS (not always), but I want to know whether is possible to do it within a certain program or not.
When you do echo 0 > /sys/devices/system/cpu/cpu<number>/online, what happens next depends on the particular CPU. On ARM embedded systems the kernel will typically disable the clock that drives the particular core PLL so effectively you get what you want.
On Intel X86 systems, you can only disable the interrupts and call the hlt instruction (which Linux Kernel does). This effectively puts CPU to the power-saving state until it is woken up by another CPU at user request. If you have a laptop, you can verify that power draw indeed goes down when you disable the core by reading the power from /sys/class/power_supply/BAT{0,1}/current_now (or uevent for all values such as voltage) or using the "powertop" utility.
For example, here's the call chain for disabling the CPU core in Linux Kernel for Intel CPUs.
https://github.com/torvalds/linux/blob/master/drivers/cpufreq/intel_pstate.c
arch/x86/kernel/smp.c: smp_ops.play_dead = native_play_dead,
arch/x86/kernel/smpboot.c : native_play_dead() -> play_dead_common() -> local_irq_disable()
Before that, CPUFREQ also sets the CPU to the lowest power consumption level before disabling it though this does not seem to be strictly necessary.
intel_pstate_stop_cpu -> intel_cpufreq_stop_cpu -> intel_pstate_set_min_pstate -> intel_pstate_set_pstate -> wrmsrl_on_cpu(cpu->cpu, MSR_IA32_PERF_CTL, pstate_funcs.get_val(cpu, pstate));
On Intel X86 there does not seem to be an official way to disable the actual clocks and voltage regulators. Even if there was, it would be specific to the motherboard and thus your closest bet might be looking into BIOS such as coreboot.
Hmm, I realized I have no idea about Intel except looking into kernel sources.
In Windows 10 it became possible with new power management commands CPMINCORES CPMAXCORES.
Powercfg -setacvalueindex scheme_current sub_processor CPMAXCORES 50
Powercfg -setacvalueindex scheme_current sub_processor CPMINCORES 25
Powercfg -setactive scheme_current
Here 50% of cores are assigned for desired deep sleep, and 25% are forbidden to be parked. Very good in numeric simulations requiring increased clock rate (15% boost on Intel)
You can not choose which cores to park, but Windows 10 kernel checks Intel's Comet Lake and newer "prefered" (more power efficient) cores, and starts parking those not preferred.
It is not a strict parking, so at high load the kernel can use these cores with very low load.
just in case if you are looking for alternatives
You can get closest to this by using governors like cpufreq. Make Linux exclude the CPU and power saving mode will ensure that the core runs at minimal frequency.
You can also isolate cpus from the scheduler at kernel boot time.
Add isolcpus=0,1,2 to the kernel boot parameters.
https://www.linuxtopia.org/online_books/linux_kernel/kernel_configuration/re46.html
I know this is an old question but one way to disable the CPU is via grub config.
If you add to end of GRUB_CMDLINE_LINUX in /etc/default/grub (assuming you are using a standard Linux dist, if you are using an appliance the location of the grub config may be different), e.g.:
GRUB_CMDLINE_LINUX=".......Current config here **maxcpus**=2"
Then remake you grub config by running
grub2-mkconfig -o /boot/grub2/grub.cfg (or grub-mkconfig -o /boot/grub2/grub.cfg depending on your installation). Some distros may require nr_cpus instead of maxcpus.
Just some extra info:
If you are running a server with Multiple physical CPU then disabling one CPU may will most likely disable the memory set that is linked to that CPU, therefore it may have an effect on the performance of the server
Disabling the CPU this way, will not effect your type 1 hypervisor from accessing the CPU (this is based on xen hypervisor, I believe it will apply to vmware as well, if anyone can provide confirmation would be great). Depending on virtualbox setup, it may restrict the amount of CPU you can allocate to VM's unless you are running para-virtualization.
I am unsure however if you will have any power savings, most servers and even desktops these days, already control the power well, putting to sleep any device not needed for the current load. My concern would be by reducing the number of CPU (cores) then you will just be moving the load to the remaining CPU and due to the need to schedule the processors time, and potentially having instructions queued, and the effect of having a smaller number of cores available for interrupts (eg: network traffic), it may have a negative effect on power consumption.
AFAIK there is no system call or library function available as of now. or even ioctl implementation. So apart from creating new module / system call there are two ways I can think of :
using ASM asm(<assembly code>); where assembly code being architecture specific asm code to modify cpu flag.
system call in c (man 3 system). Assuming you just want to do it through c.

How to measure total boottime for linux kernel on intel rangeley board

I am working on intel rangeley board. I want to measure the total time taken to boot the linux kernel. Is there any possible and proven way to achieve this on intel board?
Try using rdtsc. According to the Intel insn ref manual:
The processor monotonically increments the time-stamp counter MSR
every clock cycle and resets it to 0 whenever the processor is reset.
See “Time Stamp Counter” in Chapter 17 of the Intel® 64 and IA-32
Architectures Software Developer’s Manual, Volume 3B, for specific
details of the time stamp counter behavior.
(see the x86 tag wiki for links to manuals)
Normally the TSC is only used for relative measurements between two points in time, or as a timesource. The absolute value is apparently meaningful. It ticks at the CPU's rated clock speed, regardless of the power-saving clock speed it's actually running at.
You might need to make sure you read the TSC from the boot CPU on a multicore system. The other cores might not have started their TSCs until Linux sent them an inter-processor interrupt to start them up. Linux might sync their TSCs to the boot CPU's TSC, since gettimeofday() does use the TSC. IDK, I'm just writing down stuff I'd be sure to check on if I wanted to do this myself.
You may need to take precautions to avoid having the kernel modify the TSC when using it as a timesource. Probably via a boot option that forces Linux to use a different timesource.

Experienced strange rdtsc behavior comparing physical hardware and kvm-based VMs

I have a following problem. I run several stress tests on a Linux machine
$ uname -a
Linux debian 3.14-2-686-pae #1 SMP Debian 3.14.15-2 (2014-08-09) i686 GNU/Linux
It's an Intel i5 Intel(R) Core(TM) i5-2400 CPU # 3.10GHz, 8 G RAM, 300 G HDD.
These tests are not I/O intensive, I mostly compute double arithmetic in the following way:
start = rdtsc();
do_arithmetic();
stop = rdtsc();
diff = stop - start;
I repeat these tests many times, running my benchmarking application on a physical machine or on a KVM based VM:
qemu-system-i386 disk.img -m 2000 -device virtio-net-pci,netdev=net1,mac=52:54:00:12:34:03 -netdev type=tap,id=net1,ifname=tap0,script=no,downscript=no -cpu host,+vmx -enable-kvm -nographichere
I collect data statistics (i.e., diffs) for many trials. For the physical machine (not loaded), I get the data distribution of processing delay mostly likely to be a very narrow lognormal.
When I repeat the experiment on the virtual machine (physical and virtual machines are not loaded), the lognormal distribution is still there (of a little bit wider shape), however, I collect a few points with completion times much shorter (about two times) than the absolute minimum gathered for the physical machine!! (Notice that the completion time distribution on the physical machine is very narrow lying close to the min value). Also there are some points with completion times much longer than the average completion time on the hardware machine.
I guess that my rdtsc benchmarking method is not very accurate for the VM environment. Can you please suggest a method to improve my benchmarking system that could provide reliable (comparable) statistics between the physical and the kvm-based virtual environment? At least something, that won't show me that the VM is 2x faster than a hardware PC in a small number of cases.
Thanks in advance for any suggestions or comments on this subject.
Best regards
maybe you can try clock_gettime(CLOCK_THREAD_CPUTIME_ID,&ts),see man clock_gettime for more information
It seems that it's not the problem of rdtsc at all. I am using my i5 Intel core with a fixed limited frequency through the acpi_cpufreq driver with the userspace governor. Even though the CPU speed is fixed at let's say 2.4 G (out of 3.3G), there are some calculations performed with the maximum speed of 3.3 G. Roughly speaking, I also encountered a very small number of such cases on the physical machine ~1 per 10000. On kvm, this behavior is of higher frequency, let's say about a few percent. I will further investigate this problem.

Evaluating SMI (System Management Interrupt) latency on Linux-CentOS/Intel machine

I am interested in evaluating the behavior (latency, frequency) of SMI handling on Linux machine running CentOS and used for a (very) soft real time application.
What tools are recommended (hwlatdetect for CentOS?), and what is the best course of action to go about this?
If no good tools are available for CentOS, am I correct to assume that installing a
different OS on the same machine should yield the same results since the underlying hardware/bios are the same?
Is there any source for ballpark figures on these parameters.
The machines are X86_64 architecture, running CentOS 6.4 (kernel 2.6.32-358.23.2.el2.centos.plus.x86_64.)
SMIs can certainly happen during normal operation. My home desktop has a chipset-driven SMI every second and a half which is enabled in the chipset. I've also seen some servers that have them twice a second due to a BIOS-driven CPU frequency scaling scheme. However, some systems can go long periods of time without an SMI occurring so it really depends.
Question #1: hwlatdetect is one option to detect the latency of SMIs occurring on your system. BIOSBITS is another option which is a bootable CD that can identify if SMIs are occuring. You can also write your own test by creating a kernel module that spins in a loop and takes timestamps (using RDTSC). If you see a long gap between two timestamp readings, you could consult CPU MSR 0x34 to see if the SMI counter incremented which would indicate that an SMI happened.
If you want to generate an SMI, you can make a kernel module that does an OUT CPU instruction to port 0xb2, e.g. write a value of 0 to this port. (You can also time this SMI by gathering a timestamp just before and just after the write to port 0xB2).
Question #2, SMIs operate at a layer below the OS so which OS you choose, shouldn't have any impact.
Question #3: BIOSBITS recommends that SMI latencies be kept under 150 microseconds.
SMI will put your system into SMM (System Management Mode) mode, which will postpone the
normal execution of kernel during the SMI handling time period. In other words, SMM
is neither real mode nor protected mode as we know of normal operation of kernel,
instead it executes some special instruction kept in SMRAM (stored in Bios Firmware). To detect it's latency you can try to trigger an SMI (it can be software generated) and try to catch the total time spent in SMM mode. To accomplish this you can write a Linux kernel module, cause you'll be require some special privileges to issue an SMI (I think).
For real time systems I think it's nice if you can avoid these sort of interrupts like SMI.
You can check whether System Management Interrupts (SMI) are serviced or not with turbostat. For example:
# turbostat sleep 120
[check column SMI for value greater than 0]
Of course, from that you can also compute a SMI frequency.
Knowing that SMIs are actually happening at a certain rate is important information. But you also want to know how much time System Management Mode (SMM) spends in those interrupts. For example, if an SMI interruption is only very short than it might be irrelevant for your realtime application. On the other hand, if you have hardware with long SMI interruptions you probably want to talk to the vendor, configure the firmware differently (if possible) and or switch to other hardware with less intrusive SMM.
The perf tool has a mode that measures how many cycles are spend in SMM during SMIs (using the information provided by certain CPU counters). Example:
# perf stat -a -A --smi-cost -- sleep 120
Performance counter stats for 'system wide':
SMI cycles% SMI#
CPU0 0.0% 0
CPU1 0.0% 0
CPU2 0.0% 0
CPU3 0.0% 0
120.002927948 seconds time elapsed
You can also look at the raw values with:
# perf stat -a -A --smi-cost --metric-only -- sleep 120
From that you can compute how much time an SMI takes on average on your machine. (divide cycles difference by the number of cycles per time unit).
It certainly makes sense to cross check the CPU counter based results with empiric ones.
You can use the Linux Hardware Latency Detector that is integrated in the Linux Kernel. Usage example:
# echo hwlat > /sys/kernel/debug/tracing/current_tracer
# echo 1 > /sys/kernel/debug/tracing/tracing_thresh
# watch -d -n 5 cat /sys/kernel/debug/tracing/tracing_max_latency
# echo "Don't forget to disable it again"
# echo nop > /sys/kernel/debug/tracing/current_tracer
Those tools are available on CentOS/RHEL 7 and should be available on other distributions, as well.
Regarding ballpark figures: Recently I came across a HP 2011-ish ProLiant Gen8 Xeon server that fires 504 SMIs per minute. Perf computes a rate of 0.1 % in SMM, and based on the counter values the averge time spent in an SMI is as high as several microseconds - but the Linux hwlat detector doesn't detect such high interruptions on that system.
That SMI rate matches what HP documents in its Configuring and tuning
HPE ProLiant Servers for low-latency applications guide (October, 2017):
Disabling System Management Interrupts to the processor provides one of
the greatest benefits to low-latency environments.
Disabling the Processor Power and Utilization Monitoring SMI has the greatest
effect because it generates a processor interrupt eight times a second in G6
and later servers.
(emphasis mine; and that guide also documents other SMI sources)
On a Supermicro board with Intel Atom C3758 and an Intel NUC (i5-4250U) system of mine there are exactly zero SMIs counted.
On an Intel i7-6600U based Dell laptop, the system reports 8 SMIs per minute, but the aperf counter is lower than the (unhalted) cycles counter which isn't supposed to happen.
Actually, SMI is used for more than just keyboard emulation. Servers use SMI to report and correct ECC memory errors, ACPI uses SMI to communicate with BIOS and perform some tasks, even enabling and disabling ACPI is done through SMI, BIOS often intercepts power state changes through SMI... there's more, this is just a few examples.
According to wikipage on System Management Mode, SMI is not used during normal operation, except perhaps to emulate a PS/2 keyboard with a USB physical keyboard.
And most Linux systems are able to drive genuine USB keyboard without that emulation. You could configure your BIOS to disable it.

CPU and memory usage of jemalloc as compared to glibc malloc

I had recently learnt about jemalloc, it is the memory allocator used by firefox. I have tried integrating jemalloc into my system by overriding new and delete operator and calling the jemalloc equivalents of malloc and free i.e je_malloc and je_free.I have written a test application that does 100 million allocations.I have run the application both with glibc malloc and jemalloc, while running with jemalloc takes lesser time for such allocations the CPU utilization is pretty high, moreover the the memory foot print is also larger as compared to malloc. After reading this document on jemalloc analysis
it seemed that jemalloc might have footprints greater than malloc as it employs techniques to optimize speed than memory. However, I haven't got any pointers to the CPU usage with Jemalloc. I would like to state that I working on a multiprocessor machine the details of which are given below.
processor : 11
vendor_id : GenuineIntel
cpu family : 6
model : 44
model name : Intel(R) Xeon(R) CPU X5680 # 3.33GHz
stepping : 2
cpu MHz : 3325.117
cache size : 12288 KB
physical id : 1
siblings : 12
core id : 10
cpu cores : 6
apicid : 53
fpu : yes
fpu_exception : yes
cpuid level : 11
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx pdpe1gb rdtscp lm constant_tsc ida nonstop_tsc arat pni monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr sse4_1 sse4_2 popcnt lahf_lm
bogomips : 6649.91
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management: [8]
I am using top -c -b -d 1.10 -p 24670 | awk -v time=$TIME '{print time,",",$9}' to keep track of the CPU usage.
Did someone have similar experiences while integrating Jemlloc?
Thanks!
One wise guy said on CppCon that you never have to guess about performance. You have to measure it instead.
I tried to use jemalloc with multithreaded Linux application. It was custom application level protocol server (over TCP/IP). This C++ application used some Java code via JNI (near 5% of time it used Java, and 95% of time it used C++ code) I run 2 application instances in production mode. Each one had 150 threads.
After 72 hours of running glibc one used 900 M of memory, and jemalloc one used 2.2 G of memory. I didn't see significant CPU usage difference. Actual performance (average client request serving time) was near the same for both instances.
So, in my test glibc was much better than jemalloc. Of course, it is my application specific.
Conclusion: If you have reasons to think that your application memory management is not effective because of fragmentation, you have to make test similar to one I described. It is the only reliable information source for your specific needs. If jemalloc is always better that glibc, glibc will make jemalloc its official allocator. If glibc is always better, jemalloc will stop to exist. When competitors exist long time in parallel, it means that each one has its own usage niche.
Aerospike implemented jemalloc on our NoSQL database, and publicly released the implementation about a year ago with v3.3.x. Just today Psi Mankoski published an article on High Scalability about why and how we did it, and the performance improvement it gave compared to GlibC malloc.
We actually saw a decrease in RAM utilization because of the way we were able to use jemalloc's debugging capability to minimize RAM fragmentation. In the production environment, server % Free Memory was often a "spiky graph," and had often spiked as high as 54% prior to the implementation of JEMalloc. After implementation, you can see the decrease in RAM utilization over the 4-month analysis period. RAM % free memory began to "flatline" and be far more predictable, hovering between ~22-40% depending on the server node.
As Preet says, there was a lot less fragmentation over time, which means less RAM utilization. Psi's article gives "proof in the pudding" behind such a statement.
This question might not belong here since for real-world solutions, it should be irrelevant what other people found on their different hardware/environments/usage scenarios. You should test on the target system and see what suits you.
As for the higher memory footprint, one of the most classical performance optimizations in computer science is the time-memory tradeoff. That is, caching certain results for instant lookup later on and preventing frequent recalculation. Also, since it is presumably a lot more complex, there would probably be a lot more internal bookkeeping. This kind of tradeoff should be more or less expected, especially when picking between variants of such low level and widely used core modules. You have to cater the peformance characteristics to your usage characteristics, since usually, there is no silver bullet.
You might also want to look at google's TCMalloc which is quite close although I believe Jemalloc is slightly more performant in general, as well as creating less heap fragmentation over time.
I am developing simple NoSQL database.
(https://github.com/nmmmnu/HM4)
jemalloc vs standard malloc
When I use jemalloc, performance decrease, but memory "fragmentation" decreases as well. Jemalloc also seems to use less memory on the peak, but difference is 5-6%.
What I mean with memory fragmentation is as follows.
First I allocate lots of key value pairs (5-7 GB of memory)
Then I look at the memory usage.
Then I deallocate all pairs and any other memory my executable uses. Order of allocation is same as order of deallocation.
Finally I check memory usage again.
In standard malloc, usage is almost like on the peak. (I especially checked for mmap memory and there is none).
With jemalloc usage is minimal.
bonus information - tcmalloc
Last time I checked with tcmalloc, it was really very fast - probably 10% improvements over standard malloc.
On the peak, it consumes less memory than standard malloc, but more than jemalloc.
I do not remember about the memory fragmentation, but it was far from jemalloc result.
This paper investigates the performance of different memory allocators.
Share some conclusions here:
Figure 1 shows the effects of different allocation strategies on TPC-DS with scale factor 100. We measure memory consumption and execution time with our multi-threaded database system on a 4-socket Intel Xeon server. In this experiment, our DBMS executes the query set sequentially using all available cores. Even this relatively simple workload already results in significant performance and memory usage differences. Our database linked with jemalloc can reduce the execution time to 1/2 in comparison to linking it with the
standard malloc of glibc 2.23.

Resources