Is there a way to disable CPU cache (L1/L2) on a Linux system?

Is there a way to disable CPU cache (L1/L2) on a Linux system? - linux

I am profiling some code on a Linux system (running on Intel Core i7 4500U) to obtain the time of ONLY the execution costs. The application is the demo mpeg2dec from libmpeg2. I am trying to obtain a probability distribution for the mpeg2 execution times. However we want to see the raw execution cost when cache is switched off.
Is there a way I can disable the cpu cache of my system via a Linux command, or via a gcc flag ? or even set the cpu (L1/L2) cache size to 0KB ? or even add some code changed to disable cache ? Of course, without modifying or rebuilding the kernel.

See this 2012 thread, someone posted a tiny kernel module source to disable cache through asm.
http://www.linuxquestions.org/questions/linux-kernel-70/disabling-cpu-caches-936077/

If disabling the cache is really necessary, then so be it.
Otherwise, to know how much time a process takes in terms of user or system "cycles", then I would recommend the getrusage() function.
struct rusage usage;
getrusage(RUSAGE_SELF, &usage);
You can call it before/after your loop/test and subtracted the values to get a good idea of how much time your process took, even if many other processes run in parallel on the same machine. The main problem you'd get is if your process start swapping. In that case your timings will be off.
double user_usage = usage.ru_utime.tv_sec + usage.ru_utime.tv_usec / 1000000.0;
double system_uage = usage.ru_stime.tv_sec + usage.ru_stime.tv_usec / 1000000.0;
This is really precise from my own experience. To increase precision, you could be root when running your test and give it a negative priority (-1 or -2 is enough.) Then it won't be swapped out until you call a function that may require it.
Of course, you still get the effect of the cache... assuming you do not handle very large amount of data with code that goes on and on (opposed to having a loop).

Related

How to Configure and Sample Intel Performance Counters In-Process

In a nutshell, I'm trying to achieve the following inside a userland benchmark process (pseudo-code, assuming x86_64 and a UNIX system):
results[] = ...
for (iteration = 0; iteration < num_iterations; iteration++) {
pctr_start = sample_pctr();
the_benchmark();
pctr_stop = sample_pctr();
results[iteration] = pctr_stop - pctr_start;
}
FWIW, the performance counter I am thinking of using is CPU_CLK_UNHALTED.THREAD_ALL, to read the number of core cycles independent of clock frequency changes (In an earlier question I had been planning to use the TSC register for this, but alas, that is not what this register measures at all).
My initial intention was to use inline assembler to first configure a counter using WRMSR, then to read the counter using RDPMC inside sample_pctr().
I stumbled at the first hurdle, as writing MSRs requires kernel privileges. It seems like you can in fact read the counters from user space (if configured correctly), but the act of configuring the counter (with an MSR) needs to be undertaken by the kernel.
Does anyone know a lightweight way to ask the kernel to configure the a performance counters from user-space so that I can then use RDPMC from within my benchmark harness?
Stuff I've looked into/thought about:
Perf tools for Linux. Seems to be geared up for sampling over the whole lifetime of a process, not within a process as specific points (before and after each iteration).
Use perf syscalls directly (i.e. perf_event_open). Looks like the counter value will only update periodically (using a sample rate) or after the counter exceeds a threshold. I need the counter value precisely at the moment I ask. This is why RDPMC seemed so attractive. I imagine that sampling frequently will itself skew the performance counter readings.
PAPI builds on perf, so probably inherits the above problem.
Write a kernel module -- too much effort, too error prone.
Ideally I would like a solution which works on OpenBSD and Linux, but somehow I think that is a tall order. Perhaps just for Linux for now.
Any help is most appreciated. Thanks.
EDIT: I just found the Linux msr device node, which would probably suffice. I'll leave the question up in case a better answer shows up.

It seems the best way -- for Linux at least -- is to use the msr device node.
You simply open a device node, seek to the address of the MSR required, and read or write 8 bytes.
OpenBSD is harder, since (at the time of writing) there is no user-space proxy to the MSRs. So you would need to write a kernel module or implement a sysctl by hand.

How to prevent compiler & linker from taking over entire machine?

When I compile a large project the compiler slows down the machine tremendously, virtually freezes it out. If I'm lucky a keystroke in vim takes a few seconds to register. If I'm not I may as well go for a walk since nothing can be done on my workstation at all.
Is there any way to prevent compiler and linker from consuming the entire machine? More generally, is it possible to limit a family of processes to a portion of computing resources, such as threads, memory, disk access bandwidth?
Something like limiting the resources available to the process tree that originates from the shell that runs the build would be ideal.

Most linux distros have a package called cpulimit. You can use this to limit the CPU usage for the gcc tool chain binaries.
It's mention as an answer to this question.
Limiting certain processes to CPU % - Linux

I'm not an expert about it but you could try starting the compilation with a specific cgroup that has limited resources. I don't know exactly how complicated it is to do it but it shouldn't be too hard.
You could also try changing the nice of the process to give it a lower priority so that it does take the entire machine but will be easily bumped by another process.

Lowering linux kernel timer frequency

When I run my Virtual Machine with Gentoo as guest, I have found that there is considerable overhead coming from tick_periodic function. (This is the function which runs on every timer interrupt.) This function updates a global jiffy using write_seqlocks which leads to the overhead.
Here's a grep of HZ and relevant stuff in my kernel config file.
sharan013#sitmac4:~$ cat /boot/config | egrep 'HZ|TIME'
# CONFIG_RCU_FAST_NO_HZ is not set
CONFIG_NO_HZ=y
# CONFIG_HZ_100 is not set
# CONFIG_HZ_250 is not set
# CONFIG_HZ_300 is not set
CONFIG_HZ_1000=y
CONFIG_HZ=1000
# CONFIG_MACHZ_WDT is not set
CONFIG_TIMERFD=y
CONFIG_HIGH_RES_TIMERS=y
CONFIG_X86_CYCLONE_TIMER=y
CONFIG_HPET_TIMER=y
Clearly it has set the configuration to 1000, but when I do sysconf(_SC_CLK_TCK), I get 100 as my timer frequency. So what is my system's timer frequency?
What I want to do is to bring the frequency down to 100, even lower if possible. Although it might effect the interactivity and precision of poll/select and schedulers time slice, I am ready to sacrifice these things for lesser timer interrupt as it will speed up VM.
When I tried to find out what has to be done I read in some place that you can do so by changing in the configuration file, else where I read that adding divider=10 to the boot parameter does the job, else where I read that none of it is needed if you can set the CONFIG_HIGH_RES_TIMERS to acheive low-latency timers even without increasing the timer frequency and the same is possible with a tickless system CONFIG_NO_HZ.
I am extermely confused about what is the right approach.
All I want is to bring down the timer interrupt to as low as possible.
Can I know the right way of doing this?

Don't worry! Your confusion is nothing but expected. Linux timer interrupts are very confusing and have had a long and quite exciting history.
CLK_TCK
Linux has no sysconf system call and glibc is just returning the constant value 100. Sorry.
HZ <-- what you probably want
When configuring your kernel you can choose a timer frequency of either 100Hz, 250Hz, 300Hz or 1000Hz. All of these are supported, and although 1000Hz is the default it's not always the best.
People will generally choose a high value when they value latency (a desktop or a webserver) and a low value when they value throughput (HPC).
CONFIG_HIGH_RES_TIMERS
This has nothing to do with timer interrupts, it's just a mechanism that allows you to have higher resolution timers. This basically means that timeouts on calls like select can be more accurate than 1/HZ seconds.
divider
This command line option is a patch provided by Red Hat. You can probably use this (if you're using Red Hat or CentOS), but I'd be careful. It's caused lots of bugs and you should probably just recompile with a different Hz value.
CONFIG_NO_HZ
This really doesn't do much, it's for power saving and it means that the ticks will stop (or at least become less frequent) when nothing is executing. This is probably already enabled on your kernel. It doesn't make any difference when at least one task is runnable.
Frederic Weisbecker actually has a patch pending which generalizes this to cases where only a single task is running, but it's a little way off yet.

linux- How to determine time taken by each function in C Program

I want to check time taken by each function and system calls made by each function in my project .My code is part of user as well as kernel space. So i need time taken in both space. I am interested to know performance in terms of CPU time and Disk IO. Should i use profiler tool ? if yes , which will be more preferable ? or what other option i have ?
Please help,
Thanks

As for kernel level profiling or time taken by some instructions or functions could be measured in clock tics used. To get actual how many clock ticks have been used to do a given task could be measured by kernel function as...
#include <sys/time.h>
unsigned long ini,end;
rdtscl(ini);
...your code....
rdtscl(end);
printk("time lapse in cpu clics: %lu\n",(end-ini));
for more details http://www.xml.com/ldd/chapter/book/ch06.html
and if your code is taking more time then you can also use jiffies effectively.
And for user-space profiling you can use various timing functions whicg give the time in nanosecond resolution or oprofile(http://oprofile.sourceforge.net/about/) & refer tis Timer function to provide time in nano seconds using C++

For kernel-space function tracing and profiling (which includes a call-graph format and the time taken by individual functions), consider using the Ftrace framework.
Specifically for function profiling (within the kernel), enable the CONFIG_FUNCTION_PROFILER kernel config: under Kernel Hacking / Tracing / Kernel function profiler.
It's Help :
CONFIG_FUNCTION_PROFILER:
This option enables the kernel function profiler. A file is created
in debugfs called function_profile_enabled which defaults to zero.
When a 1 is echoed into this file profiling begins, and when a
zero is entered, profiling stops. A "functions" file is created in
the trace_stats directory; this file shows the list of functions that
have been hit and their counters.
Some resources:
Documentation/trace/ftrace.txt
Secrets of the Ftrace function tracer
Using ftrace to Identify the Process Calling a Kernel Function

Well I only develop in userspace so I don't know, how much this will help you with disk IO or Kernelspace profiling, but I profiled a lot with oprofile.
I haven't used it in a while, so I cannot give you a step by step guide, but you may find more informations here:
http://oprofile.sourceforge.net/doc/results.html
Usually this helped me finding my problems.
You may have to play a bit with the opreport output, to get the results you want.

Can I tell Linux not to swap out a particular processes' memory?

Is there a way to tell Linux that it shouldn't swap out a particular processes' memory to disk?
Its a Java app, so ideally I'm hoping for a way to do this from the command line.
I'm aware that you can set the global swappiness to 0, but is this wise?

You can do this via the mlockall(2) system call under Linux; this will work for the whole process, but do read about the argument you need to pass.
Do you really need to pull the whole thing in-core? If it's a java app, you would presumably lock the whole JVM in-core. I don't know of a command-line method for doing this, but you could write a trivial program to call fork, call mlockall, then exec.
You might also look to see if one of the access pattern notifications in madvise(2) meets your needs. Advising the VM subsystem about a better paging strategy might work out better if it's applicable for you.
Note that a long time ago now under SunOS, there was a mechanism similar to madvise called vadvise(2).

If you wish to change the swappiness for a process add it to a cgroup and set the value for that cgroup:
https://unix.stackexchange.com/questions/10214/per-process-swapiness-for-linux#10227

There exist a class of applications in which you never want them to swap. One such class is a database. Databases will use memory as caches and buffers for their disk areas, and it makes absolutely no sense that these are ever put to swap. The particular memory may hold some relevant data that is not needed for a week until one day when a client asks for it. Without the caching/swapping, the database would simply find the relevant record on disk, which would be quite fast; but with swapping, your service might suddenly be taking a long time to respond.
mysqld includes code to use the OS / system call memlock. On Linux, since at least 2.6.9, this system call will work for non-root processes that have the CAP_IPC_LOCK capability[1]. When using memlock(), the process must still work within the bounds of the LimitMEMLOCK limit. [2]. One of the (few) good things about systemd is that you can grant the mysqld process these capabilities, without requiring a special program. If can also set the rlimits as you'd expect with ulimit. Here is an override file for mysqld that does the requisite steps, including a few others that you might need for a process such as a database:
[Service]
# Prevent mysql from swapping
CapabilityBoundingSet=CAP_IPC_LOCK
# Let mysqld lock all memory to core (don't swap)
LimitMEMLOCK=-1
# do not kills this process if low on memory
OOMScoreAdjust=-900
# Use higher io scheduling
IOSchedulingClass=realtime
Type=simple
ExecStart=
ExecStart=/usr/sbin/mysqld --memlock $MYSQLD_OPTS
Note The standard community mysql currently ships with Type=forking and adds --daemonize in the option to the service on the ExecStart line. This is inherently less stable than the above method.
UPDATE I am not 100% happy with this solution. After several days of runtime, I noticed the process still had enormous amounts of swap! Examining /proc/XXXX/smaps, I note the following:
The largest contributor of swap is from a stack segment! 437 MB and fluctuating. This presents obvious performance issues. It also indicates stack-based memory leak.
There are zero Locked pages. This indicates the memlock option in MySQL (or Linux) is broken. In this case, it wouldn't matter much because MySQL can't memlock stack.

You can do that by the mlock family of syscalls. I'm not sure, however, if you can do it for a different process.

As super user you can 'nice' it to the highest priority level -20 and hope that's enough to keep it from being swapped out. It usually is. Positive numbers lower scheduling priority. Normal users cannot nice upwards (negative nos.)

Except in extremely unusual circumstances, asking this question means that You're Doing It Wrong(tm).
Seriously, if Linux wants to swap and you're trying to keep your process in memory then you're putting an unreasonable demand on the OS. If your app is that important then 1) buy more memory, 2) remove other apps/daemons from the machine, or dedicate a machine to your app, and/or 3) invest in a really fast disk subsystem. These steps are reasonable for an important app. If you can't justify them, then you probably can't justify wiring memory and starving other processes either.

Why do you want to do this?
If you are trying to increase performance of this app then you are probably on the wrong track. The OS will swap out a process to increase memory for disk cache - even if there is free RAM, the kernel knows best (actauly the samrt guys that wrote the scheduler know best).
If you have a process that needs responsiveness (it's swapped out while not used and you need it to restart quickly) then nice it to high priority, mlock, or using a real time kernel might help.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string