I'm working on a monitoring agent that works with systems using the Linux kernel. By opening /proc/stat , you can easily tell how much time one one or all CPUs (aggregate) is burning waiting for I/O requests to complete.
I'm trying to find a way to break that number down so that I can differentiate between disk and network i/o. For instance, after converting the unit out of kernel ticks to seconds, you see that all CPU's combined have spent 1024 seconds waiting for I/O. I'd like to know how many of those were burned due to slow network connections.
I'm not sure if this is even possible, any help is appreciated :) I don't see anything in /proc/net or sysfs that would help.
Try to look at SystemTap. It is very similar to Solaris DTrace and you can get to a different levels of detail.
Related
We have a process that takes about 20 hours to run on our Linux box. We would like to make it faster, and as a first step need to identify bottlenecks. What is our best option to do so?
I am thinking of sampling the process's CPU, RAM, and disk usage every N seconds. So unless you have other suggestions, my specific questions would be:
How much should N be?
Which tool can provide accurate readings of these stats, with minimal interference or disruption from the fact that the tool itself is running?
Any other tips, nuggets of wisdom, or references to other helpful documents would be appreciated, since this seems to be one of these tasks where you can make a lot of time-consuming mistakes and false-starts as a newbie.
First of all, what you want and what you are asking is completely different.
Monitoring is required when you are running it for first time i.e. when you don't know its resource utilization (CPU, Memory,Disk etc.).
You can follow below procedure to drill down the bottleneck,
Monitor system resources (Generally 10-20 seconds interval should be fine with Munin, ganglia or other tool).
In this you should be able to identify if your hw is bottleneck or not i.e are you running out of resources Ex. 100% cpu util, very low memory, high io etc.
If this your case then probably think about upgrading hw or tuning the existing.
Then you tune your application/utility. Use profilers/loggers to find out which method, process is taking time. Try to tune that process. If you have single threaded codes then probably use parallelism. If DB etc. are involved try to tune your queries, DB params.
Then again run test with monitoring to drill down more :)
I think a graph representation should be helpful for solving your problem and i advice you Munin.
It's a resource monitoring tool with a web interface. By default it monitors disk IO, memory, cpu, load average, network usage... It's light and easy to install. It's also easy to develop your own plugins and set alert thresholds.
http://munin-monitoring.org/
Here is an example of what you can get from Munin : http://demo.munin-monitoring.org/munin-monitoring.org/demo.munin-monitoring.org/
With a 1270v3 and a single thread app I'm at the end of performance but when I watch monitoring tools like atop I don't understand how this whole stuff works. I tried to find a nice article about this sort of topic but they either have been explained in a language I don't understand or are not about the stuff I would like to know. I hope it is alright to ask this kind of stuff here.
From my understanding a single-thread app does only use one thread for all/most of the work. So the performance is defined by the single-thread power of the CPU.
A moment before I wrote this question I played around with CPU-frequency and noticed that although there are only two instances of the app running the usage is shared across all cores.
So I assume that the thread jumps around between these cores.
So I set the CPU scaling to performance with cpufreq-set -g performance. The result was that all CPU cores/threads stayed at about 2GHz like it was before besides one that is permanently on 3.5GHz (100%). As I only changed the scaling for one core, why is the usage still shared across all cores? I mean the app is running at about 300%, why doesn't it stick to the CPU core with the 100%?
Furthermore as I noticed that only one of the CPU's got scaled up I looked into the help page and found -r which should scale all cores with the performance settings. Unfortunately nothing does change. (Is this a bug in Ubuntu 1404?) So I used -c with the number 8 (8 threads) -> didn't work. 4 -> works but only scales 2 cores out of 8. 7 -> scaled 4 cores. So I'm wondering, does this not support hyper-threading or is the whole program that buggy?
However as I understand it, the CPU's with the max frequency together with the thread jump around in the monitoring tools as they display the average the usage, which than looks like shared. Did I figure this right?
Would forcing one cpu to 3.5GHz and forcing the app to this core improve performance or is all the stuff I'm wondering about only about avg calculation between the data they show each second.
If so am I right that I should run best with cpufreq-set -c 7 -g performance if power consumption doesn't matter?
Thanks for reading so far, I hope you have a moment to help me understand the whole thing.
Atop example screenshots:
http://i.imgur.com/VFEBvLx.png
http://i.imgur.com/cBKOnJM.png
http://i.imgur.com/bgQfwfU.png
I believe a lot of your confusion has to do with the fuzzy mapping of the capabilities of cpufreq to the actual capabilities of the hardware.
Here’s a description of what is taking place on the HW and in the OS.
A processor is a collection of cores on the same silicon substrate. The cores are what we used to call CPUs with some enhancements. CPUs now have the capability of running multiple HW threads (hyperthreading), each hardware thread being equivalent to one of the old type CPUs. Putting this all together, the 1270v3 is a quad core (if I recall correctly), meaning it has 4 cores on the same silicon substrate. Each core can support two HW threads, each HW thread being equivalent to what the OS calls a CPU (and I’ll call a virtual CPU). So from the OS perspective, the 1270v3 has 8 (virtual) CPUs.
The OS doesn’t see packages, cores or HW threads. It sees CPUs, and to it there appear to be 8 of them.
To further complicate the issue, modern processors have various HW supporting power saving states called P-states, C-states and package C-states. Why do I mention these? It’s because they are intimately associated with the frequency of the processor. And cpufreq professes to provide the user with control over the processor’s frequency.
Now, I’m not familiar with cpufreq outside of reading the manpage and other material on the web. From my reading, it has a lot of idiosyncrasies, so I’ll talk about what it is doing from a broad perspective.
In a general sense, cpufreq has a lot of generic capability that may or may not be supported by the HW or the kernel. This is confusing because it looks like the functionality is there but then things don’t happen as you would expect. For example, cpufreq gives the impression that you can set each CPU’s frequency independently. In reality, on a hyperthreaded system, two “CPUs” are associated with each core and must have the same frequency.
A lot of the functionality that cpufreq is supposed to control is associated with the power efficiency characteristics of the processor, but again, its mapping to the processor’s actual hardware capabilities is incomplete and misleading. Though cpufreq seems to allow you to set max and min frequencies, the processor hardware doesn’t work this way. In modern Intel processors, such as the 1270v3, there are something called P-states. These P-states are frequency-voltage pairs that slow down a processor’s frequency (and drop its voltage) to reduce the processor’s power consumption at the cost of the application taking longer to run. These frequency-voltage pairings aren’t arbitrary though cpufreq gives the impression that they are.
What does this all mean? In addition to the thread migration issues that the commenter mentioned, cpufreq isn’t going to behave the way you expect because it appears to have capability that it actually doesn’t, and the capability that it does actually have maps only roughly to the actual capabilities of the HW and OS.
I embedded some further comments in your text.
With a 1270v3 and a single thread app I'm at the end of performance but when I watch monitoring tools like atop I don't understand how this whole stuff works. I tried to find a nice article about this sort of topic but they either have been explained in a language I don't understand or are not about the stuff I would like to know. I hope it is alright to ask this kind of stuff here.
From my understanding a single-thread app does only use one thread for all/most of the work. [Yes, but this doesn’t mean that the thread is locked to a specific virtual CPU or core.] So the performance is defined by the single-thread power of the CPU. [It’s not that simple. The OS migrates threads around, it has its own maintenance processes, etc] A moment before I wrote this question I played around with CPU-frequency and noticed that although there are only two instances of the app running the usage is shared across all cores. So I assume that the thread jumps around between these cores. So I set the CPU scaling to performance with cpufreq-set -g performance. The result was that all CPU cores/threads stayed at about 2GHz like it was before besides one that is permanently on 3.5GHz (100%). As I only changed the scaling for one core, why is the usage still shared across all cores? I mean the app is running at about 300%, why doesn't it stick to the CPU core with the 100%? [Since I can’t see what you are observing, I don’t really understand what you are asking.]
Furthermore as I noticed that only one of the CPU's got scaled up I looked into the help page and found -r which should scale all cores with the performance settings. Unfortunately nothing does change. (Is this a bug in Ubuntu 1404?) So I used -c with the number 8 (8 threads) -> didn't work. 4 -> works but only scales 2 cores out of 8. 7 -> scaled 4 cores. [I haven’t used cpufreq so can’t directly speak to its behavior, but the manpage implies that “-c ” refers to a specific virtual CPU and not the number of virtual CPUs.] So I'm wondering, does this not support hyper-threading or is the whole program that buggy? [Again, I’m not sure from your explanation what you are doing, but the n->n/2 behavior makes sense to me. You can change the frequency of a core but since each core has two hyperthreads/virtual CPUs, two of those virtual CPUs must scale together.]
However as I understand it, the CPU's with the max frequency together with the thread jump around in the monitoring tools as they display the average the usage, which than looks like shared. Did I figure this right? [Again, I’m not sure what you are observing. Both physically and in atop, the CPU designation should not change, meaning CPU001 will always refer to the same virtual CPU. The core with the max frequency shouldn’t physically jump around, though the user thread may. Something to note is that monitoring tools can be pretty heavy users of the CPU. This heavy usage can make figuring out your processor usage difficult if it causes threads to jump around to different virtual CPUs.]
Would forcing one cpu to 3.5GHz and forcing the app to this core improve performance or is all the stuff I'm wondering about only about avg calculation between the data they show each second. [I found a pretty good explanation of atop with a lot of helpful screen shots: http://www.unixmen.com/linux-basics-monitor-system-resources-processes-using-atop/] If so am I right that I should run best with cpufreq-set -c 7 -g performance if power consumption doesn't matter? [It all depends upon what other processes are running on your system. If your system is mostly idle except for your processes, then forcing a core to a certain frequency won’t make a difference. [I’m suspicious of what a “governor” does. The language appears to refer to power-efficiency/performance (“balanced”, “powersave”, “performance”, etc) but the details don’t match the capability of today’s hardware.]
Thanks for reading so far, I hope you have a moment to help me
I am trying to do some performance measurements using Intels RDTSC, and it is quite
odd the variations I get during different testruns. In most cases my benchmark in C
needs 3000000 Mio cycles, however, exactly the same execution can in some cases take
5000000, almost double as much. I tried to have no intense workloads running in parallel
so that I get good performance estimations. Anyone an idea where this huge timing
variations can come from? I know that interrupts and stuff can happening, but I did not expect
such huge variations in timing!
PS.: I am running it on a Pentium processor with Linux running on it.
Thanks for feedback,
John
I think the answer is in:
I tried to have no intense workloads
running in parallel
You have insufficient control over this in a modern OS.
According to this Wikipedia article, the RDTSC (time stamp counter) cannot be used reliably for benchmarking on multi-core systems. There is no promise that all cores have the same value in the time stamp register.
On Linux, it is better to use the POSIX clock_gettime function.
You have to take the cache of most modern processors into account. Maybe another process evicts your program's cache content in the case where you measured the long running time.
As Henk pointed out, lots of stuff happen in a modern OS that you can't control that much.
I want to know if there is an efficient solution to monitor a process resource consumption (cpu, memory, network bandwidth) in Linux. I want to write a daemon in C++ that does this monitoring for some given PIDs. From what I know, the classic solution is to periodically read the information from /proc, but this doesn't seem the most efficient way (it involves many system calls). For example to monitor the memory usage every second for 50 processes, I have to open, read and close 50 files (that means 150 system calls) every second from /proc. Not to mention the parsing involved when reading these files.
Another problem is the network bandwidth consumption: this cannot be easily computed for each process I want to monitor. The solution adopted by NetHogs involves a pretty high overhead in my opinion: it captures and analyzes every packet using libpcap, then for each packet the local port is determined and searched in /proc to find the corresponding process.
Do you know if there are more efficient alternatives to these methods presented or any libraries that deal with this problems?
/usr/src/linux/Documentation/accounting/taskstats.txt
Taskstats is a netlink-based interface for sending per-task and
per-process statistics from the kernel to userspace.
Taskstats was designed for the following benefits:
efficiently provide statistics during lifetime of a task and on its exit
unified interface for multiple accounting subsystems
extensibility for use by future accounting patches
This interface lets you monitor CPU, memory, and I/O usage by processes of your choosing. You only need to set up and receive messages on a single socket.
This does not differentiate (for example) disk I/O versus network I/O. If that's important to you, you might go with a LD_PRELOAD interception library that tracks socket operations. Assuming that you can control the startup of the programs you wish to observe and that they won't do trickery behind your back, of course.
I can't think of any light-weight solutions if those still fail, but linux-audit can globally trace syscalls, which seems a fair bit more direct than re-capturing and analyzing your own network traffic.
Take a look at the linux trace toolkit (LTTng). It inserts tracepoints into the kernel and has some post processing to get some of the kind of statistics you're asking about. The trace files get large if you capture everything, but you can keep things manageable if you limit the types of events you arm.
http://lttng.org for more info...
Regarding network bandwidth: This Superuser answer describes processing /proc/net/tcp to collect network bandwidth usage.
I know that iptables can be used to do network accounting (see, e.g., LWN's, Linux.com's, or Shorewall's articles), but I don't see any practical way to do accounting that on a per-process basis.
i just came across this as i was looking for answers to the same thing. just a note - when using /proc filesystem, you do not have to close the file after each read. you can keep the file open and each time you do a read you will get new statistics... so, you shouldn't have the overhead of opening and closing each time you want to get the stats... i have this working in javascript on node.js if you want an example...
Reading /proc is ultimately the only way to monitor CPU and memory usage by individual processes without injecting your code into the kernel. If you look at top(1), you'll see reading lots of files in /proc is exactly what it does every second. All user-mode tools and libraries that retrive this sort of information have to get it from /proc.
As with network bandwidth usage, there are several approaches, which all more or less boil down to capturing all network traffic in and out of the box. You can also consider writing a special netfilter (iptables) module that does exactly the type of counting you need without the overhead of traffic capturing.
I'm curious how many cycles it takes to change contexts in Linux. I'm specifically using an E5405 Xeon (x64), but I'd love to see how it compares to other platforms as well.
There`s a free app called LMBench written by Larry McVoy and friends. It provides a bunch of OS & HW benchmarks
One of the tests is called lat_ctx and it measures contex switch latencies.
Google for lmbench and check for yourself on your own HW. Its the only way to get a number meaningful to you.
Gilad
Run vmstat on your machine while doing something that requires heavy context switching. It doesnt tell you how long the actual switch takes, but it will tell you how many switches you do per second.
Then, you have to estimate how much each timeslice spends performing actual code, compared to switching context. Maybe a 100:1 or something? I dont know. 1000:1?
A machine of mine is now doing roughly 3000 switches per second, ie 0.3 ms per timeslice. With a ratio of 100:1 that would mean the actual switch takes 0.003 ms.
But, with multiple cores, threads yielding execution, etc etc, I'm wouldnt draw any conclusion from such a guess :)
I've written code that's able to echo (small) UDP packets at 200k packets per second.
That suggests that it's possible to context switch in not more than 2.5 microseconds, with the actual context switch probably taking somewhat less than that.