Program to collect CPU usage data for performance analysis [closed] - linux

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 3 years ago.
Improve this question
I wrote a complex program in Go (which uses many concurrency constructs).
I would like to make an accurate analysis of my program's CPU usage but I don't know where to start.
In particular, I would like to obtain useful information on:
Maximum number of goroutines (i.e. concurrent threads) running at the same time;
How much CPU usage changes if I run multiple instances of the same program at the same time?
Stack utilization (it tells me if I use a lot (or a few) of the stack based on how many nested functions I engage);
I work in Linux Ubuntu 18.04.1 LTS.
What should I do to get this information? Is there any program (maybe specific for Golang) that allows to obtain this information?

Well, that's a complex topic so there cannot be a single definitive answer.
Actually, what you came close to, is called "collection of metrics" or "telemetry" in production settings.
In most cases, the collection of metrics uses sampling approach: that is, a snapshot of the system state of interest is collected and sent somewhere.
"Somewhere" is usually some system which allows to persist the values of the metrics somewhere, and also usually provides various ways to analyze them.
In the simplest case, the ananysis is done by glaring at graphs drawn from the collected data in some sort of the UI. More complex cases include alerting when the value of some metric raises above (or drops below) some threshold.
A single metric is some named value of a particular type.
The metrics can be produced from different sources of data.
The sources typical to a reasonably common setups in which programs written in Go run include:
The Go runtime itself.
This includes things like the number of goroutines and the garbage collection stats—the measurements impossible to get outside of the running
Go program for obvious reasons.
The measurements provided by the OS about the running process which executes your program.
This includes things like the CPU time spent in the user and system contexts of the kernel, the memory consumption—as seen by the OS, the number of opened file (and socket) descriptors, number of CPU context switches, disk I/O stats and so on.
The measurements provided by the containerization software running the container containing the program.
On Linux this is usually provided by the cgroup subsystem which is chiefly responsible for controlling of the resource limits imposed onto a process hierarchy.
How exactly to turn the data from these data sources is an open question (and that's why it's unfit for the SO format).
For instance, to collect Go runtime stats you may use the expvar mechanism—as suggested by #Adrian,—and periodically poll the HTTP endpoint provided by it for data.
Or you may run an internal goroutine in your program which periodically grabs these data from the runtime and pushes it somewhere.
Sampling of the OS-level process-related data, again, can be done in different ways. Say, you may collect them from your very program using something like github.com/shirou/gopsutil/process and push them along with the metrics gathered from the runtime stats, or you may use one or more of myriads of tools to collect this data externally.
(The most low-tech but accessible way of gathering the OS-level performance data I know of is using tools like pidstat, iotop, atop, cpustat etc).
The question of persisting and analyzing the collected data is, again, open.
For a start, it may be as simple as dumping everything into a structured file—may be with a timestamp on each record—and process it with anything you like—for instance, pyplot or RRD-tool or R or…whatever.
Or you may reach for a big gun right from the start and send your metrics to graphite or graphana or zabbix or icinga or whatever currently is at the top of its hip curve.

Related

what is the proper way to test NFS performance [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I'm given a project where the only objective is to monitor a network's NFS performance. I know there's a bunch of open source tools out there, but still I would like to get the basic idea behind in order to better tweak those around. So the network consists of some hundred linux systems and some thousand accounts with NFS mounted home dir's; the script can be pushed out to every station, server is also possible, if that way does any good. Afaik, essentially all the script should do is a few dd's and watch the IO rate over NFS. And my question is just what is the proper way of doing so? Do I add a new account to the system solely to run the scripts?Some general thoughts are greatly appreciated :)
Bonnie
A classical performances evaluation tool tests. The main program tests database type access to a single file (or a set of files if you wish to test more than 1G of storage), and it tests creation, reading, and deleting of small files which can simulate the usage of programs such as Squid, INN, or Maildir format email.
Relevance to NFS:: Performance testing, workload
DBench
Dbench was written to allow independent developers to debug and test SAMBA. It is heavily inspired of the original SAMBA tool : NetBench
As NetBench it allow to:
torture the file system
improve the network load independently of the disk IO
Measure performances
But it does not need as much hardware resources as NetBench to run.
Relevance to NFS::
IOZone
Performance tests suite. POSIX and 64 bits compliant. This tests is the file system test from the L.S.E. Main features
POSIX async I/O, Mmap() file I/O, Normal file I/O
Single stream measurement, Multiple stream measurement, Distributed file server measurements (Cluster)
POSIX pthreads, Multi-process measurement
Selectable measurements with fsync, O_SYNC
Latency plots
Relevance to NFS:: Performance testing. Good for exercising a given mount point under various load conditions.
ful detail can be found here . http://wiki.linux-nfs.org/wiki/index.php/Testing_tools

Linux CPU Usage Tools

Background
I've written a tool to capture CPU usage on a per/thread basis. The output of the tools is a binary file, that I can pump into my parsing utility that I wrote. And the output of the parsing utility is a CSV file that I can import into Excel to chart pretty graphs of process/thread CPU usage.
This CPU usage capture tool is running on an embedded ARM platform running a Linux kernel based on 2.6.35.3. That being said, I was concerned about making the tool light weight. I didn't want it to store directly to a CSV file, in order to minimize the processing time and the file size of the captured data.
Question
The tool works, but I'm wondering if I took the long way around the problem? Is there already a tool out there that does this (or something like it)?
You're probably wondering why I care if I already made a tool that works. Well, it's not as light weight as I'd like. It's taking up about 10% of CPU usage. As a benchmark, top only takes up about 1% (max).
Update
I've decided to continue using my tool for now. At least until a better solution becomes available. I was able to shave off a couple percentage points by using open() instead of fopen() on /proc/stat. I'm also using read() instead of fgets().
IBM has a tool called nmon which does the same(for AIX & Linux): According to IBM's documentation, it takes ~2% CPU. You may want to look at that.
Comparing nmon with your tool could give you a fair idea about your program's performance and how you may improve your csv capture.
This might be a bit of a steep learning curve, but you might want look into SystemTap: http://sourceware.org/systemtap/

Accurate way of measuring overhead in kernel space

I recently implemented a security mechanism for Linux which hooks into system calls. Now I have to measure the overhead caused by it. The project requires to compare the execution time of typical Linux apps with and without the mechanism. By typical Linux apps I assume ex. gzipping 1G file, doing 'find /', grepping files. The main goal is to show the overhead in different types of tasks: CPU bound, I/O bound etc.
The question is: how to organise the test so that they will be reliable? The first important thing is the fact that my mechanism works only in kernel space, so it is relevant to compare systime. I can use 'time' command for it, but is it the most accurate way of measuring systime? Another idea is to run those apps in long loops to minimize error. Then the loops should be inside or outside time command? If they are outside I will get many results - should I choose min, max, median, average?
Thanks for any suggestions.
I think you want more to measure a typical application payload (as Ninjajl's comment suggests, the compilation of the kernel could be a good payload). You probably don't want to measure the overhead inside each syscall itself, or even inside the kernel as a whole.
The reason for this is that most applications spend much more time and resource in user-space than in kernel-land (i.e. syscalls), so overhead inside syscalls is a "second-order" effect and probably don't matter as much. Of course, there are probable exceptions.
Perhaps phoronix test suite might be relevant.
You might be interested by oprofile
See also this answer and this question

Buffering on top of VFS

the problem I try to deal with it is the saving of big number (millions) of small files (up to 50KB), which are sent via network. The saving is done sequential: server receives a file or a dir (via network), it saves it on disk; the next one arrives, it's saved etc.
Apparently, the performance is not acceptable, if multiple server processes coexist (let's say I have 5 processes which all read from network and write at the same time), because the I/O scheduler doesn't manage to merge efficiently the I/O writes.
A suggested solution is to implement some sort of buffering: each server process should have a 50MB cache, in which it should write the current file, do a chdir etc; when the buffer is full, it should be synced to disk, therefore obtaining an I/O burst.
My questions to you:
1) I know that already exists a buffer mechanism (disk buffer); do you think that the above scenario is going to add some improvement? (the design is much more complicated and it's not easy to implement a simple test case)
2) do you have any suggestions, where to look if I would implement this?
Many thanks.
You're going to need to do better than
"apparently the performance is not acceptable".
Specifically
How are you measuring it? Do you have an exact, reproducible figure
What is your target?
In order to do optimisation, you need two things- a method of measuring it (a metric) and a target (so you know when to stop, or how useful or useless a particular technique is).
Without either, you're sunk, I'm afraid.
How important are those writes? I have three suggestions (which can be combined), but one of them is a lot of work, and one of them is less safe...
Journaling
I'm guessing you're seeing some poor performance due in part to the journaling common to most modern Linux filesystems. The journaling causes barriers to be inserted into the IO queue when file metadata is written. You can try turning down the safety (and maybe turning up the speed) with mount(8) options barrier=0 and data=writeback.
But if there is a crash, the journal might not be able to prevent a lengthy fsck(8). And there's a chance the fsck(8) will wind up throwing away your data when fixing the problem. On the one hand, it's not a step to take lightly, on the other hand, back in the old days, we ran our ext2 filesystems in async mode without a journal both ways in the snow and we liked it.
IO Scheduler elevator
Another possibility is to swap the IO elevator; see Documentation/block/switching-sched.txt in the Linux kernel source tree. The short version is that deadline, noop, as, and cfq are available. cfq is the kernel default, and probably what your system is using. You can check:
$ cat /sys/block/sda/queue/scheduler
noop deadline [cfq]
The most important parts from the file:
As of the Linux 2.6.10 kernel, it is now possible to change the
IO scheduler for a given block device on the fly (thus making it possible,
for instance, to set the CFQ scheduler for the system default, but
set a specific device to use the deadline or noop schedulers - which
can improve that device's throughput).
To set a specific scheduler, simply do this:
echo SCHEDNAME > /sys/block/DEV/queue/scheduler
where SCHEDNAME is the name of a defined IO scheduler, and DEV is the
device name (hda, hdb, sga, or whatever you happen to have).
The list of defined schedulers can be found by simply doing
a "cat /sys/block/DEV/queue/scheduler" - the list of valid names
will be displayed, with the currently selected scheduler in brackets:
# cat /sys/block/hda/queue/scheduler
noop deadline [cfq]
# echo deadline > /sys/block/hda/queue/scheduler
# cat /sys/block/hda/queue/scheduler
noop [deadline] cfq
Changing the scheduler might be worthwhile, but depending upon the barriers inserted into the queue by the journaling requirements, there might not be much reordering possible. Still, it is less likely to lose your data, so it might be the first step.
Application changes
Another possibility is to drastically change your application to bundle files itself, and write fewer, larger, files to disk. I know it sounds strange, but (a) the iD development team packaged their maps, textures, objects, etc., into giant zip files that they would read into the program with a few system calls, unpack, and run with, because they found the performance much better than reading a few hundred or few thousand smaller files. Load times between levels was drastically shorter. (b) The Gnome desktop team and KDE desktop teams took different approaches to loading their icons and resource files: the KDE team packages their many small files into larger packages of some sort, and the Gnome team did not. The Gnome team had longer startup delays and were hoping the kernel could make some efforts to improve their startup time. The kernel team kept suggesting the fewer, larger, files approach.
Creating/renaming a file, syncing it, having lots of files in a directory and having lots of files (with tail waste) are some of the slow operations in your scenario. However to avoid them it would only help to write lesser files (for example writing out archives, concatenated file or similiar). I would actually try a (limited) parallel async or sync approach. The IO scheduler and caches are typically quite good.

Using "top" in Linux as semi-permanent instrumentation

I'm trying to find the best way to use 'top' as semi-permanent instrumentation in the development of a box running embedded Linux. (The instrumentation will be removed from the final-test and production releases.)
My first pass is to simply add this to init.d:
top -b -d 15 >/tmp/toploop.out &
This runs top in "batch" mode every 15 seconds. Let's assume that /tmp has plenty of space…
Questions:
Is 15 seconds a good value to choose for general-purpose monitoring?
Other than disk space, how seriously is this perturbing the state of the system?
What other (perhaps better) tools could be used like this?
Look at collectd. It's a very light weight system monitoring framework coded for performance.
We use sysstat to monitor things like this.
You might find that vmstat and iostat with a delay and no repeat counter is a better option.
I suspect 15 seconds would be more than adequate unless you actually want to watch what's happening in real time, but that doesn't appear to be the case here.
As far as load, on an idling PIII 900Mhz w/ 768MB of RAM running Ubuntu (not sure which version, but not more than a year old) I have top updating every 0.5 seconds and it's about 2% CPU utilization. At 15s updates, I'm seeing 0.1% CPU utilization.
depending upon what exactly you want, you could use the output of uptime, free, and ps to get most, if not all, of top's information.
If you are looking for overall load, uptime is probably sufficient. However, if you want specific information about processes, you are adventurous, and have the /proc filessystem enabled, you may want to write your own tools. The primary benefit in this environment is that you can focus on exactly what you want and minimize the load introduced to the system.
The proc file system gives your application read access to the kernel memory that keeps track of many of the interesting variables. Reading from /proc is one of the lightest ways to get this information. Additionally, you may be able to get more information than provided by top. I've done this in the past to get amount of time spent in user and system by this process. Additionally, you can use this to get information about the number of file descriptors open by the process. You might also use this to get detailed information about how the network system is working.
Much of this information is pre-processed by other applications which can be used if you get the information you need. However, it is rather straight-forward to read the raw information. Do a man proc for more information.
Pity you haven't said what you are monitoring for.
You should decide whether 15 seconds is ok or not. Feel free to drop it way lower if you wish (and have a fast HDD)
No worries unless you are running a soft real-time system
Have a look at tools suggested in other answers. I'll add another sugestion: "iotop", for answering a "who is thrashing the HDD" questions.
At work for system monitoring during stress tests we use a tool called nmon.
What I love about nmon is it has the ability to export to XLS and generate beautiful graphs for you.
It generates statistics for:
Memory Usage
CPU Usage
Network Usage
Disk I/O
Good luck :)

Resources