Profiling very long running tasks - linux

How can you profile a very long running script that is spawning lots of other processes?
We have a job that takes a long time to run - 11 or more hours, sometimes more than 17 - so it runs on a Amazon EC2 instance.
(It is doing cufflinks DNA alignment and stuff.)
The job is executing lots of processes, scripts and utilities and such.
How can we profile it and determine which component parts of the job take the longest time?
A simple CPU utilisation per process per second would probably suffice. How can we obtain it?

There are many solutions to your question :
munin is a great monitoring tool that can scan almost everything in your system and make nice graph about it :). It is very easy to install and use it.
atop could be a simple solution, it can scan cpu, memory, disk regulary and you can store all those informations into files (the -W option), then you'll have to anaylze those file to detect the bottleneck.
sar, that can scan more than everything on your system, but a little more hard to interpret (you'll have to make the graph yourself with RRDtool for example)

Related

Is it unhealthy for SSD if I write 'vital signal' to check a python code is running?

A python program that I'm building was used to die for no apparent reason. I couldn't figure out the reason, so my workaround was to add few lines that write the time to a 'vitality' file every time a certain line within the program is executed, which happens about every 0.1 seconds.
A separate script reads the 'vitality' file every 1 second, and when the vital sign doesn't update for, say 10 seconds, the script kills the program and restarts it.
So far this workaround has been working great on the original problem, but now I'm rather concerned if the SSD will degrade by this or not.
Does writing 10 digits of unixtimestamp every 0.1s to a file have negligible effect on SSD health, or would it degrade the SSD fast?
Doing that will degrade the SSD and destroy it over time.
In my last job, the SSD health tool (smartctl) indicated that the 15 SSDs in our cluster product were wearing rapidly and had only months of life left. The team found that a third party software package (etcd) was syncing a small amounts of data to a filesystem on SSD once per second. And each sync wrote at least an entire 16K block. Luckily, the problem was found early enough that we could patch it in a software update before suffering too many customer returns.
Write the 'vitality' file somewhere else. It could be on a tmpfs like /var/run/user/. Or use a different vitality mechanism; something like supervisord can manage your task, run health checks and restart it on failure.

Limiting the memory usage of a program in Linux

I'm new to Linux and Terminal (or whatever kind of command prompt it uses), and I want to control the amount of RAM a process can use. I already looked for hours to find an easy-t-use guide. I have a few requirements for limiting it:
Multiple instances of the program will be running, but I only want to limit some of the instances.
I do not want the process to crash once it exceeds the limit. I want it to use HDD page swap.
The program will run under WINE, and is a .exe.
So can somebody please help with the command to limit the RAM usage on a process in Linux?
The fact that you’re using Wine makes no difference in this particular context, which leaves requirements 1 and 2. Requirement 2 –
I do not want the process to crash once it exceeds the limit. I want it to use HDD page swap.
– is known as limiting the resident set size or rss of the process, and it’s actually rather nontrivial to do on Linux, as is demonstrated by a question asked in 2010. You’ll need to set up Linux control groups (cgroups). Fortunately, Justin L.’s answer gives a brief rundown on how to do so. Note that
instead of jlebar, you should use your own Unix user name, and
instead of your/program, you should use wine /path/to/Windows/program.exe.
Using cgroups will also satisfy your other requirements – you can start as many instances of the program as you wish, but only those which you start with cgexec -g memory:limited will be limited.

How to monitor a process in Linux CPU, Memory and time

How can I benchmark a process in Linux? I need something like "top" and "time" put together for a particular process name (it is a multiprocess program so many PIDs will be given)?
Moreover I would like to have a plot over time of memory and cpu usage for these processes and not just final numbers.
Any ideas?
I typically throw together a simple script for this type of work.
Take a look at the kernel documentation for the proc filesystem (Google 'linux proc.txt').
The first line of /proc/stat (Section 1.8 in proc.txt) will give you cumulative cpu usage stats (i.e. user, nice, system, idle, ...). For each process, the file /proc/$PID/stat (Table 1-4 in proc.txt) will provide you with both process-specific cpu usage stats and memory usage stats (see rss).
If you google a bit you'll find plenty of detailed info on these files, and pointers to libraries / apps / code snippets that can help you obtain / derive the values you need. With that in mind, I'll focus on the high-level strategy.
For CPU stats, use your favorite scripting language to create an executable that takes a set of process ids for monitoring. At a fixed interval (ex: 1 second) poll / calculate the cumulative totals for each process and the system as a whole. During each poll interval, write all results on a single line to stdout.
For memory stats, write a similar script, but simply log the per-process memory usage. Memory is a bit easier as we directly obtain the instantaneous values.
Run these script for the duration of your test, passing the set of processes ids that you'd like to monitor and redirecting its output to a log file.
./logcpu $(pidof foo) $(pidof bar) > cpustats
./logmem $(pidof foo) $(pidof bar) > memstats
Import the contents of these files into a spreadsheet (for certain applications this is as easy as copy / paste). For CPU, you are after instantaneous values but have cumulative values, so you'll need to do some minor spreadsheet work to derive these values (it's just the delta 't(x + 1) - t(x)'). Of course you could have your cpu logger write the delta, but you'll be spending a bit more time up front on the script.
Finally, use your spreadsheet to generate a nice plot.
Following are the tools for monitoring a linux system
System commands like top, free -m, vmstat, iostat, iotop, sar, netstat, etc. Nothing comes near these linux utility when you are debugging a problem. These command give you a clear picture that is going inside your server
SeaLion: Agent executes all the commands mentioned in #1 (also user defined) and outputs of these commands can be accessed in a beautiful web interface. This tool comes handy when you are debugging across hundreds of servers as installation is clear simple. And its FREE
Nagios: It is the mother of all monitoring/alerting tools. It is very much customization but very much difficult to setup for beginners. There are sets of tools called nagios plugins that covers pretty much all important Linux metrics
Munin
Server Density: A cloudbased paid service that collects important Linux metrics and gives users ability to write own plugins.
New Relic: Another well know hosted monitoring service.
Zabbix

High %wa CPU load when running PHP as CLI

Sorry for the vague question, but I've just written some php code that executes itself as CLI, and I'm pretty sure it's misbehaving. When I run "top" on the command line it's showing very little resources given to any individual process, but between 40-98% to iowait time (%wa). I usually have about .7% distributed between %us and %sy, with the remaining resources going to idle processes (somewhere between 20-50% usually).
This server is executing MySQL queries in, easily, 300x the time it takes other servers to run the same query, and it even takes what seems like forever to log on via SSH... so despite there being some idle cpu time left over, it seems clear that something very bad is happening. Whatever scripts are running, are updating my MySQL database, but it seems to be exponentially slower then when they started.
I need some ideas to serve as launch points for me to diagnose what's going on.
Some things that I would like to know are:
How I can confirm how many scripts are actually running
Is there anyway to confirm that these scripts are actually shutting down when they are through, and not just "hanging around" taking up CPU time and memory?
What kind of bottlenecks should I be checking to make sure I don't create too many instances of this script so this doesn't happen again.
I realize this is probably a huge question, but I'm more then willing to follow any links provided and read up on this... I just need to know where to start looking.
High iowait means that your disk bandwidth is saturated. This might be just because you're flooding your MySQL server with too many queries, and it's maxing out the disk trying to load the data to execute them.
Alternatively, you might be running low on physical memory, causing large amounts of disk IO for swapping.
To start diagnosing, run vmstat 60 for 5 minutes and check the output - the si and so columns show swap-in and swap-out, and the bi and bo lines show other IO. (Edit your question and paste the output in for more assistance).
High iowait may mean you have a slow/defective disk. Try checking it out with a S.M.A.R.T. disk monitor.
http://www.linuxjournal.com/magazine/monitoring-hard-disks-smart
ps auxww | grep SCRIPTNAME
same.
Why are you running more than one instance of your script to begin with?

Using "top" in Linux as semi-permanent instrumentation

I'm trying to find the best way to use 'top' as semi-permanent instrumentation in the development of a box running embedded Linux. (The instrumentation will be removed from the final-test and production releases.)
My first pass is to simply add this to init.d:
top -b -d 15 >/tmp/toploop.out &
This runs top in "batch" mode every 15 seconds. Let's assume that /tmp has plenty of space…
Questions:
Is 15 seconds a good value to choose for general-purpose monitoring?
Other than disk space, how seriously is this perturbing the state of the system?
What other (perhaps better) tools could be used like this?
Look at collectd. It's a very light weight system monitoring framework coded for performance.
We use sysstat to monitor things like this.
You might find that vmstat and iostat with a delay and no repeat counter is a better option.
I suspect 15 seconds would be more than adequate unless you actually want to watch what's happening in real time, but that doesn't appear to be the case here.
As far as load, on an idling PIII 900Mhz w/ 768MB of RAM running Ubuntu (not sure which version, but not more than a year old) I have top updating every 0.5 seconds and it's about 2% CPU utilization. At 15s updates, I'm seeing 0.1% CPU utilization.
depending upon what exactly you want, you could use the output of uptime, free, and ps to get most, if not all, of top's information.
If you are looking for overall load, uptime is probably sufficient. However, if you want specific information about processes, you are adventurous, and have the /proc filessystem enabled, you may want to write your own tools. The primary benefit in this environment is that you can focus on exactly what you want and minimize the load introduced to the system.
The proc file system gives your application read access to the kernel memory that keeps track of many of the interesting variables. Reading from /proc is one of the lightest ways to get this information. Additionally, you may be able to get more information than provided by top. I've done this in the past to get amount of time spent in user and system by this process. Additionally, you can use this to get information about the number of file descriptors open by the process. You might also use this to get detailed information about how the network system is working.
Much of this information is pre-processed by other applications which can be used if you get the information you need. However, it is rather straight-forward to read the raw information. Do a man proc for more information.
Pity you haven't said what you are monitoring for.
You should decide whether 15 seconds is ok or not. Feel free to drop it way lower if you wish (and have a fast HDD)
No worries unless you are running a soft real-time system
Have a look at tools suggested in other answers. I'll add another sugestion: "iotop", for answering a "who is thrashing the HDD" questions.
At work for system monitoring during stress tests we use a tool called nmon.
What I love about nmon is it has the ability to export to XLS and generate beautiful graphs for you.
It generates statistics for:
Memory Usage
CPU Usage
Network Usage
Disk I/O
Good luck :)

Resources