Serial Code Experience Big Difference In Running Time On A GPFS FS - io

I need to measure the wall time of a serial code running on our cluster. In an exclusive mode, i.e., no other user is using my node, the wall time of the code vary quite a lot, ranging from 2:30m to 3:20m. The code does the same thing in every run. I am wandering if the big variance in the wall time is caused by the GPFS file system since the code reads and writes to files stored in a GPFS file system. My question is if there is a tool I can view the GPFS i/o performance and relate it to the performance of my code?
Thanks.

This is a very big question...we need to narrow it down a bit. So, let me ask some questions.
Let us see the time command output for a simple ls command.
$ time ls
real 0m0.003s
user 0m0.001s
sys 0m0.001s
Wall clock time is == real time, which in your case, is varying. If we go to the next step of debugging, the question to ask is: does user time and system time also varies? If GPFS file system is inside the kernel and consumes varying time, you should see the sys time vary. If the sys time remains the same, but the real time varies, then the program is spending time sleeping on something. There are more deeper ways to find the problem....but can you please clarify your question more?

Related

Trying to find out why a Linux server had high cpu usage in the past

enter image description here
One of my linux servers in the cloud had very high cpu usage yesterday, then the issue somehow disappeared by itself.
If there a way to find out which process was taking all the cpu power yesterday?
For example, I want to find out which process was using the most cpu yesterday during 10AM~11AM, is this achievable?
If you were running super-detailed logging, you might have the info recorded. But probably not; that kind of logging would take a lot of space (not just load average every few minutes, but a full snapshot of top output.)
So probably not.
If it was a process that's still running, it's total CPU time might be higher than normal, but you don't know that. (Look at the "time" field in top. But that's just total since the process started, with no record of when it happened.)
If it was something that ran from cron (and then exited), you could look at cron logs.
Or in general look for any log activity from around then; some system processes do end up logging things in the system logs. (journalctl with options to look around that time window.) But that will probably just give you hints about what might have started around then.
Another source of possible hints could be mod times on files. If you find some files that were modified around that time, that might remind you of something you'd set up to run occasionally.
find / -xdev -mmin -$((25*60)) -mmin +$((24*60))
would I think print all files (on the root filesystem) that are older than 24 hours old (24*60 minutes), but younger than 25 hours old. (Relative to now; -daystart changes that I think.)

PBS walltime: how much was actually used?

How do I figure out how much walltime (mem? vmem?) a PBS job (PBS Pro) actually ended up using, if it's not presented in the stodut/sterr logs?
In Torque, this information is visible in the accounting log and in the qstat -f output for the job. In qstat -f, you wanted to look at the resources_used information.
This may have diverged somewhat in PBS Pro, but my guess is they have something similar.
Wall time is always measured outside of the system. That's why it refers to the "clock on the wall".
This is important because it often encompasses elements that some systems fail to measure, or measure poorly. To illustrate, before a system can capture the time, some code must run to allocate the memory to capture the time, and then some code must run to assign that memory. Everything before that happens is misreported to not have "cost" any time at all.
While I may have described the essence of wall time, do look to dbeer's excellent answer for capturing a time close to wall clock time (and hopefully solving your metric gathering problem).

Linux time command - real vs user vs system

I am running a jar file in Linux with time command. Below is the output after execution.
15454.58s real 123464.61s user 6455.55s system
Below is the command executed.
time java -jar -Xmx7168m Batch.jar
But actual time taken to execute the process is 9270 seconds.
Why the actual time(wall clock time) and real time is different?
Can anyone explain this? Its running on multi core machine (32 core).
Maybe this explains the deviation you are experiencing. From the time Wikipedia article:
Because a program may fork children whose CPU times (both user and
sys) are added to the values reported by the time command, but on a
multicore system these tasks are run in parallel, the total CPU time
may be greater than the real time.
Apart from that, your understanding of real time conforms with the definition given in time(7):
Real time is defined as time measured from some fixed point, either from a standard point in the past (see the description of the Epoch and calendar time below), or from some point (e.g., the start) in the life of a process (elapsed time).
See also bash(1) (although its documentation on the time command is not overly comprehensive).
If seconds are exact enough for you, this little wrapper can help:
#!/bin/bash
starttime=$( date +"%s" )
# run your program here
endtime=$( date +"%s" )
duration=$(( endtime-starttime ))
echo "Execution took ${duration} s."
Note: If the system time is changed while your program is running, the results will be incorrect.
From what I remember, user time is the time it spends in user space, system is the time spend running in kernel space (syscalls, for example), and real is also called the wall clock time (the actual time that you can measure with a stop-watch, for example). Don't know exactly how this is calculated on a SMP system.

What may slow down an ATA read-verify command sent to HDD on linux?

I am writing a C program to scan hard drives using ATA read-verify(0x40) command on Linux, like what MHDD's scan does on DOS.
I issue the command using HDIO_DRIVE_TASK, and measure ioctl's block time using CLOCK_MONOTONIC.
I run the program as root, and have its ionice set to real time, but the readouts are always larger than what MHDD shows. Also, MHDD's result don't change a lot, but my program's result often vary a lot.
I try to issue the command twice for each block and measure the block time of the second run.
This fixes part of the problem, but my results still vary a lot.
What factors may slow down my command? How should I avoid them?
P.S. I have some spare drives with different health for testing use.

Benchmarking two binary file in linux

I have two binary file.This file is not make by me.
I need to benchmarking this files to see how works fast and well.
I try to use the time command but the big problem is :
to run in same time the files
to stop in same time the running files
to use time command with this two files.
If i use this solution Benchmarking programs on Linux and change the place of the file with time running command the output changed.
0m0.010s file1
0m0.017s file2
change the order in time command .
0m0.002s file2
0m0.013s file1
Thank you. Regards.
There are many ways to do what you want, but probably simplest one is to simply run one program in a loop many times (say 1000 or more) such that total execution time becomes something that is easy to measure - say, 50 seconds. Then repeat the same for another one.
This allows you to get much more accurate measurements, and also minimizes inherent jitter between runs.
Having said that, note that with run times as low as you observe, time to start process may be not a small fraction of total measurement you get. So, if you run a loop, be sure to consider price to start new process 1000 times.

Resources