What are different between perf with event and without event? - linux

I want to profile the run time and some information about a program create by myself.
I'm using the Linux tool - perf to measure the run time.
the command I used is like below:
$ perf stat -r 10000 ./program > /dev/null
Performance counter stats for './program' (10000 runs):
2.510453 task-clock (msec) # 0.946 CPUs utilized ( +- 0.04% )
0 context-switches # 0.002 K/sec ( +- 15.59% )
0 cpu-migrations # 0.000 K/sec
50 page-faults # 0.020 M/sec ( +- 0.02% )
9237800 cycles # 3.680 GHz ( +- 0.00% )
4011695 instructions # 0.43 insn per cycle ( +- 0.00% )
689371 branches # 274.600 M/sec ( +- 0.00% )
5144 branch-misses # 0.75% of all branches ( +- 0.03% )
0.002653910 seconds time elapsed ( +- 0.04% )
And then I using the -e flag to customize the events by myself.
$ perf stat -r 10000 -e cache-misses ./program > /dev/null
Performance counter stats for './program' (10000 runs):
941 cache-misses ( +- 0.69% )
0.003506780 seconds time elapsed ( +- 0.25% )
You can see that the run time of two command are totally different. The run time of command with event flag is slower than the one without the flag.
Why is that happened? And which run time of command is same as the run time of real user?

Related

perf : How to check processess running on particular cpu

Is there any option in perf to look into processes running on a particular cpu /core, and how much percentage of that core is taken by each process.
Reference links would be helpful.
perf is intended to do a profiling which is not good fit for your case. You may try to do sampling /proc/sched_debug (if it is compiled in your kernel). For example you may check which process is currently running on CPU:
egrep '^R|cpu#' /proc/sched_debug
cpu#0, 917.276 MHz
R egrep 2614 37730.177313 ...
cpu#1, 917.276 MHz
R bash 2023 218715.010833 ...
By using his PID as a key, you may check how many CPU time in milliseconds it consumed:
grep se.sum_exec_runtime /proc/2023/sched
se.sum_exec_runtime : 279346.058986
However, as #BrenoLeitão mentioned, SystemTap is quite useful for your script. Here is script for your task.
global cputimes;
global cmdline;
global oncpu;
global NS_PER_SEC = 1000000000;
probe scheduler.cpu_on {
oncpu[pid()] = local_clock_ns();
}
probe scheduler.cpu_off {
if(oncpu[pid()] == 0)
next;
cmdline[pid()] = cmdline_str();
cputimes[pid(), cpu()] <<< local_clock_ns() - oncpu[pid()];
delete oncpu[pid()];
}
probe timer.s(1) {
printf("%6s %3s %6s %s\n", "PID", "CPU", "PCT", "CMDLINE");
foreach([pid+, cpu] in cputimes) {
cpupct = #sum(cputimes[pid, cpu]) * 10000 / NS_PER_SEC;
printf("%6d %3d %3d.%02d %s\n", pid, cpu,
cpupct / 100, cpupct % 100, cmdline[pid]);
}
delete cputimes;
}
It traces moments when process is running on CPU and stops execution on that (due to migration or sleeping) by attaching to scheduler.cpu_on and scheduler.cpu_off probes. Second probe calculates time difference between these events and saves it to cputimes aggregation along with process command line arguments.
timer.s(1) fires once per second -- it walks over aggregation and calculates percentage. Here is sample output for Centos 7 with bash running infinite loop:
0 0 100.16
30 1 0.00
51 0 0.00
380 0 0.02 /usr/bin/python -Es /usr/sbin/tuned -l -P
2016 0 0.08 sshd: root#pts/0 "" "" "" ""
2023 1 100.11 -bash
2630 0 0.04 /usr/libexec/systemtap/stapio -R stap_3020c9e7ba76838179be68cd2390a10c_2630 -F3
I understand that perf is not the proper way to do it, although you can limit perf per CPU, as using perf record -C <cpulist> or even perf stat -c <cpulist>.
The close you are going to see is the context-switch event, but, this is not going to provide you the application names at all.
I think you are going to need something more powerful, as systemtap.

time command output on an already running process

I have a process that spawns some other processes,
I want to use the time command on a specific process and get the same output as the time command.
Is that possible and how?
I want to use the time command on a specific process and get the same output as the time command.
Probably it is enough just to use pidstat to get user and sys time:
$ pidstat -p 30122 1 4
Linux 2.6.32-431.el6.x86_64 (hostname) 05/15/2014 _x86_64_ (8 CPU)
04:42:28 PM PID %usr %system %guest %CPU CPU Command
04:42:29 PM 30122 706.00 16.00 0.00 722.00 3 has_serverd
04:42:30 PM 30122 714.00 12.00 0.00 726.00 3 has_serverd
04:42:31 PM 30122 714.00 14.00 0.00 728.00 3 has_serverd
04:42:32 PM 30122 708.00 16.00 0.00 724.00 3 has_serverd
Average: 30122 710.50 14.50 0.00 725.00 - has_serverd
If not then according to strace time uses wait4 system call (http://linux.die.net/man/2/wait4) to get information about a process from the kernel. The same info returns getrusage but you cannot call it for an arbitrary process according to its documentation (http://linux.die.net/man/2/getrusage).
So, I do not know any command that will give the same output. However it is feasible to create a bash script that gets PID of the specific process and outputs something like time outpus then
This script does these steps:
1) Get the number of clock ticks per second
getconf CLK_TCK
I assume it is 100 and 1 tick is equal to 10 milliseconds.
2) Then in loop do the same sequence of commands while exists the directory /proc/YOUR-PID:
while [ -e "/proc/YOUR-PID" ];
do
read USER_TIME SYS_TIME REAL_TIME <<< $(cat /proc/PID/stat | awk '{print $14, $15, $22;}')
sleep 0.1
end loop
Some explanation - according to man proc :
user time: ($14) - utime - Amount of time that this process has been scheduled in user mode, measured in clock ticks
sys time: ($15) - stime - Amount of time that this process has been scheduled in kernel mode, measured in clock ticks
starttime ($22) - The time in jiffies the process started after system boot.
3) When the process is finished get finish time
read FINISH_TIME <<< $(cat '/proc/self/stat' | awk '{print $22;}')
And then output:
the real time = ($FINISH_TIME-$REAL_TIME) * 10 - in milliseconds
user time: ($USER_TIME/(getconf CLK_TCK)) * 10 - in milliseconds
sys time: ($SYS_TIME/(getconf CLK_TCK)) * 10 - in milliseconds
I think it should give roughly the same result as time. One possible problem I see is if the process exists for a very short period of time.
This is my implementation of time:
#!/bin/bash
# Uses herestrings
print_res_jeffies()
{
let "TIME_M=$2/60000"
let "TIME_S=($2-$TIME_M*60000)/1000"
let "TIME_MS=$2-$TIME_M*60000-$TIME_S*1000"
printf "%s\t%dm%d.%03dms\n" $1 $TIME_M $TIME_S $TIME_MS
}
print_res_ticks()
{
let "TIME_M=$2/6000"
let "TIME_S=($2-$TIME_M*6000)/100"
let "TIME_MS=($2-$TIME_M*6000-$TIME_S*100)*10"
printf "%s\t%dm%d.%03dms\n" $1 $TIME_M $TIME_S $TIME_MS
}
if [ $(getconf CLK_TCK) != 100 ]; then
exit 1;
fi
if [ $# != 1 ]; then
exit 1;
fi
PROC_DIR="/proc/"$1
if [ ! -e $PROC_DIR ]; then
exit 1
fi
USER_TIME=0
SYS_TIME=0
START_TIME=0
while [ -e $PROC_DIR ]; do
read TMP_USER_TIME TMP_SYS_TIME TMP_START_TIME <<< $(cat $PROC_DIR/stat | awk '{print $14, $15, $22;}')
if [ -e $PROC_DIR ]; then
USER_TIME=$TMP_USER_TIME
SYS_TIME=$TMP_SYS_TIME
START_TIME=$TMP_START_TIME
sleep 0.1
else
break
fi
done
read FINISH_TIME <<< $(cat '/proc/self/stat' | awk '{print $22;}')
let "REAL_TIME=($FINISH_TIME - $START_TIME)*10"
print_res_jeffies 'real' $REAL_TIME
print_res_ticks 'user' $USER_TIME
print_res_ticks 'sys' $SYS_TIME
And this is an example that compares my implementation of time and real time:
>time ./sys_intensive > /dev/null
Alarm clock
real 0m10.004s
user 0m9.883s
sys 0m0.034s
In another terminal window I run my_time.sh and give it PID:
>./my_time.sh `pidof sys_intensive`
real 0m10.010ms
user 0m9.780ms
sys 0m0.030ms

for and start commands in a batch for parallel and sequential work

I have an 8 core CPU with 8GB of RAM, and I'm creating a batch file to automate 7-zip CLI in exhausting most parameters and variables to compress the same set of files with the ultimate goal of finding the strongest combination of parameters and variables that result in the smallest archive size possible.
This is very time consuming by nature especially when the set of files to be processed is in gigabytes. I need a way not only to automate but to speed up this whole process.
7-zip works with different compression algorithms, some are single-threaded only, and some are multi-threaded, some do not require much amount of memory, and some require huge amounts of it and could even surpass the 8GB barrier. I've already successfully created an automated batch that works in sequence which exclude combinations requiring more than 8GB of memory.
I've split the different compression algorithms in several batches to simplify the whole process. For example, compression in PPMd as a 7z archive uses 1-thread and up to 1024MB. This is my current batch:
#echo off
echo mem=1m 2m 3m 4m 6m 8m 12m 16m 24m 32m 48m 64m 96m 128m 192m 256m 384m 512m 768m 1024m
echo o=2 3 4 5 6 7 8 10 12 14 16 20 24 28 32
echo s=off 1m 2m 4m 8m 16m 32m 64m 128m 256m 512m 1g 2g 4g 8g 16g 32g 64g on
echo x=1 3 5 7 9
for %%x IN (9) DO for %%d IN (1024m 768m 512m 384m 256m 192m 128m 96m 64m 48m 32m 24m 16m 12m 8m 6m 4m 3m 2m 1m) DO for %%w IN (32 28 24 20 16 14 12 10 8 7 6 5 4 3 2) DO for %%s IN (on) DO 7z.exe a teste.resultado\%%xx.ppmd.%%dd.%%ww.%%ss.7z .\teste.original\* -mx=%%x -m0=PPMd:mem=%%d:o=%%w -ms=%%s
exit
x, s, o and mem are parameters, and what's after each of them are the variables which 7z.exe will work with. x and s in this case are of no concern, they mean compression strength and solid block size for the archive.
That batch will work fine, but is limited to running only 1 instance of 7z.exe at a time and now I'm looking for a way to make it run more 7z.exe instances in parallel but without exceeding 8GB of RAM or 8 threads at once, whichever comes first, before proceeding to do the next ones in the sequence.
How can I improve this? I have some ideas but I don't know how to make them work in a batch. I was thinking of 2 other variables that won't interact with the 7z processes but would control when the next 7z instance would start. One variable would keep track of how many threads are currently in use and another would track how much memory are in use. Could that work?
Edit:
Sorry, I need to add details, I'm new to this posting style. Following this answer - https://stackoverflow.com/a/19481253/2896127 - I mentioned 8 batches were created and that 7z.PPMd batch was one of them. Maybe listing all the batches and how 7z deals with the parameters will give a better insight on the whole issue. I'll start with the simple ones:
7z.PPMd - 1 fully utilized thread and dictionary dependant 32m-1055m memory usage per instance.
7z.BZip2 - 8 fully utilized threads and fixed 109m memory usage per instance.
zip.Bzip2 - 8 partially utilized threads and fixed 336m memory usage per instance.
zip.Deflate - 8 partially utilized threads and fixed 260m memory usage per instance.
zip.PPMd - 8 partially utilized threads and dictionary dependant 280m-2320m memory usage per instance.
What I mean with partially utilized threads is that, while I assign 8 threads to be used by each 7.exe instance, the algorithm can do variable CPU usage at a randomly fashion, out of my control, unpredictable, but the limitation is set there - no more than 8 threads. In the case of 8 fully utilized threads, it means that on my 8 core CPU, each instance is utilizing 100% of CPU.
The most complex ones - 7z.LZMA, 7z.LZMA2, zip.LZMA - will need to be explained in detail but I am running short on time now. I'll be back to edit the LZMA part whenever I have more free time.
Thanks again.
EDIT: Adding in LZMA part.
7z.LZMA - each instance is n-threaded, ranging from 1 to 2:
1 fully utilized thread, dictionary dependant, 64k to 512m:
64k dictionary uses 32m memory
...
512m dictionary uses 5407m memory
excluded range: 768m to 1024m (above the limit of 8192m memory available)
2 partially utilized threads, dictionary dependant, 64k to 512m:
64k dictionary uses 38m memory
...
512m dictionary uses 5413m memory
excluded range: 768m to 1024m (above the limit of 8192m memory available)
7z.LZMA2 - each instance is n-threaded, ranging from 1 to 8:
1 fully utilized thread, dictionary dependant, 64k to 512m:
64k dictionary uses 32m memory
...
512m dictionary uses 5407m memory
excluded range: 768m to 1024m (above the limit of 8192m memory available)
2 or 3 partially utilized threads, dictionary dependant, 64k to 512m:
64k dictionary uses 38m memory
...
512m dictionary uses 5413m memory
excluded range: 768m to 1024m (above the limit of 8192m memory available)
4 or 5 partially utilized threads, dictionary dependant, 64k to 256m:
64k dictionary uses 51m memory
...
256m dictionary uses 5677m memory
excluded range: 384m to 1024m (above the limit of 8192m memory available)
6 or 7 partially utilized threads, dictionary dependant, 64k to 192m:
64k dictionary uses 62m memory
...
192m dictionary uses 6965m memory
excluded range: 256m to 1024m (above the limit of 8192m memory available)
8 partially utilized threads, dictionary dependant, 64k to 128m:
64k dictionary uses 72m memory
...
128m dictionary uses 6717m memory
excluded range: 192m to 1024m (above the limit of 8192m memory available)
zip.LZMA - each instance is n-threaded, ranging from 1 to 8:
1 fully utilized thread, dictionary dependant, 64k to 512m:
64k dictionary uses 3m memory
...
512m dictionary uses 5378m memory
excluded range: 768m to 1024m (above the limit of 8192m memory available)
2 or 3 partially utilized threads, dictionary dependant, 64k to 512m:
64k dictionary uses 9m memory
...
512m dictionary uses 5384m memory
excluded range: 768m to 1024m (above the limit of 8192m memory available)
4 or 5 partially utilized threads, dictionary dependant, 64k to 256m:
64k dictionary uses 82m memory
...
256m dictionary uses 5456m memory
excluded range: 384m to 1024m (above the limit of 8192m memory available)
6 or 7 partially utilized threads, dictionary dependant, 64k to 256m:
64k dictionary uses 123m memory
...
256m dictionary uses 8184m (very close to the limit though, I may consider excluding it)
excluded range: 384m to 1024m (above the limit of 8192m memory available)
8 partially utilized threads, dictionary dependant, 64k to 128m:
64k dictionary uses 164m memory
...
128m dictionary uses 5536m memory
excluded range: 192m to 1024m (above the limit of 8192m memory available)
I'm trying to understand the behaviour of the commands with nul in them. I don't quite understand what's happening during that part, what those symbols ^ > ^&1 "" are meant to say.
2>nul del %lock%!nextProc!
%= Redirect the lock handle to the lock file. The CMD process will =%
%= maintain an exclusive lock on the lock file until the process ends. =%
start /b "" cmd /c %lockHandle%^>"%lock%!nextProc!" 2^>^&1 !cpu%%N! !cmd!
)
set "launch="
Then later on, at the :wait code:
) 9>>"%lock%%%N"
) 2>nul
if %endCount% lss %startCount% (
1>nul 2>nul ping /n 2 ::1
goto :wait
)
2>nul del %lock%*
EDIT 2 (29-10-2013): Adding the current point of the situation.
After trial and error research, complemented with step by step notes of what's happening, I was able to understand the behaviour above. I simplified the line with start command to this:
start /b /low cmd /c !cmd!>"%lock%!nextProc!"
Though it works, I still don't understand the meaning of 1^>"filename" 2^>^&1 'command'. I know it is related to writing text in the filename what would otherwise be displayed to me. In this case, it would show all of 7z.exe text but written in the file. Until 7z.exe instance finishes its job, nothing is written in the file, but the file already exists, yet at the same time doesn't exist. When 7z.exe actually finishes, the file is finalized and this time it exists for the next part of the script.
Now I can understand the processing behaviour of the suggested script and I'm complementing it with something of my own - I am trying to implement all batches into "one batch do it all" script. In the simplified version, this is it:
echo 8 threads - maxproc=1
for %%x IN (9) DO for %%t IN (8) DO for %%d IN (900k) DO for %%s IN (on) DO 7z.exe a teste.resultado\%%xx.bzip2.%%tt.%%dd.%%ss.7z .\teste.original\* -mx=%%x -ms=%%s -m0=BZip2:d=%%d:mt=%%t
for %%x IN (9) DO for %%t IN (8) DO for %%d IN (900k) DO 7z.exe a teste.resultado\%%xx.bzip2.%%tt.%%dd.zip .\teste.original\* -mx=%%x -mm=BZip2:d=%%d -mmt=%%t
for %%x IN (9) DO for %%t IN (8) DO for %%w IN (257 256 192 128 96 64 48 32 24 16 12 8) DO 7z.exe a teste.resultado\%%xx.deflate64.%%tt.%%ww.zip .\teste.original\* -mx=%%x -mm=deflate64:fb=%%w -mmt=%%t
for %%x IN (9) DO for %%t IN (8) DO for %%w IN (258 256 192 128 96 64 48 32 24 16 12 8) DO 7z.exe a teste.resultado\%%xx.deflate.%%tt.%%ww.zip .\teste.original\* -mx=%%x -mm=deflate:fb=%%w -mmt=%%t
for %%x IN (9) DO for %%t IN (8) DO for %%d IN (256m 128m 64m 32m 16m 8m 4m 2m 1m) DO for %%w IN (16 15 14 13 12 11 10 9 8 7 6 5 4 3 2) DO 7z.exe a teste.resultado\%%xx.ppmd.%%tt.%%dd.%%ww.zip .\teste.original\* -mx=%%x -mm=PPMd:mem=%%d:o=%%w -mmt=%%t
echo 4 threads - maxproc=2
for %%x IN (9) DO for %%t IN (4) DO for %%d IN (256m) DO for %%w IN (273 256 192 128 96 64 48 32 24 16 12 8) DO for %%s IN (on) DO 7z.exe a teste.resultado\%%xx.lzma2.%%tt.%%dd.%%ww.%%ss.7z .\teste.original\* -mx=%%x -ms=%%s -m0=lzma2:d=%%d:fb=%%w -mmt=%%t
echo 2 threads - maxproc=4
for %%x IN (9) DO for %%t IN (2) DO for %%d IN (512m) DO for %%w IN (273 256 192 128 96 64 48 32 24 16 12 8) DO for %%s IN (on) DO 7z.exe a teste.resultado\%%xx.lzma.%%tt.%%dd.%%ww.%%ss.7z .\teste.original\* -mx=%%x -ms=%%s -m0=LZMA:d=%%d:fb=%%w -mmt=%%t
for %%x IN (9) DO for %%t IN (2) DO for %%d IN (512m) DO for %%w IN (273 256 192 128 96 64 48 32 24 16 12 8) DO for %%s IN (on) DO 7z.exe a teste.resultado\%%xx.lzma2.%%tt.%%dd.%%ww.%%ss.7z .\teste.original\* -mx=%%x -ms=%%s -m0=lzma2:d=%%d:fb=%%w -mmt=%%t
for %%x IN (9) DO for %%t IN (2) DO for %%d IN (512m) DO for %%w IN (273 256 192 128 96 64 48 32 24 16 12 8) DO 7z.exe a teste.resultado\%%xx.lzma.%%tt.%%dd.%%ww.zip .\teste.original\* -mx=%%x -mm=lzma:d=%%d:fb=%%w -mmt=%%t
echo 1 threads - maxproc=8
for %%x IN (9) DO for %%t IN (1) DO for %%d IN (512m) DO for %%w IN (273 256 192 128 96 64 48 32 24 16 12 8) DO for %%s IN (on) DO 7z.exe a teste.resultado\%%xx.lzma.%%tt.%%dd.%%ww.%%ss.7z .\teste.original\* -mx=%%x -ms=%%s -m0=LZMA:d=%%d:fb=%%w -mmt=%%t
for %%x IN (9) DO for %%t IN (1) DO for %%d IN (512m) DO for %%w IN (273 256 192 128 96 64 48 32 24 16 12 8) DO for %%s IN (on) DO 7z.exe a teste.resultado\%%xx.lzma2.%%tt.%%dd.%%ww.%%ss.7z .\teste.original\* -mx=%%x -ms=%%s -m0=lzma2:d=%%d:fb=%%w -mmt=%%t
for %%x IN (9) DO for %%d IN (1024m 768m 512m 384m 256m 192m 128m 96m 64m 48m 32m 24m 16m 12m 8m 6m 4m 3m 2m 1m) DO for %%w IN (32 28 24 20 16 14 12 10 8 7 6 5 4 3 2) DO for %%s IN (on) DO 7z.exe a teste.resultado\%%xx.ppmd.%%dd.%%ww.%%ss.7z .\teste.original\* -mx=%%x -m0=PPMd:mem=%%d:o=%%w -ms=%%s
for %%x IN (9) DO for %%t IN (1) DO for %%d IN (512m) DO for %%w IN (273 256 192 128 96 64 48 32 24 16 12 8) DO 7z.exe a teste.resultado\%%xx.lzma.%%tt.%%dd.%%ww.zip .\teste.original\* -mx=%%x -mm=lzma:d=%%d:fb=%%w -mmt=%%t
In short, I want to process all that in the most efficient manner possible. Doing it by deciding how many processes can run at a time would be a way, but then there's also the memory required for each process, so that the sum of all required memory by those processes won't exceed 8192 MB. I got this part working.
#echo off
setlocal enableDelayedExpansion
set "maxMem=8192"
set "maxThreads=8"
:cycle1
set "cycleCount=4"
set "cycleThreads=1"
set "maxProc="
set /a "maxProc=maxThreads/cycleThreads"
set "cycleFor1=for %%x IN (9) DO for %%t IN (1) DO for %%d IN (512m) DO for %%w IN (273 256 192 128 96 64 48 32 24 16 12 8) DO for %%s IN (on) DO ("
set "cycleFor2=for %%x IN (9) DO for %%t IN (1) DO for %%d IN (512m) DO for %%w IN (273 256 192 128 96 64 48 32 24 16 12 8) DO for %%s IN (on) DO ("
set "cycleFor3=for %%x IN (9) DO for %%d IN (1024m 768m 512m 384m 256m 192m 128m 96m 64m 48m 32m 24m 16m 12m 8m 6m 4m 3m 2m 1m) DO for %%w IN (32 28 24 20 16 14 12 10 8 7 6 5 4 3 2) DO for %%s IN (on) DO ("
set "cycleFor4=for %%x IN (9) DO for %%t IN (1) DO for %%d IN (512m) DO for %%w IN (273 256 192 128 96 64 48 32 24 16 12 8) DO ("
set "cycleCmd1=7z.exe a teste.resultado\%%xx.lzma.%%tt.%%dd.%%ww.%%ss.7z .\teste.original\* -mx=%%x -ms=%%s -m0=LZMA:d=%%d:fb=%%w -mmt=%%t"
set "cycleCmd2=7z.exe a teste.resultado\%%xx.lzma2.%%tt.%%dd.%%ww.%%ss.7z .\teste.original\* -mx=%%x -ms=%%s -m0=lzma2:d=%%d:fb=%%w -mmt=%%t"
set "cycleCmd3=7z.exe a teste.resultado\%%xx.ppmd.%%dd.%%ww.%%ss.7z .\teste.original\* -mx=%%x -m0=PPMd:mem=%%d:o=%%w -ms=%%s"
set "cycleCmd4=7z.exe a teste.resultado\%%xx.lzma.%%tt.%%dd.%%ww.zip .\teste.original\* -mx=%%x -mm=lzma:d=%%d:fb=%%w -mmt=%%t"
set "tempMem1=5407"
set "tempMem2=5407"
set "tempMem3=1055"
set "tempMem4=5378"
rem set "tempMem1=5407"
rem set "tempMem2=5407"
rem set "tempMem3=1055 799 543 415 287 223 159 127 95 79 63 55 47 43 39 37 35 34 33 32"
rem set "tempMem4=5378"
set "memSum=0"
if not defined memRem set "memRem=!maxMem!"
for /l %%N in (1 1 %cycleCount%) DO (set "tempProc%%N=")
for /l %%N in (1 1 %cycleCount%) DO (
set memRem
set /a "tempProc%%N=%memRem%/tempMem%%N"
set /a "memSum+=tempMem%%N"
set /a "memRem-=tempMem%%N"
set /a "maxProc=!tempProc%%N!"
call :executeCycle
set /a "memRem+=tempMem%%N"
set /a "memSum-=tempMem%%N"
set /a "maxProc-=!tempProc%%!
)
goto :fim
:executeCycle
set "lock=lock_%random%_"
set /a "startCount=0, endCount=0"
for /l %%N in (1 1 %maxProc%) DO set "endProc%%N="
set launch=1
for %%x IN (9) DO for %%t IN (1) DO for %%d IN (512m) DO for %%w IN (273 256 192 128 96 64 48 32 24 16 12 8) DO for %%s IN (on) DO (
set "cmd=7z.exe a teste.resultado\%%xx.lzma.%%tt.%%dd.%%ww.%%ss.7z .\teste.original\* -mx=%%x -ms=%%s -m0=LZMA:d=%%d:fb=%%w -mmt=%%t"
if !startCount! lss %maxProc% (
set /a "startCount+=1, nextProc=startCount"
) else (
call :wait
)
set cmd!nextProc!=!cmd!
echo !time! - proc!nextProc!: starting !cmd!
2>nul del %lock%!nextProc!
start /b /low cmd /c !cmd!>"%lock%!nextProc!"
)
set "launch="
:wait
for /l %%N in (1 1 %startCount%) do (
if not defined endProc%%N if exist "%lock%%%N" (
echo !time! - proc%%N: finished !cmd%%N!
if defined launch (
set nextProc=%%N
exit /b
)
set /a "endCount+=1, endProc%%N=1"
) 9>>"%lock%%%N"
) 2>nul
if %endCount% lss %startCount% (
1>nul 2>nul ping /n 2 ::1
goto :wait
)
2>nul del %lock%*
echo ===
echo Thats all folks!
exit /b
:fim
pause
I have trouble with cycleFor1 and cycleCmd1 located in :cycle1 part - they should be replacing the for line and the first cmd variable inside the :executeCycle, to make it work as I intend to. How do I do that?
The other issue I have is about tempMem3. I have logged all memory required when the command cycleCmd3 would be running. It is dictionary dependant. tempMem3 and cycleCmd3 are related like this:
for %%d IN (1024m 768m 512m 384m 256m 192m 128m 96m 64m 48m 32m 24m 16m 12m 8m 6m 4m 3m 2m 1m) DO
set "tempMem3=1055 799 543 415 287 223 159 127 95 79 63 55 47 43 39 37 35 34 33 32"
So 1024m would use 1055, 768m would use 799, and so on till 1m using 32. I don't know how to translate that into the script.
Any help is appreciated.
I've already posted a robust batch solution that limits the number of parallel processes at Parallel execution of shell processes. That script uses a list of commands that is embedded within the script. Follow the link to see how it works.
I modified that script to generate the commands using FOR loops as per your question. I also set the limit to 8 simultaneous processes.
Your maximum memory is 1g, and you never have more than 8 processes, so I don't see how you could ever exceed 8g. If you increase the max memory per processes, then you do have to worry about total memory. you will have to add additional logic to keep track of how much memory is being used, and which cpu IDs are available. Note that batch numbers are limited to ~2g, so I recommend computing memory used in megabytes.
By default, the script hides the output of the commands. If you want to see the output, then run it with the /O option.
#echo off
setlocal enableDelayedExpansion
:: Display the output of each process if the /O option is used
:: else ignore the output of each process
if /i "%~1" equ "/O" (
set "lockHandle=1"
set "showOutput=1"
) else (
set "lockHandle=1^>nul 9"
set "showOutput="
)
:: Define the maximum number of parallel processes to run.
:: Each process number can optionally be assigned to a particular server
:: and/or cpu via psexec specs (untested).
set "maxProc=8"
:: Optional - Define CPU targets in terms of PSEXEC specs
:: (everything but the command)
::
:: If a cpu is not defined for a proc, then it will be run on the local machine.
:: I haven't tested this feature, but it seems like it should work.
::
:: set cpu1=psexec \\server1 ...
:: set cpu2=psexec \\server1 ...
:: set cpu3=psexec \\server2 ...
:: etc.
:: For this demo force all cpu specs to undefined (local machine)
for /l %%N in (1 1 %maxProc%) do set "cpu%%N="
:: Get a unique base lock name for this particular instantiation.
:: Incorporate a timestamp from WMIC if possible, but don't fail if
:: WMIC not available. Also incorporate a random number.
set "lock="
for /f "skip=1 delims=-+ " %%T in ('2^>nul wmic os get localdatetime') do (
set "lock=%%T"
goto :break
)
:break
set "lock=%temp%\lock%lock%_%random%_"
:: Initialize the counters
set /a "startCount=0, endCount=0"
:: Clear any existing end flags
for /l %%N in (1 1 %maxProc%) do set "endProc%%N="
:: Launch the commands in a loop
set launch=1
echo mem=1m 2m 3m 4m 6m 8m 12m 16m 24m 32m 48m 64m 96m 128m 192m 256m 384m 512m 768m 1024m
echo o=2 3 4 5 6 7 8 10 12 14 16 20 24 28 32
echo s=off 1m 2m 4m 8m 16m 32m 64m 128m 256m 512m 1g 2g 4g 8g 16g 32g 64g on
echo x=1 3 5 7 9
for %%x IN (9) DO for %%d IN (1024m 768m 512m 384m 256m 192m 128m 96m 64m 48m 32m 24m 16m 12m 8m 6m 4m 3m 2m 1m) DO (
set "cmd=7z.exe a teste.resultado\%%xx.ppmd.%%dd.%%ww.%%ss.7z .\teste.original\* -mx=%%x -m0=PPMd:mem=%%d:o=%%w -ms=%%s"
if !startCount! lss %maxProc% (
set /a "startCount+=1, nextProc=startCount"
) else (
call :wait
)
set cmd!nextProc!=!cmd!
if defined showOutput echo -------------------------------------------------------------------------------
echo !time! - proc!nextProc!: starting !cmd!
2>nul del %lock%!nextProc!
%= Redirect the lock handle to the lock file. The CMD process will =%
%= maintain an exclusive lock on the lock file until the process ends. =%
start /b "" cmd /c %lockHandle%^>"%lock%!nextProc!" 2^>^&1 !cpu%%N! !cmd!
)
set "launch="
:wait
:: Wait for procs to finish in a loop
:: If still launching then return as soon as a proc ends
:: else wait for all procs to finish
:: redirect stderr to null to suppress any error message if redirection
:: within the loop fails.
for /l %%N in (1 1 %startCount%) do (
%= Redirect an unused file handle to the lock file. If the process is =%
%= still running then redirection will fail and the IF body will not run =%
if not defined endProc%%N if exist "%lock%%%N" (
%= Made it inside the IF body so the process must have finished =%
if defined showOutput echo ===============================================================================
echo !time! - proc%%N: finished !cmd%%N!
if defined showOutput type "%lock%%%N"
if defined launch (
set nextProc=%%N
exit /b
)
set /a "endCount+=1, endProc%%N=1"
) 9>>"%lock%%%N"
) 2>nul
if %endCount% lss %startCount% (
1>nul 2>nul ping /n 2 ::1
goto :wait
)
2>nul del %lock%*
if defined showOutput echo ===============================================================================
echo Thats all folks!
To execute at the same time no more than 8 instances of 7z.exe process you could do this:
#Echo OFF & Setlocal EnableDelayedExpansion
Set /A "pCount=0" & REm Process count
For
...
) DO (
Set /A "pCount+=1"
If !pCount! LEQ 8 (
Start /B 7z.exe a teste.resultado\%%xx.ppmd.%%dd.%%ww.%%ss.7z .\teste.original\* -mx=%%x -m0=PPMd:mem=%%d:o=%%w -ms=%%s
)
)
...
If do you want to run each process in a new parallel CMD window then you would replace the Start /B line on my code for this else:
CMD /C "Start /w 7z.exe a teste.resultado\%%xx.ppmd.%%dd.%%ww.%%ss.7z .\teste.original\* -mx=%%x -m0=PPMd:mem=%%d:o=%%w -ms=%%s"

Writing "fib" to run in parallel: -N2 is slower?

I'm learning Haskell and trying write code to execute in parallel, but Haskell always runs it sequentially. And when I execute with the -N2 runtime flag it take more time to execute than if I omit this flag.
Here is code:
import Control.Parallel
import Control.Parallel.Strategies
fib :: Int -> Int
fib 1 = 1
fib 0 = 1
fib n = fib (n - 1) + fib (n - 2)
fib2 :: Int -> Int
fib2 n = a `par` (b `pseq` (a+b))
where a = fib n
b = fib n + 1
fib3 :: Int -> Int
fib3 n = runEval $ do
a <- rpar (fib n)
b <- rpar (fib n + 1)
rseq a
rseq b
return (a+b)
main = do putStrLn (show (fib3 40))
What did I do wrong? I tried this sample in Windows 7 on Intel core i5 and in Linux on Atom.
Here is log from my console session:
ghc -rtsopts -threaded -O2 test.hs
[1 of 1] Compiling Main ( test.hs, test.o )
test +RTS -s
331160283
64,496 bytes allocated in the heap
2,024 bytes copied during GC
42,888 bytes maximum residency (1 sample(s))
22,648 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)
Generation 0: 0 collections, 0 parallel, 0.00s, 0.00s elapsed
Generation 1: 1 collections, 0 parallel, 0.00s, 0.00s elapsed
Parallel GC work balance: nan (0 / 0, ideal 1)
MUT time (elapsed) GC time (elapsed)
Task 0 (worker) : 0.00s ( 6.59s) 0.00s ( 0.00s)
Task 1 (worker) : 0.00s ( 0.00s) 0.00s ( 0.00s)
Task 2 (bound) : 6.33s ( 6.59s) 0.00s ( 0.00s)
SPARKS: 2 (0 converted, 0 pruned)
INIT time 0.00s ( 0.00s elapsed)
MUT time 6.33s ( 6.59s elapsed)
GC time 0.00s ( 0.00s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 6.33s ( 6.59s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 10,191 bytes per MUT second
Productivity 100.0% of total user, 96.0% of total elapsed
gc_alloc_block_sync: 0
whitehole_spin: 0
gen[0].sync_large_objects: 0
gen[1].sync_large_objects: 0
test +RTS -N2 -s
331160283
72,688 bytes allocated in the heap
5,644 bytes copied during GC
28,300 bytes maximum residency (1 sample(s))
24,948 bytes maximum slop
2 MB total memory in use (0 MB lost due to fragmentation)
Generation 0: 1 collections, 0 parallel, 0.00s, 0.00s elapsed
Generation 1: 1 collections, 1 parallel, 0.00s, 0.01s elapsed
Parallel GC work balance: 1.51 (937 / 621, ideal 2)
MUT time (elapsed) GC time (elapsed)
Task 0 (worker) : 0.00s ( 9.29s) 0.00s ( 0.00s)
Task 1 (worker) : 4.53s ( 9.29s) 0.00s ( 0.00s)
Task 2 (bound) : 5.84s ( 9.29s) 0.00s ( 0.01s)
Task 3 (worker) : 0.00s ( 9.29s) 0.00s ( 0.00s)
SPARKS: 2 (1 converted, 0 pruned)
INIT time 0.00s ( 0.00s elapsed)
MUT time 10.38s ( 9.29s elapsed)
GC time 0.00s ( 0.01s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 10.38s ( 9.30s elapsed)
%GC time 0.0% (0.1% elapsed)
Alloc rate 7,006 bytes per MUT second
Productivity 100.0% of total user, 111.6% of total elapsed
gc_alloc_block_sync: 0
whitehole_spin: 0
gen[0].sync_large_objects: 0
gen[1].sync_large_objects: 0
I think answer is that "GHC will optimise the fib function so that it does no allocation, and
computations that do no allocation cause problems for the RTS because
the scheduler never gets to run and do load-balancing (which is
necessary for parallelism)" as wrote Simon in this discussion group. Also I found good tutorial.

How can I get the current network interface throughput statistics on Linux/UNIX? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 8 years ago.
Improve this question
Tools such as MRTG provide network throughput / bandwidth graphs for the current network utilisation on specific interfaces, such as eth0. How can I return that information at the command line on Linux/UNIX?
Preferably this would be without installing anything other than what is available on the system as standard.
iftop does for network usage what top(1) does for CPU usage -- http://www.ex-parrot.com/~pdw/iftop/
I don't know how "standard" iftop is, but I was able to install it with yum install iftop on Fedora.
Got sar? Likely yes if youre using RHEL/CentOS.
No need for priv, dorky binaries, hacky scripts, libpcap, etc. Win.
$ sar -n DEV 1 3
Linux 2.6.18-194.el5 (localhost.localdomain) 10/27/2010
02:40:56 PM IFACE rxpck/s txpck/s rxbyt/s txbyt/s rxcmp/s txcmp/s rxmcst/s
02:40:57 PM lo 0.00 0.00 0.00 0.00 0.00 0.00 0.00
02:40:57 PM eth0 10700.00 1705.05 15860765.66 124250.51 0.00 0.00 0.00
02:40:57 PM eth1 0.00 0.00 0.00 0.00 0.00 0.00 0.00
02:40:57 PM IFACE rxpck/s txpck/s rxbyt/s txbyt/s rxcmp/s txcmp/s rxmcst/s
02:40:58 PM lo 0.00 0.00 0.00 0.00 0.00 0.00 0.00
02:40:58 PM eth0 8051.00 1438.00 11849206.00 105356.00 0.00 0.00 0.00
02:40:58 PM eth1 0.00 0.00 0.00 0.00 0.00 0.00 0.00
02:40:58 PM IFACE rxpck/s txpck/s rxbyt/s txbyt/s rxcmp/s txcmp/s rxmcst/s
02:40:59 PM lo 0.00 0.00 0.00 0.00 0.00 0.00 0.00
02:40:59 PM eth0 6093.00 1135.00 8970988.00 82942.00 0.00 0.00 0.00
02:40:59 PM eth1 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Average: IFACE rxpck/s txpck/s rxbyt/s txbyt/s rxcmp/s txcmp/s rxmcst/s
Average: lo 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Average: eth0 8273.24 1425.08 12214833.44 104115.72 0.00 0.00 0.00
Average: eth1 0.00 0.00 0.00 0.00 0.00 0.00 0.00
I wrote this dumb script a long time ago, it depends on nothing but Perl and Linux≥2.6:
#!/usr/bin/perl
use strict;
use warnings;
use POSIX qw(strftime);
use Time::HiRes qw(gettimeofday usleep);
my $dev = #ARGV ? shift : 'eth0';
my $dir = "/sys/class/net/$dev/statistics";
my %stats = do {
opendir +(my $dh), $dir;
local #_ = readdir $dh;
closedir $dh;
map +($_, []), grep !/^\.\.?$/, #_;
};
if (-t STDOUT) {
while (1) {
print "\033[H\033[J", run();
my ($time, $us) = gettimeofday();
my ($sec, $min, $hour) = localtime $time;
{
local $| = 1;
printf '%-31.31s: %02d:%02d:%02d.%06d%8s%8s%8s%8s',
$dev, $hour, $min, $sec, $us, qw(1s 5s 15s 60s)
}
usleep($us ? 1000000 - $us : 1000000);
}
}
else {print run()}
sub run {
map {
chomp (my ($stat) = slurp("$dir/$_"));
my $line = sprintf '%-31.31s:%16.16s', $_, $stat;
$line .= sprintf '%8.8s', int (($stat - $stats{$_}->[0]) / 1)
if #{$stats{$_}} > 0;
$line .= sprintf '%8.8s', int (($stat - $stats{$_}->[4]) / 5)
if #{$stats{$_}} > 4;
$line .= sprintf '%8.8s', int (($stat - $stats{$_}->[14]) / 15)
if #{$stats{$_}} > 14;
$line .= sprintf '%8.8s', int (($stat - $stats{$_}->[59]) / 60)
if #{$stats{$_}} > 59;
unshift #{$stats{$_}}, $stat;
pop #{$stats{$_}} if #{$stats{$_}} > 60;
"$line\n";
} sort keys %stats;
}
sub slurp {
local #ARGV = #_;
local #_ = <>;
#_;
}
It just reads from /sys/class/net/$dev/statistics every second, and prints out the current numbers and the average rate of change:
$ ./net_stats.pl eth0
rx_bytes : 74457040115259 4369093 4797875 4206554 364088
rx_packets : 91215713193 23120 23502 23234 17616
...
tx_bytes : 90798990376725 8117924 7047762 7472650 319330
tx_packets : 93139479736 23401 22953 23216 23171
...
eth0 : 15:22:09.002216 1s 5s 15s 60s
^ current reading ^-------- averages ---------^
nload is a great tool for monitoring bandwidth in real time and easily installed in Ubuntu or Debian with sudo apt-get install nload.
Device eth0 [10.10.10.5] (1/2):
=====================================================================================
Incoming:
. ...|
# ####|
.. |#| ... #####. .. Curr: 2.07 MBit/s
###.### #### #######|. . ## | Avg: 1.41 MBit/s
########|#########################. ### Min: 1.12 kBit/s
........ ################################### .### Max: 4.49 MBit/s
.##########. |###################################|##### Ttl: 1.94 GByte
Outgoing:
########## ########### ###########################
########## ########### ###########################
##########. ########### .###########################
########### ########### #############################
########### ###########..#############################
############ ##########################################
############ ##########################################
############ ########################################## Curr: 63.88 MBit/s
############ ########################################## Avg: 32.04 MBit/s
############ ########################################## Min: 0.00 Bit/s
############ ########################################## Max: 93.23 MBit/s
############## ########################################## Ttl: 2.49 GByte
Another excellent tool is iftop, also easily apt-get'able:
191Mb 381Mb 572Mb 763Mb 954Mb
└────────────┴──────────┴─────────────────────┴───────────┴──────────────────────
box4.local => box-2.local 91.0Mb 27.0Mb 15.1Mb
<= 1.59Mb 761kb 452kb
box4.local => box.local 560b 26.8kb 27.7kb
<= 880b 31.3kb 32.1kb
box4.local => userify.com 0b 11.4kb 8.01kb
<= 1.17kb 2.39kb 1.75kb
box4.local => b.resolvers.Level3.net 0b 58b 168b
<= 0b 83b 288b
box4.local => stackoverflow.com 0b 42b 21b
<= 0b 42b 21b
box4.local => 224.0.0.251 0b 0b 179b
<= 0b 0b 0b
224.0.0.251 => box-2.local 0b 0b 0b
<= 0b 0b 36b
224.0.0.251 => box.local 0b 0b 0b
<= 0b 0b 35b
─────────────────────────────────────────────────────────────────────────────────
TX: cum: 37.9MB peak: 91.0Mb rates: 91.0Mb 27.1Mb 15.2Mb
RX: 1.19MB 1.89Mb 1.59Mb 795kb 486kb
TOTAL: 39.1MB 92.6Mb 92.6Mb 27.9Mb 15.6Mb
Don't forget about the classic and powerful sar and netstat utilities on older *nix!
You could parse /proc/net/dev.
dstat - Combines vmstat, iostat, ifstat, netstat information and more
iftop - Amazing network bandwidth utility to analyse what is really happening on your eth
netio - Measures the net throughput of a network via TCP/IP
inq - CLI troubleshooting utility that displays info on storage, typically Symmetrix. By default, INQ returns the device name, Symmetrix ID, Symmetrix LUN, and capacity.
send_arp - Sends out an arp broadcast on the specified network device (defaults to eth0), reporting an old and new IP address mapping to a MAC address.
EtherApe - is a graphical network monitor for Unix modeled after etherman. Featuring link layer, IP and TCP modes, it displays network activity graphically.
iptraf - An IP traffic monitor that shows information on the IP traffic passing over your network.
More details:
http://felipeferreira.net/?p=1194
You can parse the output of ifconfig
I got another quick'n'dirty bash script for that:
#!/bin/bash
IF=$1
if [ -z "$IF" ]; then
IF=`ls -1 /sys/class/net/ | head -1`
fi
RXPREV=-1
TXPREV=-1
echo "Listening $IF..."
while [ 1 == 1 ] ; do
RX=`cat /sys/class/net/${IF}/statistics/rx_bytes`
TX=`cat /sys/class/net/${IF}/statistics/tx_bytes`
if [ $RXPREV -ne -1 ] ; then
let BWRX=$RX-$RXPREV
let BWTX=$TX-$TXPREV
echo "Received: $BWRX B/s Sent: $BWTX B/s"
fi
RXPREV=$RX
TXPREV=$TX
sleep 1
done
It's considering that sleep 1 will actually last exactly one second, which is not true, but good enough for a rough bandwidth assessment.
Thanks to #ephemient for the /sys/class/net/<interface>! :)
Besides iftop and iptraf, also check:
bwm-ng (Bandwidth Monitor Next Generation)
and/or
cbm (Color Bandwidth Meter)
ref: http://www.powercram.com/2010/01/bandwidth-monitoring-tools-for-ubuntu.html
If you want just to get the value, you can use simple shell oneliner like this:
S=10; F=/sys/class/net/eth0/statistics/rx_bytes; X=`cat $F`; sleep $S; Y=`cat $F`; BPS="$(((Y-X)/S))"; echo $BPS
It will show you the average "received bytes per second" for period of 10 seconds (you can change period by changing S=10 parameter, and you can measure transmitted BPS instead of received BPS by using tx_bytes instead of rx_bytes). Don't forget to change eth0 to network device you want to monitor.
Of course, you are not limited to displaying the average rate (as mentioned in other answers, there are other tools that will show you much nicer output), but this solution is easily scriptable to do other things.
For example, the following shell script (split into multiple lines for readability) will execute offlineimap process only when 5-minute average transmit speed drops below 10kBPS (presumably, when some other bandwidth-consuming process finishes):
#!/bin/sh
S=300; F=/sys/class/net/eth0/statistics/tx_bytes
BPS=999999
while [ $BPS -gt 10000 ]
do
X=`cat $F`; sleep $S; Y=`cat $F`; BPS="$(((Y-X)/S))";
echo BPS is currently $BPS
done
offlineimap
Note that /sys/class/... is Linux specific (which is ok as submitter did choose linux tag), and needs non-archaic kernel. Shell code itself is /bin/sh compatible (so not only bash, but dash and other /bin/sh implementations will work) and /bin/sh is something that is really always installed.
I like iptraf but you probably have to install it and it seems to not being maintained actively anymore.
I find dstat to be quite good. Has to be installed though. Gives you way more information than you need. Netstat will give you packet rates but not bandwith also. netstat -s
You can use iperf to benchmark network performance (maximum possible throughput).
See following links for details:
http://en.wikipedia.org/wiki/Iperf
https://iperf.fr/
https://code.google.com/p/iperf/
I couldn't get the parse ifconfig script to work for me on an AMI so got this to work measuring received traffic averaged over 10 seconds
date && rxstart=`ifconfig eth0 | grep bytes | awk '{print $2}' | cut -d : -f 2` && sleep 10 && rxend=`ifconfig eth0 | grep bytes | awk '{print $2}' | cut -d : -f 2` && difference=`expr $rxend - $rxstart` && echo "Received `expr $difference / 10` bytes per sec"
Sorry, it's ever so cheap and nasty but it worked!
ifconfig -a
ip -d link
ls -l /sys/class/net/ (physical and virtual devices)
route -n
If you want the output of (ifconfig -a) in json format you can use this (python)

Resources