Related
I am creating a shell scrip to store the output of the traceroute command with a user-entered input to a file.
I want to extract the latency for each packet and each router and find the min, max and average times for each packet.
If this is the output of traceroute:
1 176.221.87.1 (176.221.87.1) 1.474 ms 1.444 ms 1.390 ms
2 f126.broadband2.quicknet.se (92.43.37.126) 10.047 ms 19.868 ms 23.156 ms
3 10.5.12.1 (10.5.12.1) 24.098 ms 24.340 ms 25.311 ms
I need to find the max of all latency for the first packet, which is in this case 24.098 ms. Similarly min is 1.474 and average for the first packet is 11.873 ms. I need to do this for each packet.
I want output like:
1 176.221.87.1 (176.221.87.1) 1.474 ms 1.444 ms 1.390 ms
2 f126.broadband2.quicknet.se (92.43.37.126) 10.047 ms 19.868 ms 23.156 ms
3 10.5.12.1 (10.5.12.1) 24.098 ms 24.340 ms 25.311 ms
For the first packet:
Minimum: 1.474 ms
Maximum: 24.098 ms
Average: 11.873 ms
.
.
and so on.
I am not able to come up with an awk statement to do this. Perhaps there is another way?
Any inputs would be really helpful.
If my understanding is correct, you want something like this
NR == 1 {
n = $4 # set min to first value
u = $5 # keep time unit for later
}
{
s += $4
c++
if ($4 > m) { # update max
m = $4
}
if ($4 < n) { # update min
n = $4
}
}
END {
print "Minimum: " n, u "\nMaximum: " m, u "\nAverage: " s / c, u
}
For all three columns, you can extend this easily
NR == 1 {
split("4,6,8",ix,",")
for(i in ix) n[ix[i]] = $(ix[i])
u = $5
}
{
c++
for(i in ix) {
p = ix[i];
s[p] += $(p)
if ($p > m[p]) m[p] = $p
if ($p < n[p]) n[p] = $p
}
}
END {
mins = "Minimum ("u"):"
maxs = "Maximum ("u"):"
avgs = "Average ("u"):"
for(i in ix) {
mins = mins FS n[ix[i]]
maxs = maxs FS m[ix[i]]
avgs = avgs FS s[ix[i]]/c
}
print mins
print maxs
print avgs
}
Should print
Minimum (ms): 1.474 1.444 1.390
Maximum (ms): 24.098 24.340 25.311
Average (ms): 11.873 15.2173 16.619
Would like to transpose below Non formatted Input into Formatted Output, since it is having multiple de-limited,
got struck to proceed further and looking for your suggestions.
Sample_Input.txt
UMTSGSMPLMNCallDataRecord
callForwarding
chargeableDuration 0 4 44'BCD
dateForStartOfCharge 09011B'H
recordSequenceNumber 57526'D
UMTSGSMPLMNCallDataRecord
mSTerminating
chargeableDuration 0 4 44'BCD
dateForStartOfCharge 09011B'H
recordSequenceNumber 57573'D
originalCalledNumber 149212345678'TBCD
redirectingNumber 149387654321'TBCD
!!!!!!!!!!!!!!!!!!!!!!!!1164!!!!!!!!!!!!!!!!!!!!!!
UMTSGSMPLMNCallDataRecord
mSTerminating
chargeableDuration 0 0 52'BCD
dateForStartOfCharge 09011B'H
recordSequenceNumber 45761'D
tariffClass 2'D
timeForStartOfCharge 9 46 58'BCD
calledSubscriberIMSI 21329701412F'TBCD
Searched in previous questions and got some relavent inputs from Mr.Porges Answer:
#!/bin/sh
# split lines on " " and use "," for output field separator
awk 'BEGIN { FS = " "; i = 0; h = 0; ofs = "," }
# empty line - increment item count and skip it
/^\s*$/ { i++ ; next }
# normal line - add the item to the object and the header to the header list
# and keep track of first seen order of headers
{
current[i, $1] = $2
if (!($1 in headers)) {headers_ordered[h++] = $1}
headers[$1]
}
END {
h--
# print headers
for (k = 0; k <= h; k++)
{
printf "%s", headers_ordered[k]
if (k != h) {printf "%s", ofs}
}
print ""
# print the items for each object
for (j = 0; j <= i; j++)
{
for (k = 0; k <= h; k++)
{
printf "%s", current[j, headers_ordered[k]]
if (k != h) {printf "%s", ofs}
}
print ""
}
}' Sample_Input.txt
Am getting below output:
UMTSGSMPLMNCallDataRecord,callForwarding,chargeableDuration,dateForStartOfCharge,recordSequenceNumber,mSTerminating,originalCalledNumber,redirectingNumber,!!!!!!!!!!!!!!!!!!!!!!!!1164!!!!!!!!!!!!!!!!!!!!!!,tariffClass,timeForStartOfCharge,calledSubscriberIMSI
,,,,,,,,,,,
,,0,09011B'H,57526'D,,,,,,,
,,,,,,,,,,,
,,,,,,,,,,,
,,,,,,,,,,,
,,0,09011B'H,57573'D,,149212345678'TBCD,149387654321'TBCD,,,,
,,,,,,,,,,,
,,,,,,,,,,,
,,,,,,,,,,,
,,,,,,,,,,,
,,,,,,,,,,,
,,0,09011B'H,45761'D,,,,,2'D,9,21329701412F'TBCD
,,,,,,,,,,,
where it stucks,
(a). Need to tackle when the block starts like "UMTSGSMPLMNCallDataRecord" and Empty Field , then the next line words like callForwarding/mSTerminating etc and Empty Field,
First word need to be considered as Row ("UMTSGSMPLMNCallDataRecord" ) and next line word need to be considered as Column (callForwarding/mSTerminating)
(b). Need to avoid the ALPHAPET into column fields i.e 09011B'H into 09011 , 149212345678'TBCD into 149212345678
Expected Output:
UMTSGSMPLMNCallDataRecord,chargeableDuration,dateForStartOfCharge,recordSequenceNumber,originalCalledNumber,redirectingNumber,tariffClass,timeForStartOfCharge,calledSubscriberIMSI
callForwarding,0 4 44,09011,57526,,,,,
mSTerminating,0 4 44,09011,57573,149212345678,149387654321,,,
mSTerminating,0 0 52, 09011,45761,,,2,9 46 58,21329701412
Edit:I have tried on the below Input:
UMTSGSMPLMNCallDataRecord
callForwarding
chargeableDuration 0 4 44'BCD
dateForStartOfCharge 09011B'H
recordSequenceNumber 57526'D
UMTSGSMPLMNCallDataRecord
mSTerminating
chargeableDuration 0 4 44'BCD
dateForStartOfCharge 09011B'H
recordSequenceNumber 57573'D
originalCalledNumber 149212345678'TBCD
redirectingNumber 149387654321'TBCD
!!!!!!!!!!!!!!!!!!!!!!!!1164!!!!!!!!!!!!!!!!!!!!!!
UMTSGSMPLMNCallDataRecord
mSTerminating
chargeableDuration 0 0 52'BCD
dateForStartOfCharge 09011B'H
recordSequenceNumber 45761'D
tariffClass 2'D
timeForStartOfCharge 9 46 58'BCD
calledSubscriberIMSI 21329701412F'TBCD
Discussion
This is a complex problem since the records are non-uniform: some with missing fields. Since each record occupies several lines, we can deal with it using AWK's multi-records feature: by setting the RS (record separator) and FS (field separator) variables.
Next, we need to deal with collecting the header fields. I don't have a good way to do this, so I hard-code the header line.
Once we establish the order in the header, we need a way to extract a specific field from the record and we accomplish that via the function get_column(). This function also strip off non-numeric data at the end per your requirement.
One last thing, we need to trim the white spaces off the first column ($2) using the homemade trim() function.
Command Line
I placed my code in make_csv.awk. To run it:
awk -f make_csv.awk Sample_Input.txt
File make_csv.awk
BEGIN {
# Next two lines: each record is of multiple lines, each line is a
# separate field
RS = ""
FS = "\n"
print "UMTSGSMPLMNCallDataRecord,chargeableDuration,dateForStartOfCharge,recordSequenceNumber,originalCalledNumber,redirectingNumber,tariffClass,timeForStartOfCharge,calledSubscriberIMSI"
}
function get_column(name, i, f, len) {
# i, f and len are "local" variables
for (i = 1; i <= NF; i++) {
len = split($i, f, " ")
if (f[1] == name) {
result = f[2]
for (i = 3; i <= len; i++) {
result = result " " f[i]
}
# Remove the trailing non numeric data
sub(/[a-zA-Z']+/, "", result)
return result
}
}
return "" # get_column not found, return empty string
}
# Remove leading and trailing spaces
function trim(s) {
sub(/[ \t]+$/, "", s)
sub(/^[ \t]+/, "", s)
return s
}
/UMTSGSMPLMNCallDataRecord/ {
print trim($2) \
"," get_column("chargeableDuration") \
"," get_column("dateForStartOfCharge") \
"," get_column("recordSequenceNumber") \
"," get_column("originalCalledNumber") \
"," get_column("redirectingNumber") \
"," get_column("tariffClass") \
"," get_column("timeForStartOfCharge") \
"," get_column("calledSubscriberIMSI") \
""
}
Update
I tried out my AWK script against AVN's latest input and got the following output:
UMTSGSMPLMNCallDataRecord,chargeableDuration,dateForStartOfCharge,recordSequenceNumber,originalCalledNumber,redirectingNumber,tariffClass,timeForStartOfCharge,calledSubscriberIMSI
callForwarding,0 4 44,09011,57526,,,,,
mSTerminating,0 4 44,09011,57573,149212345678,149387654321,,,
mSTerminating,0 0 52,09011,45761,,,2,9 46 58,21329701412
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
I have a file on linux server that has data like :
a 22
a 10
a 17
a 51
a 33
b 51
b 47
c 33
I want a shell script or linux commands to find min, avg, 90%, max and count for each value in column 1.
Example:
for a min = 10, avg = 26, 90% = 33, max = 51, and count = 5.
Here a version with even the 90% percentile using gawk.
The definition of percentile is that one given by
Wikipedia and called Nearest rank.
The function round can be found here.
#!/bin/bash
gawk '
function round(x, ival, aval, fraction)
{
ival = int(x) # integer part, int() truncates
# see if fractional part
if (ival == x) # no fraction
return ival # ensure no decimals
if (x < 0) {
aval = -x # absolute value
ival = int(aval)
fraction = aval - ival
if (fraction >= .5)
return int(x) - 1 # -2.5 --> -3
else
return int(x) # -2.3 --> -2
} else {
fraction = x - ival
if (fraction >= .5)
return ival + 1
else
return ival
}
}
# the following block processes all the lines
# and populates counters and values
{
if($1 in counters) {
counters[$1]++;
} else {
counters[$1] = 1;
}
i = counters[$1];
values[$1, i] = $2;
} END {
for (c in counters) {
delete tmp;
min = values[c, 1];
max = values[c, 1];
sum = values[c, 1];
tmp[1] = values[c, 1];
for (i = 2; i <= counters[c]; i++) {
if (values[c, i] < min) min = values[c, i];
if (values[c, i] > max) max = values[c, i];
sum += values[c, i];
tmp[i] = values[c, i];
}
# The following 3 lines compute the percentile.
n = asort(tmp, tmp_sorted);
idx = round(0.9 * n + 0.5); # Nearest rank definition
percentile = tmp_sorted[idx];
# Output of the statistics for this group.
printf "for %s min = %d, avg = %f, 90 = %d,max = %d, count = %d\n", c, min, (sum / counters[c]), percentile, max, counters[c];
}
}'
To run execute:
./stats.sh < input.txt
I am assuming that the above script is named stats.sh and your input is saved in input.txt.
The output is:
for a min = 10, avg = 26.600000, 90 = 51,max = 51, count = 5
for b min = 47, avg = 49.000000, 90 = 51,max = 51, count = 2
for c min = 33, avg = 33.000000, 90 = 33,max = 33, count = 1
Here the explanation:
counters is an associative array, the key is the value in column 1
and the value is the number of values found in the input for each
value in column 1.
values is a two dimensional (value_in_column_one, counter_per_value)
array that keeps all the values grouped by value in column one.
At the end of the script the outermost loop goes trough all the values
found in column 1. The innermost for loop analyses all the values belonging
to a particular value in column 1 and it computes all the statics.
For lines starting with a, here's an awk script.
$ echo 'a 22
a 10
a 17
a 51
a 33
b 51
b 47
c 33' | awk 'BEGIN{n=0;s=0;};/^a/{n=n+1;s=s+$2;};END{print n;print s;print s/n;}'
5
133
26.6
Using awk:
awk 'NR==1{min=$1} {sum+=$2; if(min>=$2) min=$2; if(max<$2) max=$2}
END{printf("max=%d,min=%d,count=%d,avg=%.2f\n", max, min, NR, (sum/NR))}' file
max=51,min=10,count=8,avg=33.00
EDIT:
awk '$1 != v {
if (NR>1)
printf("For %s max=%d,min=%d,count=%d,avg=%.2f\n", v, max, min, k, (sum/k));
v=$1;
min=$2;
k=sum=max=0
}
{
k++;
sum+=$2;
if (min > $2)
min=$2;
if (max < $2)
max=$2
}
END {
printf("For %s max=%d,min=%d,count=%d,avg=%.2f\n", v, max, min, k, (sum/k))
}' < <(sort -n -k1,2 f)
OUTPUT:
For a max=51,min=10,count=5,avg=26.60
For b max=51,min=47,count=2,avg=49.00
For c max=33,min=33,count=1,avg=33.00
May I introduce you to the problem that destroyed my weekend. I have biological data in 4 columns
#ID:::12345/1 ACGACTACGA text !"#$%vwxyz
#ID:::12345/2 TATGACGACTA text :;<=>?VWXYZ
I would like to use awk to edit the first column to replace characters : and / with -
I would like to convert the string in the last column with a comma-separated string of decimals that correspond to each individual ASCII character (any character ranging from ASCII 33 - 126).
#ID---12345-1 ACGACTACGA text 33,34,35,36,37,118,119,120,121,122
#ID---12345-2 TATGACGACTA text 58,59,60,61,62,63,86,87,88,89,90
The first part is easy, but i'm stuck with the second. I've tried using awk ordinal functions and sprintf; I can only get the former to work on the first char in the string and I can only get the latter to convert hexidecimal to decimal and not with spaces. Also tried bash function
$ od -t d1 test3 | awk 'BEGIN{OFS=","}{i = $1; $1 = ""; print $0}'
But don't know how to call this function within awk.
I would prefer to use awk as I have some downstream manipulations that can also be done in awk.
Many thanks in advance
Using the ordinal functions from the awk manual, you can do it like this:
awk -f ord.awk --source '{
# replace : with - in the first field
gsub(/:/,"-",$1)
# calculate the ordinal by looping over the characters in the fourth field
res=ord($4)
for(i=2;i<=length($4);i++) {
res=res","ord(substr($4,i))
}
$4=res
}1' file
Output:
#ID---12345/1 ACGACTACGA text 33,34,35,36,37,118,119,120,121,122
#ID---12345/2 TATGACGACTA text 58,59,60,61,62,63,86,87,88,89,90
Here is ord.awk (taken as is from: http://www.gnu.org/software/gawk/manual/html_node/Ordinal-Functions.html)
# ord.awk --- do ord and chr
# Global identifiers:
# _ord_: numerical values indexed by characters
# _ord_init: function to initialize _ord_
BEGIN { _ord_init() }
function _ord_init( low, high, i, t)
{
low = sprintf("%c", 7) # BEL is ascii 7
if (low == "\a") { # regular ascii
low = 0
high = 127
} else if (sprintf("%c", 128 + 7) == "\a") {
# ascii, mark parity
low = 128
high = 255
} else { # ebcdic(!)
low = 0
high = 255
}
for (i = low; i <= high; i++) {
t = sprintf("%c", i)
_ord_[t] = i
}
}
function ord(str, c)
{
# only first character is of interest
c = substr(str, 1, 1)
return _ord_[c]
}
function chr(c)
{
# force c to be numeric by adding 0
return sprintf("%c", c + 0)
}
If you don't want to include the whole of ord.awk, you can do it like this:
awk 'BEGIN{ _ord_init()}
function _ord_init(low, high, i, t)
{
low = sprintf("%c", 7) # BEL is ascii 7
if (low == "\a") { # regular ascii
low = 0
high = 127
} else if (sprintf("%c", 128 + 7) == "\a") {
# ascii, mark parity
low = 128
high = 255
} else { # ebcdic(!)
low = 0
high = 255
}
for (i = low; i <= high; i++) {
t = sprintf("%c", i)
_ord_[t] = i
}
}
{
# replace : with - in the first field
gsub(/:/,"-",$1)
# calculate the ordinal by looping over the characters in the fourth field
res=_ord_[substr($4,1,1)]
for(i=2;i<=length($4);i++) {
res=res","_ord_[substr($4,i,1)]
}
$4=res
}1' file
Perl soltuion:
perl -lnae '$F[0] =~ s%[:/]%-%g; $F[-1] =~ s/(.)/ord($1) . ","/ge; chop $F[-1]; print "#F";' < input
The first substitution replaces : and / in the first field with a dash, the second one replaces each character in the last field with its ord and a comma, chop removes the last comma.
I am trying to parse some CSV files using awk.
The CSV file I am working with looks like this:
fnName,minAccessTime,maxAccessTime
getInfo,300,600
getStage,600,800
getStage,600,800
getInfo,250,620
getInfo,200,700
getStage,700,1000
getInfo,280,600
I need to find the minimum, maximum and average figures for columns 2 and 3, both across all data and individual functions.
I realize you're not looking for non-awk solution, but I thought I'd share some R code to demonstrate how seamless it is to summarize data.
# read in data
awk <- read.table(textConnection("fnName,minAccessTime,maxAccessTime
getInfo,300,600
getStage,600,800
getStage,600,800
getInfo,250,620
getInfo,200,700
getStage,700,1000
getInfo,280,600"), header = TRUE, sep = ",")
# split according to the function
awk.split <- split(awk, awk$fnName)
# for each function, calculate full summary for columns 2 and 3
lapply(X = awk.split, FUN = function(x) {
summary(x[2:3])
})
Result:
$getInfo
minAccessTime maxAccessTime
Min. :200.0 Min. :600
1st Qu.:237.5 1st Qu.:600
Median :265.0 Median :610
Mean :257.5 Mean :630
3rd Qu.:285.0 3rd Qu.:640
Max. :300.0 Max. :700
$getStage
minAccessTime maxAccessTime
Min. :600.0 Min. : 800.0
1st Qu.:600.0 1st Qu.: 800.0
Median :600.0 Median : 800.0
Mean :633.3 Mean : 866.7
3rd Qu.:650.0 3rd Qu.: 900.0
Max. :700.0 Max. :1000.0
This awk script should give you all the skills necessary to get what you want.
It basically runs through all lines in your input file, ignoring those where the second field is minAccessTime (the CSV header).
On all other records, it updates the count of, minimum-of-minima, maximum-of-minima, minimum-of-maxima, maximum-of-maxima, sum-of-minima, and sum-of-maxima for both the overall data plus each individual function name.
The former is stored in count, min_min, max_min, min_max, max_max, sum_min and sum_max. The latter are stored in associative arrays with similar names (with _arr appended).
Then, once all records are read, the END section outputs the information.
NR > 1 {
count++;
sum_min += $2;
sum_max += $3;
if (count == 1) {
min_min = $2;
max_min = $2;
min_max = $3;
max_max = $3;
} else {
if ($2 < min_min) { min_min = $2; }
if ($2 > max_min) { max_min = $2; }
if ($3 < min_max) { min_max = $3; }
if ($3 > max_max) { max_max = $3; }
}
count_arr[$1]++;
sum_min_arr[$1] += $2;
sum_max_arr[$1] += $3;
if (count_arr[$1] == 1) {
min_min_arr[$1] = $2;
max_min_arr[$1] = $2;
min_max_arr[$1] = $3;
max_max_arr[$1] = $3;
} else {
if ($2 < min_min_arr[$1]) { min_min_arr[$1] = $2; }
if ($2 > max_min_arr[$1]) { max_min_arr[$1] = $2; }
if ($3 < min_max_arr[$1]) { min_max_arr[$1] = $3; }
if ($3 > max_max_arr[$1]) { max_max_arr[$1] = $3; }
}
}
END {
print "Overall:"
print " Total records = " count
print " Sum of minima = " sum_min
print " Sum of maxima = " sum_max
if (count > 0) {
print " Min of minima = " min_min
print " Max of minima = " max_min
print " Min of maxima = " min_max
print " Max of maxima = " max_max
print " Avg of minima = " sum_min / count
print " Avg of maxima = " sum_max / count
}
for (task in count_arr) {
print "Function " task ":"
print " Total records = " count_arr[task]
print " Sum of minima = " sum_min_arr[task]
print " Sum of maxima = " sum_max_arr[task]
print " Min of minima = " min_min_arr[task]
print " Max of minima = " max_min_arr[task]
print " Min of maxima = " min_max_arr[task]
print " Max of maxima = " max_max_arr[task]
print " Avg of minima = " sum_min_arr[task] / count_arr[task]
print " Avg of maxima = " sum_max_arr[task] / count_arr[task]
}
}
Storing that script into qq.awk and placing your sample data into qq.in, then running:
awk -F, -f qq.awk qq.in
generates the following output, which I'm relatively certain will give you every possible piece of information you need:
Overall:
Total records = 7
Sum of minima = 2930
Sum of maxima = 5120
Min of minima = 200
Max of minima = 700
Min of maxima = 600
Max of maxima = 1000
Avg of minima = 418.571
Avg of maxima = 731.429
Function getStage:
Total records = 3
Sum of minima = 1900
Sum of maxima = 2600
Min of minima = 600
Max of minima = 700
Min of maxima = 800
Max of maxima = 1000
Avg of minima = 633.333
Avg of maxima = 866.667
Function getInfo:
Total records = 4
Sum of minima = 1030
Sum of maxima = 2520
Min of minima = 200
Max of minima = 300
Min of maxima = 600
Max of maxima = 700
Avg of minima = 257.5
Avg of maxima = 630
If you insist on Awk...
$ awk -F, '
> func newmin(fname, array, value) { if (!(fname in array) || array[fname]>value) array[fname] = value }
> func newmax(fname, array, value) { if (!(fname in array) || array[fname]<value) array[fname] = value }
> NR>1 {
> newmin($1,min2,$2)
> newmin("global",min2,$2)
> newmax($1,max2,$2)
> newmax("global",max2,$2)
> newmin($1,min3,$3)
> newmin("global",min3,$3)
> newmax($1,max3,$3)
> newmax("global",max3,$3)
> }
> END { for (fname in min2) { print fname, min2[fname], max2[fname], min3[fname], max3[fname] } }'