Transpose column to row into Formatted Output: - linux
Would like to transpose below Non formatted Input into Formatted Output, since it is having multiple de-limited,
got struck to proceed further and looking for your suggestions.
Sample_Input.txt
UMTSGSMPLMNCallDataRecord
callForwarding
chargeableDuration 0 4 44'BCD
dateForStartOfCharge 09011B'H
recordSequenceNumber 57526'D
UMTSGSMPLMNCallDataRecord
mSTerminating
chargeableDuration 0 4 44'BCD
dateForStartOfCharge 09011B'H
recordSequenceNumber 57573'D
originalCalledNumber 149212345678'TBCD
redirectingNumber 149387654321'TBCD
!!!!!!!!!!!!!!!!!!!!!!!!1164!!!!!!!!!!!!!!!!!!!!!!
UMTSGSMPLMNCallDataRecord
mSTerminating
chargeableDuration 0 0 52'BCD
dateForStartOfCharge 09011B'H
recordSequenceNumber 45761'D
tariffClass 2'D
timeForStartOfCharge 9 46 58'BCD
calledSubscriberIMSI 21329701412F'TBCD
Searched in previous questions and got some relavent inputs from Mr.Porges Answer:
#!/bin/sh
# split lines on " " and use "," for output field separator
awk 'BEGIN { FS = " "; i = 0; h = 0; ofs = "," }
# empty line - increment item count and skip it
/^\s*$/ { i++ ; next }
# normal line - add the item to the object and the header to the header list
# and keep track of first seen order of headers
{
current[i, $1] = $2
if (!($1 in headers)) {headers_ordered[h++] = $1}
headers[$1]
}
END {
h--
# print headers
for (k = 0; k <= h; k++)
{
printf "%s", headers_ordered[k]
if (k != h) {printf "%s", ofs}
}
print ""
# print the items for each object
for (j = 0; j <= i; j++)
{
for (k = 0; k <= h; k++)
{
printf "%s", current[j, headers_ordered[k]]
if (k != h) {printf "%s", ofs}
}
print ""
}
}' Sample_Input.txt
Am getting below output:
UMTSGSMPLMNCallDataRecord,callForwarding,chargeableDuration,dateForStartOfCharge,recordSequenceNumber,mSTerminating,originalCalledNumber,redirectingNumber,!!!!!!!!!!!!!!!!!!!!!!!!1164!!!!!!!!!!!!!!!!!!!!!!,tariffClass,timeForStartOfCharge,calledSubscriberIMSI
,,,,,,,,,,,
,,0,09011B'H,57526'D,,,,,,,
,,,,,,,,,,,
,,,,,,,,,,,
,,,,,,,,,,,
,,0,09011B'H,57573'D,,149212345678'TBCD,149387654321'TBCD,,,,
,,,,,,,,,,,
,,,,,,,,,,,
,,,,,,,,,,,
,,,,,,,,,,,
,,,,,,,,,,,
,,0,09011B'H,45761'D,,,,,2'D,9,21329701412F'TBCD
,,,,,,,,,,,
where it stucks,
(a). Need to tackle when the block starts like "UMTSGSMPLMNCallDataRecord" and Empty Field , then the next line words like callForwarding/mSTerminating etc and Empty Field,
First word need to be considered as Row ("UMTSGSMPLMNCallDataRecord" ) and next line word need to be considered as Column (callForwarding/mSTerminating)
(b). Need to avoid the ALPHAPET into column fields i.e 09011B'H into 09011 , 149212345678'TBCD into 149212345678
Expected Output:
UMTSGSMPLMNCallDataRecord,chargeableDuration,dateForStartOfCharge,recordSequenceNumber,originalCalledNumber,redirectingNumber,tariffClass,timeForStartOfCharge,calledSubscriberIMSI
callForwarding,0 4 44,09011,57526,,,,,
mSTerminating,0 4 44,09011,57573,149212345678,149387654321,,,
mSTerminating,0 0 52, 09011,45761,,,2,9 46 58,21329701412
Edit:I have tried on the below Input:
UMTSGSMPLMNCallDataRecord
callForwarding
chargeableDuration 0 4 44'BCD
dateForStartOfCharge 09011B'H
recordSequenceNumber 57526'D
UMTSGSMPLMNCallDataRecord
mSTerminating
chargeableDuration 0 4 44'BCD
dateForStartOfCharge 09011B'H
recordSequenceNumber 57573'D
originalCalledNumber 149212345678'TBCD
redirectingNumber 149387654321'TBCD
!!!!!!!!!!!!!!!!!!!!!!!!1164!!!!!!!!!!!!!!!!!!!!!!
UMTSGSMPLMNCallDataRecord
mSTerminating
chargeableDuration 0 0 52'BCD
dateForStartOfCharge 09011B'H
recordSequenceNumber 45761'D
tariffClass 2'D
timeForStartOfCharge 9 46 58'BCD
calledSubscriberIMSI 21329701412F'TBCD
Discussion
This is a complex problem since the records are non-uniform: some with missing fields. Since each record occupies several lines, we can deal with it using AWK's multi-records feature: by setting the RS (record separator) and FS (field separator) variables.
Next, we need to deal with collecting the header fields. I don't have a good way to do this, so I hard-code the header line.
Once we establish the order in the header, we need a way to extract a specific field from the record and we accomplish that via the function get_column(). This function also strip off non-numeric data at the end per your requirement.
One last thing, we need to trim the white spaces off the first column ($2) using the homemade trim() function.
Command Line
I placed my code in make_csv.awk. To run it:
awk -f make_csv.awk Sample_Input.txt
File make_csv.awk
BEGIN {
# Next two lines: each record is of multiple lines, each line is a
# separate field
RS = ""
FS = "\n"
print "UMTSGSMPLMNCallDataRecord,chargeableDuration,dateForStartOfCharge,recordSequenceNumber,originalCalledNumber,redirectingNumber,tariffClass,timeForStartOfCharge,calledSubscriberIMSI"
}
function get_column(name, i, f, len) {
# i, f and len are "local" variables
for (i = 1; i <= NF; i++) {
len = split($i, f, " ")
if (f[1] == name) {
result = f[2]
for (i = 3; i <= len; i++) {
result = result " " f[i]
}
# Remove the trailing non numeric data
sub(/[a-zA-Z']+/, "", result)
return result
}
}
return "" # get_column not found, return empty string
}
# Remove leading and trailing spaces
function trim(s) {
sub(/[ \t]+$/, "", s)
sub(/^[ \t]+/, "", s)
return s
}
/UMTSGSMPLMNCallDataRecord/ {
print trim($2) \
"," get_column("chargeableDuration") \
"," get_column("dateForStartOfCharge") \
"," get_column("recordSequenceNumber") \
"," get_column("originalCalledNumber") \
"," get_column("redirectingNumber") \
"," get_column("tariffClass") \
"," get_column("timeForStartOfCharge") \
"," get_column("calledSubscriberIMSI") \
""
}
Update
I tried out my AWK script against AVN's latest input and got the following output:
UMTSGSMPLMNCallDataRecord,chargeableDuration,dateForStartOfCharge,recordSequenceNumber,originalCalledNumber,redirectingNumber,tariffClass,timeForStartOfCharge,calledSubscriberIMSI
callForwarding,0 4 44,09011,57526,,,,,
mSTerminating,0 4 44,09011,57573,149212345678,149387654321,,,
mSTerminating,0 0 52,09011,45761,,,2,9 46 58,21329701412
Related
How to sort the characters of a word using awk?
I can't seem to find any way of sorting a word based on its characters in awk. For example if the word is "hello" then its sorted equivalent is "ehllo". how to achieve this in awk ?
With GNU awk for PROCINFO[], "sorted_in" (see https://www.gnu.org/software/gawk/manual/gawk.html#Controlling-Scanning) and splitting with a null separator resulting in an array of chars: $ echo 'hello' | awk ' BEGIN { PROCINFO["sorted_in"]="#val_str_asc" } { split($1,chars,"") word = "" for (i in chars) { word = word chars[i] } print word } ' ehllo $ echo 'hello' | awk -v ordr='#val_str_asc' 'BEGIN{PROCINFO["sorted_in"]=ordr} {split($1,chars,""); word=""; for (i in chars) word=word chars[i]; print word}' ehllo $ echo 'hello' | awk -v ordr='#val_str_desc' 'BEGIN{PROCINFO["sorted_in"]=ordr} {split($1,chars,""); word=""; for (i in chars) word=word chars[i]; print word}' ollhe
Another option is a Decorate-Sort-Undecorate with sed. Essentially, you use sed to break "hello" into one character per-line (decorating each character with a newline '\n') and pipe the result to sort. You then use sed to do the reverse (undecorate each line by removing the '\n') to join the lines back together. printf "hello" | sed 's/\(.\)/\1\n/g' | sort | sed '{:a N;s/\n//;ta}' ehllo There are several approaches you can use, but this one is shell friendly, but the behavior requires GNU sed.
This would be more doable with gawk, which includes the asort function to sort an array: awk 'BEGIN{FS=OFS=ORS=""}{split($0,a);asort(a);for(i in a)print a[i]}'<<<hello This outputs: ehllo Demo: https://ideone.com/ylWQLJ
You need to write a function to sort letters in a word (see : https://www.gnu.org/software/gawk/manual/html_node/Join-Function.html): function siw(word, result, arr, arrlen, arridx) { split(word, arr, "") arrlen = asort(arr) for (arridx = 1; arridx <= arrlen; arridx++) { result = result arr[arridx] } return result } And define a sort sub-function to compare two words (see : https://www.gnu.org/software/gawk/manual/html_node/Array-Sorting-Functions.html): function compare_by_letters(i1, v1, i2, v2, left, right) { left = siw(v1) right = siw(v2) if (left < right) return -1 else if (left == right) return 0 else return 1 } And use this function with awk sort function: asort(array_test, array_test_result, "compare_by_letters") Then, the sample program is: function siw(word, result, arr, arrlen, arridx) { result = hash_word[word] if (result != "") { return result } split(word, arr, "") arrlen = asort(arr) for (arridx = 1; arridx <= arrlen; arridx++) { result = result arr[arridx] } hash_word[word] = result return result } function compare_by_letters(i1, v1, i2, v2, left, right) { left = siw(v1) right = siw(v2) if (left < right) return -1 else if (left == right) return 0 else return 1 } { array_test[i++] = $0 } END { alen = asort(array_test, array_test_result, "compare_by_letters") for (aind = 1; aind <= alen; aind++) { print array_test_result[aind] } } Executed like this: echo -e "fail\nhello\nborn" | awk -f sort_letter.awk Output: fail born hello Of course, if you have a big input, you could adapt siw function to memorize result for fastest compute: function siw(word, result, arr, arrlen, arridx) { result = hash_word[word] if (result != "") { return result } split(word, arr, "") arrlen = asort(arr) for (arridx = 1; arridx <= arrlen; arridx++) { result = result arr[arridx] } hash_word[word] = result return result }
here's a very unorthodox method for a quick-n-dirty approach, if you really want to sort "hello" into "ehllo" : mawk/mawk2/gawk 'BEGIN { FS="^$" # to make it AaBbCc… etc; chr(65) = ascii "A" for (x = 65; x < 91; x++) { ref = sprintf("%s%c%c",ref, x, x+32) } } /^[[:alpha:]]$/ { print } /[[:alpha:]][[:alpha:]]+/ { # for gawk/nawk, feel free to change # that to /[[:alpha:]]{2,}/ # the >= 2+ condition is to prevent wasting time # sorting single letter words "A" and "I" s=""; x=1; len=length(inp=$0); while ( len && (x<53) ) { if (inp~(ch = substr(ref,x++,1))) { while ( sub(ch,"",inp) ) { s = s ch; len -= 1 ; } } } print s }' I'm aware it's an extremely inefficient way of doing selection sort. The potential time-savings stem from instant loop ending the moment all letters are completed, instead of iterating all 52 letters everytime. The downside is that it doesn't pre-profile the input (e.g. if u detect that this row is only lower-case, then u can speed it up with a lowercase only loop instead) The upside is that it eliminates the need for custom-functions, eliminate any gawk dependencies, and also eliminate the need to split every row into an array (or every character into its own field) i mean yes technically one can set FS to null string thus automatically becomes having NF as the string length. But at times it could be slow if input is a bit large. If you need unicode support, then a match()-based approach is more desirable. added (x<53) condition to prevent run-away infinite loops in case input isn't pure ASCII letters
Printing ten values from each array alternating between them until both are empty
Suppose I have the following script and it is iterating over two arrays and prints the the values alternating between them. I feed the Script the following file: 1 2 1 2 1 2 1 2 1 2 1 2 Awk Script: BEGIN{ FS = "\t" } { a[o++]=$1; b[c++]=$2 } END{ while (++co <= 5 ) { for ( k = 1+count; k <= 2+count; k++ ){ if (length(a) >= 0){ print a[k]; delete a[k]} } for ( ki = 1+count; ki <= 2+count; ki++ ){ if (length(b) >= 0){ print b[ki]; delete b[ki] } } count = k } } And I expect the output to be 1 1 2 2 1 1 2 2 1 1 2 2 But what I get is: 1 1 2 2 1 1 2 2 (followed by blank lines but they are not my issue) So how to make it run until both arrays are empty and everything is printed. And how to make it work even if the arrays are of different size e.g. one caontains for example 20 more values than the other. So then just these values should be printed until the array is empty.
wrt how to make it run until both arrays are empty - There's no need to delete the array contents. wrt just these values should be printed until the array is empty - "the array is empty" could mean print just the values at the indices present in both arrays ("the first array is empty") or it could mean print all the values from both arrays ("both arrays are empty"). The script below assumes the latter but is easily tweaked to do the former (change || to &&). $ cat tst.awk { a[++maxA]=$1; b[++maxB]=$2 } END { while ( (prevEnd<maxA) || (prevEnd<maxB) ) { prt(a) prevEnd = prt(b) } } function prt(arr, idx) { for (idx=prevEnd+1; idx<=prevEnd+2; idx++) { if (idx in arr) { print arr[idx] } } return (idx-1) } $ awk -f tst.awk file 1 1 2 2 1 1 2 2 1 1 2 2
Awk: given list of users with session data, output list of users with specific data
Not sure how to ask this question, thus I don't know how to search for it on google or SO. Let me just show you the given data. By the way, this is just an Awk exercise, its not homework. Been trying to solve this off and on for 2 days now. Below is an example; Mon Sep 15 12:17:46 1997 User-Name = "wynng" NAS-Identifier = 207.238.228.11 NAS-Port = 20104 Acct-Status-Type = Start Acct-Delay-Time = 0 Acct-Session-Id = "239736724" Acct-Authentic = RADIUS Client-Port-DNIS = "3571800" Framed-Protocol = PPP Framed-Address = 207.238.228.57 Mon Sep 15 12:19:40 1997 User-Name = "wynng" NAS-Identifier = 207.238.228.11 NAS-Port = 20104 Acct-Status-Type = Stop Acct-Delay-Time = 0 Acct-Session-Id = "239736724" Acct-Authentic = RADIUS Acct-Session-Time = 115 Acct-Input-Octets = 3915 Acct-Output-Octets = 3315 Acct-Input-Packets = 83 Acct-Output-Packets = 66 Ascend-Disconnect-Cause = 45 Ascend-Connect-Progress = 60 Ascend-Data-Rate = 28800 Ascend-PreSession-Time = 40 Ascend-Pre-Input-Octets = 395 Ascend-Pre-Output-Octets = 347 Ascend-Pre-Input-Packets = 10 Ascend-Pre-Output-Packets = 11 Ascend-First-Dest = 207.238.228.255 Client-Port-DNIS = "3571800" Framed-Protocol = PPP Framed-Address = 207.238.228.57 So the log file contains the above data for various users. I specifically pasted this to show that this user had a login, Acct-Status-Type = Start, and a logoff, Acct-Status-Type = Stop. This counts as one session. Thus I need to generate the following output. User: "wynng" Number of Sessions: 1 Total Connect Time: 115 Input Bandwidth Usage: 83 Output Bandwidth Usage: 66 The problem I have is keeping the info somehow attached to the user. Each entry in the log file has the same information when the session is in Stop so I cant just regex /Acct-Input-Packets/{inPackets =$3} /Acct-Output-Packets/{outPackets = $3} Each iteration through the data will overwrite the past values. What I want to do is if I find a User-Name entry and this entry has a Stop, then I want to record for that user, the input/output packet values. This is where I get stumped. For the session values I was thinking of saving the User-Names in an array and then in the END{} count the duplicates and divide by 2 those that are greater than 2 if even. If odd then divide by two then floor it. I don't necessarily want the answer but maybe some hints/guidance or perhaps a simple example on which I could expand on.
You can check each line for : a date pattern : /\w+\s\w+\s[0-9]{2}\s[0-9]{2}:[0-9]{2}:[0-9]{2}\s[0-9]{4}/ user name value : /User-Name\s+=\s+\"\w+\"/ status value : /Acct-Status-Type\s+=\s+\w+/ input packet value : /Acct-Input-Packets\s+=\s[0-9]+/ output packet value : /Acct-Output-Packets\s+=\s[0-9]+/ an empty line : /^$/ Once you have defined what you are looking for (above pattern), it's just a matter of conditions and storing all those data in some array. In the following example, I store each value type above in a dedicated array for each type with a count index that is incremented when an empty line /^$/ is detected : awk 'BEGIN{ count = 1; i = 1; }{ if ($0 ~ /\w+\s\w+\s[0-9]{2}\s[0-9]{2}:[0-9]{2}:[0-9]{2}\s[0-9]{4}/){ match($0, /\w+\s(\w+)\s([0-9]{2})\s([0-9]{2}):([0-9]{2}):([0-9]{2})\s([0-9]{4})/, n); match("JanFebMarAprMayJunJulAugSepOctNovDec",n[1]) n[1] = sprintf("%02d",(RSTART+2)/3); arr[count]=mktime(n[6] " " n[1] " " n[2] " " n[3] " " n[4] " " n[5]); order[i]=count; i++; } else if ($0 ~ /User-Name\s+=\s+\"\w+\"/){ match($0, /User-Name\s+=\s+\"(\w+)\"/, n); name[count]=n[1]; } else if ($0 ~ /Acct-Status-Type\s+=\s+\w+/){ match($0, /Acct-Status-Type\s+=\s+(\w+)/, n); status[count]=n[1]; } else if ($0 ~ /^$/){ count++; } else if ($0 ~ /Acct-Input-Packets\s+=\s[0-9]+/){ match($0, /Acct-Input-Packets\s+=\s([0-9]+)/, n); input[count]=n[1]; } else if ($0 ~ /Acct-Output-Packets\s+=\s[0-9]+/){ match($0, /Acct-Output-Packets\s+=\s([0-9]+)/, n); output[count]=n[1]; } } END{ for (i = 1; i <= length(order); i++) { val = name[order[i]]; if (length(user[val]) == 0) { valueStart = "0"; if (status[order[i]] == "Start"){ valueStart = arr[order[i]]; } user[val]= valueStart "|0|0|0|0"; } else { split(user[val], nameArr, "|"); if (status[order[i]]=="Stop"){ nameArr[2]++; nameArr[3]+=arr[order[i]]-nameArr[1] } else if (status[order[i]] == "Start"){ # store date start nameArr[1] = arr[order[i]]; } nameArr[4]+=input[order[i]]; nameArr[5]+=output[order[i]]; user[val]= nameArr[1] "|" nameArr[2] "|" nameArr[3] "|" nameArr[4] "|" nameArr[5]; } } for (usr in user) { split(user[usr], usrArr, "|"); print "User: " usr; print "Number of Sessions: " usrArr[2]; print "Total Connect Time: " usrArr[3]; print "Input Bandwidth Usage: " usrArr[4]; print "Output Bandwidth Usage: " usrArr[5]; print "------------------------"; } }' test.txt The values are extracted with match function like : match($0, /User-Name\s+=\s+\"(\w+)\"/, n); For the date, we have to parse the month string part, I've used the solution in this post to extract like : match($0, /\w+\s(\w+)\s([0-9]{2})\s([0-9]{2}):([0-9]{2}):([0-9]{2})\s([0-9]{4})/, n); match("JanFebMarAprMayJunJulAugSepOctNovDec",n[1]) n[1] = sprintf("%02d",(RSTART+2)/3); All the processing of the collected values is done in the END clause where we have to group the values, I create a user array with the username as key and as value a concatenation of all your different type delimited by | : [startDate] "|" [sessionNum] "|" [connectionTime] "|" [inputUsage] "|" [outputUsage] With this data input (your data extended), it gives : User: TOTO Number of Sessions: 1 Total Connect Time: 114 Input Bandwidth Usage: 83 Output Bandwidth Usage: 66 ------------------------ User: wynng Number of Sessions: 2 Total Connect Time: 228 Input Bandwidth Usage: 166 Output Bandwidth Usage: 132 ------------------------
Converting string of ASCII characters to string of corresponding decimals
May I introduce you to the problem that destroyed my weekend. I have biological data in 4 columns #ID:::12345/1 ACGACTACGA text !"#$%vwxyz #ID:::12345/2 TATGACGACTA text :;<=>?VWXYZ I would like to use awk to edit the first column to replace characters : and / with - I would like to convert the string in the last column with a comma-separated string of decimals that correspond to each individual ASCII character (any character ranging from ASCII 33 - 126). #ID---12345-1 ACGACTACGA text 33,34,35,36,37,118,119,120,121,122 #ID---12345-2 TATGACGACTA text 58,59,60,61,62,63,86,87,88,89,90 The first part is easy, but i'm stuck with the second. I've tried using awk ordinal functions and sprintf; I can only get the former to work on the first char in the string and I can only get the latter to convert hexidecimal to decimal and not with spaces. Also tried bash function $ od -t d1 test3 | awk 'BEGIN{OFS=","}{i = $1; $1 = ""; print $0}' But don't know how to call this function within awk. I would prefer to use awk as I have some downstream manipulations that can also be done in awk. Many thanks in advance
Using the ordinal functions from the awk manual, you can do it like this: awk -f ord.awk --source '{ # replace : with - in the first field gsub(/:/,"-",$1) # calculate the ordinal by looping over the characters in the fourth field res=ord($4) for(i=2;i<=length($4);i++) { res=res","ord(substr($4,i)) } $4=res }1' file Output: #ID---12345/1 ACGACTACGA text 33,34,35,36,37,118,119,120,121,122 #ID---12345/2 TATGACGACTA text 58,59,60,61,62,63,86,87,88,89,90 Here is ord.awk (taken as is from: http://www.gnu.org/software/gawk/manual/html_node/Ordinal-Functions.html) # ord.awk --- do ord and chr # Global identifiers: # _ord_: numerical values indexed by characters # _ord_init: function to initialize _ord_ BEGIN { _ord_init() } function _ord_init( low, high, i, t) { low = sprintf("%c", 7) # BEL is ascii 7 if (low == "\a") { # regular ascii low = 0 high = 127 } else if (sprintf("%c", 128 + 7) == "\a") { # ascii, mark parity low = 128 high = 255 } else { # ebcdic(!) low = 0 high = 255 } for (i = low; i <= high; i++) { t = sprintf("%c", i) _ord_[t] = i } } function ord(str, c) { # only first character is of interest c = substr(str, 1, 1) return _ord_[c] } function chr(c) { # force c to be numeric by adding 0 return sprintf("%c", c + 0) } If you don't want to include the whole of ord.awk, you can do it like this: awk 'BEGIN{ _ord_init()} function _ord_init(low, high, i, t) { low = sprintf("%c", 7) # BEL is ascii 7 if (low == "\a") { # regular ascii low = 0 high = 127 } else if (sprintf("%c", 128 + 7) == "\a") { # ascii, mark parity low = 128 high = 255 } else { # ebcdic(!) low = 0 high = 255 } for (i = low; i <= high; i++) { t = sprintf("%c", i) _ord_[t] = i } } { # replace : with - in the first field gsub(/:/,"-",$1) # calculate the ordinal by looping over the characters in the fourth field res=_ord_[substr($4,1,1)] for(i=2;i<=length($4);i++) { res=res","_ord_[substr($4,i,1)] } $4=res }1' file
Perl soltuion: perl -lnae '$F[0] =~ s%[:/]%-%g; $F[-1] =~ s/(.)/ord($1) . ","/ge; chop $F[-1]; print "#F";' < input The first substitution replaces : and / in the first field with a dash, the second one replaces each character in the last field with its ord and a comma, chop removes the last comma.
Column Operations in file Linux Shell
I have a file with entries seperated by an empty space. For example: example.txt 24676 256 218503341 2173 13236272 500 1023073758 5089 2230304 96 15622969 705 0 22 0 526 13277 28 379182 141 I would like to print, in the command line, the outcome of "column 1/ column 3" or simila. I believe it can be done with awk. However, some entries are 0, hence division by 0 gives: fatal: division by zero attempted In a more advanced case, I would like to find the median value (or some percentile) of the division.
There are many ways to ignore the row with a zero divisor, including: awk '$3 != 0 { print $1/$3 }' your-data-file awk '{ if ($3 != 0) print $1/$3 }' your-data-file The question changed — to print 0 instead. The answer is not much harder: awk '{ if ($3 != 0) print $1/$3; else print 0 }' your-data-file Medians and other percentiles are much fiddlier to deal with. It's easiest if the data is in sorted order. So much easier that I'd expect to use a numeric sort and then process the data from there. I dug out an old shell script which computes descriptive statistics - min, max, mode, median, and deciles of a single numeric column of data: : "#(#)$Id: dstats.sh,v 1.2 1997/06/02 21:45:00 johnl Exp $" # # Calculate Descriptive Statistics: min, max, median, mode, deciles sort -n $* | awk 'BEGIN { max = -999999999; min = 999999999; } { # Accumulate basic data count[$1]++; item[++n] = $1; if ($1 > max) max = $1; if ($1 < min) min = $1; } END { # Print Descriptive Statistics printf("# Count = %d\n", n); printf("# Min = %d\n", min); decile = 1; for (decile = 10; decile < 100; decile += 10) { idx = int((decile * n) / 100) + 1; printf("# %d%% decile = %d\n", decile, item[idx]); if (decile == 50) median = item[idx]; } printf("# Max = %d\n", max); printf("# Median = %d\n", median); for (i in count) { if (count[i] > count[mode]) mode = i; } printf("# Mode = %d\n", mode); }' The initial values of min and max are not exactly scientific. It serves to illustrate a point. (This 1997 version is almost identical to its 1991 predecessor - all except for the version information line is identical, in fact. So, the code is over 20 years old.)
Here's one solution: awk ' $3 != 0 { vals[$NR]=$1/$3; sum += vals[$NR]; print vals[$NR] } $3 == 0 { vals[$NR]=0; print "skipping division by 0" } END { sort vals; print "Mean = " sum/$NR ", Median ~ " vals[$NR/2] } ' < your_file This will calculate, print, and accumulate the quotients if the 3rd column is not zero. When it reaches the end of your file (which should not have an empty line), it will print the mean and median of all the quotients, assuming 0 for each line in which it would have divided by zero. In awk, $n means the nth field, starting with 1, and $NR means the number of records (that is, the number of lines) that have been processed. Each quotient is stored in the array vals, enabling us to calculate the median value. In real life, the median is defined as the "middle" item given an odd number of elements, or the mean of the two "middle" items given an even number of elements. And you're on your own when it comes to implementing the sort function!