Transpose column to row into Formatted Output: - linux

Would like to transpose below Non formatted Input into Formatted Output, since it is having multiple de-limited,
got struck to proceed further and looking for your suggestions.
Sample_Input.txt
UMTSGSMPLMNCallDataRecord
callForwarding
chargeableDuration 0 4 44'BCD
dateForStartOfCharge 09011B'H
recordSequenceNumber 57526'D
UMTSGSMPLMNCallDataRecord
mSTerminating
chargeableDuration 0 4 44'BCD
dateForStartOfCharge 09011B'H
recordSequenceNumber 57573'D
originalCalledNumber 149212345678'TBCD
redirectingNumber 149387654321'TBCD
!!!!!!!!!!!!!!!!!!!!!!!!1164!!!!!!!!!!!!!!!!!!!!!!
UMTSGSMPLMNCallDataRecord
mSTerminating
chargeableDuration 0 0 52'BCD
dateForStartOfCharge 09011B'H
recordSequenceNumber 45761'D
tariffClass 2'D
timeForStartOfCharge 9 46 58'BCD
calledSubscriberIMSI 21329701412F'TBCD
Searched in previous questions and got some relavent inputs from Mr.Porges Answer:
#!/bin/sh
# split lines on " " and use "," for output field separator
awk 'BEGIN { FS = " "; i = 0; h = 0; ofs = "," }
# empty line - increment item count and skip it
/^\s*$/ { i++ ; next }
# normal line - add the item to the object and the header to the header list
# and keep track of first seen order of headers
{
current[i, $1] = $2
if (!($1 in headers)) {headers_ordered[h++] = $1}
headers[$1]
}
END {
h--
# print headers
for (k = 0; k <= h; k++)
{
printf "%s", headers_ordered[k]
if (k != h) {printf "%s", ofs}
}
print ""
# print the items for each object
for (j = 0; j <= i; j++)
{
for (k = 0; k <= h; k++)
{
printf "%s", current[j, headers_ordered[k]]
if (k != h) {printf "%s", ofs}
}
print ""
}
}' Sample_Input.txt
Am getting below output:
UMTSGSMPLMNCallDataRecord,callForwarding,chargeableDuration,dateForStartOfCharge,recordSequenceNumber,mSTerminating,originalCalledNumber,redirectingNumber,!!!!!!!!!!!!!!!!!!!!!!!!1164!!!!!!!!!!!!!!!!!!!!!!,tariffClass,timeForStartOfCharge,calledSubscriberIMSI
,,,,,,,,,,,
,,0,09011B'H,57526'D,,,,,,,
,,,,,,,,,,,
,,,,,,,,,,,
,,,,,,,,,,,
,,0,09011B'H,57573'D,,149212345678'TBCD,149387654321'TBCD,,,,
,,,,,,,,,,,
,,,,,,,,,,,
,,,,,,,,,,,
,,,,,,,,,,,
,,,,,,,,,,,
,,0,09011B'H,45761'D,,,,,2'D,9,21329701412F'TBCD
,,,,,,,,,,,
where it stucks,
(a). Need to tackle when the block starts like "UMTSGSMPLMNCallDataRecord" and Empty Field , then the next line words like callForwarding/mSTerminating etc and Empty Field,
First word need to be considered as Row ("UMTSGSMPLMNCallDataRecord" ) and next line word need to be considered as Column (callForwarding/mSTerminating)
(b). Need to avoid the ALPHAPET into column fields i.e 09011B'H into 09011 , 149212345678'TBCD into 149212345678
Expected Output:
UMTSGSMPLMNCallDataRecord,chargeableDuration,dateForStartOfCharge,recordSequenceNumber,originalCalledNumber,redirectingNumber,tariffClass,timeForStartOfCharge,calledSubscriberIMSI
callForwarding,0 4 44,09011,57526,,,,,
mSTerminating,0 4 44,09011,57573,149212345678,149387654321,,,
mSTerminating,0 0 52, 09011,45761,,,2,9 46 58,21329701412
Edit:I have tried on the below Input:
UMTSGSMPLMNCallDataRecord
callForwarding
chargeableDuration 0 4 44'BCD
dateForStartOfCharge 09011B'H
recordSequenceNumber 57526'D
UMTSGSMPLMNCallDataRecord
mSTerminating
chargeableDuration 0 4 44'BCD
dateForStartOfCharge 09011B'H
recordSequenceNumber 57573'D
originalCalledNumber 149212345678'TBCD
redirectingNumber 149387654321'TBCD
!!!!!!!!!!!!!!!!!!!!!!!!1164!!!!!!!!!!!!!!!!!!!!!!
UMTSGSMPLMNCallDataRecord
mSTerminating
chargeableDuration 0 0 52'BCD
dateForStartOfCharge 09011B'H
recordSequenceNumber 45761'D
tariffClass 2'D
timeForStartOfCharge 9 46 58'BCD
calledSubscriberIMSI 21329701412F'TBCD

Discussion
This is a complex problem since the records are non-uniform: some with missing fields. Since each record occupies several lines, we can deal with it using AWK's multi-records feature: by setting the RS (record separator) and FS (field separator) variables.
Next, we need to deal with collecting the header fields. I don't have a good way to do this, so I hard-code the header line.
Once we establish the order in the header, we need a way to extract a specific field from the record and we accomplish that via the function get_column(). This function also strip off non-numeric data at the end per your requirement.
One last thing, we need to trim the white spaces off the first column ($2) using the homemade trim() function.
Command Line
I placed my code in make_csv.awk. To run it:
awk -f make_csv.awk Sample_Input.txt
File make_csv.awk
BEGIN {
# Next two lines: each record is of multiple lines, each line is a
# separate field
RS = ""
FS = "\n"
print "UMTSGSMPLMNCallDataRecord,chargeableDuration,dateForStartOfCharge,recordSequenceNumber,originalCalledNumber,redirectingNumber,tariffClass,timeForStartOfCharge,calledSubscriberIMSI"
}
function get_column(name, i, f, len) {
# i, f and len are "local" variables
for (i = 1; i <= NF; i++) {
len = split($i, f, " ")
if (f[1] == name) {
result = f[2]
for (i = 3; i <= len; i++) {
result = result " " f[i]
}
# Remove the trailing non numeric data
sub(/[a-zA-Z']+/, "", result)
return result
}
}
return "" # get_column not found, return empty string
}
# Remove leading and trailing spaces
function trim(s) {
sub(/[ \t]+$/, "", s)
sub(/^[ \t]+/, "", s)
return s
}
/UMTSGSMPLMNCallDataRecord/ {
print trim($2) \
"," get_column("chargeableDuration") \
"," get_column("dateForStartOfCharge") \
"," get_column("recordSequenceNumber") \
"," get_column("originalCalledNumber") \
"," get_column("redirectingNumber") \
"," get_column("tariffClass") \
"," get_column("timeForStartOfCharge") \
"," get_column("calledSubscriberIMSI") \
""
}
Update
I tried out my AWK script against AVN's latest input and got the following output:
UMTSGSMPLMNCallDataRecord,chargeableDuration,dateForStartOfCharge,recordSequenceNumber,originalCalledNumber,redirectingNumber,tariffClass,timeForStartOfCharge,calledSubscriberIMSI
callForwarding,0 4 44,09011,57526,,,,,
mSTerminating,0 4 44,09011,57573,149212345678,149387654321,,,
mSTerminating,0 0 52,09011,45761,,,2,9 46 58,21329701412

Related

How to sort the characters of a word using awk?

I can't seem to find any way of sorting a word based on its characters in awk.
For example if the word is "hello" then its sorted equivalent is "ehllo". how to achieve this in awk ?
With GNU awk for PROCINFO[], "sorted_in" (see https://www.gnu.org/software/gawk/manual/gawk.html#Controlling-Scanning) and splitting with a null separator resulting in an array of chars:
$ echo 'hello' |
awk '
BEGIN { PROCINFO["sorted_in"]="#val_str_asc" }
{
split($1,chars,"")
word = ""
for (i in chars) {
word = word chars[i]
}
print word
}
'
ehllo
$ echo 'hello' | awk -v ordr='#val_str_asc' 'BEGIN{PROCINFO["sorted_in"]=ordr} {split($1,chars,""); word=""; for (i in chars) word=word chars[i]; print word}'
ehllo
$ echo 'hello' | awk -v ordr='#val_str_desc' 'BEGIN{PROCINFO["sorted_in"]=ordr} {split($1,chars,""); word=""; for (i in chars) word=word chars[i]; print word}'
ollhe
Another option is a Decorate-Sort-Undecorate with sed. Essentially, you use sed to break "hello" into one character per-line (decorating each character with a newline '\n') and pipe the result to sort. You then use sed to do the reverse (undecorate each line by removing the '\n') to join the lines back together.
printf "hello" | sed 's/\(.\)/\1\n/g' | sort | sed '{:a N;s/\n//;ta}'
ehllo
There are several approaches you can use, but this one is shell friendly, but the behavior requires GNU sed.
This would be more doable with gawk, which includes the asort function to sort an array:
awk 'BEGIN{FS=OFS=ORS=""}{split($0,a);asort(a);for(i in a)print a[i]}'<<<hello
This outputs:
ehllo
Demo: https://ideone.com/ylWQLJ
You need to write a function to sort letters in a word (see : https://www.gnu.org/software/gawk/manual/html_node/Join-Function.html):
function siw(word, result, arr, arrlen, arridx) {
split(word, arr, "")
arrlen = asort(arr)
for (arridx = 1; arridx <= arrlen; arridx++) {
result = result arr[arridx]
}
return result
}
And define a sort sub-function to compare two words (see : https://www.gnu.org/software/gawk/manual/html_node/Array-Sorting-Functions.html):
function compare_by_letters(i1, v1, i2, v2, left, right) {
left = siw(v1)
right = siw(v2)
if (left < right)
return -1
else if (left == right)
return 0
else
return 1
}
And use this function with awk sort function:
asort(array_test, array_test_result, "compare_by_letters")
Then, the sample program is:
function siw(word, result, arr, arrlen, arridx) {
result = hash_word[word]
if (result != "") {
return result
}
split(word, arr, "")
arrlen = asort(arr)
for (arridx = 1; arridx <= arrlen; arridx++) {
result = result arr[arridx]
}
hash_word[word] = result
return result
}
function compare_by_letters(i1, v1, i2, v2, left, right) {
left = siw(v1)
right = siw(v2)
if (left < right)
return -1
else if (left == right)
return 0
else
return 1
}
{
array_test[i++] = $0
}
END {
alen = asort(array_test, array_test_result, "compare_by_letters")
for (aind = 1; aind <= alen; aind++) {
print array_test_result[aind]
}
}
Executed like this:
echo -e "fail\nhello\nborn" | awk -f sort_letter.awk
Output:
fail
born
hello
Of course, if you have a big input, you could adapt siw function to memorize result for fastest compute:
function siw(word, result, arr, arrlen, arridx) {
result = hash_word[word]
if (result != "") {
return result
}
split(word, arr, "")
arrlen = asort(arr)
for (arridx = 1; arridx <= arrlen; arridx++) {
result = result arr[arridx]
}
hash_word[word] = result
return result
}
here's a very unorthodox method for a quick-n-dirty approach, if you really want to sort "hello" into "ehllo" :
mawk/mawk2/gawk 'BEGIN { FS="^$"
# to make it AaBbCc… etc; chr(65) = ascii "A"
for (x = 65; x < 91; x++) {
ref = sprintf("%s%c%c",ref, x, x+32)
} } /^[[:alpha:]]$/ { print } /[[:alpha:]][[:alpha:]]+/ {
# for gawk/nawk, feel free to change
# that to /[[:alpha:]]{2,}/
# the >= 2+ condition is to prevent wasting time
# sorting single letter words "A" and "I"
s=""; x=1; len=length(inp=$0);
while ( len && (x<53) ) {
if (inp~(ch = substr(ref,x++,1))) {
while ( sub(ch,"",inp) ) {
s = s ch;
len -= 1 ;
} } }
print s }'
I'm aware it's an extremely inefficient way of doing selection sort. The potential time-savings stem from instant loop ending the moment all letters are completed, instead of iterating all 52 letters everytime. The downside is that it doesn't pre-profile the input
(e.g. if u detect that this row is only lower-case, then u can speed it up with a lowercase only loop instead)
The upside is that it eliminates the need for custom-functions, eliminate any gawk dependencies, and also eliminate the need to split every row into an array (or every character into its own field)
i mean yes technically one can set FS to null string thus automatically becomes having NF as the string length. But at times it could be slow if input is a bit large. If you need unicode support, then a match()-based approach is more desirable.
added (x<53) condition to prevent run-away infinite loops in case input isn't pure ASCII letters

Printing ten values from each array alternating between them until both are empty

Suppose I have the following script and it is iterating over two arrays and prints the the values alternating between them. I feed the Script the following file:
1 2
1 2
1 2
1 2
1 2
1 2
Awk Script:
BEGIN{ FS = "\t" }
{ a[o++]=$1; b[c++]=$2 }
END{ while (++co <= 5 ) {
for ( k = 1+count; k <= 2+count; k++ ){ if (length(a) >= 0){ print a[k]; delete a[k]} }
for ( ki = 1+count; ki <= 2+count; ki++ ){ if (length(b) >= 0){ print b[ki]; delete b[ki] } }
count = k
}
}
And I expect the output to be
1
1
2
2
1
1
2
2
1
1
2
2
But what I get is:
1
1
2
2
1
1
2
2
(followed by blank lines but they are not my issue)
So how to make it run until both arrays are empty and everything is printed. And how to make it work even if the arrays are of different size e.g. one caontains for example 20 more values than the other. So then just these values should be printed until the array is empty.
wrt how to make it run until both arrays are empty - There's no need to delete the array contents.
wrt just these values should be printed until the array is empty - "the array is empty" could mean print just the values at the indices present in both arrays ("the first array is empty") or it could mean print all the values from both arrays ("both arrays are empty"). The script below assumes the latter but is easily tweaked to do the former (change || to &&).
$ cat tst.awk
{ a[++maxA]=$1; b[++maxB]=$2 }
END {
while ( (prevEnd<maxA) || (prevEnd<maxB) ) {
prt(a)
prevEnd = prt(b)
}
}
function prt(arr, idx) {
for (idx=prevEnd+1; idx<=prevEnd+2; idx++) {
if (idx in arr) {
print arr[idx]
}
}
return (idx-1)
}
$ awk -f tst.awk file
1
1
2
2
1
1
2
2
1
1
2
2

Awk: given list of users with session data, output list of users with specific data

Not sure how to ask this question, thus I don't know how to search for it on google or SO. Let me just show you the given data. By the way, this is just an Awk exercise, its not homework. Been trying to solve this off and on for 2 days now. Below is an example;
Mon Sep 15 12:17:46 1997
User-Name = "wynng"
NAS-Identifier = 207.238.228.11
NAS-Port = 20104
Acct-Status-Type = Start
Acct-Delay-Time = 0
Acct-Session-Id = "239736724"
Acct-Authentic = RADIUS
Client-Port-DNIS = "3571800"
Framed-Protocol = PPP
Framed-Address = 207.238.228.57
Mon Sep 15 12:19:40 1997
User-Name = "wynng"
NAS-Identifier = 207.238.228.11
NAS-Port = 20104
Acct-Status-Type = Stop
Acct-Delay-Time = 0
Acct-Session-Id = "239736724"
Acct-Authentic = RADIUS
Acct-Session-Time = 115
Acct-Input-Octets = 3915
Acct-Output-Octets = 3315
Acct-Input-Packets = 83
Acct-Output-Packets = 66
Ascend-Disconnect-Cause = 45
Ascend-Connect-Progress = 60
Ascend-Data-Rate = 28800
Ascend-PreSession-Time = 40
Ascend-Pre-Input-Octets = 395
Ascend-Pre-Output-Octets = 347
Ascend-Pre-Input-Packets = 10
Ascend-Pre-Output-Packets = 11
Ascend-First-Dest = 207.238.228.255
Client-Port-DNIS = "3571800"
Framed-Protocol = PPP
Framed-Address = 207.238.228.57
So the log file contains the above data for various users. I specifically pasted this to show that this user had a login, Acct-Status-Type = Start, and a logoff, Acct-Status-Type = Stop. This counts as one session. Thus I need to generate the following output.
User: "wynng"
Number of Sessions: 1
Total Connect Time: 115
Input Bandwidth Usage: 83
Output Bandwidth Usage: 66
The problem I have is keeping the info somehow attached to the user. Each entry in the log file has the same information when the session is in Stop so I cant just regex
/Acct-Input-Packets/{inPackets =$3}
/Acct-Output-Packets/{outPackets = $3}
Each iteration through the data will overwrite the past values. What I want to do is if I find a User-Name entry and this entry has a Stop, then I want to record for that user, the input/output packet values. This is where I get stumped.
For the session values I was thinking of saving the User-Names in an array and then in the END{} count the duplicates and divide by 2 those that are greater than 2 if even. If odd then divide by two then floor it.
I don't necessarily want the answer but maybe some hints/guidance or perhaps a simple example on which I could expand on.
You can check each line for :
a date pattern : /\w+\s\w+\s[0-9]{2}\s[0-9]{2}:[0-9]{2}:[0-9]{2}\s[0-9]{4}/
user name value : /User-Name\s+=\s+\"\w+\"/
status value : /Acct-Status-Type\s+=\s+\w+/
input packet value : /Acct-Input-Packets\s+=\s[0-9]+/
output packet value : /Acct-Output-Packets\s+=\s[0-9]+/
an empty line : /^$/
Once you have defined what you are looking for (above pattern), it's just a matter of conditions and storing all those data in some array.
In the following example, I store each value type above in a dedicated array for each type with a count index that is incremented when an empty line /^$/ is detected :
awk 'BEGIN{
count = 1;
i = 1;
}{
if ($0 ~ /\w+\s\w+\s[0-9]{2}\s[0-9]{2}:[0-9]{2}:[0-9]{2}\s[0-9]{4}/){
match($0, /\w+\s(\w+)\s([0-9]{2})\s([0-9]{2}):([0-9]{2}):([0-9]{2})\s([0-9]{4})/, n);
match("JanFebMarAprMayJunJulAugSepOctNovDec",n[1])
n[1] = sprintf("%02d",(RSTART+2)/3);
arr[count]=mktime(n[6] " " n[1] " " n[2] " " n[3] " " n[4] " " n[5]);
order[i]=count;
i++;
}
else if ($0 ~ /User-Name\s+=\s+\"\w+\"/){
match($0, /User-Name\s+=\s+\"(\w+)\"/, n);
name[count]=n[1];
}
else if ($0 ~ /Acct-Status-Type\s+=\s+\w+/){
match($0, /Acct-Status-Type\s+=\s+(\w+)/, n);
status[count]=n[1];
}
else if ($0 ~ /^$/){
count++;
}
else if ($0 ~ /Acct-Input-Packets\s+=\s[0-9]+/){
match($0, /Acct-Input-Packets\s+=\s([0-9]+)/, n);
input[count]=n[1];
}
else if ($0 ~ /Acct-Output-Packets\s+=\s[0-9]+/){
match($0, /Acct-Output-Packets\s+=\s([0-9]+)/, n);
output[count]=n[1];
}
}
END{
for (i = 1; i <= length(order); i++) {
val = name[order[i]];
if (length(user[val]) == 0) {
valueStart = "0";
if (status[order[i]] == "Start"){
valueStart = arr[order[i]];
}
user[val]= valueStart "|0|0|0|0";
}
else {
split(user[val], nameArr, "|");
if (status[order[i]]=="Stop"){
nameArr[2]++;
nameArr[3]+=arr[order[i]]-nameArr[1]
}
else if (status[order[i]] == "Start"){
# store date start
nameArr[1] = arr[order[i]];
}
nameArr[4]+=input[order[i]];
nameArr[5]+=output[order[i]];
user[val]= nameArr[1] "|" nameArr[2] "|" nameArr[3] "|" nameArr[4] "|" nameArr[5];
}
}
for (usr in user) {
split(user[usr], usrArr, "|");
print "User: " usr;
print "Number of Sessions: " usrArr[2];
print "Total Connect Time: " usrArr[3];
print "Input Bandwidth Usage: " usrArr[4];
print "Output Bandwidth Usage: " usrArr[5];
print "------------------------";
}
}' test.txt
The values are extracted with match function like :
match($0, /User-Name\s+=\s+\"(\w+)\"/, n);
For the date, we have to parse the month string part, I've used the solution in this post to extract like :
match($0, /\w+\s(\w+)\s([0-9]{2})\s([0-9]{2}):([0-9]{2}):([0-9]{2})\s([0-9]{4})/, n);
match("JanFebMarAprMayJunJulAugSepOctNovDec",n[1])
n[1] = sprintf("%02d",(RSTART+2)/3);
All the processing of the collected values is done in the END clause where we have to group the values, I create a user array with the username as key and as value a concatenation of all your different type delimited by | :
[startDate] "|" [sessionNum] "|" [connectionTime] "|" [inputUsage] "|" [outputUsage]
With this data input (your data extended), it gives :
User: TOTO
Number of Sessions: 1
Total Connect Time: 114
Input Bandwidth Usage: 83
Output Bandwidth Usage: 66
------------------------
User: wynng
Number of Sessions: 2
Total Connect Time: 228
Input Bandwidth Usage: 166
Output Bandwidth Usage: 132
------------------------

Converting string of ASCII characters to string of corresponding decimals

May I introduce you to the problem that destroyed my weekend. I have biological data in 4 columns
#ID:::12345/1 ACGACTACGA text !"#$%vwxyz
#ID:::12345/2 TATGACGACTA text :;<=>?VWXYZ
I would like to use awk to edit the first column to replace characters : and / with -
I would like to convert the string in the last column with a comma-separated string of decimals that correspond to each individual ASCII character (any character ranging from ASCII 33 - 126).
#ID---12345-1 ACGACTACGA text 33,34,35,36,37,118,119,120,121,122
#ID---12345-2 TATGACGACTA text 58,59,60,61,62,63,86,87,88,89,90
The first part is easy, but i'm stuck with the second. I've tried using awk ordinal functions and sprintf; I can only get the former to work on the first char in the string and I can only get the latter to convert hexidecimal to decimal and not with spaces. Also tried bash function
$ od -t d1 test3 | awk 'BEGIN{OFS=","}{i = $1; $1 = ""; print $0}'
But don't know how to call this function within awk.
I would prefer to use awk as I have some downstream manipulations that can also be done in awk.
Many thanks in advance
Using the ordinal functions from the awk manual, you can do it like this:
awk -f ord.awk --source '{
# replace : with - in the first field
gsub(/:/,"-",$1)
# calculate the ordinal by looping over the characters in the fourth field
res=ord($4)
for(i=2;i<=length($4);i++) {
res=res","ord(substr($4,i))
}
$4=res
}1' file
Output:
#ID---12345/1 ACGACTACGA text 33,34,35,36,37,118,119,120,121,122
#ID---12345/2 TATGACGACTA text 58,59,60,61,62,63,86,87,88,89,90
Here is ord.awk (taken as is from: http://www.gnu.org/software/gawk/manual/html_node/Ordinal-Functions.html)
# ord.awk --- do ord and chr
# Global identifiers:
# _ord_: numerical values indexed by characters
# _ord_init: function to initialize _ord_
BEGIN { _ord_init() }
function _ord_init( low, high, i, t)
{
low = sprintf("%c", 7) # BEL is ascii 7
if (low == "\a") { # regular ascii
low = 0
high = 127
} else if (sprintf("%c", 128 + 7) == "\a") {
# ascii, mark parity
low = 128
high = 255
} else { # ebcdic(!)
low = 0
high = 255
}
for (i = low; i <= high; i++) {
t = sprintf("%c", i)
_ord_[t] = i
}
}
function ord(str, c)
{
# only first character is of interest
c = substr(str, 1, 1)
return _ord_[c]
}
function chr(c)
{
# force c to be numeric by adding 0
return sprintf("%c", c + 0)
}
If you don't want to include the whole of ord.awk, you can do it like this:
awk 'BEGIN{ _ord_init()}
function _ord_init(low, high, i, t)
{
low = sprintf("%c", 7) # BEL is ascii 7
if (low == "\a") { # regular ascii
low = 0
high = 127
} else if (sprintf("%c", 128 + 7) == "\a") {
# ascii, mark parity
low = 128
high = 255
} else { # ebcdic(!)
low = 0
high = 255
}
for (i = low; i <= high; i++) {
t = sprintf("%c", i)
_ord_[t] = i
}
}
{
# replace : with - in the first field
gsub(/:/,"-",$1)
# calculate the ordinal by looping over the characters in the fourth field
res=_ord_[substr($4,1,1)]
for(i=2;i<=length($4);i++) {
res=res","_ord_[substr($4,i,1)]
}
$4=res
}1' file
Perl soltuion:
perl -lnae '$F[0] =~ s%[:/]%-%g; $F[-1] =~ s/(.)/ord($1) . ","/ge; chop $F[-1]; print "#F";' < input
The first substitution replaces : and / in the first field with a dash, the second one replaces each character in the last field with its ord and a comma, chop removes the last comma.

Column Operations in file Linux Shell

I have a file with entries seperated by an empty space. For example:
example.txt
24676 256 218503341 2173
13236272 500 1023073758 5089
2230304 96 15622969 705
0 22 0 526
13277 28 379182 141
I would like to print, in the command line, the outcome of "column 1/ column 3" or simila. I believe it can be done with awk. However, some entries are 0, hence division by 0 gives:
fatal: division by zero attempted
In a more advanced case, I would like to find the median value (or some percentile) of the division.
There are many ways to ignore the row with a zero divisor, including:
awk '$3 != 0 { print $1/$3 }' your-data-file
awk '{ if ($3 != 0) print $1/$3 }' your-data-file
The question changed — to print 0 instead. The answer is not much harder:
awk '{ if ($3 != 0) print $1/$3; else print 0 }' your-data-file
Medians and other percentiles are much fiddlier to deal with. It's easiest if the data is in sorted order. So much easier that I'd expect to use a numeric sort and then process the data from there.
I dug out an old shell script which computes descriptive statistics - min, max, mode, median, and deciles of a single numeric column of data:
: "#(#)$Id: dstats.sh,v 1.2 1997/06/02 21:45:00 johnl Exp $"
#
# Calculate Descriptive Statistics: min, max, median, mode, deciles
sort -n $* |
awk 'BEGIN { max = -999999999; min = 999999999; }
{ # Accumulate basic data
count[$1]++;
item[++n] = $1;
if ($1 > max) max = $1;
if ($1 < min) min = $1;
}
END { # Print Descriptive Statistics
printf("# Count = %d\n", n);
printf("# Min = %d\n", min);
decile = 1;
for (decile = 10; decile < 100; decile += 10)
{
idx = int((decile * n) / 100) + 1;
printf("# %d%% decile = %d\n", decile, item[idx]);
if (decile == 50)
median = item[idx];
}
printf("# Max = %d\n", max);
printf("# Median = %d\n", median);
for (i in count)
{
if (count[i] > count[mode])
mode = i;
}
printf("# Mode = %d\n", mode);
}'
The initial values of min and max are not exactly scientific. It serves to illustrate a point.
(This 1997 version is almost identical to its 1991 predecessor - all except for the version information line is identical, in fact. So, the code is over 20 years old.)
Here's one solution:
awk '
$3 != 0 { vals[$NR]=$1/$3; sum += vals[$NR]; print vals[$NR] }
$3 == 0 { vals[$NR]=0; print "skipping division by 0" }
END { sort vals; print "Mean = " sum/$NR ", Median ~ " vals[$NR/2] }
' < your_file
This will calculate, print, and accumulate the quotients if the 3rd column is not zero. When it reaches the end of your file (which should not have an empty line), it will print the mean and median of all the quotients, assuming 0 for each line in which it would have divided by zero.
In awk, $n means the nth field, starting with 1, and $NR means the number of records (that is, the number of lines) that have been processed. Each quotient is stored in the array vals, enabling us to calculate the median value.
In real life, the median is defined as the "middle" item given an odd number of elements, or the mean of the two "middle" items given an even number of elements.
And you're on your own when it comes to implementing the sort function!

Resources