Related
I can't seem to find any way of sorting a word based on its characters in awk.
For example if the word is "hello" then its sorted equivalent is "ehllo". how to achieve this in awk ?
With GNU awk for PROCINFO[], "sorted_in" (see https://www.gnu.org/software/gawk/manual/gawk.html#Controlling-Scanning) and splitting with a null separator resulting in an array of chars:
$ echo 'hello' |
awk '
BEGIN { PROCINFO["sorted_in"]="#val_str_asc" }
{
split($1,chars,"")
word = ""
for (i in chars) {
word = word chars[i]
}
print word
}
'
ehllo
$ echo 'hello' | awk -v ordr='#val_str_asc' 'BEGIN{PROCINFO["sorted_in"]=ordr} {split($1,chars,""); word=""; for (i in chars) word=word chars[i]; print word}'
ehllo
$ echo 'hello' | awk -v ordr='#val_str_desc' 'BEGIN{PROCINFO["sorted_in"]=ordr} {split($1,chars,""); word=""; for (i in chars) word=word chars[i]; print word}'
ollhe
Another option is a Decorate-Sort-Undecorate with sed. Essentially, you use sed to break "hello" into one character per-line (decorating each character with a newline '\n') and pipe the result to sort. You then use sed to do the reverse (undecorate each line by removing the '\n') to join the lines back together.
printf "hello" | sed 's/\(.\)/\1\n/g' | sort | sed '{:a N;s/\n//;ta}'
ehllo
There are several approaches you can use, but this one is shell friendly, but the behavior requires GNU sed.
This would be more doable with gawk, which includes the asort function to sort an array:
awk 'BEGIN{FS=OFS=ORS=""}{split($0,a);asort(a);for(i in a)print a[i]}'<<<hello
This outputs:
ehllo
Demo: https://ideone.com/ylWQLJ
You need to write a function to sort letters in a word (see : https://www.gnu.org/software/gawk/manual/html_node/Join-Function.html):
function siw(word, result, arr, arrlen, arridx) {
split(word, arr, "")
arrlen = asort(arr)
for (arridx = 1; arridx <= arrlen; arridx++) {
result = result arr[arridx]
}
return result
}
And define a sort sub-function to compare two words (see : https://www.gnu.org/software/gawk/manual/html_node/Array-Sorting-Functions.html):
function compare_by_letters(i1, v1, i2, v2, left, right) {
left = siw(v1)
right = siw(v2)
if (left < right)
return -1
else if (left == right)
return 0
else
return 1
}
And use this function with awk sort function:
asort(array_test, array_test_result, "compare_by_letters")
Then, the sample program is:
function siw(word, result, arr, arrlen, arridx) {
result = hash_word[word]
if (result != "") {
return result
}
split(word, arr, "")
arrlen = asort(arr)
for (arridx = 1; arridx <= arrlen; arridx++) {
result = result arr[arridx]
}
hash_word[word] = result
return result
}
function compare_by_letters(i1, v1, i2, v2, left, right) {
left = siw(v1)
right = siw(v2)
if (left < right)
return -1
else if (left == right)
return 0
else
return 1
}
{
array_test[i++] = $0
}
END {
alen = asort(array_test, array_test_result, "compare_by_letters")
for (aind = 1; aind <= alen; aind++) {
print array_test_result[aind]
}
}
Executed like this:
echo -e "fail\nhello\nborn" | awk -f sort_letter.awk
Output:
fail
born
hello
Of course, if you have a big input, you could adapt siw function to memorize result for fastest compute:
function siw(word, result, arr, arrlen, arridx) {
result = hash_word[word]
if (result != "") {
return result
}
split(word, arr, "")
arrlen = asort(arr)
for (arridx = 1; arridx <= arrlen; arridx++) {
result = result arr[arridx]
}
hash_word[word] = result
return result
}
here's a very unorthodox method for a quick-n-dirty approach, if you really want to sort "hello" into "ehllo" :
mawk/mawk2/gawk 'BEGIN { FS="^$"
# to make it AaBbCc… etc; chr(65) = ascii "A"
for (x = 65; x < 91; x++) {
ref = sprintf("%s%c%c",ref, x, x+32)
} } /^[[:alpha:]]$/ { print } /[[:alpha:]][[:alpha:]]+/ {
# for gawk/nawk, feel free to change
# that to /[[:alpha:]]{2,}/
# the >= 2+ condition is to prevent wasting time
# sorting single letter words "A" and "I"
s=""; x=1; len=length(inp=$0);
while ( len && (x<53) ) {
if (inp~(ch = substr(ref,x++,1))) {
while ( sub(ch,"",inp) ) {
s = s ch;
len -= 1 ;
} } }
print s }'
I'm aware it's an extremely inefficient way of doing selection sort. The potential time-savings stem from instant loop ending the moment all letters are completed, instead of iterating all 52 letters everytime. The downside is that it doesn't pre-profile the input
(e.g. if u detect that this row is only lower-case, then u can speed it up with a lowercase only loop instead)
The upside is that it eliminates the need for custom-functions, eliminate any gawk dependencies, and also eliminate the need to split every row into an array (or every character into its own field)
i mean yes technically one can set FS to null string thus automatically becomes having NF as the string length. But at times it could be slow if input is a bit large. If you need unicode support, then a match()-based approach is more desirable.
added (x<53) condition to prevent run-away infinite loops in case input isn't pure ASCII letters
Would like to transpose below Non formatted Input into Formatted Output, since it is having multiple de-limited,
got struck to proceed further and looking for your suggestions.
Sample_Input.txt
UMTSGSMPLMNCallDataRecord
callForwarding
chargeableDuration 0 4 44'BCD
dateForStartOfCharge 09011B'H
recordSequenceNumber 57526'D
UMTSGSMPLMNCallDataRecord
mSTerminating
chargeableDuration 0 4 44'BCD
dateForStartOfCharge 09011B'H
recordSequenceNumber 57573'D
originalCalledNumber 149212345678'TBCD
redirectingNumber 149387654321'TBCD
!!!!!!!!!!!!!!!!!!!!!!!!1164!!!!!!!!!!!!!!!!!!!!!!
UMTSGSMPLMNCallDataRecord
mSTerminating
chargeableDuration 0 0 52'BCD
dateForStartOfCharge 09011B'H
recordSequenceNumber 45761'D
tariffClass 2'D
timeForStartOfCharge 9 46 58'BCD
calledSubscriberIMSI 21329701412F'TBCD
Searched in previous questions and got some relavent inputs from Mr.Porges Answer:
#!/bin/sh
# split lines on " " and use "," for output field separator
awk 'BEGIN { FS = " "; i = 0; h = 0; ofs = "," }
# empty line - increment item count and skip it
/^\s*$/ { i++ ; next }
# normal line - add the item to the object and the header to the header list
# and keep track of first seen order of headers
{
current[i, $1] = $2
if (!($1 in headers)) {headers_ordered[h++] = $1}
headers[$1]
}
END {
h--
# print headers
for (k = 0; k <= h; k++)
{
printf "%s", headers_ordered[k]
if (k != h) {printf "%s", ofs}
}
print ""
# print the items for each object
for (j = 0; j <= i; j++)
{
for (k = 0; k <= h; k++)
{
printf "%s", current[j, headers_ordered[k]]
if (k != h) {printf "%s", ofs}
}
print ""
}
}' Sample_Input.txt
Am getting below output:
UMTSGSMPLMNCallDataRecord,callForwarding,chargeableDuration,dateForStartOfCharge,recordSequenceNumber,mSTerminating,originalCalledNumber,redirectingNumber,!!!!!!!!!!!!!!!!!!!!!!!!1164!!!!!!!!!!!!!!!!!!!!!!,tariffClass,timeForStartOfCharge,calledSubscriberIMSI
,,,,,,,,,,,
,,0,09011B'H,57526'D,,,,,,,
,,,,,,,,,,,
,,,,,,,,,,,
,,,,,,,,,,,
,,0,09011B'H,57573'D,,149212345678'TBCD,149387654321'TBCD,,,,
,,,,,,,,,,,
,,,,,,,,,,,
,,,,,,,,,,,
,,,,,,,,,,,
,,,,,,,,,,,
,,0,09011B'H,45761'D,,,,,2'D,9,21329701412F'TBCD
,,,,,,,,,,,
where it stucks,
(a). Need to tackle when the block starts like "UMTSGSMPLMNCallDataRecord" and Empty Field , then the next line words like callForwarding/mSTerminating etc and Empty Field,
First word need to be considered as Row ("UMTSGSMPLMNCallDataRecord" ) and next line word need to be considered as Column (callForwarding/mSTerminating)
(b). Need to avoid the ALPHAPET into column fields i.e 09011B'H into 09011 , 149212345678'TBCD into 149212345678
Expected Output:
UMTSGSMPLMNCallDataRecord,chargeableDuration,dateForStartOfCharge,recordSequenceNumber,originalCalledNumber,redirectingNumber,tariffClass,timeForStartOfCharge,calledSubscriberIMSI
callForwarding,0 4 44,09011,57526,,,,,
mSTerminating,0 4 44,09011,57573,149212345678,149387654321,,,
mSTerminating,0 0 52, 09011,45761,,,2,9 46 58,21329701412
Edit:I have tried on the below Input:
UMTSGSMPLMNCallDataRecord
callForwarding
chargeableDuration 0 4 44'BCD
dateForStartOfCharge 09011B'H
recordSequenceNumber 57526'D
UMTSGSMPLMNCallDataRecord
mSTerminating
chargeableDuration 0 4 44'BCD
dateForStartOfCharge 09011B'H
recordSequenceNumber 57573'D
originalCalledNumber 149212345678'TBCD
redirectingNumber 149387654321'TBCD
!!!!!!!!!!!!!!!!!!!!!!!!1164!!!!!!!!!!!!!!!!!!!!!!
UMTSGSMPLMNCallDataRecord
mSTerminating
chargeableDuration 0 0 52'BCD
dateForStartOfCharge 09011B'H
recordSequenceNumber 45761'D
tariffClass 2'D
timeForStartOfCharge 9 46 58'BCD
calledSubscriberIMSI 21329701412F'TBCD
Discussion
This is a complex problem since the records are non-uniform: some with missing fields. Since each record occupies several lines, we can deal with it using AWK's multi-records feature: by setting the RS (record separator) and FS (field separator) variables.
Next, we need to deal with collecting the header fields. I don't have a good way to do this, so I hard-code the header line.
Once we establish the order in the header, we need a way to extract a specific field from the record and we accomplish that via the function get_column(). This function also strip off non-numeric data at the end per your requirement.
One last thing, we need to trim the white spaces off the first column ($2) using the homemade trim() function.
Command Line
I placed my code in make_csv.awk. To run it:
awk -f make_csv.awk Sample_Input.txt
File make_csv.awk
BEGIN {
# Next two lines: each record is of multiple lines, each line is a
# separate field
RS = ""
FS = "\n"
print "UMTSGSMPLMNCallDataRecord,chargeableDuration,dateForStartOfCharge,recordSequenceNumber,originalCalledNumber,redirectingNumber,tariffClass,timeForStartOfCharge,calledSubscriberIMSI"
}
function get_column(name, i, f, len) {
# i, f and len are "local" variables
for (i = 1; i <= NF; i++) {
len = split($i, f, " ")
if (f[1] == name) {
result = f[2]
for (i = 3; i <= len; i++) {
result = result " " f[i]
}
# Remove the trailing non numeric data
sub(/[a-zA-Z']+/, "", result)
return result
}
}
return "" # get_column not found, return empty string
}
# Remove leading and trailing spaces
function trim(s) {
sub(/[ \t]+$/, "", s)
sub(/^[ \t]+/, "", s)
return s
}
/UMTSGSMPLMNCallDataRecord/ {
print trim($2) \
"," get_column("chargeableDuration") \
"," get_column("dateForStartOfCharge") \
"," get_column("recordSequenceNumber") \
"," get_column("originalCalledNumber") \
"," get_column("redirectingNumber") \
"," get_column("tariffClass") \
"," get_column("timeForStartOfCharge") \
"," get_column("calledSubscriberIMSI") \
""
}
Update
I tried out my AWK script against AVN's latest input and got the following output:
UMTSGSMPLMNCallDataRecord,chargeableDuration,dateForStartOfCharge,recordSequenceNumber,originalCalledNumber,redirectingNumber,tariffClass,timeForStartOfCharge,calledSubscriberIMSI
callForwarding,0 4 44,09011,57526,,,,,
mSTerminating,0 4 44,09011,57573,149212345678,149387654321,,,
mSTerminating,0 0 52,09011,45761,,,2,9 46 58,21329701412
May I introduce you to the problem that destroyed my weekend. I have biological data in 4 columns
#ID:::12345/1 ACGACTACGA text !"#$%vwxyz
#ID:::12345/2 TATGACGACTA text :;<=>?VWXYZ
I would like to use awk to edit the first column to replace characters : and / with -
I would like to convert the string in the last column with a comma-separated string of decimals that correspond to each individual ASCII character (any character ranging from ASCII 33 - 126).
#ID---12345-1 ACGACTACGA text 33,34,35,36,37,118,119,120,121,122
#ID---12345-2 TATGACGACTA text 58,59,60,61,62,63,86,87,88,89,90
The first part is easy, but i'm stuck with the second. I've tried using awk ordinal functions and sprintf; I can only get the former to work on the first char in the string and I can only get the latter to convert hexidecimal to decimal and not with spaces. Also tried bash function
$ od -t d1 test3 | awk 'BEGIN{OFS=","}{i = $1; $1 = ""; print $0}'
But don't know how to call this function within awk.
I would prefer to use awk as I have some downstream manipulations that can also be done in awk.
Many thanks in advance
Using the ordinal functions from the awk manual, you can do it like this:
awk -f ord.awk --source '{
# replace : with - in the first field
gsub(/:/,"-",$1)
# calculate the ordinal by looping over the characters in the fourth field
res=ord($4)
for(i=2;i<=length($4);i++) {
res=res","ord(substr($4,i))
}
$4=res
}1' file
Output:
#ID---12345/1 ACGACTACGA text 33,34,35,36,37,118,119,120,121,122
#ID---12345/2 TATGACGACTA text 58,59,60,61,62,63,86,87,88,89,90
Here is ord.awk (taken as is from: http://www.gnu.org/software/gawk/manual/html_node/Ordinal-Functions.html)
# ord.awk --- do ord and chr
# Global identifiers:
# _ord_: numerical values indexed by characters
# _ord_init: function to initialize _ord_
BEGIN { _ord_init() }
function _ord_init( low, high, i, t)
{
low = sprintf("%c", 7) # BEL is ascii 7
if (low == "\a") { # regular ascii
low = 0
high = 127
} else if (sprintf("%c", 128 + 7) == "\a") {
# ascii, mark parity
low = 128
high = 255
} else { # ebcdic(!)
low = 0
high = 255
}
for (i = low; i <= high; i++) {
t = sprintf("%c", i)
_ord_[t] = i
}
}
function ord(str, c)
{
# only first character is of interest
c = substr(str, 1, 1)
return _ord_[c]
}
function chr(c)
{
# force c to be numeric by adding 0
return sprintf("%c", c + 0)
}
If you don't want to include the whole of ord.awk, you can do it like this:
awk 'BEGIN{ _ord_init()}
function _ord_init(low, high, i, t)
{
low = sprintf("%c", 7) # BEL is ascii 7
if (low == "\a") { # regular ascii
low = 0
high = 127
} else if (sprintf("%c", 128 + 7) == "\a") {
# ascii, mark parity
low = 128
high = 255
} else { # ebcdic(!)
low = 0
high = 255
}
for (i = low; i <= high; i++) {
t = sprintf("%c", i)
_ord_[t] = i
}
}
{
# replace : with - in the first field
gsub(/:/,"-",$1)
# calculate the ordinal by looping over the characters in the fourth field
res=_ord_[substr($4,1,1)]
for(i=2;i<=length($4);i++) {
res=res","_ord_[substr($4,i,1)]
}
$4=res
}1' file
Perl soltuion:
perl -lnae '$F[0] =~ s%[:/]%-%g; $F[-1] =~ s/(.)/ord($1) . ","/ge; chop $F[-1]; print "#F";' < input
The first substitution replaces : and / in the first field with a dash, the second one replaces each character in the last field with its ord and a comma, chop removes the last comma.
I have a fairly large text file and am trying to search for a particular term so that i can start a process after that point, but this doesn't seem to be working for me:
fileID = fopen(resfile,'r');
line = 0;
while 1
tline = fgetl(fileID);
line = line + 1;
if ischar(tline)
startRow = strfind(tline, 'OptimetricsResult');
if isfinite(startRow) == 1;
break
end
end
end
The answer I get is 9, but my text file:
$begin '$base_index$'
$begin 'properties'
all_levels=000000000000
time(year=000000002013, month=000000000006, day=000000000020, hour=000000000008, min=000000000033, sec=000000000033)
version=000000000000
$end 'properties'
$begin '$base_index$'
$index$(pos=000000492036, lin=000000009689, lvl=000000000000)
$end '$base_index$'
definitely doesn't have that in the first 9 rows?
If I ctrl+F the file, I know that OptimetricsResult only appears once, and that it's 6792 lines down
Any suggestions?
Thanks
I think your script somehow works, and you were just looking at the wrong variable. I assume that the answer you get is startRow = 9 and not line = 9. Check the variable line. By the way, note that you're not checking an End-of-File, so your while loop might run indefinitely the file doesn't contain your search string.
An alternative approach, (which is much simpler in my humble opinion) would be reading all lines at once (each one stored as a separate string) with textscan, and then applying regexp or strfind:
%// Read lines from input file
fid = fopen(filename, 'r');
C = textscan(fid, '%s', 'Delimiter', '\n');
fclose(fid);
%// Search a specific string and find all rows containing matches
C = strfind(C{1}, 'OptimetricsResult');
rows = find(~cellfun('isempty', C));
I can't reproduce your problem.
Are you sure you've properly closed the file before re-running this script? If not, the internal line counter in fgetl does not get reset, so you get false results. Just issue a fclose all on the MATLAB command prompt, and add a fclose(fileID); after the loop, and test again.
In any case, I suggest modifying your infinite-loop (with all sorts of pitfalls) to the following finite loop:
haystack = fopen(resfile,'r');
needle = 'OptimetricsResult';
line = 0;
found = false;
while ~feof(haystack)
tline = fgetl(haystack);
line = line + 1;
if ischar(tline) && ~isempty(strfind(tline, needle))
found = true;
break;
end
end
if ~found
line = NaN; end
fclose(fileID);
line
You could of course also leave the searching to more specialized tools, which come free with most operating systems:
haystack = 'resfile.txt';
needle = 'OptimetricsResult';
if ispc % Windows
[~,lines] = system(['find /n "' needle '" ' haystack]);
elseif isunix % Mac, Linux
[~,lines] = system(['grep -n "' needle '" ' haystack]);
else
error('Unknown operating system!');
end
You'd have to do a bit more parsing to extract the line number from C, but I trust this will be no issue.
I have a file with entries seperated by an empty space. For example:
example.txt
24676 256 218503341 2173
13236272 500 1023073758 5089
2230304 96 15622969 705
0 22 0 526
13277 28 379182 141
I would like to print, in the command line, the outcome of "column 1/ column 3" or simila. I believe it can be done with awk. However, some entries are 0, hence division by 0 gives:
fatal: division by zero attempted
In a more advanced case, I would like to find the median value (or some percentile) of the division.
There are many ways to ignore the row with a zero divisor, including:
awk '$3 != 0 { print $1/$3 }' your-data-file
awk '{ if ($3 != 0) print $1/$3 }' your-data-file
The question changed — to print 0 instead. The answer is not much harder:
awk '{ if ($3 != 0) print $1/$3; else print 0 }' your-data-file
Medians and other percentiles are much fiddlier to deal with. It's easiest if the data is in sorted order. So much easier that I'd expect to use a numeric sort and then process the data from there.
I dug out an old shell script which computes descriptive statistics - min, max, mode, median, and deciles of a single numeric column of data:
: "#(#)$Id: dstats.sh,v 1.2 1997/06/02 21:45:00 johnl Exp $"
#
# Calculate Descriptive Statistics: min, max, median, mode, deciles
sort -n $* |
awk 'BEGIN { max = -999999999; min = 999999999; }
{ # Accumulate basic data
count[$1]++;
item[++n] = $1;
if ($1 > max) max = $1;
if ($1 < min) min = $1;
}
END { # Print Descriptive Statistics
printf("# Count = %d\n", n);
printf("# Min = %d\n", min);
decile = 1;
for (decile = 10; decile < 100; decile += 10)
{
idx = int((decile * n) / 100) + 1;
printf("# %d%% decile = %d\n", decile, item[idx]);
if (decile == 50)
median = item[idx];
}
printf("# Max = %d\n", max);
printf("# Median = %d\n", median);
for (i in count)
{
if (count[i] > count[mode])
mode = i;
}
printf("# Mode = %d\n", mode);
}'
The initial values of min and max are not exactly scientific. It serves to illustrate a point.
(This 1997 version is almost identical to its 1991 predecessor - all except for the version information line is identical, in fact. So, the code is over 20 years old.)
Here's one solution:
awk '
$3 != 0 { vals[$NR]=$1/$3; sum += vals[$NR]; print vals[$NR] }
$3 == 0 { vals[$NR]=0; print "skipping division by 0" }
END { sort vals; print "Mean = " sum/$NR ", Median ~ " vals[$NR/2] }
' < your_file
This will calculate, print, and accumulate the quotients if the 3rd column is not zero. When it reaches the end of your file (which should not have an empty line), it will print the mean and median of all the quotients, assuming 0 for each line in which it would have divided by zero.
In awk, $n means the nth field, starting with 1, and $NR means the number of records (that is, the number of lines) that have been processed. Each quotient is stored in the array vals, enabling us to calculate the median value.
In real life, the median is defined as the "middle" item given an odd number of elements, or the mean of the two "middle" items given an even number of elements.
And you're on your own when it comes to implementing the sort function!