I have to run many substitutions on a text file and I need to distinguish a string that has been written in place of something else from the same string if it was originally there.
For instance, say I want to replace a with b, and b with c in the second field of the following file (to get b c c)
a a
a b
b c
if I run awk '$2 == "a" {$2 = "b"}; $2 == "b" {$2 = "c"} 1' file obviously I get
a c
a c
b c
I could pay attention to the order in which I run the substitutions here, but not really in the real case. I'd like to have a flexible script where I can write the substitutions in any order and not have to worry about values being overwritten. I've tried with an optimistic awk '$2 == "a" {$2 = b}; $2 == "b" {$2 = c}; b = "b"; c = "c"; 1' file but it didn't work.
Since you only want to perform the substitution at most once, you're better off with if ... else if ...
awk '{
if ($2 == "a") {$2 = "b"}
else if ($2 == "b") {$2 = "c"}
else if ($2 == "c") {$2 = "a"}
print
}' <<END
a a
a b
b c
END
a b
a c
b a
Format the code to suit your style.
Another approach that may be more elegant:
awk '
BEGIN {repl["a"] = "b"; repl["b"] = "c"; repl["c"] = "a"}
$2 in repl {$2 = repl[$2]}
1
' <<END
a a
a b
b c
END
The general, idiomatic approach to not changing a string that you just changed is to map the old values to strings that cannot appear in the input and then convert those to the new values:
$ cat tst.awk
BEGIN {
old2new["a"] = "b"
old2new["b"] = "c"
}
{
# Step 1 - put an "X" after every "#" so "#<anything else>"
# cannot exist in the input from this point on.
gsub(/#/,"#X",$2)
# Step 2 - map "old"s to intermediate strings that cannot exist
c=0
for (old in old2new) {
gsub(old,"#"c++,$2)
}
# Step 3 - map the intermediate strings to the new strings
c=0
for (old in old2new) {
gsub("#"c++,old2new[old],$2)
}
# Step 4 - restore the "#X"s to "#"s
gsub(/#X/,"#",$2)
# Step 5 - print the record
print
}
$ awk -f tst.awk file
a b
a c
b c
I used gsub()s as that's the most common application of this but feel free to use ifs if that's more appropriate for your case.
Obviously the approach of just adding concatenating c++ to the end of # only works for up to 10 substitutions, you'd have to come up with a mapping to other characters for more than that (which is trivial but just don't trip over RE metacharacters).
Related
Hopefully someone out there in the world can help me, and anyone else with a similar problem, find a simple solution to capturing data. I have spent hours trying a one liner to solve something I thought was a simple problem involving awk, a csv file, and saving the output as a bash variable. In short here's the nut...
The Missions:
1) To output every other column, starting from the LAST COLUMN, with a specific iteration count.
2) To output every other column, starting from NEXT TO LAST COLUMN, with a specific iteration count.
The Data (file.csv):
#12#SayWhat#2#4#2.25#3#1.5#1#1#1#3.25
#7#Smarty#9#6#5.25#5#4#4#3#2#3.25
#4#IfYouLike#4#1#.2#1#.5#2#1#3#3.75
#3#LaughingHard#8#8#13.75#8#13#6#8.5#4#6
#10#AtFunny#1#3#.2#2#.5#3#3#5#6.5
#8#PunchLines#7#7#10.25#7#10.5#8#11#6#12.75
Desired results for Mission 1:
2#2.25#1.5#1#3.25
9#5.25#4#3#3.25
4#.2#.5#1#3.75
8#13.75#13#8.5#6
1#.2#.5#3#6.5
7#10.25#10.5#11#12.75
Desired results for Mission 2:
SayWhat#4#3#1#1
Smarty#6#5#4#2
IfYouLike#1#1#2#3
LaughingHard#8#8#6#4
AtFunny#3#2#3#5
PunchLines#7#7#8#6
My Attempts:
The closes I have come to solving any of the above problems, is an ugly pipe (which is OK for skinning a cat) for Mission 1. However, it doesn't use any declared iterations (which should be 5). Also, I'm completely lost on solving Mission 2.
Any help to simplify the below and solving Mission 2 will be HELLA appreciated!
outcome=$( awk 'BEGIN {FS = "#"} {for (i = 0; i <= NF; i += 2) printf ("%s%c", $(NF-i), i + 2 <= NF ? "#" : "\n");}' file.csv | sed 's/##.*//g' | awk -F# '{for (i=NF;i>0;i--){printf $i"#"};printf "\n"}' | sed 's/#$//g' | awk -F# '{$1="";print $0}' OFS=# | sed 's/^#//g' );
Also, if doing a loop for a specific number of iterations is helpful in solving this problem, then magic number is 5. Maybe a solution could be a for-loop that is counting from right to left and skipping every other column as 1 iteration, with the starting column declared as an awk variable (Just a thought I have no way of knowing how to do)
Thank you for looking over this problem.
There are certainly more elegant ways to do this, but I am not really an awk person:
Part 1:
awk -F# '{ x = ""; for (f = NF; f > (NF - 5 * 2); f -= 2) { x = x ? $f "#" x : $f ; } print x }' file.csv
Output:
2#2.25#1.5#1#3.25
9#5.25#4#3#3.25
4#.2#.5#1#3.75
8#13.75#13#8.5#6
1#.2#.5#3#6.5
7#10.25#10.5#11#12.75
Part 2:
awk -F# '{ x = ""; for (f = NF - 1; f > (NF - 5 * 2); f -= 2) { x = x ? $f "#" x : $f ; } print x }' file.csv
Output:
SayWhat#4#3#1#1
Smarty#6#5#4#2
IfYouLike#1#1#2#3
LaughingHard#8#8#6#4
AtFunny#3#2#3#5
PunchLines#7#7#8#6
The literal 5 in each of those is your "number of iterations."
Sample data:
$ cat mission.dat
#12#SayWhat#2#4#2.25#3#1.5#1#1#1#3.25
#7#Smarty#9#6#5.25#5#4#4#3#2#3.25
#4#IfYouLike#4#1#.2#1#.5#2#1#3#3.75
#3#LaughingHard#8#8#13.75#8#13#6#8.5#4#6
#10#AtFunny#1#3#.2#2#.5#3#3#5#6.5
#8#PunchLines#7#7#10.25#7#10.5#8#11#6#12.75
One awk solution:
NOTE: OP can add logic to validate the input parameters.
$ cat mission
#!/bin/bash
# format: mission { 1 | 2 } { number_of_fields_to_display }
mission=${1} # assumes user inputs "1" or "2"
offset=$(( mission - 1 )) # subtract one to determine awk/NF offset
iteration_count=${2} # assume for now this is a positive integer
awk -F"#" -v offset=${offset} -v itcnt=${iteration_count} 'BEGIN { OFS=FS }
{ # we will start by counting fields backwards until we run out of fields
# or we hit "itcnt==iteration_count" fields
loopcnt=0
for (i=NF-offset ; i>=0; i-=2) # offset=0 for mission=1; offset=1 for mission=2
{ loopcnt++
if (loopcnt > itcnt)
break
fstart=i # keep track of the field we want to start with
}
# now printing our fields starting with field # "fstart";
# prefix the first printf with a empty string, then each successive
# field is prefixed with OFS=#
pfx = ""
for (i=fstart; i<= NF-offset; i+=2)
{ printf "%s%s",pfx,$i
pfx=OFS
}
# terminate a line of output with a linefeed
printf "\n"
}
' mission.dat
Some test runs:
###### mission #1
# with offset/iteration = 4
$ mission 1 4
2.25#1.5#1#3.25
5.25#4#3#3.25
.2#.5#1#3.75
13.75#13#8.5#6
.2#.5#3#6.5
10.25#10.5#11#12.75
#with offset/iteration = 5
$ mission 1 5
2#2.25#1.5#1#3.25
9#5.25#4#3#3.25
4#.2#.5#1#3.75
8#13.75#13#8.5#6
1#.2#.5#3#6.5
7#10.25#10.5#11#12.75
# with offset/iteration = 6
$ mission 1 6
12#2#2.25#1.5#1#3.25
7#9#5.25#4#3#3.25
4#4#.2#.5#1#3.75
3#8#13.75#13#8.5#6
10#1#.2#.5#3#6.5
8#7#10.25#10.5#11#12.75
###### mission #2
# with offset/iteration = 4
$ mission 2 4
4#3#1#1
6#5#4#2
1#1#2#3
8#8#6#4
3#2#3#5
7#7#8#6
# with offset/iteration = 5
$ mission 2 5
SayWhat#4#3#1#1
Smarty#6#5#4#2
IfYouLike#1#1#2#3
LaughingHard#8#8#6#4
AtFunny#3#2#3#5
PunchLines#7#7#8#6
# with offset/iteration = 6;
# notice we pick up field #1 = empty string so output starts with a '#'
$ mission 2 6
#SayWhat#4#3#1#1
#Smarty#6#5#4#2
#IfYouLike#1#1#2#3
#LaughingHard#8#8#6#4
#AtFunny#3#2#3#5
#PunchLines#7#7#8#6
this is probably not what you're asking but perhaps will give you an idea.
$ awk -F_ -v skip=4 -v endoff=0 '
BEGIN {OFS=FS}
{offset=(NF-endoff)%skip;
for(i=offset;i<=NF-endoff;i+=skip) printf "%s",$i (i>=(NF-endoff)?ORS:OFS)}' file
112_116_120
122_126_130
132_136_140
142_146_150
you specify the number of skips between columns and the end offset as input variables. Here, for last column end offset is set to zero and skip column is 4.
For clarity I used the input file
$ cat file
_111_112_113_114_115_116_117_118_119_120
_121_122_123_124_125_126_127_128_129_130
_131_132_133_134_135_136_137_138_139_140
_141_142_143_144_145_146_147_148_149_150
changing FS for your format should work.
I want to make pairs of words based on the third column (identifier). My file is similar to this example:
A ID.1
B ID.2
C ID.1
D ID.1
E ID.2
F ID.3
The result I want is:
A C ID.1
A D ID.1
B E ID.2
C D ID.1
Note that I don't want to obtain the same word pair in the opposite order. In my real file some words appear more than one time with different identifiers.
I tried this code which works well but requires a lot of time (and I don't know if there are redundancies):
counter=2
cat filtered_go_annotation.txt | while read f1 f2; do
tail -n +$counter go_annotation.txt | grep $f2 | awk '{print "'$f1' " $1}';
((counter++))
done > go_network2.txt
The 'tail' is used to delete a line when it's read.
Awk solution:
awk '{ a[$2] = ($2 in a? a[$2] FS : "") $1 }
END {
for (k in a) {
len = split(a[k], items);
for (i = 1; i <= len; i++)
for (j = i+1; j <= len; j++)
print items[i], items[j], k
}
}' filtered_go_annotation.txt
The output:
A C ID.1
A D ID.1
C D ID.1
B E ID.2
With GNU awk for sorted_in and true multi-dimensional arrays:
$ cat tst.awk
{ vals[$2][$1] }
END {
PROCINFO["sorted_in"] = "#ind_str_asc"
for (i in vals) {
for (j in vals[i]) {
for (k in vals[i]) {
if (j != k) {
print j, k, i
}
}
delete vals[i][j]
}
}
}
$ awk -f tst.awk file
A C ID.1
A D ID.1
C D ID.1
B E ID.2
I wonder if this would work (in GNU awk):
$ awk '
($2 in a) && !($1 in a[$2]) { # if ID.x is found in a and A not in a[ID.X]
for(i in a[$2]) # loop all existing a[ID.x]
print i,$1,$2 # and output combination of current and all previous matching
}
{
a[$2][$1] # hash to a
}' file
A C ID.1
A D ID.1
C D ID.1
B E ID.2
in two steps
$ sort -k2 file > file.s
$ join -j2 file.s{,} | awk '!(a[$2,$3]++ + a[$3,$2]++){print $2,$3,$1}'
A C ID.1
A D ID.1
C D ID.1
B E ID.2
If your input is large, it may be faster to solve it in steps, e.g.:
# Create temporary directory for generated data
mkdir workspace; cd workspace
# Split original file
awk '{ print $1 > $2 }' ../infile
# Find all combinations
perl -MMath::Combinatorics \
-n0777aE \
'
$c=Math::Combinatorics->new(count=>2, data=>[#F]);
while(#C = $c->next_combination) {
say join(" ", #C) . " " . $ARGV
}
' *
Output:
C D ID.1
C A ID.1
D A ID.1
B E ID.2
Perl
solution using regex backtracking
perl -n0777E '/^([^ ]*) (.*)\n(?:.*\n)*?([^ ]*) (\2)\n(?{say"$1 $3 $2"})(?!)/mg' foo.txt
flags see perl -h.
^([^ ]*) (.*)\n : matches a line with at least one space first capturing group at the left side of first space, second capturing group the right side.
(?:.*\n)*?: matches (without capturing) 0 or more lines lazily to try following pattern first before matching more lines.
([^ ]*) (\2)\n : similar to first match using backreference \2 to match a line with the same key.
(?{say"$1 $3 $2"}) : code to display the captured groups
(?!) : to make the match fail to backtrack.
Note that it could be shortened a bit
perl -n0777E '/^(\S+)(.+)[\s\S]*?^((?1))(\2)$(?{say"$1 $3$2"})(?!)/mg' foo.txt
Yet another awk making use of the redefinition of $0. This makes the solution of RomanPerekhrest a bit shorter :
{a[$2]=a[$2] FS $1}
END { for(i in a) { $0=a[i]; for(j=1;j<NF;j++)for(k=j+1;k<=NF;++k) print $j,$k,i} }
I have a really difficult file.asv (values separated with #) that contains lines with un-matching columns.
Example:
name#age#city#lat#long
eric#paris#4.4283333333333331e+01#-1.0550000000000000e+02
dan#43#berlin#3.1366000000000000e+01#-1.0371500000000000e+02
london##2.5250000000000000e+01#1.0538333000000000e+02
Latitude and Longitude values are pretty consistent. They have 22 or 23 characters (depending on the positive (absent) and negative sign), and always with scientific notation. I would like to keep only latitude and longitude from each line.
Expected output:
lat#long
4.4283333333333331e+01#-1.0550000000000000e+02
3.1366000000000000e+01#-1.0371500000000000e+02
2.5250000000000000e+01#1.0538333000000000e+02
Headers are not totally necessary, I can add them later. I could also work with separated latitude and longitude outputs, and then paste them together. Any sed or awk command I could use?
Use this awk:
awk 'BEGIN{OFS=FS="#"} {print $(NF-1),$NF}' file
Here,
OFS - Output Field Separator
FS - Input Field Separator
NF - Number of Fields
Assuming that latitude and longitude always be a last fields. $NF and $(NF-1) will print last two fields.
Test:
$ awk 'BEGIN{OFS=FS="#"} {print $(NF-1),$NF}' file
lat#long
4.4283333333333331e+01#-1.0550000000000000e+02
3.1366000000000000e+01#-1.0371500000000000e+02
2.5250000000000000e+01#1.0538333000000000e+02
Simple grep would do, assuming -o option is present
$ grep -o '[^#]*#[^#]*$' file.asv
lat#long
4.4283333333333331e+01#-1.0550000000000000e+02
3.1366000000000000e+01#-1.0371500000000000e+02
2.5250000000000000e+01#1.0538333000000000e+02
Trying to select fields using regular expression
$ cat ll.awk
function rep(c, n, ans) { # repeat `c' `n' times
while (n--) ans = ans c
return ans
}
function build_re( d, s, os) { # build a regexp `r'
d = "[0-9]" # a digit
s = "[+-]" # sign
os = s "?" # optional sign
r = os d "[.]" rep(d, 16) "e" s d d # adjust here
r = "^" r "$" # match entire string
}
function process( sep, line, i) {
for (i = 1; i <= NF; i ++ ) {
if (!($i ~ r)) continue # skip fields
line = line sep $i; sep = FS
}
if (length(line)) print line
}
BEGIN {
build_re()
FS = "#"
}
{ # call on every line of input
process()
}
Usage:
$ awk -f ll.awk file.txt
4.4283333333333331e+01#-1.0550000000000000e+02
3.1366000000000000e+01#-1.0371500000000000e+02
2.5250000000000000e+01#1.0538333000000000e+02
One more in Gnu awk using gensub:
$ awk '{print gensub(/(.+)((#[^#]+){2})$/,"\\2","g",$0)}' file
#lat#long
#4.4283333333333331e+01#-1.0550000000000000e+02
#3.1366000000000000e+01#-1.0371500000000000e+02
#2.5250000000000000e+01#1.0538333000000000e+02
I have a problem with my bash script on Linux.
My input looks like this:
input
Karydhs y n y y y n n y n n n y n y n
Markopoulos y y n n n y n y n y y n n n y
name3 y n y n n n n n y y n y n y n
etc...
which y=yes and n=no and that is the results of voting... and now I want with using awk to display the name and the total yes vote of each person (name) and the person that win (got the most y), any ideas?
I do something like this:
awk '{count=0 for (I=1;i<=15;i++) if (a[I]="y") count++} {print $1,count}' filename
Here is a fast (no sort required, no explicit "for" loop), one-pass solution that takes into account the possibility of ties:
awk 'NF==0{next}
{name=$1; $1=""; gsub(/[^y]/,"",$0); l=length($0);
print name, l;
if (mx=="" || mx < l) { mx=l; tie=""; winner=name; }
else if (mx == l) {
tie = 1; winner = winner", "name;
}
}
END {fmt = tie ? "The winners have won %d votes each:\n" :
"The winner has won %d votes:\n";
printf fmt, mx;
print winner;
}'
Output:
Karydhs 7
Markopoulos 7
name3 6
The winners have won 7 votes each:
Karydhs, Markopoulos
NOTE: The program above is presented for readability, but is accepted with the line breaks shown by GNU awk. Certain awks disallow splitting the ternary conditional.
What about this?
awk '{ for (i=2;i<NF;i++) { if ($i=="y") { a[$1" "$i]++} } } END { print "Yes tally"; l=0; for (i in a) { print i,a[i]; if (l>a[i]) { l=l } else { l=a[i];name=i } } split(name,a," "); print "Winner is ",a[1],"with ",l,"votes" } ' f
Yes tally
name3 y 6
Markopoulos y 6
Karydhs y 7
Winner is Karydhs with 7 votes
Here's yet another approach.
{ name=$1; $1=""; votes[name]=length(gensub("[^y]","","g")); }
END {asorti(votes,rank); for (r in rank) print rank[r], votes[rank[r]]; }
It is similar to the answer from #mklement0, but it uses asorti()¹ to sort inside of awk.
name=$1 saves the name from token 1
$1=""; clears token 1, which has the side effect of removing it from $0
votes[name] is an array indexed by the candidate's name
gensub("[^y]","","g") removes everything but 'y's from what's left of $0
and length() counts them
asorti(votes,rank) sorts votes by index into rank; at this point the arrays look like this:
votes rank
[name3] = 6 [1] = Karydhs
[Markopoulos] = 7 [2] = Markopoulos
[Karydhs] = 7 [3] = name3
for (r in rank) print rank[r], votes[rank[r]]; prints the results:
Karydhs 7
Markopoulos 7
name3 6
¹ the asorti() function may not be available in some versions of awk
Alternative two-pass awk
$ awk '{print $1; $1=""}1' votes |
awk -Fy 'NR%2{printf "%s ",$0; next} {print NF-1}' |
sort -k2nr
Karydhs 7
Markopoulos 7
name3 6
A simpler - and POSIX-compliant - awk solution, assisted by sort; note that no winner information (which may apply to multiple lines) is explicitly printed, but the sorting by votes in descending order should make the winner(s) obvious.
awk '{
printf "%s", $1
$1=""
yesCount=gsub("y", "")
printf " %s\n", yesCount
}' file |
sort -t ' ' -k2,2nr
printf "%s", $1 prints the name field only, without a trailing newline.
$1="" clears the 1st field, causing $0, the input line, to be rebuilt so that it contains the vote columns only.
yesCount=gsub("y", "") performs a dummy substitution that takes advantage of the fact that Awk's gsub() function returns the count of replacements performed; in effect, the return value is the number of y values on the line.
printf " %s\n", yesCount then prints the number of yes votes as the second output field and terminates the line.
sort -t ' ' -k2,2,nr then sorts the resulting lines by the second (-k2,2) space-separated (-t ' ') field, numerically (n), in reverse order (r) so that the highest yes-vote counts appear first.
I have a large txt file ("," as delimiter) with some data and string:
2014:04:29:00:00:58:GMT: subject=BMRA.BM.T_GRIFW-1.FPN, message={SD=2014:04:29:00:00:00:GMT,SP=5,NP=3,TS=2014:04:29:01:00:00:GMT,VP=4.0,TS=2014:04:29:01:29:00:GMT,VP=4.0,TS=2014:04:29:01:30:00:GMT,VP=3.0}
2014:04:29:00:00:59:GMT: subject=BMRA.BM.T_GRIFW-2.FPN, message={SD=2014:04:29:00:00:00:GMT,SP=5,NP=2,TS=2014:04:29:01:00:00:GMT,VP=3.0,TS=2014:04:29:01:30:00:GMT,VP=3.0}
I would like to find lines that contain 'T_GRIFW' and then print the $1 field from 'subject' onwards and only the times and floats from $2 onwards. Furthermore, I want to incorporate an if statement so that if field $4 == 'NP=3', only fields $5,$6,$9,$10 are printed after the previous fields and if $4 == 'NP=2', all following fields are printed (times and floats only)
For instance, the result of the two sample lines will be:
subject=BMRA.BM.T_GRIFW-1.FPN,2014:04:29:00:00:00,5,3,2014:04:29:01:00:00,4.0,2014:04:29:01:30:00,3.0
subject=BMRA.BM.T_GRIFW-2.FPN,2014:04:29:00:00:00,5,2,2014:04:29:01:00:00,3.0,2014:04:29:01:30:00,3.0
I know this is complex and I have tried my best to be thorough in my description. The basic code I have thus far is:
awk 'BEGIN {FS=","}{OFS=","} /T_GRIFW-1.FPN/ {print $1}' tib_messages.2014-04-29
THANKS A MILLION!
Here's an awk executable file that'll create your desired output:
#!/usr/bin/awk -f
# use a more complicated FS => field numbers counted differently
BEGIN { FS="=|,"; OFS="," }
$2 ~ /T_GRIFW/ && $8=="NP" {
str="subject=" $2 OFS
# strip ":GMT" from dates and "}" from everywhere
gsub( /:GMT|[\}]/, "")
# append common fields to str with OFS
for(i=5;i<=13;i+=2) str=str $i OFS
# print the remaining fields and line separator
if($9==3) { print str $19, $21 }
else if($9==2) { print str $15, $17 }
}
Placing that in a file called awko and chmod'ing it then running awko data yields:
subject=BMRA.BM.T_GRIFW-1.FPN,2014:04:29:00:00:00,5,3,2014:04:29:01:00:00,4.0,2014:04:29:01:30:00,3.0
subject=BMRA.BM.T_GRIFW-2.FPN,2014:04:29:00:00:00,5,2,2014:04:29:01:00:00,3.0,2014:04:29:01:30:00,3.0
I've placed comments in the script, but here are some things that could be spelled out better:
Using a more complicated FS means you don't have reparse for = to work with the field data
I "cheated" and just hard-coded subject (which now falls at the end of $1) for str
:GMT and } appeared to be the only data that needed to be forcibly removed
With this FS Dates and numbers are two apart from each other but still loop-able
In either final print call, the str already ends in an OFS, so the comma between it and next field can be skipped
If I understand your requirements, the following will work:
BEGIN {
FS=","
OFS=","
}
/T_GRIFW/ {
split($1, subject, " ")
result = subject[2] OFS
delete arr
counter = 1
for (i = 2; i <= NF; i++) {
add = 0
if ($4 == "NP=3") {
if (i == 5 || i == 6 || i == 9 || i == 10) {
add = 1
}
}
else if ($4 == "NP=2") {
add = 1
}
if (add) {
counter = counter + 1
split($i, field, "=")
if (match(field[2], "[0-9]*\.[0-9]+|GMT")) {
arr[counter] = field[2]
}
}
}
for (i in arr) {
gsub(/{|}/,"", arr[i]) # remove curly braces
result = result arr[i] OFS
}
print substr(result, 0, length(result)-1)
}