Swap two columns depending on condition for third column in linux - linux

I have a file with 3 columns like this
Col1 Col2 Col3
A B <-
C D ->
E F ->
I want to swap the entries of the Col1 and Col2 whenever there is
<-
in the third column. I want my output file to be like
Col1 Col2 Col3
B A ->
C D ->
E F ->

awk '($3=="<-"){$3=$2;$2=$1;$1=$3;$3="->"}1' <file>
Essentially, if $3=="<-", then swap the columns and redefine $3. Then print.

An short awk example is
cat foooo | awk '{if (match($3,"<-")){print $2,$1,$3}else{print $1,$2,$3}}'
where foooo is the file name.
If you also want to change the "<-" then the code would be
cat foooo | awk '{if (match($3,"<-")){print $2,$1,"->"}else{print $1,$2,$3}}'

Related

Awk: Sum up column values across multiple files with identical column layout

I have a number of files with the same header:
COL1, COL2, COL3, COL4
You can ignore COL1-COL3. COL4 contains a number. Each file contains about 200 rows. I am trying to sum up across the rows. For example:
File 1
COL1 COL2 COL3 COL4
x y z 3
a b c 4
File 2
COL1 COL2 COL3 COL4
x y z 5
a b c 10
Then a new file is returned:
COL1 COL2 COL3 COL4
x y z 8
a b c 14
Is there a simple way to do this without AWK? I will use AWK if need be, I just thought there might be a simple one-liner that I could just run right away. The AWK script I have in mind feels a bit long.
Thanks
Combining paste with awk, as in Kristo Mägi's answer, is your best bet:
paste merges the corresponding lines from the input files,
which sends a single stream of input lines to awk, with each input line containing all fields to sum up.
Assuming a fixed number of input files and columns, Kristo's answer can be simplified to (making processing much more efficient):
paste file1 file2 | awk '{ print $1, $2, $3, (NR==1 ? $4 : $4 + $8) }'
Note: The above produces space-separated output columns, because awk's default value for OFS, the output-field separator, is a single space.
Assuming that all files have the same column structure and line count, below is a generalization of the solution, which:
generalizes to more than 2 input files (and more than 2 data rows)
generalizes to any number of fields, as long as the field to sum up is the last one.
#!/bin/bash
files=( file1 file2 ) # array of input files
paste "${files[#]}" | awk -v numFiles=${#files[#]} -v OFS='\t' '
{
row = sep = ""
for(i=1; i < NF/numFiles; ++i) { row = row sep $i; sep = OFS }
sum = $(NF/numFiles) # last header col. / (1st) data col. to sum
if (NR > 1) { for(i=2; i<=numFiles; ++i) sum += $(NF/numFiles * i) } # add other cols.
printf "%s%s%s\n", row, OFS, sum
}
'
Note that \t (the tab char.) is used to separate output fields and that, due to relying on awk's default line-splitting into fields, preserving the exact input whitespace between fields is not guaranteed.
If all files have the same header - awk solution:
awk '!f && FNR==1{ f=1; print $0 }FNR>1{ s[FNR]+=$NF; $NF=""; r[FNR]=$0 }
END{ for(i=2;i<=FNR;i++) print r[i],s[i] }' File[12]
The output (for 2 files):
COL1 COL2 COL3 COL4
x y z 8
a b c 14
This approach can be applied to multiple files (in that case you may specify globbing File* for filename expansion)
One more option.
The command:
paste f{1,2}.txt | sed '1d' | awk '{print $1,$2,$3,$4+$8}' | awk 'BEGIN{print "COL1","COL2","COL3","COL4"}1'
The result:
COL1 COL2 COL3 COL4
x y z 8
a b c 14
What it does:
Test files:
$ cat f1.txt
COL1 COL2 COL3 COL4
x y z 3
a b c 4
$ cat f2.txt
COL1 COL2 COL3 COL4
x y z 5
a b c 10
Command: paste f{1,2}.txt
Joins 2 files and gives output:
COL1 COL2 COL3 COL4 COL1 COL2 COL3 COL4
x y z 3 x y z 5
a b c 4 a b c 10
Command: sed '1d'
Is meant to remove header temporarily
Command: awk '{print $1,$2,$3,$4+$8}'
Returns COL1-3 and sums $4 and $8 from paste result.
Command: awk 'BEGIN{print "COL1","COL2","COL3","COL4"}1'
Adds header back
EDIT:
Following #mklement0 comment, he is right about header handling as I forgot the NR==1 part.
So, I'll proxy his updated version here also:
paste f{1,2}.txt | awk '{ print $1, $2, $3, (NR==1 ? $4 : $4 + $8) }'
You state you have "a number of files". i.e., more than 2.
Given these 3 files (and should work with any number):
$ cat f1 f2 f3
COL1 COL2 COL3 COL4
x y z 3
a b c 4
COL1 COL2 COL3 COL4
x y z 5
a b c 10
COL1 COL2 COL3 COL4
x y z 10
a b c 15
You can do:
$ awk 'FNR==1{next}
{sum[$1]+=$4}
END{print "COL1 COL4";
for (e in sum) print e, sum[e]} ' f1 f2 f3
COL1 COL4
x 18
a 29
It is unclear what you intend to do with COL2 or COL3, so I did not add that.
$ awk '
NR==1 { print }
{ sum[FNR]+=$NF; sub(/[^[:space:]]+[[:space:]]*$/,""); pfx[FNR]=$0 }
END { for(i=2;i<=FNR;i++) print pfx[i] sum[i] }
' file1 file2
COL1 COL2 COL3 COL4
x y z 8
a b c 14
The above will work robustly and efficiently with any awk on any UNIX system, with any number of input files and with any contents of those files. The only potential problem would be that it has to retain the equivalent of 1 of those files in memory so if each file was absolutely massive then you may exhaust available memory.

awk difference between subsequent lines

This is a great example how to solve the problem if I want to print differences between subsequent lines of a single column.
awk 'NR>1{print $1-p} {p=$1}' file
But how would I do it if I have multiple (unknown) number of columns in the file and I want the differences for all of them, eg. (note that the number of columns is not necessarily 3, it can be 10 or 15 or more)
col1 col2 col3
---- ---- ----
1 3 2
2 4 10
1 9 -3
. . .
the output would be:
col1 col2 col3
---- ---- ----
1 1 8
-1 5 -13
. . .
Instead of saving the first column, save the entire line and you would able to split it then print the difference using a loop:
awk 'NR>1{for(i=1;i<=NF;i++) printf "%d ", $i - a[i] ; print ""}
{p=split($0, a)}' file
If you need the column title then you can print it using BEGIN.
$ awk 'NR<3; NR>3{for (i=1;i<=NF;i++) printf "%d%s", $i-p[i], (i<NF?OFS:ORS)} {split($0,p)}' file | column -t
col1 col2 col3
---- ---- ----
1 1 8
-1 5 -13

AWK script: Finding number of matches that each element in Col2 has in Col1

I want to compare two columns in a file as below using AWK, can someone gives a help please?
e.g.
Col1 Col2
---- ----
2 A
2 D
3 D
3 D
3 A
7 N
7 M
1 D
1 R
Now I want to use AWK to implement the following algorithm to find matches between those columns:
list1[] <=== Col1
list2[] <=== Col2
NewList[]
for i in col2:
d = 0
for j in range(1,len(col2)):
if i == list2[j]:
d++
NewList.append(list1[list2.index[i]])
Expected result:
A ==> 2 // means A matches two times to Col1
D ==> 4 // means D matches two times to Col1
....
So I want to write the above code in AWK script and I find it too complicated for me as I haven't used it yet.
Thank you very much for your help
Not all that complicated, keep the count in an array indexed by the character and print the array out at the end;
awk '{cnt[$2]++} END {for(c in cnt) print c, cnt[c]}' test.txt
# A 2
# D 4
# M 1
# N 1
# R 1
{cnt[$2]++} # For each row, get the second column and increase the
# value of the array at that position (ie cnt['A']++)
END {for(c in cnt) print c, cnt[c]}
# When all rows done (END), loop through the keys of the
# array and print key and array[key] (the value)
alternative solution
$ rev file | cut -c1 | sort | uniq -c
2 A
4 D
1 M
1 N
1 R
for the formatting pipe to ... | sed -r 's/(\w) (\w)/\2 ==> \1/'
A ==> 2
D ==> 4
M ==> 1
N ==> 1
R ==> 1
Or, do everything in awk

In a *nix environment, how would I group columns together?

I have the following text file:
A,B,C
A,B,C
A,B,C
Is there a way, using standard *nix tools (cut, grep, awk, sed, etc), to process such a text file and get the following output:
A
A
A
B
B
B
C
C
C
You can do:
tr , \\n
and that will generate
A
B
C
A
B
C
A
B
C
which you could sort.
Unless you want to pull the first column then second then third, in which case you want something like:
awk -F, '{for(i=1;i<=NF;++i) print i, $i}' | sort -sk1 | awk '{print $2}'
To explain this, the first part generates
1 A
2 B
3 C
1 A
2 B
3 C
1 A
2 B
3 C
the second part will stably sort (so the internal order is preserved)
1 A
1 A
1 A
2 B
2 B
2 B
3 C
3 C
3 C
and the third part will strip the numbers
You could use a shell for-loop combined with cut if you know in advanced the number of columns. Here is an example using bash syntax:
for i in {1..3}; do
cut -d, -f $i file.txt
done
Try:
awk 'BEGIN {FS=","} /([A-C],)+([A-C])?/ {for (i=1;i<=NF;i++) print $i}' YOURFILE | sort

linux, Comma Separated Cells to Rows Preserve/Aggregate Column

There was a similar question here but for excel/vba Excel Macro - Comma Separated Cells to Rows Preserve/Aggregate Column
because i have a big file (>300mb) this is not an option, thus I am struggeling to get it to work in bash.
Based on this data
1 Cat1 a,b,c
2 Cat2 d
3 Cat3 e
4 Cat4 f,g
I would like to convert it to:
1 Cat1 a
1 Cat1 b
1 Cat1 c
2 Cat2 d
3 Cat3 e
4 Cat4 f
4 Cat4 g
cat > data << EOF
1 Cat1 a,b,c
2 Cat2 d
3 Cat3 e
4 Cat4 f,g
EOF
set -f # turn off globbing
IFS=, # prepare for comma-separated data
while IFS=$'\t' read C1 C2 C3; do # split columns at tabs
for X in $C3; do # split C3 at commas (due to IFS)
printf '%s\t%s\t%s\n' "$C1" "$C2" "$X"
done
done < data
This looks like a job for awk or perl.
awk 'BEGIN { FS = OFS = "\t" }
{ split($3, a, ",");
for (i in a) {$3 = a[i]; print} }'
perl -F'\t' -alne 'foreach (split ",", $F[2]) {
$F[2] = $_; print join("\t", #F)
}'
Both programs are based on the same algorithm: split the third column at commas, and iterate over the components, printing the original line with each component in the third column in turn.

Resources