How to loop an awk command on every column of a table and output to a single output file? - linux

I have a multi column file composed of single unit 1s, 2s and 3s. There are a lot of repeats of a unit in each column, and sometimes it switches from one to another. I want to count how many times this switch happens on every column. For example in column 1 the switch change from 1 to 2 to 3 to 1, so there are 3 switches and the output should be 3. In the second column there are 2s the entire column, so the changes is 0 and the output is 0.
My input file has 4000 columns so it is impossible to do it by hand. The file is space separated.
For example:
Input:
1 2 3 1 2
1 2 2 1 3
1 2 3 1 2
2 2 2 1 2
2 2 2 1 2 ......
3 2 2 1 2
3 2 2 1 1
1 2 2 1 1
1 2 2 1 2
1 2 2 1 1
Desired output:
3 ## column 1 switch times
0 ## column 2 switch times
3 .....
0
5
I was using:
awk '{print $1}' <inputfile> | uniq | wc -l
awk '{print $2}' <inputfile> | uniq | wc -l
awk '{print $3}' <inputfile> | uniq | wc -l
....
This execute one column at a time. It will give me the output "4" for the first column, later I will just calculate 4-1 =3 to get my desired output. But Is there a way I can write this awk command into a loop and execute it on each column and output to one file?
Thanks!

awk tells you how many fields are in a given row in the variable NF, so you can create two arrays to keep track of the information you need. One array will keep the value of the last row in the given column. The other will count the number of switches in a given column. You'll also keep a track of the maximum number of columns (and set the counts for new columns to zero so that they are printed appropriately in the output at the end if the number of switches is 0 for that column). You'll also make sure you don't count the transition from an empty string to a non-empty string — which happens when the column is encountered for the first time.
If, in fact, the file is uniformly the same number of columns, that will only affect the first row of data. If subsequent rows actually have more columns than the first line, then it adds them. If a column stops appearing for a bit, I've assumed it should resume where it left off (as if the missing columns were the same value as before). You can decide on different algorithms; that could count as two transitions (from number to blank and from blank to number too. If that's the case, you have to modify the counting code. Or, perhaps more sensibly, you could decide that irregular numbers of columns are simply not allowed, in which case you can bail out early if the number of columns in the current row is not the same as in the previous row (beware blank lines, or are they outlawed too?).
And you won't try writing the whole program on one line because it will be incomprehensible and it really isn't necessary.
awk '{ if (NF > maxNF)
{
for (i = maxNF + 1; i <= NF; i++)
count[i] = 0;
maxNF = NF;
}
for (i = 1; i <= NF; i++)
{
if (col[i] != "" && $i != col[i])
count[i]++;
col[i] = $i;
}
}
END {
for (i = 1; i <= maxNF; i++)
print count[i];
}' data-file-with-4000-columns
Given your sample data (with the dots removed), the output from the script is as requested:
3
0
3
0
5
This alternative data file with jagged rows:
1 2 3 1 2
1 2 2 1 3
1 2 3 1 2
2 2 2 1 2
2 2 2 1 2 1 1 1
3 2 2 1 2 2 1
3 2 2 1 1
1 2 2 1 1 2 2 1
1 2 2 1
1 2 2 1 1 3
produces the output:
3
0
3
0
3
2
1
0
Which is correct according to the rules I formulated — but if you decide you want different rules to cover the data, you can end up with different answers.
If you used printf("%d\n", count[i]); in the final loop, you'd not need to set the count values to zero in a loop. You pays your money and takes your pick.

Use a loop and keep an array for each of the column current value and another array for the corresponding count:
awk '{for(i=0;i<5;i++) if(c[i]!=$(i+1)) {c[i]=$(i+1); t[i]++}} END{for(i=0;i<5;i++)print t[i]-1}' filename
Note that this assumes that column's value are not zero. If you happen to have zero values, then just initialize the array c to some unique value which will not be present in the file.

Coded out for ease of viewing, SaveColx, CountColx should be arrays. I'd print the column number itself in the results at least for checking :-)
BEGIN {
SaveCol1 = " "
CountCol1 = 0
CountCol2 = 0
CountCol3 = 0
CountCol4 = 0
CountCol5 = 0
}
{
if ( SaveCol1 == " " ) {
SaveCol1 = $1
SaveCol2 = $2
SaveCol3 = $3
SaveCol4 = $4
SaveCol5 = $5
next
}
if ( $1 != SaveCol1 ) {
CountCol1++
SaveCol1 = $1
}
if ( $2 != SaveCol2 ) {
CountCol2++
SaveCol2 = $2
}
if ( $3 != SaveCol3 ) {
CountCol3++
SaveCol3 = $3
}
if ( $4 != SaveCol4 ) {
CountCol4++
SaveCol4 = $4
}
if ( $5 != SaveCol5 ) {
CountCol5++
SaveCol5 = $5
}
}
END {
print CountCol1
print CountCol2
print CountCol3
print CountCol4
print CountCol5
}

Related

Find the maximum values in 2nd column for each distinct values in 1st column using Linux

I have two columns as follows
ifile.dat
1 10
3 34
1 4
3 32
5 3
2 2
4 20
3 13
4 50
1 40
2 20
What I look for is to find the maximum values in 2nd column for each 1,2,3,4,5 in 1st column.
ofile.dat
1 40
2 20
3 34
4 50
5 3
I found someone has done this using other program e.g. Get the maximum values of column B per each distinct value of column A
awk seems a prime candidate for this task. Simply traverse your input file and keep an array indexed by the first column values and storing a value of column 2 if it is larger than the currently stored value. At the end of the traversal iterate over the array to print indices and corresponding values
awk '{
if (a[$1] < $2) {
a[$1]=$2
}
} END {
for (i in a) {
print i, a[i]
}
}' ifile.dat
Now the result will not be sorted numerically on the first column but that should be easily fixable if that is required
Another way is using sort.
First numeric sort on column 2 decreasing and then remove non unique values of column 1, a one-liner:
sort -n -r -k 2 ifile.dat| sort -u -n -k 1
The easiest command to find the maximum value in the second column is something like this
sort -nrk2 data.txt | awk 'NR==1{print $2}'
When doing min/max calculations, always seed the min/max variable using the first value read:
$ cat tst.awk
!($1 in max) || $2>max[$1] { max[$1] = $2 }
END {
PROCINFO["sorted_in"] = "#ind_num_asc"
for (key in max) {
print key, max[key]
}
}
$ awk -f tst.awk file
1 40
2 20
3 34
4 50
5 3
The above uses GNU awk 4.* for PROCINFO["sorted_in"] to control output order, see http://www.gnu.org/software/gawk/manual/gawk.html#Controlling-Array-Traversal.
Considering that your 1st field will be starting from 1 if yes then try one more solution in awk also.
awk '{a[$1]=$2>a[$1]?$2:(a[$2]?a[$2]:$2);} END{for(j=1;j<=length(a);j++){if(a[j]){print j,a[j]}}}' Input_file
Adding one more way for same too here.
sort -k1 Input_file | awk 'prev != $1 && prev{print prev, val;val=prev=""} {val=val>$2?val:$2;prev=$1} END{print prev,val}'

Find rows common in more than two files using awk [duplicate]

This question already has answers here:
How to find common rows in multiple files using awk
(2 answers)
Closed 7 years ago.
I have tab delimited text files in which common rows between them are to be found based on columns 1 and 2 as key columns.
Sample files:
file1.txt
aba 0 0
abc 0 1
abd 1 1
xxx 0 0
file2.txt
xyz 0 0
aba 0 0 0 0
xxx 0 0
abc 1 1
file3.txt
xyx 0 0
aba 0 0
aba 0 1 0
xxx 0 0 0 1
abc 1 1
I would like to get rows common in 2 files or 3 files using columns 1 and 2 as key to search. For the common rows based on column 1 and 2 reporting the first occurrence in any file would do the job.
Sample Output for rows common in 2 files:
abc 1 1
Sample output for rows common in 3 files:
aba 0 0
xxx 0 0
In real scenario I do have to specify different values for number of files. Can anybody suggest a generalized solution to pass the value for number of files in which it has to be common.
I have this piece of code which looks for rows common in all files.
awk '
FNR == NR {
arr[$1,$2] = 1
line[$1,$2] = line[$1,$2] ( line[$1,$2] ? SUBSEP : "" ) $0
next
}
FNR == 1 { delete found }
{ if ( arr[$1,$2] && ! found[$1,$2] ) { arr[$1,$2]++; found[$1,$2] = 1 } }
END {
num_files = ARGC -1
for ( key in arr ) {
if ( arr[key] < num_files ) { continue }
split( line[ key ], line_arr, SUBSEP )
for ( i = 1; i <= length( line_arr ); i++ ) {
printf "%s\n", line_arr[ i ]
}
}
}
' *.txt > commoninall.txt
This should work:
cat file[123].txt | sort | awk 'BEGIN{FS="\t"; V1=""; V2=""}
{if (V1==$1 && V2==$2) { b=b+1 } else
{ print b":"$0; b=1; V1=$1; V2=$2} }' |grep "2:"|awk '
BEGIN{FS=":"} {print $2}'
I cat all file in one stream, sort the lines, check if the first two tab seperated colums are equal (if they are then print the line) and then filter out all duplicated lines.
BTW: I took this nice file[123].txt globbing idea from the comment of William Pursell.
This should work too
I put all the lines in an array (b) with two first values and accumulate in a the number of repetitions. If number > 1 it will be printed from b which has the last line saved for this pair combination column1/column2
cat *.txt | awk -F" " '{a[$1$2]=a[$1$2]+1; b[$1$2]=$0} END{ for (i in a){if(a[i]>1){print b[i]}}}'
Is it ok too?
EDIT
To show all lines in all files, you need just a little more:
cat *.txt | awk -F" " '{a[$1$2]=a[$1$2]+1; c=a[$1$2]; b[$1$2c]=$0} END{ for (i in a){if(a[i]>1){for(c=1; c<=a[i];++c){print b[i c]}}}}'
Very thanks to #PeterPaulKiefer for the cat *txt idea

How to sum column of different file in bash scripting

I have two files:
file-1
1 2 3 4
1 2 3 4
1 2 3 4
file-2
0.5
0.5
0.5
Now I want to add column 1 of file-2 to column 3 of file-1
Output
1 2 3.5 4
1 2 3.5 4
1 2 3.5 4
I've tried this, but it does not work correctly:
awk '{print $1, $2, $3+file-2 }' file-2=$1_of_file-2 file-1 > file-3
I know the awk statement is not right but I want to use something like this; can anyone help me?
Your data isn't very exciting…
awk 'FNR == NR { for (i = 1; i <= NF; i++) { line[NR,i] = $i } fields[NR] = NF }
FNR != NR { line[FNR,3] += $1
pad = ""
for (i = 1; i <= fields[FNR]; i++) { printf "%s%s", pad, line[FNR,i]; pad = " " }
printf "\n"
}' file-1 file-2
The first pattern matches the lines in the first file; it saves each field into the pseudo-multidimensional array line, and also records how many fields there are in that line.
The second pattern matches the lines in the second file; it adds the value in column one to column three of the saved data, then prints out all the fields with a space between them, and adds a newline to the end.
Given this (mildly) modified input, the script (saved in file so-25657951.sh) produces the output shown:
$ cat file-1
1 2 3 4
2 3 6 5
3 4 9 6
$ cat file-2
0.1
0.2
0.3
$ bash so-25657951.sh
1 2 3.1 4
2 3 6.2 5
3 4 9.3 6
$
Note that because this slurps the whole of the first file into memory before reading anything from the second file, the input files should not be too large (say sub-gigabyte size). If they're bigger than that, you should probably devise an alternative strategy.
For example, there is a getline function (even in POSIX awk) which could be used to read a line from file 2 for each line in file 1, and you could then simply print the data without needing to accumulate anything:
awk '{ getline add < "file-2"; $3 += add; print }' file-1
This works reasonably cleanly for any size of file (as long as the files have the same number of lines — or, more precisely, as long as file-2 has at least as many lines as file-1).
This may work:
cat f1
1 2 3 4
2 3 6 5
3 4 9 6
cat f2
0.1
0.2
0.3
awk 'FNR==NR {a[NR]=$1;next} {$3+=a[FNR]}1' f2 f1
1 2 3.1 4
2 3 6.2 5
3 4 9.3 6
After I posted it, I do see that its the same as Jaypal posted in a comment.

Using an if/else statement in the middle of AWK

I have a 5-column file:
PS 6 15 0 1
PS 1 17 0 1
PS 4 18 0 1
that I would like to get it in this 7-column format:
PS.15 PS 6 N 1 0 1
PS.17 PS 1 P 1 0 1
PS.18 PS 4 N 1 0 1
To create 6 of the 7 columns requires just grabbing directly (and sometimes applying small arithmetic) from columns in the original file. However, to create one column (column 4) requires an if-else statement.
Specifically, to create new columns 1, 2, 3, I use:
cat File | awk '{print $1"."$3"\t"$1"\t"$2}'
and to create new columns 5, 6,7, I use:
cat testFileB | awk '{print $4+$5"\t"$4/($4+$5)"\t"$5/($4+$5)}'
and to create new column 4, I use:
cat testFileB | awk '{if ($2 == 1 || $2 == 2 || $2 == 3) print "P"; else print "N";}'
These three statements work fine independently and get me what I want (the correct values for the columns that are all separated by tabs). However, when I try to apply them simultaneously (create all 7 columns at once), I can only do so with unwanted new lines (instead of tabs) before and after column 4 (the if/else statement column):
For instance, my attempt to simultaneously create columns 1, 2, 3, 4:
cat File | awk '{print $1"."$3"\t"$1"\t"$2; if ($2 == 1 || $2 == 2 || $2 == 3) print "P"; else print "N";}'
results in unwanted new lines before column 4:
PS.15 PS 6
N
PS.17 PS 1
P
PS.18 PS 4
Similarly, my attempt to simultaneously create columns 4, 5, 6, 7:
cat File | awk '{if ($2 == 1 || $2 == 2 || $2 == 3) print "P"; else print "N"; print $4+$5"\t"$4/($4+$5)"\t"$5/($4+$5)}'
results in unwanted new lines after column 4:
N
1 0 1
P
1 0 1
N
1 0 1
Is there a solution so that I can create all 7 columns at once, and there are only tabs between them (no new lines)?
If you don't want automatic line feeds, you can just use printf instead of print. I'm not quite sure if you want a tab separating the N1 or not, but that's easy enough to adjust;
cat testfile | awk '{printf "%s.%s\t%s\t%s\t",$1,$3,$1,$2; if ($2 == 1 || $2 == 2 || $2 == 3) printf "P"; else printf "N"; print $4+$5"\t"$4/($4+$5)"\t"$5/($4+$5)}'
PS.15 PS 6 N1 0 1
PS.17 PS 1 P1 0 1
PS.18 PS 4 N1 0 1
Simply set your OFS (instead of repeating a \t all across the line), and use the ternary operator to print P or N:
$ awk -v OFS='\t' '{s=$4+$5;print $1"."$3,$1,$2,($2~/^[123]$/?"P":"N"),s,$4/s,$5/s}' file
PS.15 PS 6 N 1 0 1
PS.17 PS 1 P 1 0 1
PS.18 PS 4 N 1 0 1

How to select rows in which column two and three are not equal to each other and to 0 or 1?(with awk)

I have a file like this:
AX-75448119 0 1
AX-75448118 0.45 0.487179
AX-75474642 0 0
AX-75474643 0.25 0.820513
AX-75448113 1 0
AX-75474641 1 1
and I want to select the rows that column 2 and 3 are not equal each other and 0 or 1 (both of them)! (i.e if column 2 and 3 are similar but equal to 0.5 (or any other number except 0 and 1) I would like to have that row)
so the output would be:
AX-75448119 0 1
AX-75448118 0.45 0.487179
AX-75474643 0.25 0.820513
AX-75448113 1 0
I know how to write the command to select the rows that column 2 and 3 are equal to each other and are equal to 0 or 1 which is this:
awk '$2=$3==1 || $2=$3==0' test.txt | wc -l
but I want exactly the opposite, to select every rows that are not the output of the above command!
Thanks, I hope I was able to explain what I want
It might work for you if I get your requirements correctly.
awk ' $2 != $3 { print; next } $2 == $3 && $2 != 0 && $2 != 1 { print }' INPUTFILE
See it in action at Ideone.com
This might work for you:(?)
awk '($2==0 || $2==1) && ($3==0 || $3==1) && $2==$3{next}1' file

Resources