Calculating average in irregular intervals without considering missing values in shell script? - linux

I have a dataset with many missing values as -999. Part of the data is
input.txt
30
-999
10
40
23
44
-999
-999
31
-999
54
-999
-999
-999
-999
-999
-999
10
23
2
5
3
8
8
7
9
6
10
and so on
I would like calculate the average in each 5,6,6 rows interval without considering the missing values.
Desire output is
ofile.txt
25.75 (i.e. consider first 5 rows and take average without considering missing values, so (30+10+40+23)/4)
43 (i.e. consider next 6 rows and take average without considering missing values, so (44+31+54)/3)
-999 (i.e. consider next 6 and take average without considering missing values. Since all are missing, so write as a missing value -999)
8.6 (i.e. consider next 5 rows and take average (10+23+2+5+3)/5)
8 (i.e. consider next 6 rows and take average)
I can do if it is regular interval (lets say 5) with this
awk '!/\-999/{sum += $1; count++} NR%5==0{print count ? (sum/count) :-999;sum=count=0}' input.txt
I asked a similar question with regular interval here Calculating average without considering missing values in shell script? But here I am asking the solution for irregular intervals.

With AWK
awk -v f="5" 'f&&f--&&$0!=-999{c++;v+=$0} NR%17==0{f=5;r++}
!f&&NR%17!=0{f=6;r++} r&&!c{print -999;r=0} r&&c{print v/c;r=v=c=0}
END{if(c!=0)print v/c}' input.txt
Output
25.75
43
-999
8.6
8
Breakdown
f&&f--&&$0!=-999{c++;v+=$0} #add valid values and increment count
NR%17==0{f=5;r++} #reset to 5,6,6 pattern
!f&&NR%17!=0{f=6;r++} #set 6 if pattern doesnt match
r&&!c{print -999;r=0} #print -999 if no valid values
r&&c{print v/c;r=v=c=0} #print avg
END{
if(c!=0) #print remaining values avg
print v/c
}

$ cat tst.awk
function nextInterval( intervals) {
numIntervals = split("5 6 6",intervals)
intervalsIdx = (intervalsIdx % numIntervals) + 1
return intervals[intervalsIdx]
}
BEGIN {
interval = nextInterval()
noVal = -999
}
$0 != noVal {
sum += $0
cnt++
}
++numRows == interval {
print (cnt ? sum / cnt : noVal)
interval = nextInterval()
numRows = sum = cnt = 0
}
$ awk -f tst.awk file
25.75
43
-999
8.6
8

Related

Linux filter text rows by sum specific colums

From raw sequencing data I created a count file (.txt) with the counts of unique sequences per sample.
The data looks like this:
sequence seqLength S1 S2 S3 S4 S5 S6 S7 S8
AAAAA... 46 0 1 1 8 1 0 1 5
AAAAA... 46 50 1 5 0 2 0 4 0
...
TTTTT... 71 0 0 5 7 5 47 2 2
TTTTT... 81 5 4 1 0 7 0 1 1
I would like to filter the sequences per row sum, so only rows with a total sum of all samples (sum of S1 to S8) lower than for example 100 are removed.
This can probably be done with awk, but I have no experience with this text-processing utility.
Can anyone help?
Give a try to this:
awk 'NR>1 {sum=0; for (i=3; i<=NF; i++) { sum+= $i } if (sum > 100) print}' file.txt
It will skip line 1 NR>1
Then will sum items per row starting from item 3 (S1 to S8) in your example:
{sum=0; for (i=3; i<=NF; i++) { sum+= $i }
Then will only print rows with sum is > than 100: if (sum > 100) print}'
You could modify/test with the condition based on the sum, but hope this can give you an idea about how to do it with awk
Following awk may help you on same.
awk 'FNR>1{for(i=3;i<=NF;i++){sum+=$i};if(sum>100){print sum > "out_file"};sum=""}' Input_file
In case you need different different out files then following may help.
awk 'FNR>1{for(i=3;i<=NF;i++){sum+=$i};if(sum>100){print sum > "out_file"++i};sum=""}' Input_file

Average column if value in other column matches and print as additional column

I have a file like this:
Score 1 24 HG 1
Score 2 26 HG 2
Score 5 56 RP 0.5
Score 7 82 RP 1
Score 12 97 GM 5
Score 32 104 LS 3
I would like to average column 5 if column 4 are identical and print the average as column 6 so that it looks like this:
Score 1 24 HG 1 1.5
Score 2 26 HG 2 1.5
Score 5 56 RP 0.5 0.75
Score 7 82 RP 1 0.75
Score 12 97 GM 5 5
Score 32 104 LS 3 3
I have tried a couple of solutions I found on here.
e.g.
awk '{ total[$4] += $5; ++n[$4] } END { for(i in total) print i, total[i] / n[i] }'
but they all end up with this:
HG 1.5
RP 0.75
GM 5
LS 3
Which is undesirable as I lose a lot of information.
You can iterate through your table twice: calculate the averages (as you already) do on the first iteration, and then print them out on the second iteration:
awk 'NR==FNR { total[$4] += $5; ++n[$4] } NR>FNR { print $0, total[$4] / n[$4] }' file file
Notice the file twice at the end. While going through the "first" file, NR==FNR, and we sum the appropriate values, keeping them in memory (variables total and n). During "second" file traversal, NR>FNR, and we print out all the original data + averages:
Score 1 24 HG 1 1.5
Score 2 26 HG 2 1.5
Score 5 56 RP 0.5 0.75
Score 7 82 RP 1 0.75
Score 12 97 GM 5 5
Score 32 104 LS 3 3
You can use 1 pass through the file, but you have to store in memory the entire file, so disk i/o vs memory tradeoff:
awk '
BEGIN {FS = OFS = "\t"}
{total[$4] += $5; n[$4]++; line[NR] = $0; key[NR] = $4}
END {for (i=1; i<=NR; i++) print line[i], total[key[i]] / n[key[i]]}
' file

Calculate average of 1kb windows

My files looks like the following:
18 1600014 + CAA 0 3
18 1600017 - CTT 0 1
18 1600019 - CTC 0 1
18 1600020 + CAT 0 3
18 1600031 - CAA 0 1
18 1600035 - CAT 0 1
...
I am trying to calculate the average of column 6 in windows that cover 1000 range of column 2. So from 1600001-1601000, 1601001-1602000, etc. My values go from 1600000-1700000. Is there any way to do this is one step? My initial thought was to use grep to sort these values, but that would require many different commands. I am aware you can calculate the average with awk but can you reiterate over each window?
Desire output would be something like this:
1600001-1601000 3.215
1601001-1602000 3.141
1602001-1603000 3.542
You can use GNU awk to gather the counts and sums, if I understand your problem correct, you might need something like this:
BEGIN { mod = 1000
PROCINFO["sorted_in"] = "#ind_num_asc"
}
{
k= ($2 - ( $2 % mod ) ) / mod
sum[ k ]+= $6
cnt[ k ]++
}
END {
for( k in sum ) printf( "%d-%d\t%6.3f\n", k*mod +1, (k+1)*mod, sum[k] / cnt [k])
}

Calculating average without considering missing values in shell script?

I have a dataset with many missing values as -999. Part of the data is
input.txt
30
-999
10
40
23
44
-999
-999
31
-999
54
-999
-999
-999
-999
-999
-999
-999 and so on
I would like calculate the average in each 6 rows interval without considering the missing values.
Desire output is
ofile.txt
29.4
42.5
-999
While I am trying with this
awk '!/\-999/{sum += $1; count++} NR%6==0{print count ? (sum/count) : count;sum=count=0}' input.txt
it is giving
29.4
42.5
0
I'm not entirely sure why, if you're discounting -999 values, you'd think that -999 was a better choice than zero for the average of the third group. In the first two groups, the -999 values contribute to neither the sum nor the count, so an argument could be made that zero is a better choice.
However, it may be that you want -999 to represent a "lack of value" (which would certainly be the case where there were no values in a group). If that's the case, you can just ouput -999 rather than count in your original code:
awk '!/\-999/{sm+=$1;ct++} NR%6==0{print ct?(sm/ct):-999;sm=ct=0}' input.txt
Even if you decide that zero is a better answer, I'd still make that explicit rather than outputting the count variable itself:
awk '!/\-999/{sm+=$1;ct++} NR%6==0{print ct?(sm/ct):0;sm=ct=0}' input.txt

How to extract every N columns and write into new files?

I've been struggling to write a code for extracting every N columns from an input file and write them into output files according to their extracting order.
(My real world case is to extract every 800 columns from a total 24005 columns file starting at column 6, so I need a loop)
In a simpler case below, extracting every 3 columns(fields) from an input file with a start point of the 2nd column.
for example, if the input file looks like:
aa 1 2 3 4 5 6 7 8 9
bb 1 2 3 4 5 6 7 8 9
cc 1 2 3 4 5 6 7 8 9
dd 1 2 3 4 5 6 7 8 9
and I want the output to look like this:
output_file_1:
1 2 3
1 2 3
1 2 3
1 2 3
output_file_2:
4 5 6
4 5 6
4 5 6
4 5 6
output_file_3:
7 8 9
7 8 9
7 8 9
7 8 9
I tried this, but it doesn't work:
awk 'for(i=2;i<=10;i+a) {{printf "%s ",$i};a=3}' <inputfile>
It gave me syntax error and the more I fix the more problems coming out.
I also tried the linux command cut but while I was dealing with large files this seems effortless. And I wonder if cut would do a loop cut of every 3 fields just like the awk.
Can someone please help me with this and give a quick explanation? Thanks in advance.
Actions to be performed by awk on the input data must be included in curled braces, so the reason the awk one-liner you tried results in a syntax error is that the for cycle does not respect this rule. A syntactically correct version will be:
awk '{for(i=2;i<=10;i+a) {printf "%s ",$i};a=3}' <inputfile>
This is syntactically correct (almost, see end of this post.), but does not do what you think.
To separate the output by columns on different files, the best thing is to use awk redirection operator >. This will give you the desired output, given that your input files always has 10 columns:
awk '{ print $2,$3,$4 > "file_1"; print $5,$6,$7 > "file_2"; print $8,$9,$10 > "file_3"}' <inputfile>
mind the " " to specify the filenames.
EDITED: REAL WORLD CASE
If you have to loop along the columns because you have too many of them, you can still use awk (gawk), with two loops: one on the output files and one on the columns per file. This is a possible way:
#!/usr/bin/gawk -f
BEGIN{
CTOT = 24005 # total number of columns, you can use NF as well
DELTA = 800 # columns per file
START = 6 # first useful column
d = CTOT/DELTA # number of output files.
}
{
for ( i = 0 ; i < d ; i++)
{
for ( j = 0 ; j < DELTA ; j++)
{
printf("%f\t",$(START+j+i*DELTA)) > "file_out_"i
}
printf("\n") > "file_out_"i
}
}
I have tried this on the simple input files in your example. It works if CTOT can be divided by DELTA. I assumed you had floats (%f) just change that with what you need.
Let me know.
P.s. going back to your original one-liner, note that the loop is an infinite one, as i is not incremented: i+a must be substituted by i+=a, and a=3 must be inside the inner braces:
awk '{for(i=2;i<=10;i+=a) {printf "%s ",$i;a=3}}' <inputfile>
this evaluates a=3 at every cycle, which is a bit pointless. A better version would thus be:
awk '{for(i=2;i<=10;i+=3) {printf "%s ",$i}}' <inputfile>
Still, this will just print the 2nd, 5th and 8th column of your file, which is not what you wanted.
awk '{ print $2, $3, $4 >"output_file_1";
print $5, $6, $7 >"output_file_2";
print $8, $9, $10 >"output_file_3";
}' input_file
This makes one pass through the input file, which is preferable to multiple passes. Clearly, the code shown only deals with the fixed number of columns (and therefore a fixed number of output files). It can be modified, if necessary, to deal with variable numbers of columns and generating variable file names, etc.
(My real world case is to extract every 800 columns from a total 24005 columns file starting at column 6, so I need a loop)
In that case, you're correct; you need a loop. In fact, you need two loops:
awk 'BEGIN { gap = 800; start = 6; filebase = "output_file_"; }
{
for (i = start; i < start + gap; i++)
{
file = sprintf("%s%d", filebase, i);
for (j = i; j <= NF; j += gap)
printf("%s ", $j) > file;
printf "\n" > file;
}
}' input_file
I demonstrated this to my satisfaction with an input file with 25 columns (numbers 1-25 in the corresponding columns) and gap set to 8 and start set to 2. The output below is the resulting 8 files pasted horizontally.
2 10 18 3 11 19 4 12 20 5 13 21 6 14 22 7 15 23 8 16 24 9 17 25
2 10 18 3 11 19 4 12 20 5 13 21 6 14 22 7 15 23 8 16 24 9 17 25
2 10 18 3 11 19 4 12 20 5 13 21 6 14 22 7 15 23 8 16 24 9 17 25
2 10 18 3 11 19 4 12 20 5 13 21 6 14 22 7 15 23 8 16 24 9 17 25
With GNU awk:
$ awk -v d=3 '{for(i=2;i<NF;i+=d) print gensub("(([^ ]+ +){" i-1 "})(([^ ]+( +|$)){" d "}).*","\\3",""); print "----"}' file
1 2 3
4 5 6
7 8 9
----
1 2 3
4 5 6
7 8 9
----
1 2 3
4 5 6
7 8 9
----
1 2 3
4 5 6
7 8 9
----
Just redirect the output to files if desired:
$ awk -v d=3 '{sfx=0; for(i=2;i<NF;i+=d) print gensub("(([^ ]+ +){" i-1 "})(([^ ]+( +|$)){" d "}).*","\\3","") > ("output_file_" ++sfx)}' file
The idea is just to tell gensub() to skip the first few (i-1) fields then print the number of fields you want (d = 3) and ignore the rest (.*). If you're not printing exact multiples of the number of fields you'll need to massage how many fields get printed on the last loop iteration. Do the math...
Here's a version that'd work in any awk. It requires 2 loops and modifies the spaces between fields but it's probably easier to understand:
$ awk -v d=3 '{sfx=0; for(i=2;i<=NF;i+=d) {str=fs=""; for(j=i;j<i+d;j++) {str = str fs $j; fs=" "}; print str > ("output_file_" ++sfx)} }' file
I was successful using the following command line. :) It uses a for loop and pipes the awk program into it's stdin using -f -. The awk program itself is created using bash variable math.
for i in 0 1 2; do
echo "{print \$$((i*3+2)) \" \" \$$((i*3+3)) \" \" \$$((i*3+4))}" \
| awk -f - t.file > "file$((i+1))"
done
Update: After the question has updated I tried to hack a script that creates the requested 800-cols-awk script dynamically ( a version according to Jonathan Lefflers answer) and pipe that to awk. Although the scripts looks good (for me ) it produces an awk syntax error. The question is, is this too much for awk or am I missing something? Would really appreciate feedback!
Update: Investigated this and found documentation that says awk has a lot af restrictions. They told to use gawk in this situations. (GNU's awk implementation). I've done that. But still I'll get an syntax error. Still feedback appreciated!
#!/bin/bash
# Note! Although the script's output looks ok (for me)
# it produces an awk syntax error. is this just too much for awk?
# open pipe to stdin of awk
exec 3> >(gawk -f - test.file)
# verify output using cat
#exec 3> >(cat)
echo '{' >&3
# write dynamic script to awk
for i in {0..24005..800} ; do
echo -n " print " >&3
for (( j=$i; j <= $((i+800)); j++ )) ; do
echo -n "\$$j " >&3
if [ $j = 24005 ] ; then
break
fi
done
echo "> \"file$((i/800+1))\";" >&3
done
echo "}"

Resources