Calculate average of 1kb windows - linux

My files looks like the following:
18 1600014 + CAA 0 3
18 1600017 - CTT 0 1
18 1600019 - CTC 0 1
18 1600020 + CAT 0 3
18 1600031 - CAA 0 1
18 1600035 - CAT 0 1
...
I am trying to calculate the average of column 6 in windows that cover 1000 range of column 2. So from 1600001-1601000, 1601001-1602000, etc. My values go from 1600000-1700000. Is there any way to do this is one step? My initial thought was to use grep to sort these values, but that would require many different commands. I am aware you can calculate the average with awk but can you reiterate over each window?
Desire output would be something like this:
1600001-1601000 3.215
1601001-1602000 3.141
1602001-1603000 3.542

You can use GNU awk to gather the counts and sums, if I understand your problem correct, you might need something like this:
BEGIN { mod = 1000
PROCINFO["sorted_in"] = "#ind_num_asc"
}
{
k= ($2 - ( $2 % mod ) ) / mod
sum[ k ]+= $6
cnt[ k ]++
}
END {
for( k in sum ) printf( "%d-%d\t%6.3f\n", k*mod +1, (k+1)*mod, sum[k] / cnt [k])
}

Related

Linux filter text rows by sum specific colums

From raw sequencing data I created a count file (.txt) with the counts of unique sequences per sample.
The data looks like this:
sequence seqLength S1 S2 S3 S4 S5 S6 S7 S8
AAAAA... 46 0 1 1 8 1 0 1 5
AAAAA... 46 50 1 5 0 2 0 4 0
...
TTTTT... 71 0 0 5 7 5 47 2 2
TTTTT... 81 5 4 1 0 7 0 1 1
I would like to filter the sequences per row sum, so only rows with a total sum of all samples (sum of S1 to S8) lower than for example 100 are removed.
This can probably be done with awk, but I have no experience with this text-processing utility.
Can anyone help?
Give a try to this:
awk 'NR>1 {sum=0; for (i=3; i<=NF; i++) { sum+= $i } if (sum > 100) print}' file.txt
It will skip line 1 NR>1
Then will sum items per row starting from item 3 (S1 to S8) in your example:
{sum=0; for (i=3; i<=NF; i++) { sum+= $i }
Then will only print rows with sum is > than 100: if (sum > 100) print}'
You could modify/test with the condition based on the sum, but hope this can give you an idea about how to do it with awk
Following awk may help you on same.
awk 'FNR>1{for(i=3;i<=NF;i++){sum+=$i};if(sum>100){print sum > "out_file"};sum=""}' Input_file
In case you need different different out files then following may help.
awk 'FNR>1{for(i=3;i<=NF;i++){sum+=$i};if(sum>100){print sum > "out_file"++i};sum=""}' Input_file

Average column if value in other column matches and print as additional column

I have a file like this:
Score 1 24 HG 1
Score 2 26 HG 2
Score 5 56 RP 0.5
Score 7 82 RP 1
Score 12 97 GM 5
Score 32 104 LS 3
I would like to average column 5 if column 4 are identical and print the average as column 6 so that it looks like this:
Score 1 24 HG 1 1.5
Score 2 26 HG 2 1.5
Score 5 56 RP 0.5 0.75
Score 7 82 RP 1 0.75
Score 12 97 GM 5 5
Score 32 104 LS 3 3
I have tried a couple of solutions I found on here.
e.g.
awk '{ total[$4] += $5; ++n[$4] } END { for(i in total) print i, total[i] / n[i] }'
but they all end up with this:
HG 1.5
RP 0.75
GM 5
LS 3
Which is undesirable as I lose a lot of information.
You can iterate through your table twice: calculate the averages (as you already) do on the first iteration, and then print them out on the second iteration:
awk 'NR==FNR { total[$4] += $5; ++n[$4] } NR>FNR { print $0, total[$4] / n[$4] }' file file
Notice the file twice at the end. While going through the "first" file, NR==FNR, and we sum the appropriate values, keeping them in memory (variables total and n). During "second" file traversal, NR>FNR, and we print out all the original data + averages:
Score 1 24 HG 1 1.5
Score 2 26 HG 2 1.5
Score 5 56 RP 0.5 0.75
Score 7 82 RP 1 0.75
Score 12 97 GM 5 5
Score 32 104 LS 3 3
You can use 1 pass through the file, but you have to store in memory the entire file, so disk i/o vs memory tradeoff:
awk '
BEGIN {FS = OFS = "\t"}
{total[$4] += $5; n[$4]++; line[NR] = $0; key[NR] = $4}
END {for (i=1; i<=NR; i++) print line[i], total[key[i]] / n[key[i]]}
' file

Calculating average in irregular intervals without considering missing values in shell script?

I have a dataset with many missing values as -999. Part of the data is
input.txt
30
-999
10
40
23
44
-999
-999
31
-999
54
-999
-999
-999
-999
-999
-999
10
23
2
5
3
8
8
7
9
6
10
and so on
I would like calculate the average in each 5,6,6 rows interval without considering the missing values.
Desire output is
ofile.txt
25.75 (i.e. consider first 5 rows and take average without considering missing values, so (30+10+40+23)/4)
43 (i.e. consider next 6 rows and take average without considering missing values, so (44+31+54)/3)
-999 (i.e. consider next 6 and take average without considering missing values. Since all are missing, so write as a missing value -999)
8.6 (i.e. consider next 5 rows and take average (10+23+2+5+3)/5)
8 (i.e. consider next 6 rows and take average)
I can do if it is regular interval (lets say 5) with this
awk '!/\-999/{sum += $1; count++} NR%5==0{print count ? (sum/count) :-999;sum=count=0}' input.txt
I asked a similar question with regular interval here Calculating average without considering missing values in shell script? But here I am asking the solution for irregular intervals.
With AWK
awk -v f="5" 'f&&f--&&$0!=-999{c++;v+=$0} NR%17==0{f=5;r++}
!f&&NR%17!=0{f=6;r++} r&&!c{print -999;r=0} r&&c{print v/c;r=v=c=0}
END{if(c!=0)print v/c}' input.txt
Output
25.75
43
-999
8.6
8
Breakdown
f&&f--&&$0!=-999{c++;v+=$0} #add valid values and increment count
NR%17==0{f=5;r++} #reset to 5,6,6 pattern
!f&&NR%17!=0{f=6;r++} #set 6 if pattern doesnt match
r&&!c{print -999;r=0} #print -999 if no valid values
r&&c{print v/c;r=v=c=0} #print avg
END{
if(c!=0) #print remaining values avg
print v/c
}
$ cat tst.awk
function nextInterval( intervals) {
numIntervals = split("5 6 6",intervals)
intervalsIdx = (intervalsIdx % numIntervals) + 1
return intervals[intervalsIdx]
}
BEGIN {
interval = nextInterval()
noVal = -999
}
$0 != noVal {
sum += $0
cnt++
}
++numRows == interval {
print (cnt ? sum / cnt : noVal)
interval = nextInterval()
numRows = sum = cnt = 0
}
$ awk -f tst.awk file
25.75
43
-999
8.6
8

Find rows common in more than two files using awk [duplicate]

This question already has answers here:
How to find common rows in multiple files using awk
(2 answers)
Closed 7 years ago.
I have tab delimited text files in which common rows between them are to be found based on columns 1 and 2 as key columns.
Sample files:
file1.txt
aba 0 0
abc 0 1
abd 1 1
xxx 0 0
file2.txt
xyz 0 0
aba 0 0 0 0
xxx 0 0
abc 1 1
file3.txt
xyx 0 0
aba 0 0
aba 0 1 0
xxx 0 0 0 1
abc 1 1
I would like to get rows common in 2 files or 3 files using columns 1 and 2 as key to search. For the common rows based on column 1 and 2 reporting the first occurrence in any file would do the job.
Sample Output for rows common in 2 files:
abc 1 1
Sample output for rows common in 3 files:
aba 0 0
xxx 0 0
In real scenario I do have to specify different values for number of files. Can anybody suggest a generalized solution to pass the value for number of files in which it has to be common.
I have this piece of code which looks for rows common in all files.
awk '
FNR == NR {
arr[$1,$2] = 1
line[$1,$2] = line[$1,$2] ( line[$1,$2] ? SUBSEP : "" ) $0
next
}
FNR == 1 { delete found }
{ if ( arr[$1,$2] && ! found[$1,$2] ) { arr[$1,$2]++; found[$1,$2] = 1 } }
END {
num_files = ARGC -1
for ( key in arr ) {
if ( arr[key] < num_files ) { continue }
split( line[ key ], line_arr, SUBSEP )
for ( i = 1; i <= length( line_arr ); i++ ) {
printf "%s\n", line_arr[ i ]
}
}
}
' *.txt > commoninall.txt
This should work:
cat file[123].txt | sort | awk 'BEGIN{FS="\t"; V1=""; V2=""}
{if (V1==$1 && V2==$2) { b=b+1 } else
{ print b":"$0; b=1; V1=$1; V2=$2} }' |grep "2:"|awk '
BEGIN{FS=":"} {print $2}'
I cat all file in one stream, sort the lines, check if the first two tab seperated colums are equal (if they are then print the line) and then filter out all duplicated lines.
BTW: I took this nice file[123].txt globbing idea from the comment of William Pursell.
This should work too
I put all the lines in an array (b) with two first values and accumulate in a the number of repetitions. If number > 1 it will be printed from b which has the last line saved for this pair combination column1/column2
cat *.txt | awk -F" " '{a[$1$2]=a[$1$2]+1; b[$1$2]=$0} END{ for (i in a){if(a[i]>1){print b[i]}}}'
Is it ok too?
EDIT
To show all lines in all files, you need just a little more:
cat *.txt | awk -F" " '{a[$1$2]=a[$1$2]+1; c=a[$1$2]; b[$1$2c]=$0} END{ for (i in a){if(a[i]>1){for(c=1; c<=a[i];++c){print b[i c]}}}}'
Very thanks to #PeterPaulKiefer for the cat *txt idea

How to loop an awk command on every column of a table and output to a single output file?

I have a multi column file composed of single unit 1s, 2s and 3s. There are a lot of repeats of a unit in each column, and sometimes it switches from one to another. I want to count how many times this switch happens on every column. For example in column 1 the switch change from 1 to 2 to 3 to 1, so there are 3 switches and the output should be 3. In the second column there are 2s the entire column, so the changes is 0 and the output is 0.
My input file has 4000 columns so it is impossible to do it by hand. The file is space separated.
For example:
Input:
1 2 3 1 2
1 2 2 1 3
1 2 3 1 2
2 2 2 1 2
2 2 2 1 2 ......
3 2 2 1 2
3 2 2 1 1
1 2 2 1 1
1 2 2 1 2
1 2 2 1 1
Desired output:
3 ## column 1 switch times
0 ## column 2 switch times
3 .....
0
5
I was using:
awk '{print $1}' <inputfile> | uniq | wc -l
awk '{print $2}' <inputfile> | uniq | wc -l
awk '{print $3}' <inputfile> | uniq | wc -l
....
This execute one column at a time. It will give me the output "4" for the first column, later I will just calculate 4-1 =3 to get my desired output. But Is there a way I can write this awk command into a loop and execute it on each column and output to one file?
Thanks!
awk tells you how many fields are in a given row in the variable NF, so you can create two arrays to keep track of the information you need. One array will keep the value of the last row in the given column. The other will count the number of switches in a given column. You'll also keep a track of the maximum number of columns (and set the counts for new columns to zero so that they are printed appropriately in the output at the end if the number of switches is 0 for that column). You'll also make sure you don't count the transition from an empty string to a non-empty string — which happens when the column is encountered for the first time.
If, in fact, the file is uniformly the same number of columns, that will only affect the first row of data. If subsequent rows actually have more columns than the first line, then it adds them. If a column stops appearing for a bit, I've assumed it should resume where it left off (as if the missing columns were the same value as before). You can decide on different algorithms; that could count as two transitions (from number to blank and from blank to number too. If that's the case, you have to modify the counting code. Or, perhaps more sensibly, you could decide that irregular numbers of columns are simply not allowed, in which case you can bail out early if the number of columns in the current row is not the same as in the previous row (beware blank lines, or are they outlawed too?).
And you won't try writing the whole program on one line because it will be incomprehensible and it really isn't necessary.
awk '{ if (NF > maxNF)
{
for (i = maxNF + 1; i <= NF; i++)
count[i] = 0;
maxNF = NF;
}
for (i = 1; i <= NF; i++)
{
if (col[i] != "" && $i != col[i])
count[i]++;
col[i] = $i;
}
}
END {
for (i = 1; i <= maxNF; i++)
print count[i];
}' data-file-with-4000-columns
Given your sample data (with the dots removed), the output from the script is as requested:
3
0
3
0
5
This alternative data file with jagged rows:
1 2 3 1 2
1 2 2 1 3
1 2 3 1 2
2 2 2 1 2
2 2 2 1 2 1 1 1
3 2 2 1 2 2 1
3 2 2 1 1
1 2 2 1 1 2 2 1
1 2 2 1
1 2 2 1 1 3
produces the output:
3
0
3
0
3
2
1
0
Which is correct according to the rules I formulated — but if you decide you want different rules to cover the data, you can end up with different answers.
If you used printf("%d\n", count[i]); in the final loop, you'd not need to set the count values to zero in a loop. You pays your money and takes your pick.
Use a loop and keep an array for each of the column current value and another array for the corresponding count:
awk '{for(i=0;i<5;i++) if(c[i]!=$(i+1)) {c[i]=$(i+1); t[i]++}} END{for(i=0;i<5;i++)print t[i]-1}' filename
Note that this assumes that column's value are not zero. If you happen to have zero values, then just initialize the array c to some unique value which will not be present in the file.
Coded out for ease of viewing, SaveColx, CountColx should be arrays. I'd print the column number itself in the results at least for checking :-)
BEGIN {
SaveCol1 = " "
CountCol1 = 0
CountCol2 = 0
CountCol3 = 0
CountCol4 = 0
CountCol5 = 0
}
{
if ( SaveCol1 == " " ) {
SaveCol1 = $1
SaveCol2 = $2
SaveCol3 = $3
SaveCol4 = $4
SaveCol5 = $5
next
}
if ( $1 != SaveCol1 ) {
CountCol1++
SaveCol1 = $1
}
if ( $2 != SaveCol2 ) {
CountCol2++
SaveCol2 = $2
}
if ( $3 != SaveCol3 ) {
CountCol3++
SaveCol3 = $3
}
if ( $4 != SaveCol4 ) {
CountCol4++
SaveCol4 = $4
}
if ( $5 != SaveCol5 ) {
CountCol5++
SaveCol5 = $5
}
}
END {
print CountCol1
print CountCol2
print CountCol3
print CountCol4
print CountCol5
}

Resources