Calculating average without considering missing values in shell script? - linux

I have a dataset with many missing values as -999. Part of the data is
input.txt
30
-999
10
40
23
44
-999
-999
31
-999
54
-999
-999
-999
-999
-999
-999
-999 and so on
I would like calculate the average in each 6 rows interval without considering the missing values.
Desire output is
ofile.txt
29.4
42.5
-999
While I am trying with this
awk '!/\-999/{sum += $1; count++} NR%6==0{print count ? (sum/count) : count;sum=count=0}' input.txt
it is giving
29.4
42.5
0

I'm not entirely sure why, if you're discounting -999 values, you'd think that -999 was a better choice than zero for the average of the third group. In the first two groups, the -999 values contribute to neither the sum nor the count, so an argument could be made that zero is a better choice.
However, it may be that you want -999 to represent a "lack of value" (which would certainly be the case where there were no values in a group). If that's the case, you can just ouput -999 rather than count in your original code:
awk '!/\-999/{sm+=$1;ct++} NR%6==0{print ct?(sm/ct):-999;sm=ct=0}' input.txt
Even if you decide that zero is a better answer, I'd still make that explicit rather than outputting the count variable itself:
awk '!/\-999/{sm+=$1;ct++} NR%6==0{print ct?(sm/ct):0;sm=ct=0}' input.txt

Related

Adding a number to column [line by line]

I have a text file named text: The row and columns are:
1 A 18 -180
2 B 19 -180
3 C 20 -150
50 D 21 -100
128 E 22 -130
10 F 23 -0
10 G 23 -0
What I want to do is to print out the 4th column with adding a constant number to each of the lines (except ==0). To do this is what I have done.
#!/bin/bash
FILE="/dir/text"
while IFS= read -r line
do
echo "$line"
done <"$FILE"
I can read the fourth column, but at the same time I want to put an argument $1 which will add a constant number to all of the lines in the fourth column except any line of the fourth column has ==0.
UPDATE:
The Desired output would be like: [the line has zeros are ignored]
-160
-160
-130
-80
-110
For example, the program name is example.sh. I want to add a number to the fourth column using an argument. Therefore it would be:
example.sh $1
where $1 could be any number I want to add in the 4th column.
You should awk here which will be faster than bash.
awk -v number="100" '$4!=0{$4+=number} 1' Input_file
number is an awk variable where you could set its value as per your need.
Explanation: Adding detailed explanation for above code.
awk -v number="100" ' ##Starting awk program from here and creating a variable number whose value is 100.
$4!=0{ ##Checking condition if 4th column is NOT zero then do following.
$4+=number ##Adding variable number to 4th column here.
}
1 ##Mentioning 1 will print edited/non-edited lines.
' Input_file ##mentioning Input_file name here.
In order to preserve your formatting using awk while adding the values to the 4th field, you can calculate the new value of the 4th field and then use sub to change the value without forcing awk to recalculate the fields and removing the whitespace.
For example, with your file stored as text and adding a value of 180 to the 4th field (except where 0), you could do:
awk -v n=180 '$4!=0 {newval=$4+n; sub(/[0-9]+$/,newval)}1' text
Doing so would produce the following output:
$ awk -v n=180 '$4!=0 {newval=$4+n; sub(/[0-9]+$/,newval)}1' text
1 A 18 0
2 B 19 0
3 C 20 30
50 D 21 80
128 E 22 50
10 F 23 -0
10 G 23 -0
If called withing a shell script, you could pass your $1 parameter as:
awk -v n="$1" '$4!=0 {newval=$4+n; sub(/[0-9]+$/,newval)}1' text
Though I would suggest checking that an argument has been provided to the script with:
[ -z "$1" ] && {
echo "error: value require as argument"
exit 1
}
or you can provide a default value -- up to you.
With bash:
while read -ra a; do [[ ${a[3]} != -0 ]] && ((a[3]+=42)); echo "${a[#]}"; done < file
Output:
1 A 18 -138
2 B 19 -138
3 C 20 -108
50 D 21 -58
128 E 22 -88
10 F 23 -0
10 G 23 -0

How to find pattern and make operation in another field in awk?

I have a file with 4 columns separated by space like this bellow:
1_86500000 50 1_87500000 19
1_87500000 13 1_89500000 42
1_89500000 25 1_90500000 10
1_90500000 3 1_91500000 11
1_91500000 23 1_92500000 29
1_92500000 34 1_93500000 4
1_93500000 39 1_94500000 49
1_94500000 35 1_95500000 26
2_35500000 1 2_31500000 81
2_31500000 12 2_4150000 50
The First and Third columns are not in phase so I can not divide the value of one by another.
As there are only two or one possible columns $1 or $3, a solution would be look for the pattern and divide its value in the another column or set it to 0 if there is none like this expected result shows:
P.S. the second field in this expected result is just illustrative to shown the division.
1_86500000 0/50 0
1_87500000 19/13 1.46154
1_89500000 42/25 1.68
1_90500000 10/3 3.333
1_91500000 11/23 0.47826
1_92500000 29/34 0.85294
1_93500000 4/39 0.10256
1_94500000 49/35 1.4
2_35500000 0/1 0
2_31500000 81/12 6.75
2_4150000 50/0 50
I do not archived anything by myself other than this. So I do not have any starting point by now.
I tried separate the fields merged with _ to see if I could match by subtracting the coordinates. If I got 0 would mean that the columns was in phase and correct. But I could not go further.
awk '{if( ($5-$2)==0) print $1,$2,$3,$4,$5,$6}' file
I tried to match both columns but I only got phased results:
awk '{if(($1==$3)) print $1,$4/$2}' file
Can you help me?
awk to the rescue!
$ awk '{d[$1]=$2; n[$3]=$4}
END {for(k in n)
if(k in d) {print k,n[k]"/"d[k],n[k]/d[k]; delete d[k]}
else print k,n[k]"/0",n[k];
for(k in d) print k,"0/"d[k],0}' file | sort
1_86500000 0/50 0
1_87500000 19/13 1.46154
1_89500000 42/25 1.68
1_90500000 10/3 3.33333
1_91500000 11/23 0.478261
1_92500000 29/34 0.852941
1_93500000 4/39 0.102564
1_94500000 49/35 1.4
1_95500000 26/0 26
2_31500000 81/12 6.75
2_35500000 0/1 0
2_4150000 50/0 50
your division by zero result is little strange though!
Explanation keep two arrays for numerator and denominator. Once scanned the file, go over numerator array and find the corresponding denominator and make the division. For the denominators not used apply the convention given.

Calculating average in irregular intervals without considering missing values in shell script?

I have a dataset with many missing values as -999. Part of the data is
input.txt
30
-999
10
40
23
44
-999
-999
31
-999
54
-999
-999
-999
-999
-999
-999
10
23
2
5
3
8
8
7
9
6
10
and so on
I would like calculate the average in each 5,6,6 rows interval without considering the missing values.
Desire output is
ofile.txt
25.75 (i.e. consider first 5 rows and take average without considering missing values, so (30+10+40+23)/4)
43 (i.e. consider next 6 rows and take average without considering missing values, so (44+31+54)/3)
-999 (i.e. consider next 6 and take average without considering missing values. Since all are missing, so write as a missing value -999)
8.6 (i.e. consider next 5 rows and take average (10+23+2+5+3)/5)
8 (i.e. consider next 6 rows and take average)
I can do if it is regular interval (lets say 5) with this
awk '!/\-999/{sum += $1; count++} NR%5==0{print count ? (sum/count) :-999;sum=count=0}' input.txt
I asked a similar question with regular interval here Calculating average without considering missing values in shell script? But here I am asking the solution for irregular intervals.
With AWK
awk -v f="5" 'f&&f--&&$0!=-999{c++;v+=$0} NR%17==0{f=5;r++}
!f&&NR%17!=0{f=6;r++} r&&!c{print -999;r=0} r&&c{print v/c;r=v=c=0}
END{if(c!=0)print v/c}' input.txt
Output
25.75
43
-999
8.6
8
Breakdown
f&&f--&&$0!=-999{c++;v+=$0} #add valid values and increment count
NR%17==0{f=5;r++} #reset to 5,6,6 pattern
!f&&NR%17!=0{f=6;r++} #set 6 if pattern doesnt match
r&&!c{print -999;r=0} #print -999 if no valid values
r&&c{print v/c;r=v=c=0} #print avg
END{
if(c!=0) #print remaining values avg
print v/c
}
$ cat tst.awk
function nextInterval( intervals) {
numIntervals = split("5 6 6",intervals)
intervalsIdx = (intervalsIdx % numIntervals) + 1
return intervals[intervalsIdx]
}
BEGIN {
interval = nextInterval()
noVal = -999
}
$0 != noVal {
sum += $0
cnt++
}
++numRows == interval {
print (cnt ? sum / cnt : noVal)
interval = nextInterval()
numRows = sum = cnt = 0
}
$ awk -f tst.awk file
25.75
43
-999
8.6
8

Rearrange column with empty values using awk or sed

i want to rearrange the columns of a txt file, but there are empty values, which cause a problem. For example:
testfile:
Name ID Count Date Other
A 1 10 513 x
6 15 312 x
3 18 314 x
B 19 31 942 x
8 29 722 x
when i tried $ more testfile |awk '{print $2"\t"$1"\t"$3"\t"$4"\t"$5}'
it becomes:
ID Name Count Date Other
1 A 10 513 x
15 6 312 x
18 3 314 x
19 B 31 942 x
29 8 722 x
which is not i want, please help,i want it to be
ID Name Count Date Other
1 A 10 513 x
15 6 312 x
18 3 314 x
19 B 31 942 x
29 8 722 x
moreover i am not sure which columns might contain empty values, and the column length is not fixed, thank you
Assuming your input file is not tab-separated and you have (or can get) GNU awk then I recommend:
$ awk -v FIELDWIDTHS="8 8 8 8 8" -v OFS='\t' '{
for (i=1;i<=NF;i++) {
gsub(/^\s+|\s+$/,"",$i)
}
t=$1; $1=$2; $2=t'
}1' file
ID Name Count Date Other
1 A 10 513 x
6 15 312 x
3 18 314 x
19 B 31 942 x
8 29 722 x
If your file is tab-separated then all you need is:
awk 'BEGIN{FS=OFS="\t"} {t=$1; $1=$2; $2=t}1' file
Another awk alternative is using the number of fields. If you know your data and it's only deficit in the first column you can try this.
awk -v OFS="\t" 'NF==4{$5=$4;$4=$3;$3=$2;$2=$1;$1=""} {print $2,$1,$3,$4,$5}'
However, the output will be tab separated instead of fixed length format. You can achieve the same using printf and changing OFS, but perhaps tab separated is what you really need for tabular representation.
The most natural model for awk to use is columns as defined by the transitions from white-space to non-white-space and back. Since you have columns that may themselves be white-space, the natural model won't work.
However, you can revert to using a model based on column positions rather than transitions, meaning that a file containing only spaces (the presence of tabs will complicate things):
Name ID Count Date Other
A 1 10 513 x
6 15 312 x
3 18 314 x
B 19 31 942 x
8 29 722 x
can still be rearranged, though not as succinctly as transition-based columns.
The following awk script will do the trick, swapping name and id:
{
name = substr($0, 1,7);
id = substr($0, 9,7);
count = substr($0,17,7);
date = substr($0,25,7);
other = substr($0,33 );
print id" "name" "count" "date" "other;
}
If the original file is called pax.in and the awk script is stored in pax.awk, the command awk -f pax.awk pax.in will give you, as desired:
ID Name Count Date Other
1 A 10 513 x
6 15 312 x
3 18 314 x
19 B 31 942 x
8 29 722 x
Keep in mind I've written that script to be relatively flexible, allowing you to change the order of the columns quite easily. If all you want is to swap the first two columns, you could use:
awk '{print substr($0,9,8)substr($0,1,8)substr($0,17)}' qq.in
or the slightly shorter (if you're allowed to use other tools):
sed -E 's/^(.{8})(.{8})/\2\1/' qq.in

how does linux store negative number in text files?

I made a file using ed and named it numeric. Its content is as follow:
-100
-10
0
99
11
-56
12
Then I executed this command on terminal:
sort numeric
And the result was:
0
-10
-100
11
12
-56
99
And of course this output was not at all expected!
Sort want to be asked to sort numerically (otherwise it will default to lexigraphic sorting)
$ sort -n numbers.dat
-100
-56
-10
0
11
12
99
Watch out for the "-n" parameter (see manual)
Text files are text files, they contain text. Your numbers are sorted alphabetically. If you want sort to sort based on numerical value, use sort -n.
Also, your sort result is strange, when I run the same test I get:
$ sort numeric
-10
-100
-56
0
11
12
99
Sorted alphabetically, as expected.
See https://glot.io/snippets/e555jjumx6
Use sort -n to make sort do numerical sorting instead of alphabetical
That's because by default, sort is alphabetical, not numeric. sort -n does numbers.
Otherwise you'll get
1
10
2
3
etc.

Resources