Average of multiple files without considering missing values using Shell

Average of multiple files without considering missing values using Shell - linux

I have five different files. Part of each file looks as:
ifile1.txt ifile2.txt ifile3.txt ifile4.txt ifile5.txt
2 3 2 3 2
1 2 /no value 2 3
/no value 2 4 3 /no value
3 1 0 0 1
/no value /no value /no value /no value /no value
I need to compute average of these five files without considering missing values. i.e.
ofile.txt
2.4
2.0
3.0
1.0
99999
Here 2.4 = (2+3+2+3+2)/5
2.0 = (1+2+2+3)/4
3.0 = (2+4+3)/3
1.0 = (3+1+0+0+1)/5
99999 = all are missing
I was trying in the following way, but don't feel it is a proper way.
paste ifile1.txt ifile2.txt ifile3.txt ifile4.txt ifile5.txt > ofile.txt
tr '\n' ' ' < ofile.txt > ofile1.txt
awk '!/\//{sum += $1; count++} {print count ? (sum/count) : count;sum=count=0}' ofile1.txt > ofile2.txt
awk '!/\//{sum += $2; count++} {print count ? (sum/count) : count;sum=count=0}' ofile1.txt > ofile3.txt
awk '!/\//{sum += $3; count++} {print count ? (sum/count) : count;sum=count=0}' ofile1.txt > ofile4.txt
awk '!/\//{sum += $4; count++} {print count ? (sum/count) : count;sum=count=0}' ofile1.txt > ofile5.txt
awk '!/\//{sum += $5; count++} {print count ? (sum/count) : count;sum=count=0}' ofile1.txt > ofile6.txt
paste ofile2.txt ofile3.txt ofile4.txt ofile5.txt ofile6.txt > ofile7.txt
tr '\n' ' ' < ofile7.txt > ofile.txt

The following script.awk will deliver what you want:
BEGIN {
gap = -1;
maxidx = -1;
}
{
if (NR != FNR + gap) {
idx = 0;
gap = NR - FNR;
}
if (idx > maxidx) {
maxidx = idx;
count[idx] = 0;
sum[idx] = 0;
}
if ($0 != "/no value") {
count[idx]++;
sum[idx] += $0;
}
idx++;
}
END {
for (idx = 0; idx <= maxidx; idx++) {
if (count[idx] == 0) {
sum[idx] = 99999;
count[idx] = 1;
}
print sum[idx] / count[idx];
}
}
You call it with:
awk -f script.awk ifile*.txt
and it allows for an arbitrary number of input files, each with an arbitrary number of lines. It works as follows:
BEGIN {
gap = -1;
maxidx = -1;
}
This begin section runs before any lines are processed and it sets the current gap and maximum index accordingly.
The gap is the difference between the overall line number NR and the file line number FNR, used to detect when you switch files, something that's very handy when processing multiple input files.
The maximum index is used to figure out the largest line count so as to output the correct number of records at the end.
{
if (NR != FNR + gap) {
idx = 0;
gap = NR - FNR;
}
if (idx > maxidx) {
maxidx = idx;
count[idx] = 0;
sum[idx] = 0;
}
if ($0 != "/no value") {
count[idx]++;
sum[idx] += $0;
}
idx++;
}
The above code is the meat of the solution, executed per line. The first if statement is used to detect whether you've just moved into a new file and it does this simply so it can aggregate all the associated lines from each file. By that I mean the first line in each input file is used to calculate the average for the first line of the output file.
The second if statement adjusts maxidx if the current line number is beyond any previous line number we've encountered. This is for the case where file one may have seven lines but file two has nine lines (not so in your case but it's worth handling anyway). A previously unencountered line number also means we initialise its sum and count to be zero.
The final if statement simply updates the sum and count if the line contains anything other than /no value.
And then, of course, you need to adjust the line number for the next time through.
END {
for (idx = 0; idx <= maxidx; idx++) {
if (count[idx] == 0) {
sum[idx] = 99999;
count[idx] = 1;
}
print sum[idx] / count[idx];
}
}
In terms of outputting the data, it's a simple matter of going through the array and calculating the average from the sum and count. Notice that, if the count is zero (all corresponding entries were /no value), we adjust the sum and count so as to get 99999 instead. Then we just print the average.
So, running that code over your input files gives, as requested:
$ awk -f script.awk ifile*.txt
2.4
2
3
1
99999

Using bash and numaverage (which ignores non-numeric input), plus paste, sed and tr (both for cleaning, since numaverage needs single column input, and throws an error if input is 100% text):
paste ifile* | while read x ; do \
numaverage <(tr '\t' '\n' <<< "$x") 2>&1 | \
sed -n '1{s/Emp.*/99999/;p}' ; \
done
Output:
2.4
2
3
1
99999

Related

How to calculate the percent in linux

Sample input data:
Col1, Col2
120000,1261
120000,119879
120000,117737
120000,14051
200000,58411
200000,115292
300000,279892
120000,98572
250000,249598
120000,14051
......
I used Excel with follow steps:
Col3=Col2/Col1.
Format Col3 with percentage
Use countif to group by Col3
How to do this task with awk or other way in linux command line ?
Expected result:
percent|count
0-20% | 10
21-50% | 5
51-100%| 10
I calculated the percent but i'm still finding the way to group by Col3
cat input.txt |awk -F"," '$3=100*$2/$1'

awk approach:
awk 'BEGIN {
FS=",";
OFS="|";
}
(NR > 1){
percent = 100 * $2 / $1;
if (percent <= 20) {
a["0-20%"] += 1;
} else if (percent <= 50) {
a2 += 1;
a["21-50%"] += 1;
} else {
a["51-100%"] += 1;
}
}
END {
print "percent", "count"
for (i in a) {
print i, a[i];
}
}' data
Sample output:
percent|count
0-20%|3
21-50%|1
51-100%|6

A generic self documented. Need some fine tuning depending on group name in result (due to +1% or not but not the real purpose)
awk -F ',' -v Step='0|20|50|100' '
BEGIN {
# define group
Gn = split( Step, aEdge, "|")
}
NR>1{
# Define wich percent
L = $2 * 100 / ($1>0 ? $1 : 1)
# in which group
for( j=1; ( L < aEdge[j] || L >= aEdge[j+1] ) && j < Gn;) j++
# add to group
G[j]++
}
# print result ordered
END {
print "percent|count"
for( i=1;i<Gn;i++) printf( "%d-%d%%|%d\n", aEdge[i], aEdge[i+1], G[i])
}
' data

another awk with parametric bins and formatted output.
$ awk -F, -v OFS=\| -v bins='20,50,100' '
BEGIN {n=split(bins,b)}
NR>1 {for(i=1;i<=n;i++)
if($2/$1 <= b[i]/100)
{a[b[i]]++; next}}
END {print "percent","count";
b[0]=-1;
for(i=1;i<=n;i++)
printf "%-7s|%3s\n", b[i-1]+1"-"b[i]"%",a[b[i]]}' file
percent|count
0-20% | 3
21-50% | 1
51-100%| 6

Pure bash:
# arguments are histogram boundaries *in ascending order*
hist () {
local lower=0$(printf '+(val*100>sum*%d)' "$#") val sum count n;
set -- 0 "$#" 100;
read -r
printf '%7s|%5s\n' percent count;
while IFS=, read -r sum val; do echo $((lower)); done |
sort -n | uniq -c |
while read count n; do
printf '%2d-%3d%%|%5d\n' "${#:n+1:2}" $count;
done
}
Example:
$ hist 20 50 < csv.dat
percent|count
0- 20%| 3
20- 50%| 1
50-100%| 6
Potential Issue: Does not print intervals with no values:
$ hist 20 25 45 50 < csv.dat
percent|count
0- 20%| 3
25- 45%| 1
50-100%| 6
Explanation:
lower is set to an expression which will count the number of percentages less than 100*val/num
The list of intervals is augmented with 0 and 100 so that the limits print correctly
The header line is ignored
The output header is printed
For each csv row, read the variables $num and $val and send the numeric evaluation of $lower (which uses those variables) to...
count the number of instances of each interval count...
and print the interval and count

Another, in GNU awk, using switch and regex to identify the values (since parsing was tagged in OP):
NR>1{
switch(p=$2/$1){
case /0\.[01][0-9]|\.20/:
a["0-20%"]++;
break;
case /\.[2-4][0-9]|\.50/:
a["21-50%"]++;
break;
default:
a["51-100%"]++
}
}
END{ for(i in a)print i, a[i] }
Run it:
$ awk -F, -f program.awk file
21-50% 1
0-20% 3
51-100% 6

How to split column by matching header?

I'm thinking if there is a way to split the column by matching the header ?
The data looks like this
ID_1 ID_2 ID_3 ID_6 ID_15
value1 0 2 4 7 6
value2 0 4 4 3 8
value3 2 2 3 7 8
I would like to get the columns only on ID_3 & ID_15
ID_3 ID_15
4 6
4 8
3 8
awk can simply separate it if I know the order of the column
However, I have a very huge table and only have a list of ID in hands.
Can I still use awk or there is an easier way in linux ?

The input format isn't well defined, but there are a few simple ways, awk, perl and sqlite.
(FNR==1) {
nocol=split(col,ocols,/,/) # cols contains named columns
ncols=split("vals " $0,cols) # header line
for (nn=1; nn<=ncols; nn++) colmap[cols[nn]]=nn # map names
OFS="\t" # to align output
for (nn=1; nn<=nocol; nn++) printf("%s%s",ocols[nn],OFS)
printf("\n") # output header line
}
(FNR>1) { # read data
for (nn=1; nn<=nocol; nn++) {
if (nn>1) printf(OFS) # pad
if (ocols[nn] in colmap) { printf("%s",$(colmap[ocols[nn]])) }
else { printf "--" } # named column not in data
}
printf("\n") # wrap line
}
$ nawk -f mycols.awk -v col=ID_3,ID_15 data
ID_3 ID_15
4 6
4 8
3 8
Perl, just a variation on the above with some perl idioms to confuse/entertain:
use strict;
use warnings;
our #ocols=split(/,/,$ENV{cols}); # cols contains named columns
our $nocol=scalar(#ocols);
our ($nn,%colmap);
$,="\t"; # OFS equiv
# while (<>) {...} implicit with perl -an
if ($. == 1) { # FNR equiv
%colmap = map { $F[$_] => $_+1 } 0..$#F ; # create name map hash
$colmap{vals}=0; # name anon 1st col
print #ocols,"\n"; # output header
} else {
for ($nn = 0; $nn < $nocol; $nn++) {
print "\t" if ($nn>0);
if (exists($colmap{$ocols[$nn]})) { printf("%s",$F[$colmap{$ocols[$nn]}]) }
else { printf("--") } # named column not in data
}
printf("\n")
}
$ cols="ID_3,ID_15" perl -an mycols.pl < data
That uses an environment variable to skip effort parsing the command line. It needs the perl options -an which set up field-splitting and an input read loop (much like awk does).
And with sqlite (I used v3.11, v3.8 or later is required for useful .import I believe). This uses an in-memory temporary database (name a file if too large for memory, or for a persistent copy of the parsed data), and automatically creates a table based on the first line. The advantages here are that you might not need any scripting at all, and you can perform multiple queries on your data with just one parse overhead.
You can skip this next step if you have a single hard-tab delimiting the columns, in which case replace .mode csv with .mode tab in the sqlite example below.
Otherwise, to convert your data to a suitable CSV-ish format:
nawk -v OFS="," '(FNR==1){$0="vals " $0} {$1=$1;print} < data > data.csv
This adds a dummy first column "vals" to the first line, then prints each line as comma-separated, it does this by a seemingly pointless assignment to $1, but this causes $0 to be recomputed replacing FS (space/tab) with OFS (comma).
$ sqlite3
sqlite> .mode csv
sqlite> .import data.csv mytable
sqlite> .schema mytable
CREATE TABLE mytable(
"vals" TEXT,
"ID_1" TEXT,
"ID_2" TEXT,
"ID_3" TEXT,
"ID_6" TEXT,
"ID_15" TEXT
);
sqlite> select ID_3,ID_15 from mytable;
ID_3,ID_15
4,6
4,8
3,8
sqlite> .mode column
sqlite> select ID_3,ID_15 from mytable;
ID_3 ID_15
---------- ----------
4 6
4 8
3 8
Use .once or .output to send output to a file (sqlite docs). Use .headers on or .headers off as required.
sqlite is quite happy to create an unnamed column, so you don't have to add a name to the first column of the header line, but you do need to make sure the number of columns is the same for all input lines and formats.
If you get "expected X columns but found Y" errors during the .import then you'll need to clean up the data format a little for this.

$ cat c.awk
NR == 1 {
for (i=1; i<=NF; ++i) {
if ($i == "ID_3") col_3 = (i + 1)
if ($i == "ID_15") col_15 = (i + 1)
}
print "ID_3", "ID_15"
}
NR > 1 { print $col_3, $col_15 }
$ awk -f c.awk c.txt
ID_3 ID_15
4 6
4 8
3 8

You could go for something like this:
BEGIN {
keys["ID_3"]
keys["ID_15"]
}
NR == 1 {
for (i = 1; i <= NF; ++i)
if ($i in keys) cols[++n] = i
}
{
for (i = 1; i <= n; ++i)
printf "%s%s", $(cols[i]+(NR>1)), (i < n ? OFS : ORS)
}
Save the script to a file and run it like awk -f script.awk file.
Alternatively, as a "one-liner":
awk 'BEGIN { keys["ID_3"]; keys["ID_15"] }
NR == 1 { for (i = 1; i <= NF; ++i) if ($i in keys) cols[++n] = i }
{ for (i = 1; i <= n; ++i) printf "%s%s", $(cols[i]+(NR>1)), (i < n ? OFS : ORS) }' file
Before the file is processed, keys are set in the keys array, corresponding to the column headings of interest.
On the first line, record all the column numbers that contain one of the keys in the cols array.
Loop through each of the cols and print them out, followed by either the output field separator OFS or the output record separator ORS, depending on whether it's the last one. $(cols[i]+(NR>1)) handles the fact that rows after the first have an extra field at the start, because NR>1 will be true (1) for those lines and false (0) for the first line.

Try below script:
#!/bin/sh
file="$1"; shift
awk -v cols="$*" '
BEGIN{
split(cols,C)
OFS=FS="\t"
getline
split($0,H)
for(c in C){
for(h in H){
if(C[c]==H[h])F[i++]=h
}
}
}
{ l="";for(f in F){l=l $F[f] OFS}print l }
' "$file"
In command line type:
[sumit.gupta#rpm01 ~]$ test.sh filename ID_3 ID_5

How to get column names in awk?

I have a data file in the following format:
Program1, Program2, Program3, Program4
0, 1, 1, 0
1, 1, 1, 0
Columns are program names, and rows are features of programs. I need to write an awk loop that will go through every row, check if a value is equal to one, and then return the column names and put them into a "results.csv" file. The desired output should be this:
Program2, Program3
Program1, Program2, Program3
I was trying this code, but it wouldn't work:
awk -F, '{for(i=1; i<=NF; i++) if ($i==1) {FNR==1 print$i>>results}; }'
Help would be very much appreciated!

awk -F', *' '
NR==1 {for(i=1;i<=NF;i++) h[i]=$i; next}
{
sep="";
for(x=1;x<=NF;x++) {
if($x) {
printf "%s%s", sep, h[x];
sep=", ";
}
}
print ""
}' file
outputs:
Program2, Program3
Program1, Program2, Program3

$ cat tst.awk
BEGIN { FS=", *" }
NR==1 { split($0,a); next }
{
out = ""
for (i=1; i<=NF; i++)
out = out ($i ? (out?", ":"") a[i] : "")
print out
}
$ awk -f tst.awk file
Program2, Program3
Program1, Program2, Program3

My take on things is more verbose, but should handle the trailing comma. Not really a one-liner, though.
BEGIN {
# Formatting for the input and output files.
FS = ", *"
OFS = ", "
}
FNR == 1 {
# First line in the file
# Read the headers into a list for later use.
for (i = 1; i <= NF; i++) {
headers[i] = $i
}
}
FNR > 1 {
# Print the header for each column containing a 1.
stop = 0
for (i = 1; i <= NF; i++) {
# Gather the results from this line.
if ($i > 0) {
stop += 1
results[stop] = headers[i]
}
}
if (stop > 0) {
# If this input line had no results, the output line is blank
for (i = 1; i <= stop; i++) {
# Print the appropriate headers for this result.
if (i < stop) {
# Results other than the last
printf("%s%s", results[i], OFS)
} else {
# The last result
printf("%s", results[i])
}
}
}
printf("%s", ORS)
}
Save this as something like script.awk, and then run it as something like:
awk -f script.awk infile.txt > results

Using awk on large txt to extract specific characters of fields

I have a large txt file ("," as delimiter) with some data and string:
2014:04:29:00:00:58:GMT: subject=BMRA.BM.T_GRIFW-1.FPN, message={SD=2014:04:29:00:00:00:GMT,SP=5,NP=3,TS=2014:04:29:01:00:00:GMT,VP=4.0,TS=2014:04:29:01:29:00:GMT,VP=4.0,TS=2014:04:29:01:30:00:GMT,VP=3.0}
2014:04:29:00:00:59:GMT: subject=BMRA.BM.T_GRIFW-2.FPN, message={SD=2014:04:29:00:00:00:GMT,SP=5,NP=2,TS=2014:04:29:01:00:00:GMT,VP=3.0,TS=2014:04:29:01:30:00:GMT,VP=3.0}
I would like to find lines that contain 'T_GRIFW' and then print the $1 field from 'subject' onwards and only the times and floats from $2 onwards. Furthermore, I want to incorporate an if statement so that if field $4 == 'NP=3', only fields $5,$6,$9,$10 are printed after the previous fields and if $4 == 'NP=2', all following fields are printed (times and floats only)
For instance, the result of the two sample lines will be:
subject=BMRA.BM.T_GRIFW-1.FPN,2014:04:29:00:00:00,5,3,2014:04:29:01:00:00,4.0,2014:04:29:01:30:00,3.0
subject=BMRA.BM.T_GRIFW-2.FPN,2014:04:29:00:00:00,5,2,2014:04:29:01:00:00,3.0,2014:04:29:01:30:00,3.0
I know this is complex and I have tried my best to be thorough in my description. The basic code I have thus far is:
awk 'BEGIN {FS=","}{OFS=","} /T_GRIFW-1.FPN/ {print $1}' tib_messages.2014-04-29
THANKS A MILLION!

Here's an awk executable file that'll create your desired output:
#!/usr/bin/awk -f
# use a more complicated FS => field numbers counted differently
BEGIN { FS="=|,"; OFS="," }
$2 ~ /T_GRIFW/ && $8=="NP" {
str="subject=" $2 OFS
# strip ":GMT" from dates and "}" from everywhere
gsub( /:GMT|[\}]/, "")
# append common fields to str with OFS
for(i=5;i<=13;i+=2) str=str $i OFS
# print the remaining fields and line separator
if($9==3) { print str $19, $21 }
else if($9==2) { print str $15, $17 }
}
Placing that in a file called awko and chmod'ing it then running awko data yields:
subject=BMRA.BM.T_GRIFW-1.FPN,2014:04:29:00:00:00,5,3,2014:04:29:01:00:00,4.0,2014:04:29:01:30:00,3.0
subject=BMRA.BM.T_GRIFW-2.FPN,2014:04:29:00:00:00,5,2,2014:04:29:01:00:00,3.0,2014:04:29:01:30:00,3.0
I've placed comments in the script, but here are some things that could be spelled out better:
Using a more complicated FS means you don't have reparse for = to work with the field data
I "cheated" and just hard-coded subject (which now falls at the end of $1) for str
:GMT and } appeared to be the only data that needed to be forcibly removed
With this FS Dates and numbers are two apart from each other but still loop-able
In either final print call, the str already ends in an OFS, so the comma between it and next field can be skipped

If I understand your requirements, the following will work:
BEGIN {
FS=","
OFS=","
}
/T_GRIFW/ {
split($1, subject, " ")
result = subject[2] OFS
delete arr
counter = 1
for (i = 2; i <= NF; i++) {
add = 0
if ($4 == "NP=3") {
if (i == 5 || i == 6 || i == 9 || i == 10) {
add = 1
}
}
else if ($4 == "NP=2") {
add = 1
}
if (add) {
counter = counter + 1
split($i, field, "=")
if (match(field[2], "[0-9]*\.[0-9]+|GMT")) {
arr[counter] = field[2]
}
}
}
for (i in arr) {
gsub(/{|}/,"", arr[i]) # remove curly braces
result = result arr[i] OFS
}
print substr(result, 0, length(result)-1)
}

Convert a text file into columns

Let's assume I have scientific data, all numbers arranged in a single column but representing an intensities matrix of n (width) by m (height). The column of the input file has in total n * m rows. An input example may look like that:
1
2
3
......
30
The new output should be such that I have n new columns with m rows. Sticking to my example with 30 fields input and n = 3, m = 10, I would need an output file like this (separator does not matter much, could be a blank, a tab etc.):
1 11 21
2 12 22
... ... ...
10 20 30
I use gawk under Windows. Please note that there are no special FS, more real-world examples are like 60 * 60 or bigger.

If you are not limited to awk but have GNU core-utils (cygwin, native, ..) then the simplest solution is to use pr:
pr -ts" " --columns 3 file

I believe this will do:
awk '
{ split($0,data); }
END {
m = 10;
n = 3;
for( i = 1; i<=m; i++ ) {
for( j = 0; j<n; j++ ) {
printf "%s ", data[j*m + i] # output data plus space in one line
}
# here you might want to start a new line though you did not ask for it:
printf "\n";
}
}' inputfile
I might have the index counting wrong but I am sure you can figure it out. The trick is the split in the first line. It splits your input on whitespace and creates an array data. The END block runs after processing your file and just accesses data by index. Note array indices count from 0.
Assumption is all data is in a single line. Your question isn't quite clear on this. If it is on several lines you'd have to read it into the array differently.
Hope this gets you started.
EDIT
I notice you changed your question while I was answering it. So change
{ split($0,data); }
to
{ data[++i] = $1; }
to account for the input being on different lines. Actually, this would give you the option to read it into a two dimensional array in the first place.
EDIT 2
Read two dimensional array
To read as a two dimensional array assuming m and n are known beforehand and not encoded in the input somehow:
awk '
BEGIN {
m = 10;
n = 3;
}
{
for( i = 0; i<m; i++ ) {
for( j = 0; j<n; j++ ) {
data[i,j] = $0;
}
}
# do something with data
}' inputfile
However, since you only want to reformat your data you could do it immediately. Combining the two solutions getting rid of data and passing m and n on the command line:
awk -v m=10 -v n=3'
{
for( i = 0; i<m; i++ ) {
for( j = 0; j<n; j++ ) {
printf "%s ", $0 # output data plus space in one line
}
printf "\n";
}
}' inputfile

Here is a fairly simple solution (in the example I've set n equal to 3; plug in the appropriate value for n):
awk -v n=3 '{ row = row $1 " "; if (NR % n == 0) { print row; row = "" } }' FILE
This works by reading in records one line at a time concatenating each line with the preceding lines. When n lines have been concatenated, it prints the concatenated result on a single new line. This repeats until there are no more lines left in the input.

You can use the below command
paste - - - < input.txt
By default, the delimiter is TAB, to change the delimiter, use below command
paste - - - -d' ' < input.txt

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Average of multiple files without considering missing values using Shell - linux

Related

How to calculate the percent in linux

How to split column by matching header?

How to get column names in awk?

Using awk on large txt to extract specific characters of fields

Convert a text file into columns

Categories

Resources