Average of multiple files without considering missing values - linux

I want to calculate the average of 15 files:- ifile1.txt, ifile2.txt, ....., ifile15.txt. Number of columns and rows of each file are same. But some of them are missing values. Part of the data looks as
ifile1.txt ifile2.txt ifile3.txt
3 ? ? ? . 1 2 1 3 . 4 ? ? ? .
1 ? ? ? . 1 ? ? ? . 5 ? ? ? .
4 6 5 2 . 2 5 5 1 . 3 4 3 1 .
5 5 7 1 . 0 0 1 1 . 4 3 4 0 .
. . . . . . . . . . . . . . .
I would like to find a new file which will show the average of these 15 fils without considering the missing values.
ofile.txt
2.66 2 1 3 . (i.e. average of 3 1 4, average of ? 2 ? and so on)
2.33 ? ? ? .
3 5 4.33 1.33 .
3 2.67 4 0.66 .
. . . . .
This question is similar to my earlier question Average of multiple files in shell where the script was
awk 'FNR == 1 { nfiles++; ncols = NF }
{ for (i = 1; i < NF; i++) sum[FNR,i] += $i
if (FNR > maxnr) maxnr = FNR
}
END {
for (line = 1; line <= maxnr; line++)
{
for (col = 1; col < ncols; col++)
printf " %f", sum[line,col]/nfiles;
printf "\n"
}
}' ifile*.txt
But I can't able to modify it.

Use this:
paste ifile*.txt | awk '{n=f=0; for(i=1;i<=NF;i++){if($i*1){f++;n+=$i}}; print n/f}'
paste will show all files side by side
awk calculates the averages per line:
n=f=0; set the variables to 0.
for(i=1;i<=NF;i++) loop trough all the fields.
if($i*1) if the field contains a digit (multiplication by 1 will succeed).
f++;n+=$i increment f (number of fields with digits) and sum up n.
print n/f calculate n/f.

awk '
{
for (i = 1;i <= NF;i++) {
Sum[FNR,i]+=$i
Count[FNR,i]+=$i!="?"
}
}
END {
for( i = 1; i <= FNR; i++){
for( j = 1; j <= NF; j++) printf "%s ", Count[i,j] != 0 ? Sum[i,j]/Count[i,j] : "?"
print ""
}
}
' ifile*
assuming file are correctly feeded (no trailing empty space line, ...)

awk 'FNR == 1 { nfiles++; ncols = NF }
{ for (i = 1; i < NF; i++)
if ( $i != "?" ) { sum[FNR,i] += $i ; count[FNR,i]++ ;}
if (FNR > maxnr) maxnr = FNR
}
END {
for (line = 1; line <= maxnr; line++)
{
for (col = 1; col < ncols; col++)
if ( count[line,col] > 0 ) printf " %f", sum[line,col]/count[line,col];
else printf " ? " ;
printf "\n" ;
}
}' ifile*.txt
I just check the '?' ...

Related

How can I reverse print the characters of a string in each cell using AWK?

Beth 45 4.00 0 0 .072
Danny 33 3.75 ^0 0 .089
The above is the file I want to operate.
I want to write an AWK script that can reverse print the characters of a string in every cell.
Here is the code:
BEGIN { OFS = "\t\t" }
function reverse_print(str)
{
s = "";
N = length(str);
for (i = 1; i <= N; i++)
a[i] = substr(str, i, 1);
for (i = N; i >= 1; i--)
s = s a[i];
return s;
}
{
for (i = 1; i <= NF; i++)
$i = reverse_print($i) ;
print;
}
END {}
However, it does not work. The program somehow becomes dead.
I have found if I don't use the loop and handle each field one by one like the following,
BEGIN { OFS = "\t\t" }
function reverse_print(str)
{
s = "";
N = length(str);
for (i = 1; i <= N; i++)
a[i] = substr(str, i, 1);
for (i = N; i >= 1; i--)
s = s a[i];
return s;
}
{
$1 = reverse_print($1) ;
$2 = reverse_print($2) ;
$3 = reverse_print($3) ;
$4 = reverse_print($4) ;
$5 = reverse_print($5) ;
$6 = reverse_print($6) ;
print;
}
END {}
it can work well.
Here is my desired output:
hteB 54 00.4 0 0 270.
ynnaD 33 57.3 0^ 0 980.
I have thought hard but still cannot figure out where I did wrong using the loop.
Who can tell me why ?
You're using the same variable i inside and outside of the function. Use a different variable in either location or change the function definition to reverse_print(str, i) to make the i used within the function local to that function rather than the same global variable being used in the calling code.
You should also make s and N function local:
function reverse_print(str, i, s, N)
but in fact the code should be written as:
$ cat tst.awk
BEGIN { OFS = "\t\t" }
function reverse_print(fwd, rev, i, n)
{
n = length(fwd)
for (i = n; i >= 1; i--)
rev = rev substr(fwd, i, 1);
return rev
}
{
for (i = 1; i <= NF; i++)
$i = reverse_print($i)
print
}
$ awk -f tst.awk file
hteB 54 00.4 0 0 270.
ynnaD 33 57.3 0^ 0 980.
Could you please try following.(This program is tested on GNU awk only and as per Ed sir's comment too this is undefined behavior for POSIX awk)
awk '
BEGIN{
OFS="\t\t"
}
{
for(i=1;i<=NF;i++){
num=split($i,array,"")
for(j=num;j>0;j--){
val=(j<num?val:"") array[j]
}
printf "%s%s",val,(i<NF?OFS:ORS)}
val=""
}' Input_file
There is a rev command in Linux: rev - reverse lines characterwise.
You can reverse a string by calling rev with awk builtin function system like:
#reverse-fields.awk
{
for (i = 1; i <= NF; i = i + 1) {
# command line
cmd = "echo '" $i "' | rev"
# read output into revfield
cmd | getline revfield
# remove leading new line
a = gensub(/^[\n\r]+/, "", "1", revfield)
# print reversed field
printf("%s", a)
# print tab
if (i != NF) printf("\t")
# close command
close(cmd)
}
# print new line
print ""
}
$ awk -f reverse-fields.awk emp.data
0 00.4 hteB
0 57.3 naD
01 00.4 yhtaK
02 00.5 kraM
22 05.5 yraM
81 52.4 eisuS

Compute sum for column 2 and average for all other columns in multiple files without considering missing values

I want to calculate the sum for column 2 and average for all other columns from 15 files:- ifile1.txt, ifile2.txt, ....., ifile15.txt. Number of columns and rows of each file are same. But some of them are missing values. Part of the data looks as
ifile1.txt ifile2.txt ifile3.txt
3 ? ? ? . 1 2 1 3 . 4 ? ? ? .
1 ? ? ? . 1 ? ? ? . 5 ? ? ? .
4 6 5 2 . 2 5 5 1 . 3 4 3 1 .
5 5 7 1 . 0 0 1 1 . 4 3 4 0 .
. . . . . . . . . . . . . . .
I would like to find a new file which will show the sum for column 2 and average for all other columns from these 15 fils without considering the missing values.
ofile.txt
2.66 2 1 3 . (i.e. average of 3 1 4, sum of ? 2 ?, average of ? 1 ?, average of ? 3 ?, and so on)
2.33 ? ? ? .
3 15 4.33 1.33 .
3 8 4 0.66 .
. . . . .
This question is similar to my earlier question Average of multiple files without considering missing values where the script was written for average for all columns.
awk '
{
for (i = 1;i <= NF;i++) {
Sum[FNR,i]+=$i
Count[FNR,i]+=$i!="?"
}
}
END {
for( i = 1; i <= FNR; i++){
for( j = 1; j <= NF; j++) printf "%s ", Count[i,j] != 0 ? Sum[i,j]/Count[i,j] : "?"
print ""
}
}
' ifile*
But I can't able to modify it to my desire output.
Based on your previous awk script, I modify it as followed,
$ cat awk_script
{
for (i = 1;i <= NF;i++) {
Sum[FNR,i]+=$i
Count[FNR,i]+=$i!="?"
}
}
END {
for( i = 1; i <= FNR; i++){
for( j = 1; j <= NF; j++)
if(j==2) { printf "%s\t" ,Count[i,j] != 0 ? Sum[i,j] : "?" }
else {
if (Count[i,j] != 0){
val=Sum[i,j]/Count[i,j]
printf "%s%s\t",int(val),match(val,/\.[0-9]/)!=0 ? "."substr(val,RSTART+1,2):""
} else printf "?\t"
}
print ""
}
}
And the output would be:
$ awk -f awk_script ifile*
2.66 2 1 3 0
2.33 ? ? ? 0
3 15 4.33 1.33 0
3 8 4 0.66 0
0 0 0 0 0
Brief explanation,
if(j==2): print the sum of the value in each file
for the average value, I notice that the values are not rounded up, so extract the decimal part using substr(val,RSTART+1,2), and integer part using int(val)
$ cat tst.awk
BEGIN { dfltVal="?"; OFS="\t" }
{
for (colNr=1; colNr<=NF; colNr++) {
if ($colNr != dfltVal) {
sum[FNR,colNr] += $colNr
cnt[FNR,colNr]++
}
}
}
END {
for (rowNr=1; rowNr<=FNR; rowNr++) {
for (colNr=1; colNr<=NF; colNr++) {
val = dfltVal
if ( cnt[rowNr,colNr] != 0 ) {
val = int(100 * sum[rowNr,colNr] / (colNr==2 ? 1 : cnt[rowNr,colNr])) / 100
}
printf "%s%s", val, (colNr<NF ? OFS : ORS)
}
}
}
.
$ awk -f tst.awk file1 file2 file3
2.66 2 1 3
2.33 ? ? ?
3 15 4.33 1.33
3 8 4 0.66

Read all the entries of a matrix using shell script

I have a matrix,
A(i,j), i=1,m and j=1,n
I can read it in C and FORTRAN, but I can't read it in shell script. I know this is a very simple question, but I am very new to shell script. I want to read all entries and do some calculation e.g. I have a matrix:
A= 1 0 1 1
2 1 0 2
1 0 0 3
1 2 3 0
Now I want to compare each 0 with its above, below, left and right values. Finally I want to do some computation (lets say sum) with these four values around each zero. In the above example the result will be- for five zeros
1st zero: 3
2nd zero: 4
3rd zero: 4
4th zero: 6
5th zero: 6
So in FORTRAN, I can do it by reading all the values as
do j=1,n
do i=1,m
if (A(i,j) .eq. 0) then
B(i,j)=A(i-1,j)+A(i+1,j)+A(i,j+1)+A(i,j-1)
enddo
enddo
But I want to do it in shell script. How to do?
Assuming that data are given in "test.dat" (with no "A = "), I tried it anyway...
#!/bin/bash
inpfile="test.dat"
L=100 # some large value
for (( i = 0; i < L; i++ )) {
for (( j = 0; j < L; j++ )) {
A[ L * i + j ]=0
}
}
i=1
while read buf; do
inp=( $buf ); n=${#inp[#]}
if (( L <= n+1 )); then echo "L is too small"; exit -1; fi
for (( j = 1; j <= n; j++ )) {
A[ L * i + j ]=${inp[j-1]}
}
(( i++ ))
done < $inpfile
nzeros=0
for (( i = 1; i <= n; i++ )) {
for (( j = 1; j <= n; j++ )) {
if (( ${A[ L * i + j ]} == 0 )); then
(( nzeros++ ))
B[ nzeros ]=$(( \
${A[ L * (i-1) + j ]} + \
${A[ L * (i+1) + j ]} + \
${A[ L * i + j+1 ]} + \
${A[ L * i + j-1 ]} ))
fi
}
}
for (( k = 1; k <= nzeros; k++ )) {
printf "%dst zero: %d\n" $k ${B[k]}
}
Conclusion: Very painful. Fortran is recommended...(as expected)

Any one give me a solution for SORT

I want to sort data from shortest to longest line ,the data contains
space ,character ,number,-,","
,i use sort -n ,but it did not solve the job.many thanks for help
Data here
0086
0086-
0086---
0086-------
0086-1358600966
0086-18868661318
00860
00860-13081022659
00860-131111111
00860-13176880028
00860-13179488252
00860-18951041771
00861
008629-83023520
0086000
0086010-61281306
and the rerult i want is
0086
0086-
00860
00861
0086000
0086---
0086-------
0086-1358600966
00860-131111111
008629-83023520
0086-18868661318
0086010-61281306
00860-13081022659
00860-13176880028
00860-13179488252
00860-18951041771
I do not care what characters ,just from short to long .2 lines with the same long can exchange ,it is not a problem .many thanks
Perl one-liner
perl -0777 -ne 'print join("\n", map {$_->[1]} sort {$a->[0] <=> $b->[0]} map {[length, $_]} split /\n/), "\n"' file
Explanation on demand.
With GNU awk, it's very simple:
gawk '
{len[$0] = length($0)}
END {
PROCINFO["sorted_in"] = "#val_num_asc"
for (line in len) print line
}
' file
See https://www.gnu.org/software/gawk/manual/html_node/Controlling-Scanning.html#Controlling-Scanning
Just try this once, May be it will help you.
awk '{ print length($0) " " $0; }' $file | sort -n | cut -d ' ' -f 2-
the -r option was for reversing the sort.
Using awk:
#!/usr/bin/awk -f
(l = length($0)) && !($0 in nextof) {
if (l in start) {
nextof[$0] = start[l]
} else {
if (!max || l > max) max = l
if (!min || l < min) min = l
nextof[$0] = 0
}
start[l] = $0
++count[l]
}
END {
for (i = min; i <= max; ++i) {
if (j = count[i]) {
t = start[i]
print t
while (--j) {
t = nextof[t]
print t
}
}
}
}
Usage:
awk -f script.awk file
Output:
0086
00861
00860
0086-
0086000
0086---
0086-------
008629-83023520
00860-131111111
0086-1358600966
0086010-61281306
0086-18868661318
00860-18951041771
00860-13179488252
00860-13176880028
00860-13081022659
Another Version:
#!/usr/bin/awk -f
(l = length($0)) && !($0 in nextof) {
if (l in start) {
nextof[lastof[l]] = $0
} else {
if (!max || l > max) max = l
if (!min || l < min) min = l
start[l] = $0
}
lastof[l] = $0
++count[l]
}
END {
for (i = min; i <= max; ++i) {
if (j = count[i]) {
t = start[i]
print t
while (--j) {
t = nextof[t]
print t
}
}
}
}
Output:
0086
0086-
00860
00861
0086---
0086000
0086-------
0086-1358600966
00860-131111111
008629-83023520
0086-18868661318
0086010-61281306
00860-13081022659
00860-13176880028
00860-13179488252
00860-18951041771

2d histogram making

I have a data file containing two columns, like
1.1 2.2
3.1 4.5
1.2 4.5
3.2 4.6
1.1 2.3
4.2 4.9
4.2 1.1
I would like to make a histogram from the two columns, i.e. to get this output (if the step size (or bin size, as we talking about histogramming) equals to 0.1 in this case)
1.0 1.0 0
1.0 1.1 0
1.0 1.2 0
...
1.1 1.0 0
1.1 1.1 0
1.1 1.2 0
...
1.1 2.0 0
1.1 2.1 0
1.1 2.2 1
...
...
Can anybody suggest me something? It would be nice, if I can set the the range of values of the colmuns. In the above case the 1st column values goes from 1 to 4, and the same as for the second column.
EDITED: updated in order to handle more general data input, e.g. float numbers. The step size in the above case is 0.1, but it would be nice if it can be tunable for other settings, i.e. if step range (bin size) is for example 0.2, or 1.0.
If the step size is for example 1.0, then if I have 1.1 and 1.8 they have the same bin, we have to handle them together, for example (the range in this case let us say 4 for both of the two columns 0.0 ... 4.0)
1.1 1.8
2.5 2.6
1.4 2.1
1.3 1.5
3.3 4.0
3.8 3.9
4.0 3.2
4.0 4.0
output (if the bin size = 1.0)
1 1 2
1 2 1
1 3 0
1 4 0
2 1 0
2 2 1
2 3 0
2 4 0
3 1 0
3 2 0
3 3 1
3 4 1
4 1 0
4 2 0
4 3 1
4 4 1
awk 'END {
for (i = 0; ++i <= l;) {
for (j = 0; ++j <= l;)
printf "%d %d %d %s\n", i, j, \
b[i, j], (j < l ? x : ORS)
}
}
{
f[NR] = $1; s[NR] = $2
b[$1, $2]++
}' l=4 infile
You may try this (not thoroughly tested):
awk -v l=4 -v bs=0.1 'BEGIN {
if (!bs) {
print "invalid bin size" > "/dev/stderr"
exit
}
split(bs, t, ".")
t[2] || fl++
m = "%." length(t[2]) "f"
}
{
fk = fl ? int($1) : sprintf(m, $1)
sk = fl ? int($2) : sprintf(m, $2)
f[fk]; s[sk]; b[fk, sk]++
}
END {
if (!bs) exit 1
for (i = 1; int(i) <= l; i += bs) {
for (j = 1; int(j) <= l; j += bs) {
if (fl) {
fk = int(i); sk = int(j); m = "%d"
}
else {
fk = sprintf(m, i); sk = sprintf(m, j)
}
printf "%s" m OFS m OFS "%d\n", (i > 1 && fk != p ? ORS : x), fk, sk, b[fk, sk]
p = fk
}
}
}' infile
You can try this in bash:
for x in {1..4} ; do
for y in {1..4} ; do
echo $x%$y 0
done
done \
| join -1 1 -2 2 - -a1 <(sed 's/ /%/' FILE \
| sort \
| uniq -c \
| sort -k2 ) \
| sed 's/ 0 / /;s/%/ /'
It creates the table with all zeros in the last column, joins it with the real results (classic frequency table sort | uniq -c) and removes the zeros from the lines where a different number should be shown.
One solution in perl (sample output and usage to follow):
#!/usr/bin/perl -W
use strict;
my ($min, $step, $max, $file) = #ARGV
or die "Syntax: $0 <min> <step> <max> <file>\n";
my %seen;
open F, "$file"
or die "Cannot open file $file: $!\n";
my #l = map { chomp; $_} qx/seq $min $step $max/;
foreach my $first (#l) {
foreach my $second (#l) {
$seen{"$first $second"} = 0;
}
}
foreach my $line (<F>) {
chomp $line;
$line or next;
$seen{$line}++;
}
my $len = #l; # size of list
my $i = 0;
foreach my $key (sort keys %seen) {
printf("%s %d\n", $key, $seen{$key});
$i++;
print "\n" unless $i % $len;
}
exit(0);

Resources