Sample input data:
Col1, Col2
120000,1261
120000,119879
120000,117737
120000,14051
200000,58411
200000,115292
300000,279892
120000,98572
250000,249598
120000,14051
......
I used Excel with follow steps:
Col3=Col2/Col1.
Format Col3 with percentage
Use countif to group by Col3
How to do this task with awk or other way in linux command line ?
Expected result:
percent|count
0-20% | 10
21-50% | 5
51-100%| 10
I calculated the percent but i'm still finding the way to group by Col3
cat input.txt |awk -F"," '$3=100*$2/$1'
awk approach:
awk 'BEGIN {
FS=",";
OFS="|";
}
(NR > 1){
percent = 100 * $2 / $1;
if (percent <= 20) {
a["0-20%"] += 1;
} else if (percent <= 50) {
a2 += 1;
a["21-50%"] += 1;
} else {
a["51-100%"] += 1;
}
}
END {
print "percent", "count"
for (i in a) {
print i, a[i];
}
}' data
Sample output:
percent|count
0-20%|3
21-50%|1
51-100%|6
A generic self documented. Need some fine tuning depending on group name in result (due to +1% or not but not the real purpose)
awk -F ',' -v Step='0|20|50|100' '
BEGIN {
# define group
Gn = split( Step, aEdge, "|")
}
NR>1{
# Define wich percent
L = $2 * 100 / ($1>0 ? $1 : 1)
# in which group
for( j=1; ( L < aEdge[j] || L >= aEdge[j+1] ) && j < Gn;) j++
# add to group
G[j]++
}
# print result ordered
END {
print "percent|count"
for( i=1;i<Gn;i++) printf( "%d-%d%%|%d\n", aEdge[i], aEdge[i+1], G[i])
}
' data
another awk with parametric bins and formatted output.
$ awk -F, -v OFS=\| -v bins='20,50,100' '
BEGIN {n=split(bins,b)}
NR>1 {for(i=1;i<=n;i++)
if($2/$1 <= b[i]/100)
{a[b[i]]++; next}}
END {print "percent","count";
b[0]=-1;
for(i=1;i<=n;i++)
printf "%-7s|%3s\n", b[i-1]+1"-"b[i]"%",a[b[i]]}' file
percent|count
0-20% | 3
21-50% | 1
51-100%| 6
Pure bash:
# arguments are histogram boundaries *in ascending order*
hist () {
local lower=0$(printf '+(val*100>sum*%d)' "$#") val sum count n;
set -- 0 "$#" 100;
read -r
printf '%7s|%5s\n' percent count;
while IFS=, read -r sum val; do echo $((lower)); done |
sort -n | uniq -c |
while read count n; do
printf '%2d-%3d%%|%5d\n' "${#:n+1:2}" $count;
done
}
Example:
$ hist 20 50 < csv.dat
percent|count
0- 20%| 3
20- 50%| 1
50-100%| 6
Potential Issue: Does not print intervals with no values:
$ hist 20 25 45 50 < csv.dat
percent|count
0- 20%| 3
25- 45%| 1
50-100%| 6
Explanation:
lower is set to an expression which will count the number of percentages less than 100*val/num
The list of intervals is augmented with 0 and 100 so that the limits print correctly
The header line is ignored
The output header is printed
For each csv row, read the variables $num and $val and send the numeric evaluation of $lower (which uses those variables) to...
count the number of instances of each interval count...
and print the interval and count
Another, in GNU awk, using switch and regex to identify the values (since parsing was tagged in OP):
NR>1{
switch(p=$2/$1){
case /0\.[01][0-9]|\.20/:
a["0-20%"]++;
break;
case /\.[2-4][0-9]|\.50/:
a["21-50%"]++;
break;
default:
a["51-100%"]++
}
}
END{ for(i in a)print i, a[i] }
Run it:
$ awk -F, -f program.awk file
21-50% 1
0-20% 3
51-100% 6
Related
Hopefully someone out there in the world can help me, and anyone else with a similar problem, find a simple solution to capturing data. I have spent hours trying a one liner to solve something I thought was a simple problem involving awk, a csv file, and saving the output as a bash variable. In short here's the nut...
The Missions:
1) To output every other column, starting from the LAST COLUMN, with a specific iteration count.
2) To output every other column, starting from NEXT TO LAST COLUMN, with a specific iteration count.
The Data (file.csv):
#12#SayWhat#2#4#2.25#3#1.5#1#1#1#3.25
#7#Smarty#9#6#5.25#5#4#4#3#2#3.25
#4#IfYouLike#4#1#.2#1#.5#2#1#3#3.75
#3#LaughingHard#8#8#13.75#8#13#6#8.5#4#6
#10#AtFunny#1#3#.2#2#.5#3#3#5#6.5
#8#PunchLines#7#7#10.25#7#10.5#8#11#6#12.75
Desired results for Mission 1:
2#2.25#1.5#1#3.25
9#5.25#4#3#3.25
4#.2#.5#1#3.75
8#13.75#13#8.5#6
1#.2#.5#3#6.5
7#10.25#10.5#11#12.75
Desired results for Mission 2:
SayWhat#4#3#1#1
Smarty#6#5#4#2
IfYouLike#1#1#2#3
LaughingHard#8#8#6#4
AtFunny#3#2#3#5
PunchLines#7#7#8#6
My Attempts:
The closes I have come to solving any of the above problems, is an ugly pipe (which is OK for skinning a cat) for Mission 1. However, it doesn't use any declared iterations (which should be 5). Also, I'm completely lost on solving Mission 2.
Any help to simplify the below and solving Mission 2 will be HELLA appreciated!
outcome=$( awk 'BEGIN {FS = "#"} {for (i = 0; i <= NF; i += 2) printf ("%s%c", $(NF-i), i + 2 <= NF ? "#" : "\n");}' file.csv | sed 's/##.*//g' | awk -F# '{for (i=NF;i>0;i--){printf $i"#"};printf "\n"}' | sed 's/#$//g' | awk -F# '{$1="";print $0}' OFS=# | sed 's/^#//g' );
Also, if doing a loop for a specific number of iterations is helpful in solving this problem, then magic number is 5. Maybe a solution could be a for-loop that is counting from right to left and skipping every other column as 1 iteration, with the starting column declared as an awk variable (Just a thought I have no way of knowing how to do)
Thank you for looking over this problem.
There are certainly more elegant ways to do this, but I am not really an awk person:
Part 1:
awk -F# '{ x = ""; for (f = NF; f > (NF - 5 * 2); f -= 2) { x = x ? $f "#" x : $f ; } print x }' file.csv
Output:
2#2.25#1.5#1#3.25
9#5.25#4#3#3.25
4#.2#.5#1#3.75
8#13.75#13#8.5#6
1#.2#.5#3#6.5
7#10.25#10.5#11#12.75
Part 2:
awk -F# '{ x = ""; for (f = NF - 1; f > (NF - 5 * 2); f -= 2) { x = x ? $f "#" x : $f ; } print x }' file.csv
Output:
SayWhat#4#3#1#1
Smarty#6#5#4#2
IfYouLike#1#1#2#3
LaughingHard#8#8#6#4
AtFunny#3#2#3#5
PunchLines#7#7#8#6
The literal 5 in each of those is your "number of iterations."
Sample data:
$ cat mission.dat
#12#SayWhat#2#4#2.25#3#1.5#1#1#1#3.25
#7#Smarty#9#6#5.25#5#4#4#3#2#3.25
#4#IfYouLike#4#1#.2#1#.5#2#1#3#3.75
#3#LaughingHard#8#8#13.75#8#13#6#8.5#4#6
#10#AtFunny#1#3#.2#2#.5#3#3#5#6.5
#8#PunchLines#7#7#10.25#7#10.5#8#11#6#12.75
One awk solution:
NOTE: OP can add logic to validate the input parameters.
$ cat mission
#!/bin/bash
# format: mission { 1 | 2 } { number_of_fields_to_display }
mission=${1} # assumes user inputs "1" or "2"
offset=$(( mission - 1 )) # subtract one to determine awk/NF offset
iteration_count=${2} # assume for now this is a positive integer
awk -F"#" -v offset=${offset} -v itcnt=${iteration_count} 'BEGIN { OFS=FS }
{ # we will start by counting fields backwards until we run out of fields
# or we hit "itcnt==iteration_count" fields
loopcnt=0
for (i=NF-offset ; i>=0; i-=2) # offset=0 for mission=1; offset=1 for mission=2
{ loopcnt++
if (loopcnt > itcnt)
break
fstart=i # keep track of the field we want to start with
}
# now printing our fields starting with field # "fstart";
# prefix the first printf with a empty string, then each successive
# field is prefixed with OFS=#
pfx = ""
for (i=fstart; i<= NF-offset; i+=2)
{ printf "%s%s",pfx,$i
pfx=OFS
}
# terminate a line of output with a linefeed
printf "\n"
}
' mission.dat
Some test runs:
###### mission #1
# with offset/iteration = 4
$ mission 1 4
2.25#1.5#1#3.25
5.25#4#3#3.25
.2#.5#1#3.75
13.75#13#8.5#6
.2#.5#3#6.5
10.25#10.5#11#12.75
#with offset/iteration = 5
$ mission 1 5
2#2.25#1.5#1#3.25
9#5.25#4#3#3.25
4#.2#.5#1#3.75
8#13.75#13#8.5#6
1#.2#.5#3#6.5
7#10.25#10.5#11#12.75
# with offset/iteration = 6
$ mission 1 6
12#2#2.25#1.5#1#3.25
7#9#5.25#4#3#3.25
4#4#.2#.5#1#3.75
3#8#13.75#13#8.5#6
10#1#.2#.5#3#6.5
8#7#10.25#10.5#11#12.75
###### mission #2
# with offset/iteration = 4
$ mission 2 4
4#3#1#1
6#5#4#2
1#1#2#3
8#8#6#4
3#2#3#5
7#7#8#6
# with offset/iteration = 5
$ mission 2 5
SayWhat#4#3#1#1
Smarty#6#5#4#2
IfYouLike#1#1#2#3
LaughingHard#8#8#6#4
AtFunny#3#2#3#5
PunchLines#7#7#8#6
# with offset/iteration = 6;
# notice we pick up field #1 = empty string so output starts with a '#'
$ mission 2 6
#SayWhat#4#3#1#1
#Smarty#6#5#4#2
#IfYouLike#1#1#2#3
#LaughingHard#8#8#6#4
#AtFunny#3#2#3#5
#PunchLines#7#7#8#6
this is probably not what you're asking but perhaps will give you an idea.
$ awk -F_ -v skip=4 -v endoff=0 '
BEGIN {OFS=FS}
{offset=(NF-endoff)%skip;
for(i=offset;i<=NF-endoff;i+=skip) printf "%s",$i (i>=(NF-endoff)?ORS:OFS)}' file
112_116_120
122_126_130
132_136_140
142_146_150
you specify the number of skips between columns and the end offset as input variables. Here, for last column end offset is set to zero and skip column is 4.
For clarity I used the input file
$ cat file
_111_112_113_114_115_116_117_118_119_120
_121_122_123_124_125_126_127_128_129_130
_131_132_133_134_135_136_137_138_139_140
_141_142_143_144_145_146_147_148_149_150
changing FS for your format should work.
Using bash script (Ubuntu 16.04), I'm trying to compare 2 lists of ranges: does any number in any of the ranges in file1 coincide with any number in any of the ranges in file2? If so, print the row in the second file. Here I have each range as 2 tab-delimited columns (in file1, row 1 represents the range 1-4, i.e. 1, 2, 3, 4). The real files are quite big.
file1:
1 4
5 7
8 11
12 15
file2:
3 4
8 13
20 24
Desired output:
3 4
8 13
My best attempt has been:
awk 'NR=FNR { x[$1] = $1+0; y[$2] = $2+0; next};
{for (i in x) {if (x[i] > $1+0); then
{for (i in y) {if (y[i] <$2+0); then
{print $1, $2}}}}}' file1 file2 > output.txt
This returns an empty file.
I'm thinking that the script will need to involve range comparisons using if-then conditions and iterate through each line in both files. I've found examples of each concept, but can't figure out how to combine them.
Any help appreciated!
It depends on how big your files are, of course. If they are not big enough to exhaust the memory, you can try this 100% bash solution:
declare -a min=() # array of lower bounds of ranges
declare -a max=() # array of upper bounds of ranges
# read ranges in second file, store then in arrays min and max
while read a b; do
min+=( "$a" );
max+=( "$b" );
done < file2
# read ranges in first file
while read a b; do
# loop over indexes of min (and max) array
for i in "${!min[#]}"; do
if (( max[i] >= a && min[i] <= b )); then # if ranges overlap
echo "${min[i]} ${max[i]}" # print range
unset min[i] max[i] # performance optimization
fi
done
done < file1
This is just a starting point. There are many possible performance / memory footprint improvements. But they strongly depend on the sizes of your files and on the distributions of your ranges.
EDIT 1: improved the range overlap test.
EDIT 2: reused the excellent optimization proposed by RomanPerekhrest (unset already printed ranges from file2). The performance should be better when the probability that ranges overlap is high.
EDIT 3: performance comparison with the awk version proposed by RomanPerekhrest (after fixing the initial small bugs): awk is between 10 and 20 times faster than bash on this problem. If performance is important and you hesitate between awk and bash, prefer:
awk 'NR == FNR { a[FNR] = $1; b[FNR] = $2; next; }
{ for (i in a)
if ($1 <= b[i] && a[i] <= $2) {
print a[i], b[i]; delete a[i]; delete b[i];
}
}' file2 file1
awk solution:
awk 'NR==FNR{ a[$1]=$2; next }
{ for(i in a)
if (($1>=i+0 && $1<=a[i]) || ($2<=a[i] && $2>=i+0)) {
print i,a[i]; delete a[i];
}
}' file2 file1
The output:
3 4
8 13
awk 'FNR == 1 && NR == 1 { file=1 } FNR == 1 && NR != 1 { file=2 } file ==1 { for (q=1;q<=NF;q++) { nums[$q]=$0} } file == 2 { for ( p=1;p<=NF;p++) { for (i in nums) { if (i == $p) { print $0 } } } }' file1 file2
Break down:
FNR == 1 && NR == 1 {
file=1
}
FNR == 1 && NR != 1 {
file=2
}
file == 1 {
for (q=1;q<=NF;q++) {
nums[$q]=$0
}
}
file == 2 {
for ( p=1;p<=NF;p++) {
for (i in nums) {
if (i == $p) {
print $0
}
}
}
}
Basically we set file = 1 when we are processing the first file and file = 2 when we are processing the second file. When we are in the first file, read the line into an array keyed on each field of the line. When we are in the second file, process the array (nums) and check if there is an entry for each field on the line. If there is, print it.
For GNU awk as I'm controlling the for scanning order for optimizing time:
$ cat program.awk
BEGIN {
PROCINFO["sorted_in"]="#ind_num_desc"
}
NR==FNR { # hash file1 to a
if(($2 in a==0) || $1<a[$2]) # avoid collisions
a[$2]=$1
next
}
{
for(i in a) { # in desc order
# print "DEBUG: For:",$0 ":", a[i], i # remove # for debug
if(i+0>$1) { # next after
if($1<=i+0 && a[i]<=$2) {
print
next
}
}
else
next
}
}
Test data:
$ cat file1
0 3 # testing for completely overlapping ranges
1 4
5 7
8 11
12 15
$ cat file2
1 2 # testing for completely overlapping ranges
3 4
8 13
20 24
Output:
$ awk -f program.awk file1 file2
1 2
3 4
8 13
and
$ awk -f program.awk file2 file1
0 3
1 4
8 11
12 15
If Perl solution is preferred, then below one-liner would work
/tmp> cat marla1.txt
1 4
5 7
8 11
12 15
/tmp> cat marla2.txt
3 4
8 13
20 24
/tmp> perl -lane ' BEGIN { %kv=map{split(/\s+/)} qx(cat marla2.txt) } { foreach(keys %kv) { if($F[0]==$_ or $F[1]==$kv{$_}) { print "$_ $kv{$_}" }} } ' marla1.txt
3 4
8 13
/tmp>
If the ranges are ordered according to their lower bounds, we can use this to make the algorithms more efficient. The idea is to alternately proceed through the ranges in file1 and file2. More precisely, when we have a certain range R in file2, we take further and further ranges in file1 until we know whether these overlap with R. Once we know this, we switch to the next range in file2.
#!/bin/bash
exec 3< "$1" # file whose ranges are checked for overlap with those ...
exec 4< "$2" # ... from this file, and if so, are written to stdout
l4=-1 # lower bound of current range from file 2
u4=-1 # upper bound
# initialized with -1 so the first range is read on the first iteration
echo "Ranges in $1 that intersect any ranges in $2:"
while read l3 u3; do # read next range from file 1
if (( u4 >= l3 )); then
(( l4 <= u3 )) && echo "$l3 $u3"
else # the upper bound from file 2 is below the lower bound from file 1, so ...
while read l4 u4; do # ... we read further ranges from file 2 until ...
if (( u4 >= l3 )); then # ... their upper bound is high enough
(( l4 <= u3 )) && echo "$l3 $u3"
break
fi
done <&4
fi
done <&3
The script can be called with ./script.sh file2 file1
I'm thinking if there is a way to split the column by matching the header ?
The data looks like this
ID_1 ID_2 ID_3 ID_6 ID_15
value1 0 2 4 7 6
value2 0 4 4 3 8
value3 2 2 3 7 8
I would like to get the columns only on ID_3 & ID_15
ID_3 ID_15
4 6
4 8
3 8
awk can simply separate it if I know the order of the column
However, I have a very huge table and only have a list of ID in hands.
Can I still use awk or there is an easier way in linux ?
The input format isn't well defined, but there are a few simple ways, awk, perl and sqlite.
(FNR==1) {
nocol=split(col,ocols,/,/) # cols contains named columns
ncols=split("vals " $0,cols) # header line
for (nn=1; nn<=ncols; nn++) colmap[cols[nn]]=nn # map names
OFS="\t" # to align output
for (nn=1; nn<=nocol; nn++) printf("%s%s",ocols[nn],OFS)
printf("\n") # output header line
}
(FNR>1) { # read data
for (nn=1; nn<=nocol; nn++) {
if (nn>1) printf(OFS) # pad
if (ocols[nn] in colmap) { printf("%s",$(colmap[ocols[nn]])) }
else { printf "--" } # named column not in data
}
printf("\n") # wrap line
}
$ nawk -f mycols.awk -v col=ID_3,ID_15 data
ID_3 ID_15
4 6
4 8
3 8
Perl, just a variation on the above with some perl idioms to confuse/entertain:
use strict;
use warnings;
our #ocols=split(/,/,$ENV{cols}); # cols contains named columns
our $nocol=scalar(#ocols);
our ($nn,%colmap);
$,="\t"; # OFS equiv
# while (<>) {...} implicit with perl -an
if ($. == 1) { # FNR equiv
%colmap = map { $F[$_] => $_+1 } 0..$#F ; # create name map hash
$colmap{vals}=0; # name anon 1st col
print #ocols,"\n"; # output header
} else {
for ($nn = 0; $nn < $nocol; $nn++) {
print "\t" if ($nn>0);
if (exists($colmap{$ocols[$nn]})) { printf("%s",$F[$colmap{$ocols[$nn]}]) }
else { printf("--") } # named column not in data
}
printf("\n")
}
$ cols="ID_3,ID_15" perl -an mycols.pl < data
That uses an environment variable to skip effort parsing the command line. It needs the perl options -an which set up field-splitting and an input read loop (much like awk does).
And with sqlite (I used v3.11, v3.8 or later is required for useful .import I believe). This uses an in-memory temporary database (name a file if too large for memory, or for a persistent copy of the parsed data), and automatically creates a table based on the first line. The advantages here are that you might not need any scripting at all, and you can perform multiple queries on your data with just one parse overhead.
You can skip this next step if you have a single hard-tab delimiting the columns, in which case replace .mode csv with .mode tab in the sqlite example below.
Otherwise, to convert your data to a suitable CSV-ish format:
nawk -v OFS="," '(FNR==1){$0="vals " $0} {$1=$1;print} < data > data.csv
This adds a dummy first column "vals" to the first line, then prints each line as comma-separated, it does this by a seemingly pointless assignment to $1, but this causes $0 to be recomputed replacing FS (space/tab) with OFS (comma).
$ sqlite3
sqlite> .mode csv
sqlite> .import data.csv mytable
sqlite> .schema mytable
CREATE TABLE mytable(
"vals" TEXT,
"ID_1" TEXT,
"ID_2" TEXT,
"ID_3" TEXT,
"ID_6" TEXT,
"ID_15" TEXT
);
sqlite> select ID_3,ID_15 from mytable;
ID_3,ID_15
4,6
4,8
3,8
sqlite> .mode column
sqlite> select ID_3,ID_15 from mytable;
ID_3 ID_15
---------- ----------
4 6
4 8
3 8
Use .once or .output to send output to a file (sqlite docs). Use .headers on or .headers off as required.
sqlite is quite happy to create an unnamed column, so you don't have to add a name to the first column of the header line, but you do need to make sure the number of columns is the same for all input lines and formats.
If you get "expected X columns but found Y" errors during the .import then you'll need to clean up the data format a little for this.
$ cat c.awk
NR == 1 {
for (i=1; i<=NF; ++i) {
if ($i == "ID_3") col_3 = (i + 1)
if ($i == "ID_15") col_15 = (i + 1)
}
print "ID_3", "ID_15"
}
NR > 1 { print $col_3, $col_15 }
$ awk -f c.awk c.txt
ID_3 ID_15
4 6
4 8
3 8
You could go for something like this:
BEGIN {
keys["ID_3"]
keys["ID_15"]
}
NR == 1 {
for (i = 1; i <= NF; ++i)
if ($i in keys) cols[++n] = i
}
{
for (i = 1; i <= n; ++i)
printf "%s%s", $(cols[i]+(NR>1)), (i < n ? OFS : ORS)
}
Save the script to a file and run it like awk -f script.awk file.
Alternatively, as a "one-liner":
awk 'BEGIN { keys["ID_3"]; keys["ID_15"] }
NR == 1 { for (i = 1; i <= NF; ++i) if ($i in keys) cols[++n] = i }
{ for (i = 1; i <= n; ++i) printf "%s%s", $(cols[i]+(NR>1)), (i < n ? OFS : ORS) }' file
Before the file is processed, keys are set in the keys array, corresponding to the column headings of interest.
On the first line, record all the column numbers that contain one of the keys in the cols array.
Loop through each of the cols and print them out, followed by either the output field separator OFS or the output record separator ORS, depending on whether it's the last one. $(cols[i]+(NR>1)) handles the fact that rows after the first have an extra field at the start, because NR>1 will be true (1) for those lines and false (0) for the first line.
Try below script:
#!/bin/sh
file="$1"; shift
awk -v cols="$*" '
BEGIN{
split(cols,C)
OFS=FS="\t"
getline
split($0,H)
for(c in C){
for(h in H){
if(C[c]==H[h])F[i++]=h
}
}
}
{ l="";for(f in F){l=l $F[f] OFS}print l }
' "$file"
In command line type:
[sumit.gupta#rpm01 ~]$ test.sh filename ID_3 ID_5
I have five different files. Part of each file looks as:
ifile1.txt ifile2.txt ifile3.txt ifile4.txt ifile5.txt
2 3 2 3 2
1 2 /no value 2 3
/no value 2 4 3 /no value
3 1 0 0 1
/no value /no value /no value /no value /no value
I need to compute average of these five files without considering missing values. i.e.
ofile.txt
2.4
2.0
3.0
1.0
99999
Here 2.4 = (2+3+2+3+2)/5
2.0 = (1+2+2+3)/4
3.0 = (2+4+3)/3
1.0 = (3+1+0+0+1)/5
99999 = all are missing
I was trying in the following way, but don't feel it is a proper way.
paste ifile1.txt ifile2.txt ifile3.txt ifile4.txt ifile5.txt > ofile.txt
tr '\n' ' ' < ofile.txt > ofile1.txt
awk '!/\//{sum += $1; count++} {print count ? (sum/count) : count;sum=count=0}' ofile1.txt > ofile2.txt
awk '!/\//{sum += $2; count++} {print count ? (sum/count) : count;sum=count=0}' ofile1.txt > ofile3.txt
awk '!/\//{sum += $3; count++} {print count ? (sum/count) : count;sum=count=0}' ofile1.txt > ofile4.txt
awk '!/\//{sum += $4; count++} {print count ? (sum/count) : count;sum=count=0}' ofile1.txt > ofile5.txt
awk '!/\//{sum += $5; count++} {print count ? (sum/count) : count;sum=count=0}' ofile1.txt > ofile6.txt
paste ofile2.txt ofile3.txt ofile4.txt ofile5.txt ofile6.txt > ofile7.txt
tr '\n' ' ' < ofile7.txt > ofile.txt
The following script.awk will deliver what you want:
BEGIN {
gap = -1;
maxidx = -1;
}
{
if (NR != FNR + gap) {
idx = 0;
gap = NR - FNR;
}
if (idx > maxidx) {
maxidx = idx;
count[idx] = 0;
sum[idx] = 0;
}
if ($0 != "/no value") {
count[idx]++;
sum[idx] += $0;
}
idx++;
}
END {
for (idx = 0; idx <= maxidx; idx++) {
if (count[idx] == 0) {
sum[idx] = 99999;
count[idx] = 1;
}
print sum[idx] / count[idx];
}
}
You call it with:
awk -f script.awk ifile*.txt
and it allows for an arbitrary number of input files, each with an arbitrary number of lines. It works as follows:
BEGIN {
gap = -1;
maxidx = -1;
}
This begin section runs before any lines are processed and it sets the current gap and maximum index accordingly.
The gap is the difference between the overall line number NR and the file line number FNR, used to detect when you switch files, something that's very handy when processing multiple input files.
The maximum index is used to figure out the largest line count so as to output the correct number of records at the end.
{
if (NR != FNR + gap) {
idx = 0;
gap = NR - FNR;
}
if (idx > maxidx) {
maxidx = idx;
count[idx] = 0;
sum[idx] = 0;
}
if ($0 != "/no value") {
count[idx]++;
sum[idx] += $0;
}
idx++;
}
The above code is the meat of the solution, executed per line. The first if statement is used to detect whether you've just moved into a new file and it does this simply so it can aggregate all the associated lines from each file. By that I mean the first line in each input file is used to calculate the average for the first line of the output file.
The second if statement adjusts maxidx if the current line number is beyond any previous line number we've encountered. This is for the case where file one may have seven lines but file two has nine lines (not so in your case but it's worth handling anyway). A previously unencountered line number also means we initialise its sum and count to be zero.
The final if statement simply updates the sum and count if the line contains anything other than /no value.
And then, of course, you need to adjust the line number for the next time through.
END {
for (idx = 0; idx <= maxidx; idx++) {
if (count[idx] == 0) {
sum[idx] = 99999;
count[idx] = 1;
}
print sum[idx] / count[idx];
}
}
In terms of outputting the data, it's a simple matter of going through the array and calculating the average from the sum and count. Notice that, if the count is zero (all corresponding entries were /no value), we adjust the sum and count so as to get 99999 instead. Then we just print the average.
So, running that code over your input files gives, as requested:
$ awk -f script.awk ifile*.txt
2.4
2
3
1
99999
Using bash and numaverage (which ignores non-numeric input), plus paste, sed and tr (both for cleaning, since numaverage needs single column input, and throws an error if input is 100% text):
paste ifile* | while read x ; do \
numaverage <(tr '\t' '\n' <<< "$x") 2>&1 | \
sed -n '1{s/Emp.*/99999/;p}' ; \
done
Output:
2.4
2
3
1
99999
I have a file with one column and I want to add numbers in this column from bottom of the file and print the sum in each line. For example if I have the following numbers:
1
2
3
4
5
6
I expect the result to look like this:
21(6+5+4+3+2+1)
20(6+5+4+3+2)
18(6+5+4+3)
15(6+5+4)
11(6+5)
6 (6)
I could think of the following if I was to add numbers from top to bottom I wonder if there is a way to reverse the order of sum using linux, cat, awk etc. Any help or suggestion is appreciated.
`cat file.txt | gawk ' { sum+=$1; print sum; }' > Final.file`
$ tac file | awk ' { sum+=$1; print sum }' | tac
21
20
18
15
11
6
If you actually want to see the equation:
seq 6 |
awk '
{
sum[NR] = $1
eq[NR] = $1
for (i=1; i<NR; i++) {
sum[i] += $1
eq[i] = $1 "+" eq[i]
}
}
END {for (i=1; i<=NR; i++) print sum[i] "(" eq[i] ")"}
'
21(6+5+4+3+2+1)
20(6+5+4+3+2)
18(6+5+4+3)
15(6+5+4)
11(6+5)
6(6)