Finding averages from reading a file using Bash Scripting - linux

I am trying to write a bash script that reads a file 'names.txt' and will compute the average of peoples grades. For instance, names.txt looks something like this.
900706845 Harry Thompson 70 80 90
900897665 Roy Ludson 90 90 90
The script should read the line, print out the ID# of the person, the average of the three test scores and the corresponding letter grade. So the output needs to look like this
900706845 80 B
900897665 90 A
Heres what I have
#!/bin/bash
cat names.txt | while read x
do
$SUM=0; for i in 'names.txt'; do SUM=$(($SUM + $i));
done;
echo $SUM/3
done
I understand the echo will only print out the averages at this point, but I am trying to atleast get it to compute the averages before I attempt the other parts as well. Baby steps!

Like this maybe:
#!/bin/bash
while read a name1 name2 g1 g2 g3
do
avg=$(echo "($g1+$g2+$g3)/3" | bc)
echo $a $name1 $name2 $avg
done < names.txt
Output:
900706845 Harry Thompson 80
900897665 Roy Ludson 90

Customize gradeLetter for your own needs:
#!/bin/sh
awk '
function gradeLetter(g)
{
if (g >= 90) return "A";
if (g >= 80) return "B";
if (g >= 70) return "C";
if (g >= 60) return "D";
return "E"
}
{
avgGrade = ($(NF) + $(NF - 1) + $(NF - 2)) / 3;
print $1, avgGrade, gradeLetter(avgGrade)
}' names.txt

With a awk one-liner:
awk '{ AVG = int( ( $(NF-2) + $(NF-1) + $(NF) ) / 3 ) ; if ( AVG >= 90 ) { GRADE = "A" } else if ( AVG >= 80 ) { GRADE = "B" } else if ( AVG >= 70 ) { GRADE = "C" } else if ( AVG >= 60 ) { GRADE = "D" } else { GRADE = "F" } ; print $1, AVG, GRADE }' file
Let's look at the details:
awk '{
# Calculate average
AVG = int( ( $(NF-2) + $(NF-1) + $(NF) ) / 3 )
# Calculate grade
if ( AVG >= 90 ) { GRADE = "A" }
else if ( AVG >= 80 ) { GRADE = "B" }
else if ( AVG >= 70 ) { GRADE = "C" }
else if ( AVG >= 60 ) { GRADE = "D" }
else { GRADE = "F" }
print $1, AVG, GRADE
}' file

The ID#s and averages can be obtained as follows:
$ awk '{sum=0; for(i=3;i<=NF;i++) sum+=$i ; print $1, sum/3}' names.txt
900706845 80
900897665 90
Guessing at how to compute grades, one can do:
$ awk '{sum=0; for(i=3;i<=NF;i++) sum+=$i ; ave=sum/3; print $1, ave, substr("FFFFFDCBA", ave/10, 1) }' names.txt
900706845 80 B
900897665 90 A
The above solutions work for any number of tests but names are limited to 2 words. If there will always be three tests but names can be any number of words, then use:
$ awk '{ave=($(NF-2)+$(NF-1)+$NF)/3; print $1, ave, substr("FFFFFDCBA", ave/10, 1) }' names.txt
900706845 80 B
900897665 90 A

Related

Awk command to convert hex to signed decimal

I have a text file which consists of 3 columns with hex numbers (values are variable, these used only as an example):
X Y Z
0a0a 0b0b 0c0c
0a0a 0b0b 0c0c
0a0a 0b0b 0c0c
0a0a 0b0b 0c0c
I want to convert these numbers to signed decimal and print them in the same structure they are in, so I did:
awk '{x="0x"$1;
y="0x"$2;
z="0x"$3;
printf ("%d %d %d" x,y,z);}' input_file.txt > output_file.txt
The list that I get as an output consists only of unsigned values.
You can use an awk function to make a Conversion from Two's Complement:
function hex2int( hexstr, nbits )
{
max = 2 ^ nbits
med = max / 2
num = strtonum( "0x" hexstr )
return ((num < med) ? num : ( (num > med) ? num - max : -med ))
}
4bit Conversion examples:
print hex2int( "7", 4 ) # +7
print hex2int( "2", 4 ) # +2
print hex2int( "1", 4 ) # +1
print hex2int( "0", 4 ) # 0
print hex2int( "f", 4 ) # -1
print hex2int( "d", 4 ) # -3
print hex2int( "9", 4 ) # -7
print hex2int( "8", 4 ) # -8
8bit Conversion examples:
print hex2int( "7f", 8 ) # +127
print hex2int( "40", 8 ) # +64
print hex2int( "01", 8 ) # +1
print hex2int( "00", 8 ) # 0
print hex2int( "ff", 8 ) # -1
print hex2int( "40", 8 ) # -64
print hex2int( "81", 8 ) # -127
print hex2int( "80", 8 ) # -128
Putting all together using a 16bit conversion:
#!/bin/awk
function hex2int( hex )
{
num = strtonum( "0x" hex )
return ((num < med) ? num : ( (num > med) ? num - max : -med ))
}
BEGIN {
nbits = 16
max = 2 ^ nbits
med = max / 2
}
{
for( i = 1; i <= NF; i++ )
{
if( NR == 1 )
{
printf "%s%s", $i, OFS
}
else
{
printf "%d%s", hex2int($i), OFS
}
}
printf "%s", ORS
}
# eof #
Input file:
X Y Z
0a0a 0b0b 0c0c
abcd ef01 1234
ffff fafa baba
12ab abca 4321
Testing:
$ awk -f script.awk -- input.txt
Output:
X Y Z
2570 2827 3084
-21555 -4351 4660
-1 -1286 -17734
4779 -21558 17185
Reference: https://en.wikipedia.org/wiki/Two's_complement
Hope it helps!

Average of multiple files without considering missing values

I want to calculate the average of 15 files:- ifile1.txt, ifile2.txt, ....., ifile15.txt. Number of columns and rows of each file are same. But some of them are missing values. Part of the data looks as
ifile1.txt ifile2.txt ifile3.txt
3 ? ? ? . 1 2 1 3 . 4 ? ? ? .
1 ? ? ? . 1 ? ? ? . 5 ? ? ? .
4 6 5 2 . 2 5 5 1 . 3 4 3 1 .
5 5 7 1 . 0 0 1 1 . 4 3 4 0 .
. . . . . . . . . . . . . . .
I would like to find a new file which will show the average of these 15 fils without considering the missing values.
ofile.txt
2.66 2 1 3 . (i.e. average of 3 1 4, average of ? 2 ? and so on)
2.33 ? ? ? .
3 5 4.33 1.33 .
3 2.67 4 0.66 .
. . . . .
This question is similar to my earlier question Average of multiple files in shell where the script was
awk 'FNR == 1 { nfiles++; ncols = NF }
{ for (i = 1; i < NF; i++) sum[FNR,i] += $i
if (FNR > maxnr) maxnr = FNR
}
END {
for (line = 1; line <= maxnr; line++)
{
for (col = 1; col < ncols; col++)
printf " %f", sum[line,col]/nfiles;
printf "\n"
}
}' ifile*.txt
But I can't able to modify it.
Use this:
paste ifile*.txt | awk '{n=f=0; for(i=1;i<=NF;i++){if($i*1){f++;n+=$i}}; print n/f}'
paste will show all files side by side
awk calculates the averages per line:
n=f=0; set the variables to 0.
for(i=1;i<=NF;i++) loop trough all the fields.
if($i*1) if the field contains a digit (multiplication by 1 will succeed).
f++;n+=$i increment f (number of fields with digits) and sum up n.
print n/f calculate n/f.
awk '
{
for (i = 1;i <= NF;i++) {
Sum[FNR,i]+=$i
Count[FNR,i]+=$i!="?"
}
}
END {
for( i = 1; i <= FNR; i++){
for( j = 1; j <= NF; j++) printf "%s ", Count[i,j] != 0 ? Sum[i,j]/Count[i,j] : "?"
print ""
}
}
' ifile*
assuming file are correctly feeded (no trailing empty space line, ...)
awk 'FNR == 1 { nfiles++; ncols = NF }
{ for (i = 1; i < NF; i++)
if ( $i != "?" ) { sum[FNR,i] += $i ; count[FNR,i]++ ;}
if (FNR > maxnr) maxnr = FNR
}
END {
for (line = 1; line <= maxnr; line++)
{
for (col = 1; col < ncols; col++)
if ( count[line,col] > 0 ) printf " %f", sum[line,col]/count[line,col];
else printf " ? " ;
printf "\n" ;
}
}' ifile*.txt
I just check the '?' ...

Use awk to check a specific combination in other files

I have 3 files
base.txt
12345 6 78
13579 2 46
24680 1 35
123451 266 78
135792 6572 46
246803 12587 35
1stcheck.txt
Some odded stuffs
AB 12345/6/78 Fx00
BC 13579/2/47 0xFF
CD 24680/1/35 5x88
AB 123451/266_10/78 Fx00 #10 is mod(266,256)
BC 135792/6572_172/46 0xFF #172 is mod(6572,256)
CD 246803/12587_43/35 5x88 #43 is mod(12587,256)
There may be some other odded stuffs
2ndcheck.txt
12345u_6_78.dat
13579u_2_46.dat
24680u_0_35.dat
123451u_10_78.dat #10 is mod(266,256)
135792u_172_46.dat #172 is mod(6572,256)
246803u_43_35.dat #43 is mod(12587,256)
The info in 1stcheck.txt and 2ndcheck.txt is just combination of base.txt in applied some template/format
I'd like to have
report.txt
12345 6 78 passed passed
| |
(12345/6/78) (12345u_6_78)
13579 2 46 failed passed
24680 1 35 passed failed
123451 266 78 passed passed
135792 6572 46 passed passed
246803 12587 35 passed passed
Please help to consider about performance since
base.txt,2ndcheck.txt ~ 8MB-12MB
1stcheck.txt ~ 70MB
Many thanks
You'll have to decide if this is memory efficient: it does have to store data from all files in arrays before printing the table.
Required GNU awk
gawk '
# base file: store keys (and line numbers for output ordering)
FILENAME == ARGV[1] {key[$0] = FNR; next}
# 1st check: if key appears in base, store result as pass
FILENAME == ARGV[2] {
k = $2
gsub(/\//, " ", k)
if (k in key) pass1[k] = 1
}
# 2nd check: if key appears in base, store result as pass
FILENAME == ARGV[3] {
if ( match($0, /([0-9]+)._([0-9]+)_([0-9]+)\.dat/, m) ) {
k = m[1] " " m[2] " " m[3]
if (k in key) pass2[k] = 1
}
next
}
# print the result table
END {
PROCINFO["sorted_in"] = "#val_num_asc" # traverse array by line number
for (k in key) {
printf "%s\t%s\t%s\n", k \
, (k in pass1 ? "passed" : "failed") \
, (k in pass2 ? "passed" : "failed")
}
}
' base.txt 1stcheck.txt 2ndcheck.txt
12345 6 78 passed passed
13579 2 46 failed passed
24680 1 35 passed failed
Based on #glenn jackman's suggestion, I could solve my problem
gawk '
# Store key for 1st check
FILENAME == ARGV[1] {
k = $2
gsub(/\//, " ", k)
key_first[k];next
}
# Store key for 2nd check
FILENAME == ARGV[2] {
if ( match($0, /([0-9]+)._([0-9]+)_([0-9]+)\.dat/, m) ) {
k = m[1] " " m[2] " " m[3]
key_second[k];
}
next
}
# base file: do check on both 1st and 2nd check
FILENAME == ARGV[3] {
if($2>256) {
first=$1 " " $2 "_" ($2%256) " " $3
}
else {
first=$1 " " $2 " " $3
}
second=$1 " " $2%256 " " $3
if (first in key_first) pass1[$0] = 1
if (second in key_second) pass2[$0] = 1
key[$0]= FNR; next
}
# print the result table
END {
PROCINFO["sorted_in"] = "#val_num_asc" # traverse array by line number
for (k in key) {
printf "%s\t%s\t%s\n", k \
, (k in pass1 ? "sic_passed" : "sic_failed") \
, (k in pass2 ? "gd_passed" : "gd_failed")
}
}
' 1stcheck.txt 2ndcheck.txt base.txt

Determining the ratio of matches to non-matches of 2 primary strands? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How to plot a gene graph for a DNA sequence say ATGCCGCTGCGC?
Im trying to write a Perl script that compares two DNA sequences (60 characters in length each lets say) in alignment, and then show the ratio of matches to non-matches of the sequences to each other. But i'm not having much luck. if it helps i can upload my code, but its no use. here's an example of what im trying to achieve below.
e.g
A T C G T A C
| | | | | | |
T A C G A A C
So the matches of the above example would be 4. and non-matches are: 3. Giving it a ratio of 4.3.
Any help would be much appreciated. thanks.
in general, please do post your code. It does help. In any case, something like this should do what you are asking:
#!/usr/bin/perl -w
use strict;
my $d1='ATCGTAC';
my $d2='TACGAAC';
my #dna1=split(//,$d1);
my #dna2=split(//,$d2);
my $matches=0;
for (my $i=0; $i<=$#dna1; $i++) {
$matches++ if $dna1[$i] eq $dna2[$i];
}
my $mis=scalar(#dna1)-$matches;
print "Matches/Mismatches: $matches/$mis\n";
Bear in mind though that the ratio of 4 to 3 is most certainly not 4.3 but ~1.3. If you post some information on your input file format I will update my answer to include lines for parsing the sequence from your file.
Normally I'd say "What have you tried" and "upload your code first" because it doesn't seem to be a very difficult problem. But let's give this a shot:
create two arrays, one to hold each sequence:
#sequenceOne = ("A", "T", "C", "G", "T", "A", "C");
#sequenceTwo = ("T", "A", "C", "G", "A", "A", "C");
$myMatch = 0;
$myMissMatch = 0;
for ($i = 0; $i < #sequenceOne; $i++) {
my $output = "Comparing " . $sequenceOne[$i] . " <=> " . $sequenceTwo[$i];
if ($sequenceOne[$i] eq $sequenceTwo[$i]) {
$output .= " MATCH\n";
$myMatch++;
} else {
$myMissMatch++;
$output .= "\n";
}
print $output;
}
print "You have " . $myMatch . " matches.\n";
print "You have " . $myMissMatch . " mismatches\n";
print "The ratio of hits to misses is " . $myMatch . ":" . $myMissMatch . ".\n";
Of course, you'd probably want to read the sequence from something else on the fly instead of hard-coding the array. But you get the idea. With the above code your output will be:
torgis-MacBook-Pro:platform-tools torgis$ ./dna.pl
Comparing A <=> T
Comparing T <=> A
Comparing C <=> C MATCH
Comparing G <=> G MATCH
Comparing T <=> A
Comparing A <=> A MATCH
Comparing C <=> C MATCH
You have 4 matches.
You have 3 mismatches
The ratio of hits to misses is 4:3.
So many ways to do this. Here's one.
use strict;
use warnings;
my $seq1 = "ATCGTAC";
my $seq2 = "TACGAAC";
my $len = length $seq1;
my $matches = 0;
for my $i (0..$len-1) {
$matches++ if substr($seq1, $i, 1) eq substr($seq2, $i, 1);
}
printf "Length: %d Matches: %d Ratio: %5.3f\n", $len, $matches, $matches/$len;
exit 0;
Just grab the length of one of the strings (we're assuming string lengths are equal, right?), and then iterate using substr.
my #strings = ( 'ATCGTAC', 'TACGAAC' );
my $matched;
foreach my $ix ( 0 .. length( $strings[0] ) - 1 ) {
$matched++
if substr( $strings[0], $ix, 1 ) eq substr( $strings[1], $ix, 1 );
}
print "Matches: $matched\n";
print "Mismatches: ", length( $strings[0] ) - $matched, "\n";
I think substr is the way to go, rather than splitting the strings into arrays.
This is probably most convenient if presented as a subroutine:
use strict;
use warnings;
print ratio(qw/ ATCGTAC TACGAAC /);
sub ratio {
my ($aa, $bb) = #_;
my $total = length $aa;
my $matches = 0;
for (0 .. $total-1) {
$matches++ if substr($aa, $_, 1) eq substr($bb, $_, 1);
}
$matches / ($total - $matches);
}
output
1.33333333333333
Bill Ruppert's right that there are many way to do this. Here's another:
use Modern::Perl;
say compDNAseq( 'ATCGTAC', 'TACGAAC' );
sub compDNAseq {
my $total = my $i = 0;
$total += substr( $_[1], $i++, 1 ) eq $1 while $_[0] =~ /(.)/g;
sprintf '%.2f', $total / ( $i - $total );
}
Output:
1.33
Here is an approach which gives a NULL, \0, for each match in an xor comparison.
#!/usr/bin/perl
use strict;
use warnings;
my $d1='ATCGTAC';
my $d2='TACGAAC';
my $len = length $d1; # assumes $d1 and $d2 are the same length
my $matches = () = ($d1 ^ $d2) =~ /\0/g;
printf "ratio of %f", $matches / ($len - $matches);
Output: ratio of 1.333333

2d histogram making

I have a data file containing two columns, like
1.1 2.2
3.1 4.5
1.2 4.5
3.2 4.6
1.1 2.3
4.2 4.9
4.2 1.1
I would like to make a histogram from the two columns, i.e. to get this output (if the step size (or bin size, as we talking about histogramming) equals to 0.1 in this case)
1.0 1.0 0
1.0 1.1 0
1.0 1.2 0
...
1.1 1.0 0
1.1 1.1 0
1.1 1.2 0
...
1.1 2.0 0
1.1 2.1 0
1.1 2.2 1
...
...
Can anybody suggest me something? It would be nice, if I can set the the range of values of the colmuns. In the above case the 1st column values goes from 1 to 4, and the same as for the second column.
EDITED: updated in order to handle more general data input, e.g. float numbers. The step size in the above case is 0.1, but it would be nice if it can be tunable for other settings, i.e. if step range (bin size) is for example 0.2, or 1.0.
If the step size is for example 1.0, then if I have 1.1 and 1.8 they have the same bin, we have to handle them together, for example (the range in this case let us say 4 for both of the two columns 0.0 ... 4.0)
1.1 1.8
2.5 2.6
1.4 2.1
1.3 1.5
3.3 4.0
3.8 3.9
4.0 3.2
4.0 4.0
output (if the bin size = 1.0)
1 1 2
1 2 1
1 3 0
1 4 0
2 1 0
2 2 1
2 3 0
2 4 0
3 1 0
3 2 0
3 3 1
3 4 1
4 1 0
4 2 0
4 3 1
4 4 1
awk 'END {
for (i = 0; ++i <= l;) {
for (j = 0; ++j <= l;)
printf "%d %d %d %s\n", i, j, \
b[i, j], (j < l ? x : ORS)
}
}
{
f[NR] = $1; s[NR] = $2
b[$1, $2]++
}' l=4 infile
You may try this (not thoroughly tested):
awk -v l=4 -v bs=0.1 'BEGIN {
if (!bs) {
print "invalid bin size" > "/dev/stderr"
exit
}
split(bs, t, ".")
t[2] || fl++
m = "%." length(t[2]) "f"
}
{
fk = fl ? int($1) : sprintf(m, $1)
sk = fl ? int($2) : sprintf(m, $2)
f[fk]; s[sk]; b[fk, sk]++
}
END {
if (!bs) exit 1
for (i = 1; int(i) <= l; i += bs) {
for (j = 1; int(j) <= l; j += bs) {
if (fl) {
fk = int(i); sk = int(j); m = "%d"
}
else {
fk = sprintf(m, i); sk = sprintf(m, j)
}
printf "%s" m OFS m OFS "%d\n", (i > 1 && fk != p ? ORS : x), fk, sk, b[fk, sk]
p = fk
}
}
}' infile
You can try this in bash:
for x in {1..4} ; do
for y in {1..4} ; do
echo $x%$y 0
done
done \
| join -1 1 -2 2 - -a1 <(sed 's/ /%/' FILE \
| sort \
| uniq -c \
| sort -k2 ) \
| sed 's/ 0 / /;s/%/ /'
It creates the table with all zeros in the last column, joins it with the real results (classic frequency table sort | uniq -c) and removes the zeros from the lines where a different number should be shown.
One solution in perl (sample output and usage to follow):
#!/usr/bin/perl -W
use strict;
my ($min, $step, $max, $file) = #ARGV
or die "Syntax: $0 <min> <step> <max> <file>\n";
my %seen;
open F, "$file"
or die "Cannot open file $file: $!\n";
my #l = map { chomp; $_} qx/seq $min $step $max/;
foreach my $first (#l) {
foreach my $second (#l) {
$seen{"$first $second"} = 0;
}
}
foreach my $line (<F>) {
chomp $line;
$line or next;
$seen{$line}++;
}
my $len = #l; # size of list
my $i = 0;
foreach my $key (sort keys %seen) {
printf("%s %d\n", $key, $seen{$key});
$i++;
print "\n" unless $i % $len;
}
exit(0);

Resources