Use awk to check a specific combination in other files

Use awk to check a specific combination in other files - string

I have 3 files
base.txt
12345 6 78
13579 2 46
24680 1 35
123451 266 78
135792 6572 46
246803 12587 35
1stcheck.txt
Some odded stuffs
AB 12345/6/78 Fx00
BC 13579/2/47 0xFF
CD 24680/1/35 5x88
AB 123451/266_10/78 Fx00 #10 is mod(266,256)
BC 135792/6572_172/46 0xFF #172 is mod(6572,256)
CD 246803/12587_43/35 5x88 #43 is mod(12587,256)
There may be some other odded stuffs
2ndcheck.txt
12345u_6_78.dat
13579u_2_46.dat
24680u_0_35.dat
123451u_10_78.dat #10 is mod(266,256)
135792u_172_46.dat #172 is mod(6572,256)
246803u_43_35.dat #43 is mod(12587,256)
The info in 1stcheck.txt and 2ndcheck.txt is just combination of base.txt in applied some template/format
I'd like to have
report.txt
12345 6 78 passed passed
| |
(12345/6/78) (12345u_6_78)
13579 2 46 failed passed
24680 1 35 passed failed
123451 266 78 passed passed
135792 6572 46 passed passed
246803 12587 35 passed passed
Please help to consider about performance since
base.txt,2ndcheck.txt ~ 8MB-12MB
1stcheck.txt ~ 70MB
Many thanks

You'll have to decide if this is memory efficient: it does have to store data from all files in arrays before printing the table.
Required GNU awk
gawk '
# base file: store keys (and line numbers for output ordering)
FILENAME == ARGV[1] {key[$0] = FNR; next}
# 1st check: if key appears in base, store result as pass
FILENAME == ARGV[2] {
k = $2
gsub(/\//, " ", k)
if (k in key) pass1[k] = 1
}
# 2nd check: if key appears in base, store result as pass
FILENAME == ARGV[3] {
if ( match($0, /([0-9]+)._([0-9]+)_([0-9]+)\.dat/, m) ) {
k = m[1] " " m[2] " " m[3]
if (k in key) pass2[k] = 1
}
next
}
# print the result table
END {
PROCINFO["sorted_in"] = "#val_num_asc" # traverse array by line number
for (k in key) {
printf "%s\t%s\t%s\n", k \
, (k in pass1 ? "passed" : "failed") \
, (k in pass2 ? "passed" : "failed")
}
}
' base.txt 1stcheck.txt 2ndcheck.txt
12345 6 78 passed passed
13579 2 46 failed passed
24680 1 35 passed failed

Based on #glenn jackman's suggestion, I could solve my problem
gawk '
# Store key for 1st check
FILENAME == ARGV[1] {
k = $2
gsub(/\//, " ", k)
key_first[k];next
}
# Store key for 2nd check
FILENAME == ARGV[2] {
if ( match($0, /([0-9]+)._([0-9]+)_([0-9]+)\.dat/, m) ) {
k = m[1] " " m[2] " " m[3]
key_second[k];
}
next
}
# base file: do check on both 1st and 2nd check
FILENAME == ARGV[3] {
if($2>256) {
first=$1 " " $2 "_" ($2%256) " " $3
}
else {
first=$1 " " $2 " " $3
}
second=$1 " " $2%256 " " $3
if (first in key_first) pass1[$0] = 1
if (second in key_second) pass2[$0] = 1
key[$0]= FNR; next
}
# print the result table
END {
PROCINFO["sorted_in"] = "#val_num_asc" # traverse array by line number
for (k in key) {
printf "%s\t%s\t%s\n", k \
, (k in pass1 ? "sic_passed" : "sic_failed") \
, (k in pass2 ? "gd_passed" : "gd_failed")
}
}
' 1stcheck.txt 2ndcheck.txt base.txt

Related

How can I reverse print the characters of a string in each cell using AWK?

Beth 45 4.00 0 0 .072
Danny 33 3.75 ^0 0 .089
The above is the file I want to operate.
I want to write an AWK script that can reverse print the characters of a string in every cell.
Here is the code:
BEGIN { OFS = "\t\t" }
function reverse_print(str)
{
s = "";
N = length(str);
for (i = 1; i <= N; i++)
a[i] = substr(str, i, 1);
for (i = N; i >= 1; i--)
s = s a[i];
return s;
}
{
for (i = 1; i <= NF; i++)
$i = reverse_print($i) ;
print;
}
END {}
However, it does not work. The program somehow becomes dead.
I have found if I don't use the loop and handle each field one by one like the following,
BEGIN { OFS = "\t\t" }
function reverse_print(str)
{
s = "";
N = length(str);
for (i = 1; i <= N; i++)
a[i] = substr(str, i, 1);
for (i = N; i >= 1; i--)
s = s a[i];
return s;
}
{
$1 = reverse_print($1) ;
$2 = reverse_print($2) ;
$3 = reverse_print($3) ;
$4 = reverse_print($4) ;
$5 = reverse_print($5) ;
$6 = reverse_print($6) ;
print;
}
END {}
it can work well.
Here is my desired output:
hteB 54 00.4 0 0 270.
ynnaD 33 57.3 0^ 0 980.
I have thought hard but still cannot figure out where I did wrong using the loop.
Who can tell me why ?

You're using the same variable i inside and outside of the function. Use a different variable in either location or change the function definition to reverse_print(str, i) to make the i used within the function local to that function rather than the same global variable being used in the calling code.
You should also make s and N function local:
function reverse_print(str, i, s, N)
but in fact the code should be written as:
$ cat tst.awk
BEGIN { OFS = "\t\t" }
function reverse_print(fwd, rev, i, n)
{
n = length(fwd)
for (i = n; i >= 1; i--)
rev = rev substr(fwd, i, 1);
return rev
}
{
for (i = 1; i <= NF; i++)
$i = reverse_print($i)
print
}
$ awk -f tst.awk file
hteB 54 00.4 0 0 270.
ynnaD 33 57.3 0^ 0 980.

Could you please try following.(This program is tested on GNU awk only and as per Ed sir's comment too this is undefined behavior for POSIX awk)
awk '
BEGIN{
OFS="\t\t"
}
{
for(i=1;i<=NF;i++){
num=split($i,array,"")
for(j=num;j>0;j--){
val=(j<num?val:"") array[j]
}
printf "%s%s",val,(i<NF?OFS:ORS)}
val=""
}' Input_file

There is a rev command in Linux: rev - reverse lines characterwise.
You can reverse a string by calling rev with awk builtin function system like:
#reverse-fields.awk
{
for (i = 1; i <= NF; i = i + 1) {
# command line
cmd = "echo '" $i "' | rev"
# read output into revfield
cmd | getline revfield
# remove leading new line
a = gensub(/^[\n\r]+/, "", "1", revfield)
# print reversed field
printf("%s", a)
# print tab
if (i != NF) printf("\t")
# close command
close(cmd)
}
# print new line
print ""
}
$ awk -f reverse-fields.awk emp.data
0 00.4 hteB
0 57.3 naD
01 00.4 yhtaK
02 00.5 kraM
22 05.5 yraM
81 52.4 eisuS

Average of multiple files without considering missing values

I want to calculate the average of 15 files:- ifile1.txt, ifile2.txt, ....., ifile15.txt. Number of columns and rows of each file are same. But some of them are missing values. Part of the data looks as
ifile1.txt ifile2.txt ifile3.txt
3 ? ? ? . 1 2 1 3 . 4 ? ? ? .
1 ? ? ? . 1 ? ? ? . 5 ? ? ? .
4 6 5 2 . 2 5 5 1 . 3 4 3 1 .
5 5 7 1 . 0 0 1 1 . 4 3 4 0 .
. . . . . . . . . . . . . . .
I would like to find a new file which will show the average of these 15 fils without considering the missing values.
ofile.txt
2.66 2 1 3 . (i.e. average of 3 1 4, average of ? 2 ? and so on)
2.33 ? ? ? .
3 5 4.33 1.33 .
3 2.67 4 0.66 .
. . . . .
This question is similar to my earlier question Average of multiple files in shell where the script was
awk 'FNR == 1 { nfiles++; ncols = NF }
{ for (i = 1; i < NF; i++) sum[FNR,i] += $i
if (FNR > maxnr) maxnr = FNR
}
END {
for (line = 1; line <= maxnr; line++)
{
for (col = 1; col < ncols; col++)
printf " %f", sum[line,col]/nfiles;
printf "\n"
}
}' ifile*.txt
But I can't able to modify it.

Use this:
paste ifile*.txt | awk '{n=f=0; for(i=1;i<=NF;i++){if($i*1){f++;n+=$i}}; print n/f}'
paste will show all files side by side
awk calculates the averages per line:
n=f=0; set the variables to 0.
for(i=1;i<=NF;i++) loop trough all the fields.
if($i*1) if the field contains a digit (multiplication by 1 will succeed).
f++;n+=$i increment f (number of fields with digits) and sum up n.
print n/f calculate n/f.

awk '
{
for (i = 1;i <= NF;i++) {
Sum[FNR,i]+=$i
Count[FNR,i]+=$i!="?"
}
}
END {
for( i = 1; i <= FNR; i++){
for( j = 1; j <= NF; j++) printf "%s ", Count[i,j] != 0 ? Sum[i,j]/Count[i,j] : "?"
print ""
}
}
' ifile*
assuming file are correctly feeded (no trailing empty space line, ...)

awk 'FNR == 1 { nfiles++; ncols = NF }
{ for (i = 1; i < NF; i++)
if ( $i != "?" ) { sum[FNR,i] += $i ; count[FNR,i]++ ;}
if (FNR > maxnr) maxnr = FNR
}
END {
for (line = 1; line <= maxnr; line++)
{
for (col = 1; col < ncols; col++)
if ( count[line,col] > 0 ) printf " %f", sum[line,col]/count[line,col];
else printf " ? " ;
printf "\n" ;
}
}' ifile*.txt
I just check the '?' ...

How to do something like this in linux shell script

I have a shell script that is doing something.I want to print the Unknown string where there is blank space in the output.
I want to do check if (f[1] == "") or (f[2] == "") or (f[3] == ""), it should be replaced by a unknown string and should be written in a single file
if(f[1] == "") printf(fmt, id, f[1], f[2], f[3]) > file
where f[1],f[2],f[3] if empty should be replaced by unknown string
where f[1] is the first index, fmt is the format specifier I have defined in the code.How to replace these empty spaces with a string in Linux.
Any lead is appreciated.
Thanks

Use the conditional operator:
ec2-describe-instances | awk -F'\t' -v of="$out" -v mof="$file" '
function pr() { # Print accumulated data
if(id != "") { # Skip if we do not have any unprinted data.
printf(fmt, id, f[1], f[2], f[3]) > of
if (f[1] == "" || f[2] == "" || f[3] == "") {
printf(fmt, id, f[1]==""?"Unknown":f[1], f[2]==""?"Unknown":f[2], f[3]==""?"Unknown":f[3]) > mof
}
}
# Clear accumulated data.
id = f[1] = f[2] = f[3] = ""
}
BEGIN { # Set the printf() format string for the header and the data lines.
fmt = "%-20s %-40s %-33s %s\n"
# Print the header
headerText="Instance Details"
headerMaxLen=100
padding=(length(headerText) - headerMaxLen) / 2
printf("%" padding "s" "%s" "%" padding "s" "\n\n\n", "", headerText, "") > of
printf(fmt, "Instance id", "Name", "Owner", "Cost.centre") > of
printf("%" padding "s" "%s" "%" padding "s" "\n\n\n", "", headerText, "") > mof
printf(fmt, "Instance id", "Name", "Owner", "Cost.centre") > mof
}
$1 == "TAG" {
# Save the Instance ID.
id = $3
if($4 ~ /[Nn]ame/) fs = 1 # Name found
else if($4 ~ /[Oo]wner/) fs = 2 # Owner found
else if($4 ~ /[Cc]ost.[Cc]ent[er][er]/) fs = 3 # Cost center found
else next # Ignore other TAGs
f[fs] = $5 # Save data for this field.
}
$1 == "RESERVATION" {
# First line of new entry found; print results from previous entry.
pr()
}
END { # EOF found, print results from last entry.
pr()
}'

Finding averages from reading a file using Bash Scripting

I am trying to write a bash script that reads a file 'names.txt' and will compute the average of peoples grades. For instance, names.txt looks something like this.
900706845 Harry Thompson 70 80 90
900897665 Roy Ludson 90 90 90
The script should read the line, print out the ID# of the person, the average of the three test scores and the corresponding letter grade. So the output needs to look like this
900706845 80 B
900897665 90 A
Heres what I have
#!/bin/bash
cat names.txt | while read x
do
$SUM=0; for i in 'names.txt'; do SUM=$(($SUM + $i));
done;
echo $SUM/3
done
I understand the echo will only print out the averages at this point, but I am trying to atleast get it to compute the averages before I attempt the other parts as well. Baby steps!

Like this maybe:
#!/bin/bash
while read a name1 name2 g1 g2 g3
do
avg=$(echo "($g1+$g2+$g3)/3" | bc)
echo $a $name1 $name2 $avg
done < names.txt
Output:
900706845 Harry Thompson 80
900897665 Roy Ludson 90

Customize gradeLetter for your own needs:
#!/bin/sh
awk '
function gradeLetter(g)
{
if (g >= 90) return "A";
if (g >= 80) return "B";
if (g >= 70) return "C";
if (g >= 60) return "D";
return "E"
}
{
avgGrade = ($(NF) + $(NF - 1) + $(NF - 2)) / 3;
print $1, avgGrade, gradeLetter(avgGrade)
}' names.txt

With a awk one-liner:
awk '{ AVG = int( ( $(NF-2) + $(NF-1) + $(NF) ) / 3 ) ; if ( AVG >= 90 ) { GRADE = "A" } else if ( AVG >= 80 ) { GRADE = "B" } else if ( AVG >= 70 ) { GRADE = "C" } else if ( AVG >= 60 ) { GRADE = "D" } else { GRADE = "F" } ; print $1, AVG, GRADE }' file
Let's look at the details:
awk '{
# Calculate average
AVG = int( ( $(NF-2) + $(NF-1) + $(NF) ) / 3 )
# Calculate grade
if ( AVG >= 90 ) { GRADE = "A" }
else if ( AVG >= 80 ) { GRADE = "B" }
else if ( AVG >= 70 ) { GRADE = "C" }
else if ( AVG >= 60 ) { GRADE = "D" }
else { GRADE = "F" }
print $1, AVG, GRADE
}' file

The ID#s and averages can be obtained as follows:
$ awk '{sum=0; for(i=3;i<=NF;i++) sum+=$i ; print $1, sum/3}' names.txt
900706845 80
900897665 90
Guessing at how to compute grades, one can do:
$ awk '{sum=0; for(i=3;i<=NF;i++) sum+=$i ; ave=sum/3; print $1, ave, substr("FFFFFDCBA", ave/10, 1) }' names.txt
900706845 80 B
900897665 90 A
The above solutions work for any number of tests but names are limited to 2 words. If there will always be three tests but names can be any number of words, then use:
$ awk '{ave=($(NF-2)+$(NF-1)+$NF)/3; print $1, ave, substr("FFFFFDCBA", ave/10, 1) }' names.txt
900706845 80 B
900897665 90 A

Fast Way to Find Difference between Two Strings of Equal Length in Perl

Given pairs of string like this.
my $s1 = "ACTGGA";
my $s2 = "AGTG-A";
# Note the string can be longer than this.
I would like to find position and character in in $s1 where it differs with $s2.
In this case the answer would be:
#String Position 0-based
# First col = Base in S1
# Second col = Base in S2
# Third col = Position in S1 where they differ
C G 1
G - 4
I can achieve that easily with substr(). But it is horribly slow.
Typically I need to compare millions of such pairs.
Is there a fast way to achieve that?

Stringwise ^ is your friend:
use strict;
use warnings;
my $s1 = "ACTGGA";
my $s2 = "AGTG-A";
my $mask = $s1 ^ $s2;
while ($mask =~ /[^\0]/g) {
print substr($s1,$-[0],1), ' ', substr($s2,$-[0],1), ' ', $-[0], "\n";
}
EXPLANATION:
The ^ (exclusive or) operator, when used on strings, returns a string composed of the result of an exclusive or on each bit of the numeric value of each character. Breaking down an example into equivalent code:
"AB" ^ "ab"
( "A" ^ "a" ) . ( "B" ^ "b" )
chr( ord("A") ^ ord("a") ) . chr( ord("B") ^ ord("b") )
chr( 65 ^ 97 ) . chr( 66 ^ 98 )
chr(32) . chr(32)
" " . " "
" "
The useful feature of this here is that a nul character ("\0") occurs when and only when the two strings have the same character at a given position. So ^ can be used to efficiently compare every character of the two strings in one quick operation, and the result can be searched for non-nul characters (indicating a difference). The search can be repeated using the /g regex flag in scalar context, and the position of each character difference found using $-[0], which gives the offset of the beginning of the last successful match.

Use binary bit ops on the complete strings.
Things like $s1 & $s2 or $s1 ^ $s2 run incredibly fast, and work with strings of arbitrary length.

I was bored on Thanksgiving break 2012 and answered the question and more. It will work on strings of equal length. It will work if they are not. I added a help, opt handling just for fun. I thought someone might find it useful.
If you are new to PERL add don't know. Don't add any code in your script below DATA to the program.
Have fun.
./diftxt -h
usage: diftxt [-v ] string1 string2
-v = Verbose
diftxt [-V|--version]
diftxt [-h|--help] "This help!"
Examples: diftxt test text
diftxt "This is a test" "this is real"
Place Holders: space = "·" , no charater = "ζ"
cat ./diftxt
----------- cut ✂----------
#!/usr/bin/perl -w
use strict;
use warnings;
use Getopt::Std;
my %options=();
getopts("Vhv", \%options);
my $helptxt='
usage: diftxt [-v ] string1 string2
-v = Verbose
diftxt [-V|--version]
diftxt [-h|--help] "This help!"
Examples: diftxt test text
diftxt "This is a test" "this is real"
Place Holders: space = "·" , no charater = "ζ"';
my $Version = "inital-release 1.0 - Quincey Craig 11/21/2012";
print "$helptxt\n\n" if defined $options{h};
print "$Version\n" if defined $options{V};
if (#ARGV == 0 ) {
if (not defined $options{h}) {usage()};
exit;
}
my $s1 = "$ARGV[0]";
my $s2 = "$ARGV[1]";
my $mask = $s1 ^ $s2;
# setup unicode output to STDOUT
binmode DATA, ":utf8";
my $ustring = <DATA>;
binmode STDOUT, ":utf8";
my $_DIFF = '';
my $_CHAR1 = '';
my $_CHAR2 = '';
sub usage
{
print "\n";
print "usage: diftxt [-v ] string1 string2\n";
print " -v = Verbose \n";
print " diftxt [-V|--version]\n";
print " diftxt [-h|--help]\n\n";
exit;
}
sub main
{
print "\nOrig\tDiff\tPos\n----\t----\t----\n" if defined $options{v};
while ($mask =~ /[^\0]/g) {
### redirect stderr to allow for test of empty variable with error message from substr
open STDERR, '>/dev/null';
if (substr($s2,$-[0],1) eq "") {$_CHAR2 = "\x{03B6}";close STDERR;} else {$_CHAR2 = substr($s2,$-[0],1)};
if (substr($s2,$-[0],1) eq " ") {$_CHAR2 = "\x{00B7}"};
$_CHAR1 = substr($s1,$-[0],1);
if ($_CHAR1 eq "") {$_CHAR1 = "\x{03B6}"} else {$_CHAR1 = substr($s1,$-[0],1)};
if ($_CHAR1 eq " ") {$_CHAR1 = "\x{00B7}"};
### Print verbose Data
print $_CHAR1, "\t", $_CHAR2, "\t", $+[0], "\n" if defined $options{v};
### Build difference list
$_DIFF = "$_DIFF$_CHAR2";
### Build mask
substr($s1,"$-[0]",1) = "\x{00B7}";
} ### end loop
print "\n" if defined $options{v};
print "$_DIFF, ";
print "Mask: \"$s1\"\n";
} ### end main
if ($#ARGV == 1) {main()};
__DATA__

This is the easiest form you can get
my $s1 = "ACTGGA";
my $s2 = "AGTG-A";
my #s1 = split //,$s1;
my #s2 = split //,$s2;
my $i = 0;
foreach (#s1) {
if ($_ ne $s2[$i]) {
print "$_, $s2[$i] $i\n";
}
$i++;
}

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Use awk to check a specific combination in other files - string

Related

How can I reverse print the characters of a string in each cell using AWK?

Average of multiple files without considering missing values

How to do something like this in linux shell script

Finding averages from reading a file using Bash Scripting

Fast Way to Find Difference between Two Strings of Equal Length in Perl

Categories

Resources