Permutation columns without repetition - linux

Can anybody give me some piece of code or algorithm or something else to solve the following problem?
I have several files, each with a different number of columns, like:
$> cat file-1
1 2
$> cat file-2
1 2 3
$> cat file-3
1 2 3 4
I would like to subtract the column absolute values and divide by the sum of all in a row for each different columns only once (combination without repeated column pairs):
in file-1 case I need to get:
0.3333 # because |1-2/(1+2)|
in file-2 case I need to get:
0.1666 0.1666 0.3333 # because |1-2/(1+2+3)| and |2-3/(1+2+3)| and |1-3/(1+2+3)|
in file-3 case I need to get:
0.1 0.2 0.3 0.1 0.2 0.1 # because |1-2/(1+2+3+4)| and |1-3/(1+2+3+4)| and |1-4/(1+2+3+4)| and |2-3/(1+2+3+4)| and |2-4/(1+2+3+4)| and |3-4/(1+2+3+4)|

This should work though I am guessing you have made a minor mistake in your input data. Based on your third pattern the following data should be -
Instead of:
in file-2 case I need to get:
0.1666 0.1666 0.3333 # because |1-2/(1+2+3)| and |2-3/(1+2+3)| and |1-3/(1+2+3)|
It should be:
in file-2 case I need to get:
0.1666 0.3333 0.1666 # because |1-2/(1+2+3)| and |1-3/(1+2+3)| and |2-3/(1+2+3)|
Here is the awk one liner:
awk '
NF{
a=0;
for(i=1;i<=NF;i++)
a+=$i;
for(j=1;j<=NF;j++)
{
for(k=j;k<NF;k++)
printf("%s ",-($j-$(k+1))/a)
}
print "";
next;
}1' file
Short version:
awk '
NF{for (i=1;i<=NF;i++) a+=$i;
for (j=1;j<=NF;j++){for (k=j;k<NF;k++) printf("%2.4f ",-($j-$(k+1))/a)}
print "";a=0;next;}1' file
Input File:
[jaypal:~/Temp] cat file
1 2
1 2 3
1 2 3 4
Test:
[jaypal:~/Temp] awk '
NF{
a=0;
for(i=1;i<=NF;i++)
a+=$i;
for(j=1;j<=NF;j++)
{
for(k=j;k<NF;k++)
printf("%s ",-($j-$(k+1))/a)
}
print "";
next;
}1' file
0.333333
0.166667 0.333333 0.166667
0.1 0.2 0.3 0.1 0.2 0.1
Test from shorter version:
[jaypal:~/Temp] awk '
NF{for (i=1;i<=NF;i++) a+=$i;
for (j=1;j<=NF;j++){for (k=j;k<NF;k++) printf("%2.4f ",-($j-$(k+1))/a)}
print "";a=0;next;}1' file
0.3333
0.1667 0.3333 0.1667
0.1000 0.2000 0.3000 0.1000 0.2000 0.1000

#Jaypal just beat me too it! Here's what I had:
awk '{for (x=1;x<=NF;x++) sum += $x; for (i=1;i<=NF;i++) for (j=2;j<=NF;j++) if (i < j) printf ("%.1f ",-($i-$j)/sum)} END {print ""}' file.txt
Output:
0.1 0.2 0.3 0.1 0.2 0.1
prints to one decimal place.
#Jaypal, Is there a quick way to printf an absolute value? Perhaps like: abs(value) ?
EDIT:
#Jaypal, yes I've tried searching too and couldn't find something simple :-( It seems if ($i < 0) $i = -$i is the way to go. I guess you could use sed to remove any minus signs:
awk '{for (x=1;x<=NF;x++) sum += $x; for (i=1;i<=NF;i++) for (j=2;j<=NF;j++) if (i < j) printf ("%.1f ", ($i-$j)/sum)} {print ""}' file.txt | sed "s%-%%g"
Cheers!

As it looks like a homework, I will act accordingly.
To find the total numbers present in the file, you can use
cat filename | wc -w
Find the first_number by:
cat filename | cut -d " " -f 1
To find the sum in a file:
cat filename | tr " " "+" | bc
Now, that you have the total_nos, use something like:
for i in {seq 1 1 $total_nos}
do
#Find the numerator by first_number - $i
#Use the sum you got from above to get the desired value.
done

Related

I want to find some strings/words from column 1 and 2 in file1 that match column 1 in file2 and replace with column 2 strings/words in file2

I'm still learning coding using Linux platform. I have search for problems similar to mine but the once I found they were either specific or focusing only on changing the entire column 1.
Here are example of my files:
File 1
abc Gamma 3.44
bcd abc 5.77
abc Alpha 1.99
beta abc 0.88
bcd Alpha 5.66
File 2
Gamma Bacteria
Alpha Bacteria
Beta Bacteria
Output file3
abc Bacteria 3.44
bcd abc 5.77
abc Bacteria 1.99
Bacteria abc 0.88
bcd Bacteria 5.66
I have tried:
awk:
$ awk 'FNR==NR{a[$1]=$2;next} {if ($1,$2 in a){$1,$2=a[$1,$2]}; print $0}' file2 file1
$ awk 'NR==FNR {a[FNR]=$0; next} /$1|$2/ {$1 $2=a[FNR]} 1' file2 file1
They gave me:
abc Gamma 3.44
abc 5.77
abc Alpha 1.99
Bacteria abc 0.88
bcd Alpha 5.66
Only changing the $1 and remove the other text strings in column 1 which are not found in file2 $2
And this one:
$ awk -F'\t' -v OFS='\t' 'FNR==1 { next }FNR == NR { file2[$1,$2] = $1 FS $2 } FNR != NR { file1[$1,$2,] = $1 FS $2} END { print "Match:"; for (k in file1) if (k in file1) print file2[k] # Or file1[k]}' file2 file1
Didn't work
Then after i tried sed:
$ sed = file2 | sed -r 'N;s/(.*)\n(.*)/\1s|\&$|\2|/' | sed -f - file1
This gave me an error and complained about
sed -e not being called properly.
Then after take only the smallest $3 if $1 and $2 or $2 and $1 are similar
file 4
bcd abc 5.77
Bacteria abc 0.88
bcd Bacteria 5.66
I have tried this code:
$ awk 'NR == $1&$2 || $3 < min {line = $0; min = $3}END{print line}' file3
$ awk '/^$1/{if(h){print h RS m}min=""; h=$0; next}min=="" || $3 < min{min=$3; m=$0}END{print h RS m}' file3
$ awk -F'\t' '$3 != "NF==min"' OFS='\t' file3
$ awk -v a=NODE '{c=a*$3+(1-a)} !($1 in min) || c<min[$1]{min[$1]=c; minLine[$1]=$0} END{for(k in minLine) print minLine[k]}' file3 | column -t
All didn't work and i tried to research what what does each line means and changed it to fit my problem. But they all failed
This might work for you (GNU sed):
sed -E 's#(.*) (.*)#/^\1 /Is/\\S+/\2/;/^\\S+ \1 /Is/\\S+/\2/2#' file2 |
sed -Ef - file1
Generate a sed script from file2 which is run against file1 to produce the required format.

awk sum of selected values in column

I want to sum selected values in column in awk. Second column is time. I want to add values from 4th column in each second.
Input:
1 0.1 2 1 3
2 0.3 2 2 3
4 0.6 2 3 3
2 1.1 2 4 3
5 1.3 2 5 3
6 2.2 2 6 3
7 2.7 2 7 3
8 3.6 2 8 3
9 3.9 2 1 3
10 4.1 2 1 3
Expected output (we have 5 seconds):
6
9
13
9
1
EDIT:
Here is my code but i have no idea how can it works dynamic.
awk '$2>x && $2<=y (sum+=$4) END {print sum}' filename
where x - start time, y - end time. It works only for static values, it means that now I can obtain result only for one selected second.
Try the following awk program
BEGIN {
total = 0
secondEnd = 1
}
{
if($2 < secondEnd) {
total += $4
next
}
while($2 > secondEnd) {
print(total)
total = 0
secondEnd++
}
total = $4
}
END {
print(total)
}
EDIT: As per OP's request adding a code which will accept any field provided to it as a awk variable.
awk -v col1="2" -F"[ .]" '$col1 == prev+1{print sum;sum=prev=""} {sum+=$NF;prev=$col1} END{if(prev && sum){print sum}}' Input_file
OR(a non-one liner form of solution here)
awk -v col1="2" -F"[ .]" '
$col1 == prev+1{
print sum;
sum=prev=""
}
{
sum+=$NF;
prev=$col1
}
END{
if(prev && sum){
print sum}
}' Input_file
In case you are passing a bash variable to awk variable then do following.
column=2 ##Shell variable
awk -v col1="$column" -F"[ .]" '$col1 == prev+1{print sum;sum=prev=""} {sum+=$NF;prev=$col1} END{if(prev && sum){print sum}}' Input_file
Could you please try following and let me know if this helps you(considering that your actual Input_file is same as shown sample here).
awk -F"[ .]" '$2 == prev+1{print sum;sum=prev=""} {sum+=$NF;prev=$2} END{if(prev && sum){print sum}}' Input_file
Adding a non-one liner form of solution too now.
awk -F"[ .]" '
$2 == prev+1{
print sum;
sum=prev=""
}
{
sum+=$NF;
prev=$2
}
END{
if(prev && sum){
print sum}
}' Input_file

how can i print the upper triangle of a matrix

using awk command I tried to print the upper triangle of a matrix
awk '{for (i=1;i<=NF;i++) if (i>=NR) printf $i FS "\n"}' matrix
but the output is shown as a single row
Consider this sample matrix:
$ cat matrix
1 2 3
4 5 6
7 8 9
To print the upper-right triangle:
$ awk '{for (i=1;i<=NF;i++) printf "%s%s",(i>=NR)?$i:" ",FS; print""}' matrix
1 2 3
5 6
9
Or:
$ awk '{for (i=1;i<=NF;i++) printf "%2s",(i>=NR)?$i:" "; print""}' matrix
1 2 3
5 6
9
To print the upper-left triangle:
$ awk '{for (i=1;i<=NF+1-NR;i++) printf "%s%s",$i,FS; print""}' matrix
1 2 3
4 5
7
Or:
$ awk '{for (i=1;i<=NF+1-NR;i++) printf "%2s",$i; print""}' matrix
1 2 3
4 5
7
This might work for you (GNU sed):
sed -r ':a;n;H;G;s/\n//;:b;s/^\S+\s*(.*)\n.*/\1/;tb;$!ba' file
Use the hold space as a counter for those lines that have been processed and for each current line remove those many fields from the front of the current line.
N.B. The counter is set following the printing of the current line otherwise the first line would be minus the first field.
On reflection an alternative/more elegant solution is:
sed -r '1!G;h;:a;s/^\S+\s*(.*)\n.*/\1/;ta' file
And to print the upper-left triangle:
sed -r '1!G;h;:a;s/^([^\n]*)\S+[^\n]*(.*)\n.*/\1\2/;ta' file
$ awk '{for (i=NR;i<=NF;i++) printf "%s%s",$i,(i<NF?FS:RS)}' file
1 2 3
5 6
9

Bash - finding minimum number per line

I am trying to get more familiar with awk statements, especially ones that can be done with just one line. I have a file that looks like this
9 5 0 2
8 7 4 3
4 8 2 1
I want the output to look like
0
3
1
Is there a way I can do this with just a one liner using awk? Thank you.
Using awk:
awk '{min=$1; for (i=2; i<=NF; i++) if ($i < min) min=$i; print min}' file
0
3
1
The are languages with built-in "min" functions:
ruby -ane 'puts $F.min' file
Or available libraries
perl -MList::Util=min -lane 'print min #F' file
Limiting to shell:
min() { printf "%s\n" "$#" | sort -n | head -1; }
while read -a nums; do
echo $(min "${nums[#]}")
done < file
GNU awk, which you'll find in most Linux distributions, has a built-in sort function, asort.
echo -e "9 5 0 2\n8 7 4 3\n4 8 2 1" |
awk '{ split($0,a); asort(a); print a[1]; }'
0
3
1

Count occurrences of character per line/field on Unix

Given a file with data like this (ie stores.dat file)
sid|storeNo|latitude|longitude
2tt|1|-28.0372000t0|153.42921670
9|2t|-33tt.85t09t0000|15t1.03274200
What is the command that would return the number of occurrences of the 't' character per line?
eg. would return:
count lineNum
4 1
3 2
6 3
Also, to do it by count of occurrences by field what is the command to return the following results?
eg. input of column 2 and character 't'
count lineNum
1 1
0 2
1 3
eg. input of column 3 and character 't'
count lineNum
2 1
1 2
4 3
To count occurrence of a character per line you can do:
awk -F'|' 'BEGIN{print "count", "lineNum"}{print gsub(/t/,"") "\t" NR}' file
count lineNum
4 1
3 2
6 3
To count occurrence of a character per field/column you can do:
column 2:
awk -F'|' -v fld=2 'BEGIN{print "count", "lineNum"}{print gsub(/t/,"",$fld) "\t" NR}' file
count lineNum
1 1
0 2
1 3
column 3:
awk -F'|' -v fld=3 'BEGIN{print "count", "lineNum"}{print gsub(/t/,"",$fld) "\t" NR}' file
count lineNum
2 1
1 2
4 3
gsub() function's return value is number of substitution made. So we use that to print the number.
NR holds the line number so we use it to print the line number.
For printing occurrences of particular field, we create a variable fld and put the field number we wish to extract counts from.
grep -n -o "t" stores.dat | sort -n | uniq -c | cut -d : -f 1
gives almost exactly the output you want:
4 1
3 2
6 3
Thanks to #raghav-bhushan for the grep -o hint, what a useful flag. The -n flag includes the line number as well.
To count occurences of a character per line:
$ awk -F 't' '{print NF-1, NR}' input.txt
4 1
3 2
6 3
this sets field separator to the character that needs to be counted, then uses the fact that number of fields is one greater than number of separators.
To count occurences in a particular column cut out that column first:
$ cut -d '|' -f 2 input.txt | awk -F 't' '{print NF-1, NR}'
1 1
0 2
1 3
$ cut -d '|' -f 3 input.txt | awk -F 't' '{print NF-1, NR}'
2 1
1 2
4 3
One possible solution using perl:
Content of script.pl:
use warnings;
use strict;
## Check arguments:
## 1.- Input file
## 2.- Char to search.
## 3.- (Optional) field to search. If blank, zero or bigger than number
## of columns, default to search char in all the line.
(#ARGV == 2 || #ARGV == 3) or die qq(Usage: perl $0 input-file char [column]\n);
my ($char,$column);
## Get values or arguments.
if ( #ARGV == 3 ) {
($char, $column) = splice #ARGV, -2;
} else {
$char = pop #ARGV;
$column = 0;
}
## Check that $char must be a non-white space character and $column
## only accept numbers.
die qq[Bad input\n] if $char !~ m/^\S$/ or $column !~ m/^\d+$/;
print qq[count\tlineNum\n];
while ( <> ) {
## Remove last '\n'
chomp;
## Get fields.
my #f = split /\|/;
## If column is a valid one, select it to the search.
if ( $column > 0 and $column <= scalar #f ) {
$_ = $f[ $column - 1];
}
## Count.
my $count = eval qq[tr/$char/$char/];
## Print result.
printf qq[%d\t%d\n], $count, $.;
}
The script accepts three parameters:
Input file
Char to search
Column to search: If column is a bad digit, it searchs all the line.
Running the script without arguments:
perl script.pl
Usage: perl script.pl input-file char [column]
With arguments and its output:
Here 0 is a bad column, it searches all the line.
perl script.pl stores.dat 't' 0
count lineNum
4 1
3 2
6 3
Here it searches in column 1.
perl script.pl stores.dat 't' 1
count lineNum
0 1
2 2
0 3
Here it searches in column 3.
perl script.pl stores.dat 't' 3
count lineNum
2 1
1 2
4 3
th is not a char.
perl script.pl stores.dat 'th' 3
Bad input
No need for awk or perl, only with bash and standard Unix utilities:
cat file | tr -c -d "t\n" | cat -n |
{ echo "count lineNum"
while read num data; do
test ${#data} -gt 0 && printf "%4d %5d\n" ${#data} $num
done; }
And for a particular column:
cut -d "|" -f 2 file | tr -c -d "t\n" | cat -n |
{ echo -e "count lineNum"
while read num data; do
test ${#data} -gt 0 && printf "%4d %5d\n" ${#data} $num
done; }
And we can even avoid tr and the cats:
echo "count lineNum"
num=1
while read data; do
new_data=${data//t/}
count=$((${#data}-${#new_data}))
test $count -gt 0 && printf "%4d %5d\n" $count $num
num=$(($num+1))
done < file
and event the cut:
echo "count lineNum"
num=1; OLF_IFS=$IFS; IFS="|"
while read -a array_data; do
data=${array_data[1]}
new_data=${data//t/}
count=$((${#data}-${#new_data}))
test $count -gt 0 && printf "%4d %5d\n" $count $num
num=$(($num+1))
done < file
IFS=$OLF_IFS
awk '{gsub("[^t]",""); print length($0),NR;}' stores.dat
The call to gsub() deletes everything in the line that is not a t, then just print the length of what remains, and the current line number.
Want to do it just for column 2?
awk 'BEGIN{FS="|"} {gsub("[^t]","",$2); print NR,length($2);}' stores.dat
You could also split the line or field with "t" and check the length of the resulting array - 1. Set the col variable to 0 for the line or 1 through 3 for columns:
awk -F'|' -v col=0 -v OFS=$'\t' 'BEGIN {
print "count", "lineNum"
}{
split($col, a, "t"); print length(a) - 1, NR
}
' stores.dat
$ cat -n test.txt
1 test 1
2 you want
3 void
4 you don't want
5 ttttttttttt
6 t t t t t t
$ awk '{n=split($0,c,"t")-1;if (n!=0) print n,NR}' test.txt
2 1
1 2
2 4
11 5
6 6
cat stores.dat | awk 'BEGIN {FS = "|"}; {print $1}' | awk 'BEGIN {FS = "\t"}; {print NF}'
Where $1 would be a column number you want to count.
perl -e 'while(<>) { $count = tr/t//; print "$count ".++$x."\n"; }' stores.dat
Another perl answer yay! The tr/t// function returns the count of the number of times the translation occurred on that line, in other words the number of times tr found the character 't'. ++$x maintains the line number count.

Resources