Insert a row and a column in a matrix using awk - linux

I have a gridded dataset with 250 rows x 300 columns in matrix form:
ifile.txt
2 3 4 1 2 3
3 4 5 2 4 6
2 4 0 5 0 7
0 0 5 6 3 8
I would like to insert the latitude values at the first column and longitude values at the top. Which looks like:
ofile.txt
20.00 20.33 20.66 20.99 21.32 21.65
100.00 2 3 4 1 2 3
100.33 3 4 5 2 4 6
100.66 2 4 0 5 0 7
100.99 0 0 5 6 3 8
The increment is 0.33
I can do it for a small size matrix in manually, but I can't able to get any idea how to get my output in my desired format. I was writing a script in the following way, but completely useless.
echo 20 > latitude.txt
for i in `seq 1 250`;do
i1=$(( i + 0.33 )) #bash can't recognize fractions
echo $i1 >> latitude.txt
done
echo 100 > longitude.txt
for j in `seq 1 300`;do
j1=$(( j + 0.33 ))
echo $j1 >> longitude.txt
done
paste longitude.txt ifile.txt > dummy_file.txt
cat latitude.txt dummy_file.txt > ofile.txt

$ cat tst.awk
BEGIN {
lat = 100
lon = 20
latWid = lonWid = 6
latDel = lonDel = 0.33
latFmt = lonFmt = "%*.2f"
}
NR==1 {
printf "%*s", latWid, ""
for (i=1; i<=NF; i++) {
printf lonFmt, lonWid, lon
lon += lonDel
}
print ""
}
{
printf latFmt, latWid, lat
lat += latDel
for (i=1; i<=NF; i++) {
printf "%*s", lonWid, $i
}
print ""
}
$ awk -f tst.awk file
20.00 20.33 20.66 20.99 21.32 21.65
100.00 2 3 4 1 2 3
100.33 3 4 5 2 4 6
100.66 2 4 0 5 0 7
100.99 0 0 5 6 3 8

Following awk may also help you on same.
awk -v col=100 -v row=20 'FNR==1{printf OFS;for(i=1;i<=NF;i++){printf row OFS;row=row+.33;};print ""} {col+=.33;$1=$1;print col OFS $0}' OFS="\t" Input_file
Adding non one liner form of above solution too now:
awk -v col=100 -v row=20 '
FNR==1{
printf OFS;
for(i=1;i<=NF;i++){
printf row OFS;
row=row+.33;
};
print ""
}
{
col+=.33;
$1=$1;
print col OFS $0
}
' OFS="\t" Input_file

Awk solution:
awk 'NR == 1{
long = 20.00; lat = 100.00; printf "%12s%.2f", "", long;
for (i=1; i<NF; i++) { long += 0.33; printf "\t%.2f", long } print "" }
NR > 1{ lat += 0.33 }
{
printf "%.2f%6s", lat, "";
for (i=1; i<=NF; i++) printf "\t%d", $i; print ""
}' file

With perl
$ perl -lane 'print join "\t", "", map {20.00+$_*0.33} 0..$#F if $.==1;
print join "\t", 100+(0.33*$i++), #F' ip.txt
20 20.33 20.66 20.99 21.32 21.65
100 2 3 4 1 2 3
100.33 3 4 5 2 4 6
100.66 2 4 0 5 0 7
100.99 0 0 5 6 3 8
-a to auto-split input on whitespaces, result saved in #F array
See https://perldoc.perl.org/perlrun.html#Command-Switches for details on command line options
if $.==1 for the first line of input
map {20.00+$_*0.33} 0..$#F iterate based on size of #F array, and for each iteration, we get a value based on equation inside {} where $_ will be 0, 1, etc upto last index of #F array
print join "\t", "", map... use tab separator to print empty element and results of map
For all the lines, print contents of #F array pre-fixed with results of 100+(0.33*$i++) where $i will be initially 0 in numeric context. Again, tab is used as separator while joining these values
Use sprintf if needed for formatting, also $, can be initialized instead of using join
perl -lane 'BEGIN{$,="\t"; $st=0.33}
print "", map { sprintf "%.2f", 20+$_*$st} 0..$#F if $.==1;
print sprintf("%.2f", 100+($st*$i++)), #F' ip.txt

Related

How to sum column values of items with shared substring in first column using bash

I am trying to sum values across rows of a dataframe for rows which have a shared substring in the first column. The data looks like this:
ID Data_1 Data_2 Data_3 Data_4
SRW8002300_T01 1 2 3 4
SRW8002300_T02 1 2 3 4
SRW8002300_T03 1 2 3 4
SRW8004500_T01 1 2 3 4
SRW8004500_T02 1 2 3 4
SRW8006000_T01 1 2 3 4
I want to sum the 2nd to 5th column values when the first part of the ID (the part before the underscore) is shared. So the above would become:
ID Data_1 Data_2 Data_3 Data_4
SRW8002300 3 6 9 12
SRW8004500 2 4 6 8
SRW8006000 1 2 3 4
So far I've got an awk command that can strip the IDs of the string after the underscore:
awk '{print $1}' filename | awk -F'_' '{print $1}'
And another to sum column values if the value in the first column is shared:
awk '{a[$1]+=$2;b[$1]+=$3;c[$1]+=$4;d[$1]+=$5} END {for (i in a) print i, a[i], b[i], c[i], d[i]}' filename
However, I am struggling to combine these two commands to create a new dataframe with summed values for the shared IDs.
I usually code in python but am trying to get into the habit of writing bash scripts for these sorts of tasks.
Thank you for any help.
Assuming your key values are contiguous as shown in your sample input:
$ cat tst.awk
NR==1 { print; next }
{
curr = $1
sub(/_.*/,"",curr)
if ( curr != prev ) {
prt()
}
for (i=2; i<=NF; i++) {
sum[i] += $i
}
prev = curr
}
END { prt() }
function prt() {
if ( prev != "" ) {
printf "%s%s", prev, OFS
for (i=2; i<=NF; i++) {
printf "%d%s", sum[i], (i<NF ? OFS : ORS)
}
delete sum
}
}
$ awk -f tst.awk file
ID Data_1 Data_2 Data_3 Data_4
SRW8002300 3 6 9 12
SRW8004500 2 4 6 8
SRW8006000 1 2 3 4

awk sum of selected values in column

I want to sum selected values in column in awk. Second column is time. I want to add values from 4th column in each second.
Input:
1 0.1 2 1 3
2 0.3 2 2 3
4 0.6 2 3 3
2 1.1 2 4 3
5 1.3 2 5 3
6 2.2 2 6 3
7 2.7 2 7 3
8 3.6 2 8 3
9 3.9 2 1 3
10 4.1 2 1 3
Expected output (we have 5 seconds):
6
9
13
9
1
EDIT:
Here is my code but i have no idea how can it works dynamic.
awk '$2>x && $2<=y (sum+=$4) END {print sum}' filename
where x - start time, y - end time. It works only for static values, it means that now I can obtain result only for one selected second.
Try the following awk program
BEGIN {
total = 0
secondEnd = 1
}
{
if($2 < secondEnd) {
total += $4
next
}
while($2 > secondEnd) {
print(total)
total = 0
secondEnd++
}
total = $4
}
END {
print(total)
}
EDIT: As per OP's request adding a code which will accept any field provided to it as a awk variable.
awk -v col1="2" -F"[ .]" '$col1 == prev+1{print sum;sum=prev=""} {sum+=$NF;prev=$col1} END{if(prev && sum){print sum}}' Input_file
OR(a non-one liner form of solution here)
awk -v col1="2" -F"[ .]" '
$col1 == prev+1{
print sum;
sum=prev=""
}
{
sum+=$NF;
prev=$col1
}
END{
if(prev && sum){
print sum}
}' Input_file
In case you are passing a bash variable to awk variable then do following.
column=2 ##Shell variable
awk -v col1="$column" -F"[ .]" '$col1 == prev+1{print sum;sum=prev=""} {sum+=$NF;prev=$col1} END{if(prev && sum){print sum}}' Input_file
Could you please try following and let me know if this helps you(considering that your actual Input_file is same as shown sample here).
awk -F"[ .]" '$2 == prev+1{print sum;sum=prev=""} {sum+=$NF;prev=$2} END{if(prev && sum){print sum}}' Input_file
Adding a non-one liner form of solution too now.
awk -F"[ .]" '
$2 == prev+1{
print sum;
sum=prev=""
}
{
sum+=$NF;
prev=$2
}
END{
if(prev && sum){
print sum}
}' Input_file

awk - all rows where half of columns are bigger than x

As the title suggests I'm trying to find all rows in an large tsv file, where at least 50% of the columns have a value bigger than a value x using awk.
E.g for x=5:
9 6 7 2 3
0 1 2 7 6
1 3 8 9 10
should return
9 6 7 2 3
1 3 8 9 10
awk to the rescue!
$ awk -v t=5 '{c=0; for(i=1;i<=NF;i++) c+=($i>t)} c/NF>0.5' file
9 6 7 2 3
1 3 8 9 10
Using Perl:
perl -ane '$x = 5; print if #F / 2 <= grep $_ > $x, #F' -- file.tsv
Using an input .tsv file which looks like this:
Num1 Num2 Num3 Num4 Num5
9 6 7 2 3
0 1 2 7 6
1 3 8 9 10
This code will do it in a awk script. I've left comments to see
the form of a script so you can adjust accordingly.
#!/usr/bin/awk -f
# reads from stdin.
# Usage: $ ./bigcols.awk < input1.tsv
# Run at start.
BEGIN {
# print "Start"
# print "TSV setting. Field seperator set to tab."
FS = "\t"
# He wants to find lines with avg greater than var x
x=5
}
# main. Run for each record. This code uses newlines to denote records.
{
# Find lines which are of this form: (skip header)
# #+,
# ie. start with one or more numbers in column 1.
if ($1 ~ /^[0-9]+/) {
the_avg = ($1 + $2 + $3 + $4 + $5)/5
if (the_avg > x) {
print $1, $2, $3, $4, $5
}
}
}
# run at end
#END { print "Stop" }

How to sum column of different file in bash scripting

I have two files:
file-1
1 2 3 4
1 2 3 4
1 2 3 4
file-2
0.5
0.5
0.5
Now I want to add column 1 of file-2 to column 3 of file-1
Output
1 2 3.5 4
1 2 3.5 4
1 2 3.5 4
I've tried this, but it does not work correctly:
awk '{print $1, $2, $3+file-2 }' file-2=$1_of_file-2 file-1 > file-3
I know the awk statement is not right but I want to use something like this; can anyone help me?
Your data isn't very exciting…
awk 'FNR == NR { for (i = 1; i <= NF; i++) { line[NR,i] = $i } fields[NR] = NF }
FNR != NR { line[FNR,3] += $1
pad = ""
for (i = 1; i <= fields[FNR]; i++) { printf "%s%s", pad, line[FNR,i]; pad = " " }
printf "\n"
}' file-1 file-2
The first pattern matches the lines in the first file; it saves each field into the pseudo-multidimensional array line, and also records how many fields there are in that line.
The second pattern matches the lines in the second file; it adds the value in column one to column three of the saved data, then prints out all the fields with a space between them, and adds a newline to the end.
Given this (mildly) modified input, the script (saved in file so-25657951.sh) produces the output shown:
$ cat file-1
1 2 3 4
2 3 6 5
3 4 9 6
$ cat file-2
0.1
0.2
0.3
$ bash so-25657951.sh
1 2 3.1 4
2 3 6.2 5
3 4 9.3 6
$
Note that because this slurps the whole of the first file into memory before reading anything from the second file, the input files should not be too large (say sub-gigabyte size). If they're bigger than that, you should probably devise an alternative strategy.
For example, there is a getline function (even in POSIX awk) which could be used to read a line from file 2 for each line in file 1, and you could then simply print the data without needing to accumulate anything:
awk '{ getline add < "file-2"; $3 += add; print }' file-1
This works reasonably cleanly for any size of file (as long as the files have the same number of lines — or, more precisely, as long as file-2 has at least as many lines as file-1).
This may work:
cat f1
1 2 3 4
2 3 6 5
3 4 9 6
cat f2
0.1
0.2
0.3
awk 'FNR==NR {a[NR]=$1;next} {$3+=a[FNR]}1' f2 f1
1 2 3.1 4
2 3 6.2 5
3 4 9.3 6
After I posted it, I do see that its the same as Jaypal posted in a comment.

Using an if/else statement in the middle of AWK

I have a 5-column file:
PS 6 15 0 1
PS 1 17 0 1
PS 4 18 0 1
that I would like to get it in this 7-column format:
PS.15 PS 6 N 1 0 1
PS.17 PS 1 P 1 0 1
PS.18 PS 4 N 1 0 1
To create 6 of the 7 columns requires just grabbing directly (and sometimes applying small arithmetic) from columns in the original file. However, to create one column (column 4) requires an if-else statement.
Specifically, to create new columns 1, 2, 3, I use:
cat File | awk '{print $1"."$3"\t"$1"\t"$2}'
and to create new columns 5, 6,7, I use:
cat testFileB | awk '{print $4+$5"\t"$4/($4+$5)"\t"$5/($4+$5)}'
and to create new column 4, I use:
cat testFileB | awk '{if ($2 == 1 || $2 == 2 || $2 == 3) print "P"; else print "N";}'
These three statements work fine independently and get me what I want (the correct values for the columns that are all separated by tabs). However, when I try to apply them simultaneously (create all 7 columns at once), I can only do so with unwanted new lines (instead of tabs) before and after column 4 (the if/else statement column):
For instance, my attempt to simultaneously create columns 1, 2, 3, 4:
cat File | awk '{print $1"."$3"\t"$1"\t"$2; if ($2 == 1 || $2 == 2 || $2 == 3) print "P"; else print "N";}'
results in unwanted new lines before column 4:
PS.15 PS 6
N
PS.17 PS 1
P
PS.18 PS 4
Similarly, my attempt to simultaneously create columns 4, 5, 6, 7:
cat File | awk '{if ($2 == 1 || $2 == 2 || $2 == 3) print "P"; else print "N"; print $4+$5"\t"$4/($4+$5)"\t"$5/($4+$5)}'
results in unwanted new lines after column 4:
N
1 0 1
P
1 0 1
N
1 0 1
Is there a solution so that I can create all 7 columns at once, and there are only tabs between them (no new lines)?
If you don't want automatic line feeds, you can just use printf instead of print. I'm not quite sure if you want a tab separating the N1 or not, but that's easy enough to adjust;
cat testfile | awk '{printf "%s.%s\t%s\t%s\t",$1,$3,$1,$2; if ($2 == 1 || $2 == 2 || $2 == 3) printf "P"; else printf "N"; print $4+$5"\t"$4/($4+$5)"\t"$5/($4+$5)}'
PS.15 PS 6 N1 0 1
PS.17 PS 1 P1 0 1
PS.18 PS 4 N1 0 1
Simply set your OFS (instead of repeating a \t all across the line), and use the ternary operator to print P or N:
$ awk -v OFS='\t' '{s=$4+$5;print $1"."$3,$1,$2,($2~/^[123]$/?"P":"N"),s,$4/s,$5/s}' file
PS.15 PS 6 N 1 0 1
PS.17 PS 1 P 1 0 1
PS.18 PS 4 N 1 0 1

Resources