Format and then convert txt to csv using shell script and awk - linux

I have a text file:
ifile.txt
x y z t value
1 1 5 01hr01Jan2018 3
1 1 5 02hr01Jan2018 3.1
1 1 5 03hr01Jan2018 3.2
1 3.4 3 01hr01Jan2018 4.1
1 3.4 3 02hr01Jan2018 6.1
1 3.4 3 03hr01Jan2018 1.1
1 4.2 6 01hr01Jan2018 6.33
1 4.2 6 02hr01Jan2018 8.33
1 4.2 6 03hr01Jan2018 5.33
3.4 1 2 01hr01Jan2018 3.5
3.4 1 2 02hr01Jan2018 5.65
3.4 1 2 03hr01Jan2018 3.66
3.4 3.4 4 01hr01Jan2018 6.32
3.4 3.4 4 02hr01Jan2018 9.32
3.4 3.4 4 03hr01Jan2018 12.32
3.4 4.2 8.1 01hr01Jan2018 7.43
3.4 4.2 8.1 02hr01Jan2018 7.93
3.4 4.2 8.1 03hr01Jan2018 5.43
4.2 1 3.4 01hr01Jan2018 6.12
4.2 1 3.4 02hr01Jan2018 7.15
4.2 1 3.4 03hr01Jan2018 9.12
4.2 3.4 5.5 01hr01Jan2018 2.2
4.2 3.4 5.5 02hr01Jan2018 3.42
4.2 3.4 5.5 03hr01Jan2018 3.21
4.2 4.2 6.2 01hr01Jan2018 1.3
4.2 4.2 6.2 02hr01Jan2018 3.4
4.2 4.2 6.2 03hr01Jan2018 1
Explanation: Each coordinate (x,y) has a z-value and three time values. The spaces are not tabs. They are sequence of spaces.
I would like to format the t-column as row and then convert to a csv file. My expected output is as:
ofile.txt
x,y,z,01hr01Jan2018,02hr01Jan2018,03hr01Jan2018
1,1,5,3,3.1,3.2
1,3.4,3,4.1,6.1,1.1
1,4.2,6,6.33,8.33,5.33
3.4,1,2,3.5,5.65,3.66
3.4,3.4,4,6.32,9.32,12.32
3.4,4.2,8.1,7.43,7.93,5.43
4.2,1,3.4,6.12,7.15,9.12
4.2,3.4,5.5,2.2,3.42,3.21
4.2,4.2,6.2,1.3,3.4,1
I am trying it in following way, but still not getting the desire output. My script prints some extra commas (,) at the end.
My algorithm and script is:
#Step1:- Split into two files: one with x,y,z (0001.txt) and
# another with t,value (0002.txt).
awk '{n=3; for (i=1;i<=n;i++) printf "%s ", $i; print "";}' ifile.txt > 0001.txt
awk '{n=5; for (i=4;i<=n;i++) printf "%s ", $i; print "";}' ifile.txt > 0002.txt
#Setp2:- In 0001.txt: Delete the repetition rows.
awk '!seen[$1,$2,$3]++' 0001.txt > 00011.txt
#Step3:- In 0002.txt: Delete the first row. For each 3 rows in t-column,
# write the value-column as rows. Add the t-row at top
# this is very manual. I am wondering for some command
grep -E "^[0-9].*" 0002.txt > 0003.txt
awk -v n=3 '{ row = row $2 " "; if (NR % n == 0) { print row; row = "" } }' 0003.txt > 0004.txt
(echo "01hr01Jan2018,02hr01Jan2018,03hr01Jan2018";cat 0004.txt) > 00022.txt
#Step4:- Paste output of two and convert to csv.
paste 00011.txt 00022.txt > 0005.txt
cat 0005.txt | tr -s '[:blank:]' ',' > ofile.txt

You may use this awk:
awk -v OFS=, '{k=$1 OFS $2 OFS $3}
!($4 in hdr){hn[++h]=$4; hdr[$4]}
k in row{row[k]=row[k] OFS $5; next}
{rn[++n]=k; row[k]=$5}
END {
printf "%s", rn[1]
for(i=1; i<=h; i++)
printf "%s", OFS hn[i]
print ""
for (i=2; i<=n; i++)
print rn[i], row[rn[i]]
}' file
x,y,z,t,01hr01Jan2018,02hr01Jan2018,03hr01Jan2018
1,1,5,3,3.1,3.2
1,3.4,3,4.1,6.1,1.1
1,4.2,6,6.33,8.33,5.33
3.4,1,2,3.5,5.65,3.66
3.4,3.4,4,6.32,9.32,12.32
3.4,4.2,8.1,7.43,7.93,5.43
4.2,1,3.4,6.12,7.15,9.12
4.2,3.4,5.5,2.2,3.42,3.21
4.2,4.2,6.2,1.3,3.4,1

A single awk program can generate your desired output: using GNU awk
gawk '
BEGIN {SUBSEP = OFS = ","}
NR==1 {next}
{ groups[$4]; value[$1,$2,$3][$4] = $5 }
END {
PROCINFO["sorted_in"] = "#ind_str_asc"
printf "x,y,z"
for (g in groups) printf ",%s", g
printf "\n"
for (a in value) {
printf "%s", a
for (g in groups) printf "%s%s", OFS, 0+value[a][g]
printf "\n"
}
}
' ifile.txt

another similar awk, without the right header
$ awk -v OFS=, '{k=$1 OFS $2 OFS $3}
p!=k {if(p) print line; p=k; line=k}
{line=line OFS $NF}
END {print line}' file
x,y,z,value
1,1,5,3,3.1,3.2
1,3.4,3,4.1,6.1,1.1
1,4.2,6,6.33,8.33,5.33
3.4,1,2,3.5,5.65,3.66
3.4,3.4,4,6.32,9.32,12.32
3.4,4.2,8.1,7.43,7.93,5.43
4.2,1,3.4,6.12,7.15,9.12
4.2,3.4,5.5,2.2,3.42,3.21
4.2,4.2,6.2,1.3,3.4,1

Related

How to match two different length and different column text file with header using join command in linux

I have two different length text files A.txt and B.txt
A.txt looks like :
ID pos val1 val2 val3
1 2 0.8 0.5 0.6
2 4 0.9 0.6 0.8
3 6 1.0 1.2 1.3
4 8 2.5 2.2 3.4
5 10 3.2 3.4 3.8
B.txt looks like :
pos category
2 A
4 B
6 A
8 C
10 B
I want to match pos column and in both files and want the output like this
ID catgeory pos val1 val2 val3
1 A 2 0.8 0.5 0.6
2 B 4 0.9 0.6 0.8
3 A 6 1.0 1.2 1.3
4 C 8 2.5 2.2 3.4
5 B 10 3.2 3.4 3.8
I used the join function join -1 2 -2 1 <(sort -k2 A.txt) <(sort -k1 B.txt) > C.txt
The C.txt comes without a header
1 A 2 0.8 0.5 0.6
2 B 4 0.9 0.6 0.8
3 A 6 1.0 1.2 1.3
4 C 8 2.5 2.2 3.4
5 B 10 3.2 3.4 3.8
I want to get output with a header from the join function. kindly help me out
Thanks in advance
In case you are ok with awk, could you please try following. Written and tested with shown samples in GNU awk.
awk 'FNR==NR{a[$1]=$2;next} ($2 in a){$2=a[$2] OFS $2} 1' B.txt A.txt | column -t
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when B.txt is being read.
a[$1]=$2 ##Creating array a with index of 1st field and value is 2nd field of current line.
next ##next will skip all further statements from here.
}
($2 in a){ ##Checking condition if 2nd field is present in array a then do following.
$2=a[$2] OFS $2 ##Adding array a value along with 2nd field in 2nd field as per output.
}
1 ##1 will print current line.
' B.txt A.txt | column -t ##Mentioning Input_file names and passing awk program output to column to make it look better.
As you requested... It is perfectly possible to get the desired output using just GNU join:
$ join -1 2 -2 1 <(sort -k2 -g A.txt) <(sort -k1 -g B.txt) -o 1.1,2.2,1.2,1.3,1.4,1.5
ID category pos val1 val2 val3
1 A 2 0.8 0.5 0.6
2 B 4 0.9 0.6 0.8
3 A 6 1.0 1.2 1.3
4 C 8 2.5 2.2 3.4
5 B 10 3.2 3.4 3.8
$
The key to getting the correct output is using the sort -g option, and specifying the join output column order using the -o option.
To "pretty print" the output, pipe to column -t
$ join -1 2 -2 1 <(sort -k2 -g A.txt) <(sort -k1 -g B.txt) -o 1.1,2.2,1.2,1.3,1.4,1.5 | column -t
ID category pos val1 val2 val3
1 A 2 0.8 0.5 0.6
2 B 4 0.9 0.6 0.8
3 A 6 1.0 1.2 1.3
4 C 8 2.5 2.2 3.4
5 B 10 3.2 3.4 3.8
$

append the column in another text file

Hii i have a text file having 3 column
2.0 44.8 789.3
3.0 58.4 453.0
4.0 97.2 -489.1
5.2 35.3 458.6
i want to select the columns from the above text file and want to append the selected column in another text file .The file where i want to append the above columns are given below
> > > >
10.0 8.5
20.0 8.5
30.0 8.5
40.0 8.5
> > > >
10.0 8.0
20.0 8.0
30.0 8.0
40.0 8.0
> > > >
10.0 9.0
20.0 9.0
30.0 9.0
40.0 9.0
> > > >
and my expected output is
> > > >
10.0 8.5 2.0
20.0 8.5 3.0
30.0 8.5 4.0
40.0 8.5 5.2
> > > >
10.0 8.0 44.8
20.0 8.0 58.4
30.0 8.0 97.2
40.0 8.0 35.3
> > > >
10.0 9.0 789.3
20.0 9.0 453.0
30.0 9.0 -489.1
40.0 9.0 458.6
> > > >
i tried the script but after that not getting more idea, i need experts help.Thanks in advance.
#!/bin/sh
for file in inp.txt
do
awk '{print $1}' > colone
done
$ cat tst.awk
NR==FNR {
for (numBlocks=1; numBlocks<=NF; numBlocks++) {
vals[numBlocks,NR] = $numBlocks
}
next
}
/^>/ {
blockNr++
rowNr = 0
print
next
}
{ printf "%s %7s\n", $0, vals[blockNr,++rowNr] }
$ awk -f tst.awk file1 file2
> > > >
10.0 8.5 2.0
20.0 8.5 3.0
30.0 8.5 4.0
40.0 8.5 5.2
> > > >
10.0 8.0 44.8
20.0 8.0 58.4
30.0 8.0 97.2
40.0 8.0 35.3
> > > >
10.0 9.0 789.3
20.0 9.0 453.0
30.0 9.0 -489.1
40.0 9.0 458.6
> > > >
Based on OP's shown samples, could you please try following. This will print reset value of count after 3rd occurrence of > > in text file and again starts printing from 1st column values onwards from Input_file.
awk '
FNR==NR{
for(i=1;i<=NF;i++){
value[FNR,i]=$i
}
next
}
/^> >/{
count=0
print
if(col==3){ col=0 }
col++
next
}
{
print $0" "value[++count,col]
}
' Input_file text_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition FNR==NR when Input_file is being read.
for(i=1;i<=NF;i++){ ##Traversing through all fields here.
value[FNR,i]=$i ##Creating value with index of FNR,i here with value $i
}
next ##next will skip all further statments from here.
}
/^> >/{ ##Checking condition if line starts from > > then do following.
count=0 ##Nulliffying count here.
print ##Printing current line.
if(col==3){ col=0 } ##Checking if col is 3 then make its value 0 here. Why because OP sample has only 3 blocks and
##if there are more than 3 then it will start printing from very 1st values onwards after every 3 blocks.
col++ ##Increasing value of col with 1 here.
next ##next will skip all further statments from here.
}
{
print $0" "value[++count,col] ##Printing current line with space and array value here.
}
' Input_file text_file ##Mentioning Input_file and text_file names here.

awk sum of selected values in column

I want to sum selected values in column in awk. Second column is time. I want to add values from 4th column in each second.
Input:
1 0.1 2 1 3
2 0.3 2 2 3
4 0.6 2 3 3
2 1.1 2 4 3
5 1.3 2 5 3
6 2.2 2 6 3
7 2.7 2 7 3
8 3.6 2 8 3
9 3.9 2 1 3
10 4.1 2 1 3
Expected output (we have 5 seconds):
6
9
13
9
1
EDIT:
Here is my code but i have no idea how can it works dynamic.
awk '$2>x && $2<=y (sum+=$4) END {print sum}' filename
where x - start time, y - end time. It works only for static values, it means that now I can obtain result only for one selected second.
Try the following awk program
BEGIN {
total = 0
secondEnd = 1
}
{
if($2 < secondEnd) {
total += $4
next
}
while($2 > secondEnd) {
print(total)
total = 0
secondEnd++
}
total = $4
}
END {
print(total)
}
EDIT: As per OP's request adding a code which will accept any field provided to it as a awk variable.
awk -v col1="2" -F"[ .]" '$col1 == prev+1{print sum;sum=prev=""} {sum+=$NF;prev=$col1} END{if(prev && sum){print sum}}' Input_file
OR(a non-one liner form of solution here)
awk -v col1="2" -F"[ .]" '
$col1 == prev+1{
print sum;
sum=prev=""
}
{
sum+=$NF;
prev=$col1
}
END{
if(prev && sum){
print sum}
}' Input_file
In case you are passing a bash variable to awk variable then do following.
column=2 ##Shell variable
awk -v col1="$column" -F"[ .]" '$col1 == prev+1{print sum;sum=prev=""} {sum+=$NF;prev=$col1} END{if(prev && sum){print sum}}' Input_file
Could you please try following and let me know if this helps you(considering that your actual Input_file is same as shown sample here).
awk -F"[ .]" '$2 == prev+1{print sum;sum=prev=""} {sum+=$NF;prev=$2} END{if(prev && sum){print sum}}' Input_file
Adding a non-one liner form of solution too now.
awk -F"[ .]" '
$2 == prev+1{
print sum;
sum=prev=""
}
{
sum+=$NF;
prev=$2
}
END{
if(prev && sum){
print sum}
}' Input_file

Insert a row and a column in a matrix using awk

I have a gridded dataset with 250 rows x 300 columns in matrix form:
ifile.txt
2 3 4 1 2 3
3 4 5 2 4 6
2 4 0 5 0 7
0 0 5 6 3 8
I would like to insert the latitude values at the first column and longitude values at the top. Which looks like:
ofile.txt
20.00 20.33 20.66 20.99 21.32 21.65
100.00 2 3 4 1 2 3
100.33 3 4 5 2 4 6
100.66 2 4 0 5 0 7
100.99 0 0 5 6 3 8
The increment is 0.33
I can do it for a small size matrix in manually, but I can't able to get any idea how to get my output in my desired format. I was writing a script in the following way, but completely useless.
echo 20 > latitude.txt
for i in `seq 1 250`;do
i1=$(( i + 0.33 )) #bash can't recognize fractions
echo $i1 >> latitude.txt
done
echo 100 > longitude.txt
for j in `seq 1 300`;do
j1=$(( j + 0.33 ))
echo $j1 >> longitude.txt
done
paste longitude.txt ifile.txt > dummy_file.txt
cat latitude.txt dummy_file.txt > ofile.txt
$ cat tst.awk
BEGIN {
lat = 100
lon = 20
latWid = lonWid = 6
latDel = lonDel = 0.33
latFmt = lonFmt = "%*.2f"
}
NR==1 {
printf "%*s", latWid, ""
for (i=1; i<=NF; i++) {
printf lonFmt, lonWid, lon
lon += lonDel
}
print ""
}
{
printf latFmt, latWid, lat
lat += latDel
for (i=1; i<=NF; i++) {
printf "%*s", lonWid, $i
}
print ""
}
$ awk -f tst.awk file
20.00 20.33 20.66 20.99 21.32 21.65
100.00 2 3 4 1 2 3
100.33 3 4 5 2 4 6
100.66 2 4 0 5 0 7
100.99 0 0 5 6 3 8
Following awk may also help you on same.
awk -v col=100 -v row=20 'FNR==1{printf OFS;for(i=1;i<=NF;i++){printf row OFS;row=row+.33;};print ""} {col+=.33;$1=$1;print col OFS $0}' OFS="\t" Input_file
Adding non one liner form of above solution too now:
awk -v col=100 -v row=20 '
FNR==1{
printf OFS;
for(i=1;i<=NF;i++){
printf row OFS;
row=row+.33;
};
print ""
}
{
col+=.33;
$1=$1;
print col OFS $0
}
' OFS="\t" Input_file
Awk solution:
awk 'NR == 1{
long = 20.00; lat = 100.00; printf "%12s%.2f", "", long;
for (i=1; i<NF; i++) { long += 0.33; printf "\t%.2f", long } print "" }
NR > 1{ lat += 0.33 }
{
printf "%.2f%6s", lat, "";
for (i=1; i<=NF; i++) printf "\t%d", $i; print ""
}' file
With perl
$ perl -lane 'print join "\t", "", map {20.00+$_*0.33} 0..$#F if $.==1;
print join "\t", 100+(0.33*$i++), #F' ip.txt
20 20.33 20.66 20.99 21.32 21.65
100 2 3 4 1 2 3
100.33 3 4 5 2 4 6
100.66 2 4 0 5 0 7
100.99 0 0 5 6 3 8
-a to auto-split input on whitespaces, result saved in #F array
See https://perldoc.perl.org/perlrun.html#Command-Switches for details on command line options
if $.==1 for the first line of input
map {20.00+$_*0.33} 0..$#F iterate based on size of #F array, and for each iteration, we get a value based on equation inside {} where $_ will be 0, 1, etc upto last index of #F array
print join "\t", "", map... use tab separator to print empty element and results of map
For all the lines, print contents of #F array pre-fixed with results of 100+(0.33*$i++) where $i will be initially 0 in numeric context. Again, tab is used as separator while joining these values
Use sprintf if needed for formatting, also $, can be initialized instead of using join
perl -lane 'BEGIN{$,="\t"; $st=0.33}
print "", map { sprintf "%.2f", 20+$_*$st} 0..$#F if $.==1;
print sprintf("%.2f", 100+($st*$i++)), #F' ip.txt

Convert column to matrix format using awk

I have a gridded data file in column format as:
ifile.txt
x y value
20.5 20.5 -4.1
21.5 20.5 -6.2
22.5 20.5 0.0
20.5 21.5 1.2
21.5 21.5 4.3
22.5 21.5 6.0
20.5 22.5 7.0
21.5 22.5 10.4
22.5 22.5 16.7
I would like to convert it to matrix format as:
ofile.txt
20.5 21.5 22.5
20.5 -4.1 1.2 7.0
21.5 -6.2 4.3 10.4
22.5 0.0 6.0 16.7
Where top 20.5 21.5 22.5 indicate y and side values indicate x and the inside values indicate the corresponding grid values.
I found a similar question here Convert a 3 column file to matrix format but the script is not working in my case.
The script is
awk '{ h[$1,$2] = h[$2,$1] = $3 }
END {
for(i=1; i<=$1; i++) {
for(j=1; j<=$2; j++)
printf h[i,j] OFS
printf "\n"
}
}' ifile
The following awk script handles :
any size of matrix
no relation between row and column indices so it keeps track of them separately.
If a certain row column index does not appear, the value will default to zero.
This is done in this way:
awk '
BEGIN{PROCINFO["sorted_in"] = "#ind_num_asc"}
(NR==1){next}
{row[$1]=1;col[$2]=1;val[$1" "$2]=$3}
END { printf "%8s",""; for (j in col) { printf "%8.3f",j }; printf "\n"
for (i in row) {
printf "%8.3f",i; for (j in col) { printf "%8.3f",val[i" "j] }; printf "\n"
}
}' <file>
How does it work:
PROCINFO["sorted_in"] = "#ind_num_asc", states that all arrays are sorted numerically by index.
(NR==1){next} : skip the first line
{row[$1]=1;col[$2]=1;val[$1" "$2]=$3}, process the line by storing the row and column index and accompanying value.
The end statement does all the printing.
This outputs:
20.500 21.500 22.500
20.500 -4.100 1.200 7.000
21.500 -6.200 4.300 10.400
22.500 0.000 6.000 16.700
note: the usage of PROCINFO is a gawk feature.
However, if you make a couple of assumptions, you can do it much shorter:
the file contains all possible entries, no missing values
you do not want the indices of the rows and columns printed out:
the indices are sorted in column-major-order
The you can use the following short versions:
sort -g <file> | awk '($1+0!=$1){next}
($1!=o)&&(NR!=1){printf "\n"}
{printf "%8.3f",$3; o=$1 }'
which outputs
-4.100 1.200 7.000
-6.200 4.300 10.400
0.000 6.000 16.700
or for the transposed:
awk '(NR==1){next}
($2!=o)&&(NR!=2){printf "\n"}
{printf "%8.3f",$3; o=$2 }' <file>
This outputs
-4.100 -6.200 0.000
1.200 4.300 6.000
7.000 10.400 16.700
Adjusted my old GNU awk solution for your current input data:
matrixize.awk script:
#!/bin/awk -f
BEGIN { PROCINFO["sorted_in"]="#ind_num_asc"; OFS="\t" }
NR==1{ next }
{
b[$1]; # accumulating unique indices
($1 != $2)? a[$1][$2] = $3 : a[$2][$1] = $3; # set `diagonal` relation between different indices
}
END {
h = "";
for (i in b) {
h = h OFS i # form header columns
}
print h; # print header column values
for (i in b) {
row = i; # index column
# iterating through the row values (for each intersection point)
for (j in a[i]) {
row = row OFS a[i][j]
}
print row
}
}
Usage:
awk -f matrixize.awk yourfile
The output:
20.5 21.5 22.5
20.5 -4.1 1.2 7.0
21.5 -6.2 4.3 10.4
22.5 0.0 6.0 16.7
Perl solution:
#!/usr/bin/perl -an
$h{ $F[0] }{ $F[1] } = $F[2] unless 1 == $.;
END {
#s = sort { $a <=> $b } keys %h;
print ' ' x 5;
printf '%5.1f' x #s, #s;
print "\n";
for my $u (#s) {
print "$u ";
printf '%5.1f', $h{$u}{$_} for #s;
print "\n";
}
}
-n reads the input line by line
-a splits each line on whitespace into the #F array
See sort, print, printf, and keys.
awk solution:
sort -n ifile.txt | awk 'BEGIN{header="\t"}NR>1{if((NR-1)%3==1){header=header sprintf("%4.1f\t",$1); matrix=matrix sprintf("%4.1f\t",$1)}matrix= matrix sprintf("%4.1f\t",$3); if((NR-1)%3==0 && NR!=10)matrix=matrix "\n"}END{print header; print matrix}';
20.5 21.5 22.5
20.5 -4.1 1.2 7.0
21.5 -6.2 4.3 10.4
22.5 0.0 6.0 16.7
Explanations:
sort -n ifile.txt sort the file numerically
header variable will store all the data necessary to create the header line it is initiated to header="\t" and will be appended with the necessary information thanks to header=header sprintf("%4.1f\t",$1) for lines respecting (NR-1)%3==1)
in the same way you construct the matrix using matrix variable: matrix=matrix sprintf("%4.1f\t",$1) will create the first column and
matrix= matrix sprintf("%4.1f\t",$3) will populate the matrix with the content then if((NR-1)%3==0 &&
NR!=10)matrix=matrix "\n" will add the adequate EOL

Resources