BASH - conditional sum of columns and rows in csv file - linux

i have CSV file with some database benchmark results here is the example:
Date;dbms;type;description;W;D;S;results;time;id
Mon Jun 15 14:22:20 CEST 2015;sqlite;on-disk;text;2;1;1;570;265;50
Mon Jun 15 14:22:20 CEST 2015;sqlite;on-disk;text;2;1;1;420;215;50
Mon Jun 15 14:22:20 CEST 2015;sqlite;on-disk;text;2;1;1;500;365;50
Mon Jun 15 14:22:20 CEST 2015;sqlite;on-disk;text;2;1;1;530;255;50
Mon Jun 15 14:22:20 CEST 2015;hsql;on-disk;text;2;1;1;870;265;99
Mon Jun 15 14:22:20 CEST 2015;hsql;on-disk;text;2;1;1;620;215;99
Mon Jun 15 14:22:20 CEST 2015;hsql;on-disk;text;2;1;1;700;365;99
Mon Jun 15 14:22:20 CEST 2015;hsql;on-disk;text;2;1;1;530;255;99
i need to process all rows with the same id (value of the last column) and get this:
Date;dbms;type;description;W;D;S;time;results;results/time
Mon Jun 15 14:22:20 CEST 2015;sqlite;on-disk;text;2;1;1;sum column 8;sum column 9;(sum column 8 / sum column 9)
Mon Jun 15 14:22:20 CEST 2015;hsql;on-disk;text;2;1;1;sum column 8;sum column 9;(sum column 8 / sum column 9)
for now i can only do the sum of column 8 with the awk command:
awk -F";" '{print;sum+=$8 }END{print "sum " sum}' ./file.CSV
Edit:
need help with some modification of script iam already using. here are real input data:
Date;dbms;type;description;W;D;time;TotalTransactions;NOTransactions;id
Mon Jun 15 14:53:41 CEST 2015;sqlite;in-memory;TPC-C test results;2;1;10;272270;117508;50
Mon Jun 15 15:03:46 CEST 2015;sqlite;in-memory;TPC-C test results;2;1;10;280080;110063;50
Mon Jun 15 15:13:53 CEST 2015;sqlite;in-memory;TPC-C test results;5;1;10;144170;31815;60
Mon Jun 15 15:13:53 CEST 2015;sqlite;in-memory;TPC-C test results;5;1;10;137570;33910;60
Mon Jun 15 15:24:04 CEST 2015;hsql;in-memory;TPC-C test results;2;1;10;226660;97734;70
Mon Jun 15 15:34:08 CEST 2015;hsql;in-memory;TPC-C test results;2;1;10;210420;95113;70
Mon Jun 15 15:44:16 CEST 2015;hsql;in-memory;TPC-C test results;5;1;10;288360;119328;80
Mon Jun 15 15:44:16 CEST 2015;hsql;in-memory;TPC-C test results;5;1;10;270360;124328;80
i need to sum values in time, TotalTransactions and NOTransactions columns and then add a column with value (sum NOTransactions/sum time)
iam using this script:
awk 'BEGIN {FS=OFS=";"}
(NR==1) {$10="results/time"; print $0}
(NR>1 && NF) {sum7[$10]+=$7; sum8[$10]+=$8; sum9[$10]+=$9; other[$10]=$0}
END {for (i in sum8)
{$0=other[i]; $7=sum7[i];$8=sum8[i]; $9=sum9[i]; $10=sprintf("%.0f", sum9[i]/sum7[i]); print}}' ./logsFinal.csv
gives me this output:
;;;;;;;;;results/time
Mon Jun 15 15:03:46 CEST 2015;sqlite;in-memory;TPC-C test results;2;1;20;552350;227571;11379
Mon Jun 15 15:13:53 CEST 2015;sqlite;in-memory;TPC-C test results;5;1;20;281740;65725;3286
Mon Jun 15 15:34:08 CEST 2015;hsql;in-memory;TPC-C test results;2;1;20;437080;192847;9642
Mon Jun 15 15:44:16 CEST 2015;hsql;in-memory;TPC-C test results;5;1;20;558720;243656;12183
Date;dbms;type;description;W;D;0;0;0;-nan
values looks good (except header row). But i need to get these results without id column (i want delete id column)
So i need to get same values but instead of identify processed rows with same values in id column it must be rows with same values in dbms AND W AND D columns

You can use this awk:
awk 'BEGIN{ FS=OFS=";" }
NR>1 && NF {
s=""
for(i=1; i<=7; i++)
s=s $i OFS;
a[$NF]=s;
sum8[$NF]+=$8
sum9[$NF]+=$9
} END{
for (i in a)
print a[i] sum8[i], sum9[i], (sum9[i]?sum8[i]/sum9[i]:"NaN")
}' file
Mon Jun 15 14:22:20 CEST 2015;sqlite;on-disk;text;2;1;1;2020;1100;1.83636
Mon Jun 15 14:22:20 CEST 2015;hsql;on-disk;text;2;1;1;2720;1100;2.47273

This awk program will print the modified header and modify the output to contain the sums and their division:
awk 'BEGIN {FS=OFS=";"}
(NR==1) {$10="results/time"; print $0}
(NR>1 && NF) {sum8[$10]+=$8; sum9[$10]+=$9; other[$10]=$0}
END {for (i in sum8)
{$0=other[i]; $8=sum8[i]; $9=sum9[i]; $10=(sum9[i]?sum8[i]/sum9[i]:"NaN"); print}}'
which gives:
Date;dbms;type;description;W;D;S;results;time;results/time
Mon Jun 15 14:22:20 CEST 2015;sqlite;on-disk;text;2;1;1;2020;1100;1.83636
Mon Jun 15 14:22:20 CEST 2015;hsql;on-disk;text;2;1;1;2720;1100;2.47273
You don't seem to care for the ID in the result, but if you do, just replace $10= with $11=.
Also, if you need to sum things based on values of more than one column, you can create a temporary variable (a in the example below) which is a concatenation of two columns and use it as an index in the arrays, like this:
awk 'BEGIN {FS=OFS=";"}
(NR==1) {$10="results/time"; print $0}
(NR>1 && NF) {a=$5$6; sum8[a]+=$8; sum9[a]+=$9; other[a]=$0}
END {for (i in sum8)
{$0=other[i]; $8=sum8[i]; $9=sum9[i]; $10=(sum9[i]?sum8[i]/sum9[i]:"NaN"); print}}'

Related

How to prepend the filename to extracted lines via awk and grep

I'll preface this with the fact that I have no knowledge of awk (or maybe it's sed I need?) and fairly basic knowledge of grep and Linux, so apologies if this is a really dumb question. I find the man pages really difficult to decipher, and googling has gotten me quite far in my solution but not far enough to tie the two things I need to do together. Onto the problem...
I have some log files that I'm trying to extract rows from that are on a Linux server, named in the format aYYYYMMDD.log, that are all along the lines of:
Starting Process A
Wed 27 Oct 18:15:39 BST 2021 >>> /dir/task1 start <<<
...
Wed 27 Oct 18:15:40 BST 2021 >>> /dir/task1 end <<<
Wed 27 Oct 18:15:40 BST 2021 >>> /dir/task2 start <<<
...
Wed 27 Oct 18:15:42 BST 2021 >>> /dir/task2 end <<<
...
...
Wed 27 Oct 18:15:53 BST 2021 >>> /dir/taskreporting start <<<
...
Wed 27 Oct 18:15:53 BST 2021 >>> Starting task90 <<<
...
Wed 27 Oct 18:15:54 BST 2021 >>> Finishing task90 <<<
Wed 27 Oct 18:15:54 BST 2021 >>> Starting task91 <<<
...
Wed 27 Oct 18:15:57 BST 2021 >>> Finishing task91 <<<
...
...
Wed 27 Oct 18:16:12 BST 2021 >>> Starting task99 <<<
...
Wed 27 Oct 18:16:27 BST 2021 >>> Finishing task99 <<<
...
Wed 27 Oct 18:16:27 BST 2021 >>> /dir/taskreporting end <<<
...
Ended Process A
(I've excluded the log rows which are irrelevant to my requirement; )
I need to find what tasks were run during the taskreporting task, which I have managed to do with the following command (thanks to this other stackoverflow post):
awk '/taskreporting start/{flag=1;next}/taskreporting end/{flag=0}flag' <specific filename>.log | grep 'Starting task\|Finishing task'
This works well when I run it against a single file and produces output like:
Wed 27 Oct 18:15:53 BST 2021 >>> Starting task90 <<<
Wed 27 Oct 18:15:54 BST 2021 >>> Finishing task90 <<<
Wed 27 Oct 18:15:54 BST 2021 >>> Starting task91 <<<
Wed 27 Oct 18:15:57 BST 2021 >>> Finishing task91 <<<
...
Wed 27 Oct 18:16:12 BST 2021 >>> Starting task99 <<<
Wed 27 Oct 18:16:27 BST 2021 >>> Finishing task99 <<<
which is pretty much what I want to see. However, as I have multiple files to extract (having amended the filename in the above command appropriately, e.g. to *.log), I need to output the filename alongside the rows, so that I know which file the info belongs to, e.g. I'd like to see:
a211027.log Wed 27 Oct 18:15:53 BST 2021 >>> Starting task90 <<<
a211027.log Wed 27 Oct 18:15:54 BST 2021 >>> Finishing task90 <<<
a211027.log Wed 27 Oct 18:15:54 BST 2021 >>> Starting task91 <<<
a211027.log Wed 27 Oct 18:15:57 BST 2021 >>> Finishing task91 <<<
...
a211027.log Wed 27 Oct 18:16:12 BST 2021 >>> Starting task99 <<<
a211027.log Wed 27 Oct 18:16:27 BST 2021 >>> Finishing task99 <<<
I've googled and it seems like {print FILENAME} is what I need, but I couldn't figure out where to add it into my current awk command. How can I amend my awk command to get it to add the filename to the beginning of the rows? Or is there a better way of achieving my aim?
As you have provided most of the answer yourself, all that is needed is {print FILENAME, $0} which will add the filename in front of the rest of the content $0
awk '/taskreporting start/{flag=1;next}/taskreporting end/{flag=0}flag {print FILENAME, $0}' <specific filename>.log

I want to find difference between 2 numbers stored in a file using shell script

Below is content of file. I want to find out difference between each line of first field.
0.607401 # Tue Mar 27 04:30:01 IST 2018
0.607401 # Tue Mar 27 04:35:02 IST 2018
0.606325 # Tue Mar 27 04:40:02 IST 2018
0.606223 # Tue Mar 27 04:45:01 IST 2018
0.606167 # Tue Mar 27 04:50:02 IST 2018
0.605716 # Tue Mar 27 04:55:01 IST 2018
0.605716 # Tue Mar 27 05:00:01 IST 2018
0.607064 # Tue Mar 27 05:05:01 IST 2018
output:-
0
-0.001076
-0.000102
.019944
..
..
.001348
CODE:
awk '{s=$0;getline;print s-$0;next}' a.txt
However this does not work as expected...
Could you help me please?
You can use the following awk code:
$ awk 'NR==1{save=$1;next}NR>1{printf "%.6f\n",($1-save);save=$1}' file
0.000000
-0.001076
-0.000102
-0.000056
-0.000451
0.000000
0.001348
and format the output as you want by modifying the printf.
The way you are currently doing will skip some lines!!!

Formatting top's batch mode output for plotting/graphing

I've used top in the following manner to verify how much CPU percentage a certain process is taking up:
while true; do printf "`date` : " >> /var/log/besclient_resourcemonitor.txt; top -bn1 | awk '/BESClient/ {print $9}' >> /var/log/besclient_resourcemonitor.txt ; sleep 20; done
This results in the following output, truncated:
Mon Oct 16 13:37:08 CDT 2017 : 0.0
Mon Oct 16 13:37:29 CDT 2017 : 0.0
Mon Oct 16 13:38:10 CDT 2017 : 1.3
Mon Oct 16 13:38:30 CDT 2017 : 0.0
Mon Oct 16 13:38:51 CDT 2017 : 0.0
Mon Oct 16 13:39:11 CDT 2017 : 1.9
I'd like to try and graph this in gnuplot or Excel, but I'm having difficulty getting the data sorted properly to be able to place Date/Time (Columns 1-6) in X and load (at the last column) as Y. I've tried using cut, sed and awk, but must be missing something. Since there isn't a common delimiter, I believe this is what's confusing cut.
How would you go about skinning this cat? BTW, I like cats, so, no cats were harmed in the making of this output.
Not sure how gunplot works, but if you want this data as is (dont want to process further) and just delimiting columns you mentioned is a concern then you can further use awk as follows and you can store output as csv and further import in excel
awk 'BEGIN{FS=" : "; OFS=","}{print $1, $2}'
gives:
Mon Oct 16 13:37:08 CDT 2017,0.0
Mon Oct 16 13:37:29 CDT 2017,0.0
Mon Oct 16 13:38:10 CDT 2017,1.3
Mon Oct 16 13:38:30 CDT 2017,0.0
Mon Oct 16 13:38:51 CDT 2017,0.0
Mon Oct 16 13:39:11 CDT 2017,1.9
FS and OFS are Field separator and Output field separator

Bash - print all seuqential lines from file and ignore non-sequential ones

I need to extract all sequential lines from a text file based on the sequence in the 4th column. This sequence is the current time, and there is only one entry for each second (so only one line). Sometimes in the file the sequence will break, because something has slowed down the script that creates it and it has skipped a second or two. As in the below example:
Thu Jun 8 14:17:31 CEST 2017 sync:1
Thu Jun 8 14:17:32 CEST 2017 sync:1
Thu Jun 8 14:17:33 CEST 2017 sync:1
Thu Jun 8 14:17:37 CEST 2017 sync:1 <--
Thu Jun 8 14:17:38 CEST 2017 sync:1
Thu Jun 8 14:17:39 CEST 2017 sync:1
Thu Jun 8 14:17:40 CEST 2017 sync:1
I need bash to ignore this line and continue without printing it, but still print everything before and after it. How should I go about that?
If you only care about the seconds field (eg, 14:17:39 -> 15:22:40 is clearly not sequential, but this code will think it is; if your data is sufficiently simple this may be fine):
awk 'NR==1 || $6 == (p + 1)%60 ; {p=$6}' FS=':\| *' input
To check the hour and minute, you could simply convert to seconds from midnight or add logic to compare the hours and minutes. Something like:
awk '{s=$4 * 3600 + $5 * 60 + $6} NR==1 || s == (p + 1)%86400 ; {p=s}' FS=':\| *' input

sed place parentheses at the beginning and close on the 4th line

Im trying to place a open parenthesis on the first line and close it as the end of the 4th line. Below is a example of the data followed by the output that I am looking for.
tester1
SERVICE_TICKET_CREATED
Thu Mar 19 23:27:57 UTC 2015
192.168.1.3
tester2
SERVICE_TICKET_CREATED
Fri Mar 20 00:31:59 UTC 2015
192.168.1.2
(tester1
SERVICE_TICKET_CREATED
Thu Mar 19 23:27:57 UTC 2015
192.168.1.3)
(tester2
SERVICE_TICKET_CREATED
Fri Mar 20 00:31:59 UTC 2015
192.168.1.2)
Using awk you can do it as
awk 'NR%4==1{print "("$0; next} NR%4==0{print $0")"; next}1'
Test
$ awk 'NR%4==1{print "("$0; next} NR%4==0{print $0")"; next}1' input
(tester1
SERVICE_TICKET_CREATED
Thu Mar 19 23:27:57 UTC 2015
192.168.1.3)
(tester2
SERVICE_TICKET_CREATED
Fri Mar 20 00:31:59 UTC 2015
192.168.1.2)
Shorter version
awk 'NR%4==1{$0="("$0} NR%4==0{$0=$0")"}1'
sed -r 's/^/(/;N;N;N;s/$/)/' input
The N reads the next line into the buffer. s/^/(/ puts an opening paren at the beginning, s/$/)/ puts a closing one at the end of the buffer.

Resources