Sorting a hasheq list in Racket - hashmap

I am creating a function 'sort-mail' in Racket, that takes in a list of hash-eq s and sorts them based on their Data key. The input is defined this way:
(define test-dates
'("Sun, 10 Sep 2017 09:48:44 +0200"
"Wed, 13 Sep 2017 17:51:05 +0000"
"Sun, 10 Sep 2017 13:16:19 +0200"
"Tue, 17 Nov 2009 18:21:38 -0500"
"Wed, 13 Sep 2017 10:40:47 -0700"
"Thu, 14 Sep 2017 12:03:35 -0700"
"Wed, 18 Nov 2009 02:22:12 -0800"
"Sat, 09 Sep 2017 13:40:18 -0700"
"Tue, 26 Oct 2010 15:11:06 +0200"
"Tue, 17 Nov 2009 18:04:31 -0800"
"Mon, 17 Oct 2011 04:15:12 +0000"
"Sun, 16 Oct 2011 23:12:02 -0500"
"Mon, 11 Sep 2017 14:41:12 +0100"))
(define test-hashes (map (lambda (x) (hasheq 'Date x)) test-dates))
I have tried following the answer to this question, but I don't think it's what I'm looking for. So far, I am trying to sort them using the following:
(define (sort-mail test-hashes)
(sort test-hashes #:key car <))
Unfortunately, I am getting this error:
car:contract violation
expected: pair?
given: 'hasheq((Data . "Wed, 13 Sept 2017 17:51:05 +0000"
I'm pretty confused as to what my sort statement should look like, so any guidance would be great. Thank you!

There are two problems.
First, the reason for the error message is that sort calls car (the #:key function) on each element of the test-hashes list, and each of those elements is a hash-table, not a list. car expects a list, hence the error.
Your #:key function needs to extract the date from the hash table. hash-ref does that. So here's a first attempt at sort-mail:
(define (sort-mail hash-tables)
(define (date-of ht) (hash-ref ht 'Date))
(sort hash-tables #:key date-of string<?))
This brings us to the second problem, which is the function for comparing dates. Notice that the comparison function above is string<? rather than <. That's because the value associated the Date key in each hash table is a string. Calling string<? avoids a run-time type error, but the dates get sorted in the wrong order:
> (sort-mail test-hashes)
'(#hasheq((Date . "Mon, 11 Sep 2017 14:41:12 +0100"))
#hasheq((Date . "Mon, 17 Oct 2011 04:15:12 +0000"))
#hasheq((Date . "Sat, 09 Sep 2017 13:40:18 -0700"))
#hasheq((Date . "Sun, 10 Sep 2017 09:48:44 +0200"))
#hasheq((Date . "Sun, 10 Sep 2017 13:16:19 +0200"))
#hasheq((Date . "Sun, 16 Oct 2011 23:12:02 -0500"))
#hasheq((Date . "Thu, 14 Sep 2017 12:03:35 -0700"))
#hasheq((Date . "Tue, 17 Nov 2009 18:04:31 -0800"))
#hasheq((Date . "Tue, 17 Nov 2009 18:21:38 -0500"))
#hasheq((Date . "Tue, 26 Oct 2010 15:11:06 +0200"))
#hasheq((Date . "Wed, 13 Sep 2017 10:40:47 -0700"))
#hasheq((Date . "Wed, 13 Sep 2017 17:51:05 +0000"))
#hasheq((Date . "Wed, 18 Nov 2009 02:22:12 -0800")))
As you can see, the dates are sorted alphabetically, not by date. Really, then, you need a #:key function that returns the date represented in a way that can easily be compared with other dates.
Your date strings are stored in a format called RFC-2822. I did a quick search of the Racket documentation and didn't find a standard library function to parse RFC-2822 date strings. Some googling turned up this blog post by Tero Hasu, which includes a function to convert RFC-2822 date strings into Unix times. A "Unix time" is a time represented as the number of seconds since January 1, 1970. That's a number, so you can compare it with <.
Here's the code pasted from Tero Hasu's blog:
(require (prefix-in s. srfi/19))
(define (rfc2822->unix-time s) ;; string -> integer
(let ((d (s.string->date s "~a, ~d ~b ~Y ~H:~M:~S ~z")))
(s.time-second (s.date->time-utc d))))
And finally, here's the corrected sort-mail:
(define (sort-mail hash-tables)
(define (ht->unix-time ht) (rfc2822->unix-time (hash-ref ht 'Date)))
(sort hash-tables #:key ht->unix-time <))
> (sort-mail test-hashes)
'(#hasheq((Date . "Tue, 17 Nov 2009 18:21:38 -0500"))
#hasheq((Date . "Tue, 17 Nov 2009 18:04:31 -0800"))
#hasheq((Date . "Wed, 18 Nov 2009 02:22:12 -0800"))
#hasheq((Date . "Tue, 26 Oct 2010 15:11:06 +0200"))
#hasheq((Date . "Sun, 16 Oct 2011 23:12:02 -0500"))
#hasheq((Date . "Mon, 17 Oct 2011 04:15:12 +0000"))
#hasheq((Date . "Sat, 09 Sep 2017 13:40:18 -0700"))
#hasheq((Date . "Sun, 10 Sep 2017 09:48:44 +0200"))
#hasheq((Date . "Sun, 10 Sep 2017 13:16:19 +0200"))
#hasheq((Date . "Mon, 11 Sep 2017 14:41:12 +0100"))
#hasheq((Date . "Wed, 13 Sep 2017 10:40:47 -0700"))
#hasheq((Date . "Wed, 13 Sep 2017 17:51:05 +0000"))
#hasheq((Date . "Thu, 14 Sep 2017 12:03:35 -0700")))

Related

How to prepend the filename to extracted lines via awk and grep

I'll preface this with the fact that I have no knowledge of awk (or maybe it's sed I need?) and fairly basic knowledge of grep and Linux, so apologies if this is a really dumb question. I find the man pages really difficult to decipher, and googling has gotten me quite far in my solution but not far enough to tie the two things I need to do together. Onto the problem...
I have some log files that I'm trying to extract rows from that are on a Linux server, named in the format aYYYYMMDD.log, that are all along the lines of:
Starting Process A
Wed 27 Oct 18:15:39 BST 2021 >>> /dir/task1 start <<<
...
Wed 27 Oct 18:15:40 BST 2021 >>> /dir/task1 end <<<
Wed 27 Oct 18:15:40 BST 2021 >>> /dir/task2 start <<<
...
Wed 27 Oct 18:15:42 BST 2021 >>> /dir/task2 end <<<
...
...
Wed 27 Oct 18:15:53 BST 2021 >>> /dir/taskreporting start <<<
...
Wed 27 Oct 18:15:53 BST 2021 >>> Starting task90 <<<
...
Wed 27 Oct 18:15:54 BST 2021 >>> Finishing task90 <<<
Wed 27 Oct 18:15:54 BST 2021 >>> Starting task91 <<<
...
Wed 27 Oct 18:15:57 BST 2021 >>> Finishing task91 <<<
...
...
Wed 27 Oct 18:16:12 BST 2021 >>> Starting task99 <<<
...
Wed 27 Oct 18:16:27 BST 2021 >>> Finishing task99 <<<
...
Wed 27 Oct 18:16:27 BST 2021 >>> /dir/taskreporting end <<<
...
Ended Process A
(I've excluded the log rows which are irrelevant to my requirement; )
I need to find what tasks were run during the taskreporting task, which I have managed to do with the following command (thanks to this other stackoverflow post):
awk '/taskreporting start/{flag=1;next}/taskreporting end/{flag=0}flag' <specific filename>.log | grep 'Starting task\|Finishing task'
This works well when I run it against a single file and produces output like:
Wed 27 Oct 18:15:53 BST 2021 >>> Starting task90 <<<
Wed 27 Oct 18:15:54 BST 2021 >>> Finishing task90 <<<
Wed 27 Oct 18:15:54 BST 2021 >>> Starting task91 <<<
Wed 27 Oct 18:15:57 BST 2021 >>> Finishing task91 <<<
...
Wed 27 Oct 18:16:12 BST 2021 >>> Starting task99 <<<
Wed 27 Oct 18:16:27 BST 2021 >>> Finishing task99 <<<
which is pretty much what I want to see. However, as I have multiple files to extract (having amended the filename in the above command appropriately, e.g. to *.log), I need to output the filename alongside the rows, so that I know which file the info belongs to, e.g. I'd like to see:
a211027.log Wed 27 Oct 18:15:53 BST 2021 >>> Starting task90 <<<
a211027.log Wed 27 Oct 18:15:54 BST 2021 >>> Finishing task90 <<<
a211027.log Wed 27 Oct 18:15:54 BST 2021 >>> Starting task91 <<<
a211027.log Wed 27 Oct 18:15:57 BST 2021 >>> Finishing task91 <<<
...
a211027.log Wed 27 Oct 18:16:12 BST 2021 >>> Starting task99 <<<
a211027.log Wed 27 Oct 18:16:27 BST 2021 >>> Finishing task99 <<<
I've googled and it seems like {print FILENAME} is what I need, but I couldn't figure out where to add it into my current awk command. How can I amend my awk command to get it to add the filename to the beginning of the rows? Or is there a better way of achieving my aim?
As you have provided most of the answer yourself, all that is needed is {print FILENAME, $0} which will add the filename in front of the rest of the content $0
awk '/taskreporting start/{flag=1;next}/taskreporting end/{flag=0}flag {print FILENAME, $0}' <specific filename>.log

Why Sorting the timestamp using sort_values is not working?

I have a column of timestamp converted to human readable form.
I have tried to sort it from epochtime as well as after converting. It's giving me
Fri, 08 Feb 2019 17:24:16 IST
Mon, 11 Feb 2019 02:19:40 IST
Sat, 09 Feb 2019 00:22:43 IST
which is not sorted.
I have used sort_values()
each_tracker_df = each_tracker_df.sort_values(["timestamp"],ascending=True)
why it isn't working??
Since all the time is in IST. Replace the string IST with NULL.
>>import datetime
>>times=['Fri, 10 Feb 2010 17:24:16','Fri, 11 Feb 2010 17:24:16','Fri, 11 Feb 2019 17:24:16']
>>change_format=[]
>> for time in times:
change_format.append(datetime.datetime.strptime(time, '%a, %d %b %Y %H:%M:%S'))
>>change_format.sort()

I want to find difference between 2 numbers stored in a file using shell script

Below is content of file. I want to find out difference between each line of first field.
0.607401 # Tue Mar 27 04:30:01 IST 2018
0.607401 # Tue Mar 27 04:35:02 IST 2018
0.606325 # Tue Mar 27 04:40:02 IST 2018
0.606223 # Tue Mar 27 04:45:01 IST 2018
0.606167 # Tue Mar 27 04:50:02 IST 2018
0.605716 # Tue Mar 27 04:55:01 IST 2018
0.605716 # Tue Mar 27 05:00:01 IST 2018
0.607064 # Tue Mar 27 05:05:01 IST 2018
output:-
0
-0.001076
-0.000102
.019944
..
..
.001348
CODE:
awk '{s=$0;getline;print s-$0;next}' a.txt
However this does not work as expected...
Could you help me please?
You can use the following awk code:
$ awk 'NR==1{save=$1;next}NR>1{printf "%.6f\n",($1-save);save=$1}' file
0.000000
-0.001076
-0.000102
-0.000056
-0.000451
0.000000
0.001348
and format the output as you want by modifying the printf.
The way you are currently doing will skip some lines!!!

BASH - conditional sum of columns and rows in csv file

i have CSV file with some database benchmark results here is the example:
Date;dbms;type;description;W;D;S;results;time;id
Mon Jun 15 14:22:20 CEST 2015;sqlite;on-disk;text;2;1;1;570;265;50
Mon Jun 15 14:22:20 CEST 2015;sqlite;on-disk;text;2;1;1;420;215;50
Mon Jun 15 14:22:20 CEST 2015;sqlite;on-disk;text;2;1;1;500;365;50
Mon Jun 15 14:22:20 CEST 2015;sqlite;on-disk;text;2;1;1;530;255;50
Mon Jun 15 14:22:20 CEST 2015;hsql;on-disk;text;2;1;1;870;265;99
Mon Jun 15 14:22:20 CEST 2015;hsql;on-disk;text;2;1;1;620;215;99
Mon Jun 15 14:22:20 CEST 2015;hsql;on-disk;text;2;1;1;700;365;99
Mon Jun 15 14:22:20 CEST 2015;hsql;on-disk;text;2;1;1;530;255;99
i need to process all rows with the same id (value of the last column) and get this:
Date;dbms;type;description;W;D;S;time;results;results/time
Mon Jun 15 14:22:20 CEST 2015;sqlite;on-disk;text;2;1;1;sum column 8;sum column 9;(sum column 8 / sum column 9)
Mon Jun 15 14:22:20 CEST 2015;hsql;on-disk;text;2;1;1;sum column 8;sum column 9;(sum column 8 / sum column 9)
for now i can only do the sum of column 8 with the awk command:
awk -F";" '{print;sum+=$8 }END{print "sum " sum}' ./file.CSV
Edit:
need help with some modification of script iam already using. here are real input data:
Date;dbms;type;description;W;D;time;TotalTransactions;NOTransactions;id
Mon Jun 15 14:53:41 CEST 2015;sqlite;in-memory;TPC-C test results;2;1;10;272270;117508;50
Mon Jun 15 15:03:46 CEST 2015;sqlite;in-memory;TPC-C test results;2;1;10;280080;110063;50
Mon Jun 15 15:13:53 CEST 2015;sqlite;in-memory;TPC-C test results;5;1;10;144170;31815;60
Mon Jun 15 15:13:53 CEST 2015;sqlite;in-memory;TPC-C test results;5;1;10;137570;33910;60
Mon Jun 15 15:24:04 CEST 2015;hsql;in-memory;TPC-C test results;2;1;10;226660;97734;70
Mon Jun 15 15:34:08 CEST 2015;hsql;in-memory;TPC-C test results;2;1;10;210420;95113;70
Mon Jun 15 15:44:16 CEST 2015;hsql;in-memory;TPC-C test results;5;1;10;288360;119328;80
Mon Jun 15 15:44:16 CEST 2015;hsql;in-memory;TPC-C test results;5;1;10;270360;124328;80
i need to sum values in time, TotalTransactions and NOTransactions columns and then add a column with value (sum NOTransactions/sum time)
iam using this script:
awk 'BEGIN {FS=OFS=";"}
(NR==1) {$10="results/time"; print $0}
(NR>1 && NF) {sum7[$10]+=$7; sum8[$10]+=$8; sum9[$10]+=$9; other[$10]=$0}
END {for (i in sum8)
{$0=other[i]; $7=sum7[i];$8=sum8[i]; $9=sum9[i]; $10=sprintf("%.0f", sum9[i]/sum7[i]); print}}' ./logsFinal.csv
gives me this output:
;;;;;;;;;results/time
Mon Jun 15 15:03:46 CEST 2015;sqlite;in-memory;TPC-C test results;2;1;20;552350;227571;11379
Mon Jun 15 15:13:53 CEST 2015;sqlite;in-memory;TPC-C test results;5;1;20;281740;65725;3286
Mon Jun 15 15:34:08 CEST 2015;hsql;in-memory;TPC-C test results;2;1;20;437080;192847;9642
Mon Jun 15 15:44:16 CEST 2015;hsql;in-memory;TPC-C test results;5;1;20;558720;243656;12183
Date;dbms;type;description;W;D;0;0;0;-nan
values looks good (except header row). But i need to get these results without id column (i want delete id column)
So i need to get same values but instead of identify processed rows with same values in id column it must be rows with same values in dbms AND W AND D columns
You can use this awk:
awk 'BEGIN{ FS=OFS=";" }
NR>1 && NF {
s=""
for(i=1; i<=7; i++)
s=s $i OFS;
a[$NF]=s;
sum8[$NF]+=$8
sum9[$NF]+=$9
} END{
for (i in a)
print a[i] sum8[i], sum9[i], (sum9[i]?sum8[i]/sum9[i]:"NaN")
}' file
Mon Jun 15 14:22:20 CEST 2015;sqlite;on-disk;text;2;1;1;2020;1100;1.83636
Mon Jun 15 14:22:20 CEST 2015;hsql;on-disk;text;2;1;1;2720;1100;2.47273
This awk program will print the modified header and modify the output to contain the sums and their division:
awk 'BEGIN {FS=OFS=";"}
(NR==1) {$10="results/time"; print $0}
(NR>1 && NF) {sum8[$10]+=$8; sum9[$10]+=$9; other[$10]=$0}
END {for (i in sum8)
{$0=other[i]; $8=sum8[i]; $9=sum9[i]; $10=(sum9[i]?sum8[i]/sum9[i]:"NaN"); print}}'
which gives:
Date;dbms;type;description;W;D;S;results;time;results/time
Mon Jun 15 14:22:20 CEST 2015;sqlite;on-disk;text;2;1;1;2020;1100;1.83636
Mon Jun 15 14:22:20 CEST 2015;hsql;on-disk;text;2;1;1;2720;1100;2.47273
You don't seem to care for the ID in the result, but if you do, just replace $10= with $11=.
Also, if you need to sum things based on values of more than one column, you can create a temporary variable (a in the example below) which is a concatenation of two columns and use it as an index in the arrays, like this:
awk 'BEGIN {FS=OFS=";"}
(NR==1) {$10="results/time"; print $0}
(NR>1 && NF) {a=$5$6; sum8[a]+=$8; sum9[a]+=$9; other[a]=$0}
END {for (i in sum8)
{$0=other[i]; $8=sum8[i]; $9=sum9[i]; $10=(sum9[i]?sum8[i]/sum9[i]:"NaN"); print}}'

Accumulate Field Values using Groovy

I have the follwing columns KPI_PERIOD and KPI_VALUE and want to achieve a new column named KPI_Output using Groovy.
The logic to achieve the KPI_Output is adding up the values of KPI_Value. In other words, for Apr the KPI_Output is the same as KPI_Value, as it's the first month. For May the KPI_Output value is KPI_Values of Apr and May. For Jun the KPI_Output is KPI_Values of Apr, May, and June. For Jul the KPI_Output value is KPI_Value of Apr, May, Jun, and Jul - and so on....
KPI_PERIOD KPI_VALUE KPI_Output
Apr 33091 33091
May 29685 62776
Jun 31042 93818
Jul 32807 126625
Aug 32782 159407
Sep 34952 194359
Oct 32448 226807
Nov 31515 258322
Dec 24639 282961
Jan 25155 308116
Feb 31320 339436
Mar 33091 372527
How can I achieve this using Groovy?
Here You go:
def input = """KPI_PERIOD KPI_VALUE
Apr 33091
May 29685
Jun 31042
Jul 32807
Aug 32782
Sep 34952
Oct 32448
Nov 31515
Dec 24639
Jan 25155
Feb 31320
Mar 33091
"""
def splitted = input.split('\n')[1..-1]
sum = 0
def transformed = splitted.collect { it.split(/\s+/)*.trim() }.inject([]) { res, curr ->
sum += curr[1].toInteger()
curr << sum.toString()
res << curr
}
println "KPI_PERIOD KPI_VALUE KPI_OUTPUT"
transformed.each {
println "${it[0].padLeft(10)} ${it[1].padLeft(12)} ${it[2].padLeft(12)}"
}
Hope it's all clear, if not feel free to ask.

Resources