Awk/Perl convert textfile to csv with sensible format

Awk/Perl convert textfile to csv with sensible format - linux

I have a historical autogenerated logfile with the following format that I would like to convert to a csv file prior to uploading to a database
--------------------------------------
Thu Jul 8 09:34:12 BST 2010
BLUE Head 1
Duration = 20 s
Activity = 14.9 MBq
Sensitivity = 312 cps/MBq
--------------------------------------
Thu Jul 8 09:34:55 BST 2010
BLUE Head 1
Duration = 20 s
Activity = 14.9 MBq
Sensitivity = 318 cps/MBq
--------------------------------------
Thu Jul 8 10:13:39 BST 2010
RED Head 1
Duration = 20 s
Activity = 14.9 MBq
Sensitivity = 307 cps/MBq
--------------------------------------
Thu Jul 8 10:14:10 BST 2010
RED Head 1
Duration = 20 s
Activity = 14.9 MBq
Sensitivity = 305 cps/MBq
--------------------------------------
Mon Jul 19 10:11:18 BST 2010
BLUE Head 1
Duration = 20 s
Activity = 12.4 MBq
Sensitivity = 326 cps/MBq
--------------------------------------
Mon Jul 19 10:12:09 BST 2010
BLUE Head 1
Duration = 20 s
Activity = 12.4 MBq
Sensitivity = 333 cps/MBq
--------------------------------------
Mon Jul 19 10:13:57 BST 2010
RED Head 1
Duration = 20 s
Activity = 12.4 MBq
Sensitivity = 338 cps/MBq
--------------------------------------
Mon Jul 19 10:14:45 BST 2010
RED Head 1
Duration = 20 s
Activity = 12.4 MBq
Sensitivity = 340 cps/MBq
--------------------------------------
I would like to convert the logfile to the following format
Date,Camera,Head,Duration,Activity
08/07/10,BLUE,1,20,14.9
08/07/10,BLUE,1,20,14.9
08/07/10,RED,1,20,14.9
08/07/10,RED,1,20,14.9
I have used awk to get me close to what I wish
awk 'BEGIN {print "Date,Camera,Head,Duration,Activity";RS = "--------------------------------------"; FS="\n";}; {OFS=",";split($3, a, " ");split($4,b, " "); split($5,c," ");print $2,a[1],a[3],b[3],c[3]}' sensitivity.txt > sensitivity.csv
which gives me
Date,Camera,Head,Duration,Activity
,,,,
Thu Jul 8 09:34:12 BST 2010,BLUE,1,20,14.9
Thu Jul 8 09:34:55 BST 2010,BLUE,1,20,14.9
Thu Jul 8 10:13:39 BST 2010,RED,1,20,14.9
Thu Jul 8 10:14:10 BST 2010,RED,1,20,14.9
How can I
(a) get rid of the 4 output field separators in line 4
(b) Convert the date format from Thu Jul 8 09:34:12 BST 2010 to DD/MM/YY (Can I do this in pure awk or by piping to perl)

#sudo_O's answer is fine but here's an alternative:
$ cat tst.awk
BEGIN{ RS="---+\n"; OFS=","; months="JanFebMarAprMayJunJulAugSepOctNovDec" }
NR==1{ print "Date","Camera","Head","Duration","Activity"; next }
{ print sprintf("%04d%02d%02d",$6,(match(months,$2)+2)/3,$3),$7,$9,$12,$16 }
$ gawk -f tst.awk file
Date,Camera,Head,Duration,Activity
20100708,BLUE,1,20,14.9
20100708,BLUE,1,20,14.9
20100708,RED,1,20,14.9
20100708,RED,1,20,14.9
20100719,BLUE,1,20,12.4
20100719,BLUE,1,20,12.4
20100719,RED,1,20,12.4
20100719,RED,1,20,12.4
Note that I used GNU awk above so I could set the RS to more than a single character. With other awks just convert all the "---..."s lines to a blank line or control character or something and set RS accordingly before running the script.
If you don't like my suggested date format, tweak the sprintf() to suit.

This straight forward awk script will do the job:
BEGIN {
n=split("Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec",month,"|")
for (i=1;i<=n;i++) {
month_index[month[i]] = i
}
print "Date,Camera,Head,Duration,Activity"
}
/^-*$/{
i=0
next
}
{
i++
}
i==1{
printf "%02d/%02d/%02d,",$3,month_index[$2],substr($6,3)
}
i==2{
printf "%s,%d,",$1,$3
}
i==3{
printf "%d,",$3
}
i==4{
printf "%.1f\n",$3
}
Outputs:
$ awk -f script.awk file
08/07/10,BLUE,1,20,14.9
08/07/10,BLUE,1,20,14.9
08/07/10,RED,1,20,14.9
08/07/10,RED,1,20,14.9
19/07/10,BLUE,1,20,12.4
19/07/10,BLUE,1,20,12.4
19/07/10,RED,1,20,12.4
19/07/10,RED,1,20,12.4

I figured I would show how to actually parse the input, rather than just performing string transformations.
#! /usr/bin/env perl
use strict;
use warnings;
use Date::Parse;
use Date::Format;
use Text::CSV;
sub convert_date{
my $time = str2time($_[0]);
# iso 8601 style:
return time2str('%Y-%m-%d',$time); # YYYY-MM-DD
# or the outdated style output you wanted
return time2str('%d/%m/%y',$time); # DD/MM/YY
}
my %multiply_table = (
s => 1,
m => 60,
h => 60 * 60,
d => 60 * 60 * 24,
);
sub convert_duration{
my($d,$s) = $_[0] =~ /^ \s* (\d+) \s* (\w) \s* $/x;
die "Invalid duration '$_[0]'" unless $d && $s;
return $d * $multiply_table{$s};
}
my #field_list = qw'Date Camera Head Duration Activity';
my $csv = Text::CSV->new( { eol => "\n" } );
# print header
$csv->print( \*STDOUT, \#field_list );
# set record separator
local $/ = ('-' x 38) . "\n";
# parse data
while(<>){
chomp; # remove record separator
next unless $_; # skip empty section
my($time,$camdat,#fields) = split m/\n/; # split up the fields
my %data;
# split camera and head fields
#data{qw(Camera Head)} = split /\s+Head\s+/, $camdat;
# parse lines like:
# Duration = 20 s
# Activity = 14.9 MBq
# Sensitivity = 305 cps/MBq
for(#fields){
my($key,$value) = /(\w+) \s* = \s* (.*) /x;
$data{$key} = $value;
}
# at this point we start reducing precision
$data{Date} = convert_date( $time );
# remove measurement units
$data{Duration} = convert_duration($data{Duration}); # safe
$data{Activity} =~ s/[^\d]*$//; # unsafe
$csv->print(\*STDOUT, [#data{#field_list}]);
}

Related

conditional statement with awk

I'm new with linux
I'm trying to get logs between two dates with gawk.
this is my log
Oct 07 11:00:33 abcd
Oct 08 12:00:33 abcd
Oct 09 14:00:33 abcd
Oct 10 21:00:33 abcd
I can do it when both start and end date are sent
but I have problem when start or end date or both are not sent
and I don't know how to check it .
I've written below code but it has syntax error .
sudo gawk -v year='2022' -v start='' -v end='2022:10:08 21:00:34' '
BEGIN{ gsub(/[:-]/," ", start); gsub(/[:-]/," ", end) }
{ dt=year" "$1" "$2" "$3; gsub(/[:-]/," ", dt) }
if(start && end){mktime(dt)>=mktime(start) && mktime(dt)<=mktime(end)}
else if(end){mktime(dt)<=mktime(end)}
else if(start){mktime(dt)>=mktime(start)} ' log.txt
How can I modify this code ?

I'd write:
gawk -v end="Oct 10 12:00:00" '
function to_epoch(timestamp, n, a) {
n = split(timestamp, a, /[ :]/)
return mktime(strftime("%Y", systime()) " " month[a[1]] " " a[2] " " a[3] " " a[4] " " a[5])
}
BEGIN {
split("Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec", m)
for (i=1; i<=12; i++) month[m[i]]=i
if (start) {_start = to_epoch(start)} else {_start = 0}
if (end) {_end = to_epoch(end)} else {_end = 2**31}
}
{ ts = to_epoch($0) }
_start <= ts && ts <= _end
' log.txt
You'll pass the start and/or end variables with the same datetime format as appears in the log file.

This would be easier with dateutils, e.g.:
<infile dategrep -i '%b %d %H:%M:%S' '>Oct 08 00:00:00' |
dategrep -i '%b %d %H:%M:%S' '<Oct 09 23:59:59'
Output:
Oct 08 12:00:33 abcd
Oct 09 14:00:33 abcd

how to skip blank space if value is not there and print proper row and column

I have one details.txt file which has below data
size=190000
date=1603278566981
repo-name=testupload
repo-path=/home/test/testupload
size=140000
date=1603278566981
repo-name=testupload2
repo-path=/home/test/testupload2
size=170000
date=1603278566981
repo-name=testupload3
repo-path=/home/test/testupload3
and below awk script process that to
#!/bin/bash
awk -vOFS='\t' '
BEGIN{ FS="=" }
/^size/{
if(++count1==1){ header=$1"," }
sizeArr[++count]=$NF
next
}
/^#repo-name/{
if(++count2==1){ header=header OFS $1"," }
repoNameArr[count]=$NF
next
}
/^date/{
if(++count3==1){ header=header OFS $1"," }
dateArr[count]=$NF
next
}
/^#blob-name/{
if(++count4==1){ header=header OFS $1"," }
repopathArr[count]=$NF
next
}
END{
print header
for(i=1;i<=count;i++){
printf("%s,%s,%s,%s,%s\n",sizeArr[i],repoNameArr[i],dateArr[i],repopathArr[i])
}
}
' details.txt | tr -d # |awk -F, '{$3=substr($3,0,10)}1' OFS=,|sed 's/date/creationTime/g'
which prints value as expected, (because it has reponame)
size " repo-name" " creationTime" " blob-name"
10496000 testupload Fri 11 Dec 2020 07:35:56 AM CET testfile.tar11.gz
10496000 testupload Thu 10 Dec 2020 02:44:04 PM CET testfile.tar.gz
9602303 testupload Fri 11 Dec 2020 07:38:58 AM CET apache-maven-3.6.3-bin/apache-maven-3.6.3-bin.zip
but when something is missing in file format of file gets wrong format (here repo name jumps to last column's headers as first few data don't have reponame value)
size " creationTimeime" " blob-name" " " repo-name"
261304 Thu 13 Feb 2020 08:50:02 AM CET temp 8963d25231b
29639 Thu 13 Feb 2020 08:50:00 AM CET temp 3780c72cab5
93699 Thu 13 Feb 2020 08:50:00 AM CET temp 209276c91ba
and column headers gets wrongly printed but data gets printed perfectly, is there any thing that validate if one of the field is not there it should skip that and print the rest in proper format.
If data is not available it should keep that header same, it should not headers sequence.
My requirement
if deatils.txt is missing any records it should skip that and print as blank and prints as per header.
Headers gets disturbed if repo-name field is not there but rest output is correct so we need to have headers intact even if field is missing.
Wrong:
size " creationTimeime" " blob-name" " " repo-name"
261304 Thu 13 Feb 2020 08:50:02 AM CET temp 8963d25231b
29639 Thu 13 Feb 2020 08:50:00 AM CET temp 3780c72cab5
93699 Thu 13 Feb 2020 08:50:00 AM CET temp 209276c91ba
Right
size " repo-name" " creationTime" " blob-name"
10496000 testupload Fri 11 Dec 2020 07:35:56 AM CET testfile.tar11.gz
10496000 testupload Thu 10 Dec 2020 02:44:04 PM CET testfile.tar.gz
9602303 testupload Fri 11 Dec 2020 07:38:58 AM CET apache-maven-3.6.3-bin/apache-maven-3.6.3-bin.zip
Thanks
samurai

You may try this gnu awk:
awk -F= -v OFS='\t' 'function prt(ind, name, s) {s=map[ind][name]; return (s==""?" ":s);} {map[NR][$1] = $2} END {print "Size", "Repo Name", "CreationTime", "Repo Path"; for (i=1; i<=NR; i+=4) print prt(i, "size"), prt(i+2, "repo-name"), prt(i+1, "date"), prt(i+3, "repo-path")}' file
Size Repo Name CreationTime Repo Path
190000 testupload 1603278566981 /home/test/testupload
140000 testupload2 1603278566981 /home/test/testupload2
170000 testupload3 1603278566981 /home/test/testupload3
To make it readable:
awk -F= -v OFS='\t' 'function prt(ind, name, s) {
s = map[ind][name]
return (s==""?" ":s)
}
{
map[NR][$1] = $2
}
END {
print "Size", "Repo Name", "CreationTime", "Repo Path"
for (i=1; i<=NR; i+=4)
print prt(i, "size"), prt(i+2, "repo-name"), prt(i+1, "date"), prt(i+3, "repo-path")
}' file

Convert a text into time format using bash script

I am new to shell scripting.. I have a tab-separated file, e.g.,
0018803 01 1710 2050 002571
0018951 01 1934 2525 003277
0019362 02 2404 2415 002829
0019392 01 2621 2820 001924
0019542 01 2208 2413 003434
0019583 01 1815 2134 002971
Here, the 3rd and 4th column is representing Start Time and End Time.
I want to convert these two columns in proper timeFrame so that I can get 6th column as the exact time difference between column 4 and column 3 in hours and minutes.
Column 6 result will be 3:40, 5:51, 00:11, 1:59, 2:05.

One way with awk:
$ cat test.awk
# create a function to split hour and minute
function f(h, x) {
h[0] = substr(x,1,2)+0
h[1] = substr(x,3,2)+0
}
{
f(start, $3);
f(end, $4);
span = end[1] - start[1] > 0 \
? sprintf("%d:%02d", end[0]-start[0], end[1]-start[1]) \
: sprintf("%d:%02d", end[0]-start[0]-1, 60+end[1]-start[1]);
print $0 OFS span
}
then run the awk file as the following:
$ awk -f test.awk input_file
Edit: per #glenn jackman's suggestion, the code can be simplified (refer to #Kamil Cuk's method):
function g(x) {
return substr(x,1,2)*60 + substr(x,3,2)
}
{
span = g($4) - g($3)
printf("%s%s%d:%02d\n", $0, OFS, int(span/60), span%60)
}

A simple bash solution using arithmetic expansion:
while IFS='' read -r l; do
IFS=' ' read -r _ _ st et _ <<<"$l"
d=$(( (10#${et:0:2} * 60 + 10#${et:2:2}) - (10#${st:0:2} * 60 + 10#${st:2:2}) ))
printf "%s %02d:%02d\n" "$l" "$((d/60))" "$((d%60))"
done < intput_file_path
will output:
0018803 01 1710 2050 002571 03:40
0018951 01 1934 2525 003277 05:51
0019362 02 2404 2415 002829 00:11
0019392 01 2621 2820 001924 01:59
0019542 01 2208 2413 003434 02:05
0019583 01 1815 2134 002971 03:19

Here is one in GNU awk using time functions, mktime to convert to epoch time and strftime to convert the time to desired format HH:MM:
$ awk -v OFS="\t" '{
dt3="1970 01 01 " substr($3,1,2) " " substr($3,3,2) " 00"
dt4="1970 01 01 " substr($4,1,2) " " substr($4,3,2) " 00"
print $0,strftime("%H:%M",mktime(dt4)-mktime(dt3),1) # thanks #glennjackman,1 :)
}' file
Output ($6 only):
03:40
05:51
00:11
01:59
02:05
03:19

multiple search using awk - pattern + arithmetic condition

I am trying to use awk command to perform multiple search to fetch records from a log file WHERE it matches following 2 conditions :
pattern - EXEC_TIME
last column i.e. having EXEC_TIME > 5000 ms.
I tried and used below command but its not giving me correct output, not sure if can be use same way!
I am just learning awk so any help will be appreciated.
awk -F ':' '/EXEC_TIME/&&$15>="5000"{print $2,$15}' TransactionInfoLogs.log
MP170420.0548.T00003[SERV] 9065 ms
OC170420.0655.T00001[SERV] 708 ms
Below is sample log file:
[TXN_ID]:MP170420.0548.T00003[SERV][SERV]:BLKSRVREQ[MSISDN]:8028359017[SV_CHRG_ID]:37152[RESP_CODE]:200[START]:Thu Apr 20 12:44:23 WAT 2017 [END]:Thu Apr 20 12:44:23 WAT 2017[EXEC_TIME]:9065 ms
[TXNID]:XX170420.1244.C01465[TYPE]:SERVICE_CHARGE_PAYER_PAYEE[AMT]:0[PR_MSISDN]:8028359017[PR_MFS]:101[PR_W_TYPE]:12[PR_PREBAL]:0[PR_BAL]:0[PY_MSISDN]:IND03[PY_MFS]:101[PY_W_TYPE]:null[PY_PRE
BAL]:2782239[PY_BAL]:2782239
[2017-04-20 12:44:29,552][http-bio-172.24.87.5-7890-exec-7365]-
[TXN_ID]:XX170420.1244.C01467[SERV]:null[MSISDN]:8080967233[RESP_CODE]:00066[START]:Thu Apr 20 12:44:29 WAT 2017 [END]:Thu Apr 20 12:44:29 WAT 2017[EXEC_TIME]:9 ms
[2017-04-20 12:44:36,634][http-bio-172.24.87.5-7890-exec-7364]-
[TXN_ID]:OC170420.0655.T00001[SERV]:null[MSISDN]:7016532415[RESP_CODE]:00066[START]:Thu Apr 20 12:44:36 WAT 2017 [END]:Thu Apr 20 12:44:36 WAT 2017[EXEC_TIME]:708 ms
[2017-04-20 12:44:45,820][http-bio-172.24.87.5-7890-exec-7359]-
[TXN_ID]:XX170420.1244.C01471[SERV]:null[MSISDN]:8026136275[RESP_CODE]:00066[START]:Thu Apr 20 12:44:45 WAT 2017 [END]:Thu Apr 20 12:44:45 WAT 2017[EXEC_TIME]:39 ms
[2017-04-20 12:44:46,010][http-bio-172.24.87.5-7890-exec-7366]-
[TXN_ID]:XX170420.1244.C01473[SERV]:BLKSRVREQ[MSISDN]:8127459541[SV_CHRG_ID]:37152[RESP_CODE]:200[START]:Thu Apr 20 12:44:45 WAT 2017 [END]:Thu Apr 20 12:44:46 WAT 2017[EXEC_TIME]:221 ms
[TXNID]:XX170420.1244.C01473[TYPE]:SERVICE_CHARGE_PAYER_PAYEE[AMT]:0[PR_MSISDN]:8127459541[PR_MFS]:101[PR_W_TYPE]:12[PR_PREBAL]:0[PR_BAL]:0[PY_MSISDN]:IND03[PY_MFS]:101[PY_W_TYPE]:null[PY_PRE
BAL]:2853870[PY_BAL]:2853870
[2017-04-20 12:44:49,989][http-bio-172.24.87.5-7890-exec-7371]-
[TXN_ID]:XX170420.1244.C01475[SERV]:BLKSRVREQ[MSISDN]:8089138902[SV_CHRG_ID]:37152[RESP_CODE]:200[START]:Thu Apr 20 12:44:49 WAT 2017 [END]:Thu Apr 20 12:44:49 WAT 2017[EXEC_TIME]:57 ms
[TXNID]:XX170420.1244.C01475[TYPE]:SERVICE_CHARGE_PAYER_PAYEE[AMT]:0[PR_MSISDN]:8089138902[PR_MFS]:101[PR_W_TYPE]:12[PR_PREBAL]:0[PR_BAL]:0[PY_MSISDN]:IND03[PY_MFS]:101[PY_W_TYPE]:null[PY_PRE
BAL]:3071459[PY_BAL]:3071459

Whenever you have name->value mappings in an input file it's a good idea to first create an array of that mapping (n2v[] below) and then you can just reference each field by it's name rather than it's position, e.g.:
$ cat tst.awk
{
delete n2v
while ( match($0,/\[[^]]+]:/) ) {
if ( name != "" ) {
value = substr($0,1,RSTART-1)
sub(/\[.*/,"",value)
n2v[name] = value
}
name = substr($0,RSTART+1,RLENGTH-3)
$0 = substr($0,RSTART+RLENGTH)
}
value = $0
n2v[name] = value
for (name in n2v) {
value = n2v[name]
print name, "->", value
}
}
$ head -1 file | awk -f tst.awk
EXEC_TIME -> 9065 ms
START -> Thu Apr 20 12:44:23 WAT 2017
RESP_CODE -> 200
SV_CHRG_ID -> 37152
TXN_ID -> MP170420.0548.T00003
END -> Thu Apr 20 12:44:23 WAT 2017
MSISDN -> 8028359017
SERV -> BLKSRVREQ
You can then tweak the above to do whatever you want:
$ cat tst.awk
{
delete n2v
while ( match($0,/\[[^]]+]:/) ) {
if ( name != "" ) {
value = substr($0,1,RSTART-1)
sub(/\[.*/,"",value)
n2v[name] = value
}
name = substr($0,RSTART+1,RLENGTH-3)
$0 = substr($0,RSTART+RLENGTH)
}
value = $0
n2v[name] = value
}
n2v["EXEC_TIME"]+0 > 5000 { print n2v["TXN_ID"], n2v["EXEC_TIME"] }
$ awk -f tst.awk file
MP170420.0548.T00003 9065 ms

The method parse_datetime from Perl's DateTime::Format::Strptime can't parse timezone name

I have a laptop with ubuntu 12.04.
The execution of date command at the console result this:
$ date
Thu May 8 15:28:12 WIB 2014
The perl script below will be running well.
#!/usr/bin/perl
use DateTime::Format::Strptime;
$parser = DateTime::Format::Strptime->new( pattern => "%a %b %d %H:%M:%S %Y %Z");
$date = "Fri Sep 20 08:22:42 2013 WIB";
$dateimap = $parser->parse_datetime($date);
$date = $dateimap->strftime("%d-%b-%Y %H:%M:%S %z");
print "$date\n";
$date = "Fri Jan 8 16:49:34 2010 WIT";
$dateimap = $parser->parse_datetime($date);
$date = $dateimap->strftime("%d-%b-%Y %H:%M:%S %z");
print "$date\n";
The result is
20-Sep-2013 08:22:42 +0700
08-Jan-2010 16:49:34 +0900
But, why the timezone name "WIT" is converted to timezone "+0900" ?
AFAIK, WIT is Western Indonesian Time. IMHO it should has timezone "+0700" not "+0900".
The other computer has a running CentOS 5.9.
The execution of date command at the CentOS result:
$ date
Thu May 8 15:38:24 WIT 2014
But the execution of the perl script above result like this:
20-Sep-2013 08:22:42 +0700
Can't call method "strftime" on an undefined value at strptime.pl line 14.
Actually the method parse_datetime can't parse the date which contain "WIT" timezone.
The returned value $dateimap is empty or undef.
The CentOS have been set to localtime Asia/Jakarta.
$ ls -l /etc/localtime
lrwxrwxrwx 1 root root 32 Sep 23 2013 /etc/localtime -> /usr/share/zoneinfo/Asia/Jakarta
Any suggestion ?
Thank you.

Actually the problem happens because the version of module DateTime::Format::Strptime at CentOS 5.9 is 1.2000 while at ubuntu 12.04 is 1.54.
Another problem using the older version of DateTime::Format::Strptime.
$ perl -e '
> use DateTime::Format::Strptime;
> $parser = DateTime::Format::Strptime->new( pattern => "%a %b %d %H:%M:%S %Y");
> $datembox = "Wed Jan 1 06:42:18 2014 WIT";
> $date = $parser->parse_datetime($datembox);
> print "$date\n";'
$
If we remove double spaces at variable $datembox.
$ perl -e '
> use DateTime::Format::Strptime;
> $parser = DateTime::Format::Strptime->new( pattern => "%a %b %d %H:%M:%S %Y");
> $datembox = "Wed Jan 1 06:42:18 2014 WIT";
> $datembox =~ s/[\s]+/ /g;
> $date = $parser->parse_datetime($datembox);
> print "$date\n";'
2014-01-01T06:42:18
$

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Awk/Perl convert textfile to csv with sensible format - linux

Related

conditional statement with awk

how to skip blank space if value is not there and print proper row and column

Convert a text into time format using bash script

multiple search using awk - pattern + arithmetic condition

The method parse_datetime from Perl's DateTime::Format::Strptime can't parse timezone name

Categories

Resources