analyzing time tracking data in linux - linux

I have a log file containing a time series of events. Now, I want to analyze the data to count the number of event for different intervals. Each entry shows that an event has occured in this timestamp. For example here is a part of log file
09:00:00
09:00:35
09:01:20
09:02:51
09:03:04
09:05:12
09:06:08
09:06:46
09:07:42
09:08:55
I need to count the events for 5 minutes intervals. The result should be like:
09:00 4 //which means 4 events from time 09:00:00 until 09:04:59<br>
09:05 5 //which means 4 events from time 09:00:05 until 09:09:59<br>
and so on.
Do you know any trick in bash, shell, awk, ...?
Any help is appreciated.

awk to the rescue.
awk -v FS="" '{min=$5<5?0:5; a[$1$2$4min]++} END{for (i in a) print i, a[i]}' file
Explanation
It gets the values of the 1st, 2nd, 4th and 5th characters in every line and keeps track of how many times they have appeared. To group in 0-4 and 5-9 range, it creates the var min that is 0 in the first case and 5 in the second.
Sample
With your input,
$ awk -v FS="" '{min=$5<5?0:5; a[$1$2$4min]++} END{for (i in a) print i, a[i]}' a
0900 5
0905 5
With another sample input,
$ cat a
09:00:00
09:00:35
09:01:20
09:02:51
09:03:04
09:05:12
09:06:08
09:06:46
09:07:42
09:08:55
09:18:55
09:19:55
10:09:55
10:19:55
$ awk -v FS="" '{min=$5<5?0:5; a[$1$2$4min]++} END{for (i in a) print i, a[i]}' a
0900 5
0905 5
0915 2
1005 1
1015 1

another way with awk
awk -F : '{t=sprintf ("%02d",int($2/5)*5);a[$1 FS t]++}END{for (i in a) print i,a[i]}' file |sort -t: -k1n -k2n
09:00 5
09:05 5
explanation:
use : as field seperator
int($2/5)*5 is used to group the minutes into every 5 minute (00,05,10,15...)
a[$1 FS t]++ count the numbers.
the last sort command will output the sorted time.

Perl with output piped through uniq just for fun:
$ cat file
09:00:00
09:00:35
09:01:20
09:02:51
09:03:04
09:05:12
09:06:08
09:06:46
09:07:42
09:08:55
09:18:55
09:19:55
10:09:55
10:19:55
11:21:00
Command:
perl -F: -lane 'print $F[0].sprintf(":%02d",int($F[1]/5)*5);' file | uniq -c
Output:
5 09:00
5 09:05
2 09:15
1 10:05
1 10:15
1 11:20
1 11:00
Or just perl:
perl -F: -lane '$t=$F[0].sprintf(":%02d",int($F[1]/5)*5); $c{$t}++; END { print join(" ", $_, $c{$_}) for sort keys %c }' file
Output:
09:00 5
09:05 5
09:15 2
10:05 1
10:15 1
11:00 1
11:20 1

I realize this is an old question, but when I stumbled onto it I couldn't resist poking at it from another direction...
sed -e 's/:/ /' -e 's/[0-4]:.*$/0/' -e 's/[5-9]:.*$/5/' | uniq -c
In this form it assumes the data is from standard input, or add the filename as the final argument before the pipe.
It's not unlike Michal's initial approach, but if you happen to need a quick and dirty analysis of a huge log, sed is a lightweight and capable tool.
The assumption is that the data truly is in a regular format - any hiccups will appear in the result.
As a breakdown - given the input
09:00:35
09:01:20
09:02:51
09:03:04
09:05:12
09:06:08
and applying each edit clause individually, the intermediate results are as follows:
1) Eliminate the first colon.
-e 's/:/ /'
09 00:35
09 01:20
09 02:51
09 03:04
09 05:12
2) Transform minutes 0 through 4 to 0.
-e 's/[0-4]:.*$/0/'
09 00
09 00
09 00
09 00
09 05:12
09 06:08
3) Transform minutes 5-9 to 5:
-e 's/[5-9]:.*$/5/'
09 00
09 00
09 00
09 00
09 05
09 05
2 and 3 also delete all trailing content from the lines, which would make the lines non-unique (and hence 'uniq -c' would fail to produce the desired results).
Perhaps the biggest strength of using sed as the front end is that you can select on lines of interest, for example, if root logged in remotely:
sed -e '/sshd.*: Accepted .* for root from/!d' -e 's/:/ /' ... /var/log/secure

Related

awk: get log data part by part

the log file is
Oct 01 [time] a
Oct 02 [time] b
Oct 03 [time] c
.
.
.
Oct 04 [time] d
Oct 05 [time] e
Oct 06 [time] f
.
.
.
Oct 28 [time] g
Oct 29 [time] h
Oct 30 [time] i
and it is really big ( millions of lines )
I wanna to get logs between Oct 01 and Oct 30
I can do it with gawk
gawk 'some conditions' filter.log
and it works correctly.
and it return millions of log lines that is not good
because I wanna to get it part by part
some thing like this
gawk 'some conditions' -limit 100 -offset 200 filter.log
and every time when I change limit and offset
I can get another part of that.
How can I do that ?
awk solution
I would harness GNU AWK for this task following way, let file.txt content be
1
2
3
4
5
6
7
8
9
and say I want to print such lines that 1st field is odd in part starting at 3th line and ending at 7th line (inclusive), then I can use GNU AWK following way
awk 'NR<3{next}$1%2{print}NR>=7{exit}' file.txt
which will give
3
5
7
Explanation: NR is built-in variable, which hold number of row, when processing lines before 3 just go to next row without doing anything, when remainder from division by 2 is non-zero do print line, when processing 7th or further row just exit. Using exit might give notice boost in performance if you are processing relatively small part of file. Observe order of 3 pattern-action pairs in code above: next is first, then whatever you do want do, exit is last. If you want to know more about NR read 8 Powerful Awk Built-in Variables – FS, OFS, RS, ORS, NR, NF, FILENAME, FNR
(tested in GNU Awk 5.0.1)
linux solution
If you prefer working with offset limit, then you might exploit tail-head combination e.g. for above file.txt
tail -n +5 file.txt | head -3
gives output
5
6
7
observe that offset goest first and with + before value then limit with - before value.
Using OP's pseudo code mixed with some actual awk code:
gawk -v limit=100 -v offset=200 '
some conditions { matches++ # track number of matches
if (matches >= offset and limit > 0) {
print # print current line
limit-- # decrement limit
}
if (limit == 0) exit # optional: abort processing if we found "limit" number of matches
}
' filter.log

All but the last part of a pipeline is ignored, using sed file | sed file | head file

I am trying to use pipe command in UNIx to substitute some words (sed) and then use the head command to show me the first 9 lines of the file. However, the head command erase the sed command and just do it by itself. Here is what I am trying to do:
$ sed 's/[n]d/nd STREET/g' street | sed 's/[r]d/rd STREET/g' street | head -n 9 street
01 m motzart amadeous 25 2nd 94233
02 m guthrie woody 23 2nd 94223
03 f simone nina 27 2nd 94112
04 m lennon john 29 2nd 94221
05 f harris emmylou 20 2nd 94222
06 m marley bob 22 2nd 94112
07 f marley rita 26 2nd 94212
08 f warwick dione 26 2nd 94222
09 m prine john 35 3rd 94321
sed and head only read from stdin when they aren't given a filename to read from instead. Therefore, when you give head the name street, it ignores its standard input (which is where the output from sed is).
Provide the filename only once, at the front of the pipeline.
$ <street sed -e 's/[n]d/nd STREET/g' -e 's/[r]d/rd STREET/g' | head -n 9
By the way, you could also write this to use only one sed operation to handle both nd and rd:
$ <streed sed -e 's/\([rn]d\)/&1 STREET/g' | head -n 9

converting 4 digit year to 2 digit in shell script

I have file as:
$cat file.txt
1981080512 14 15
2019050612 17 18
2020040912 19 95
Here the 1st column represents dates as YYYYMMDDHH
I would like to write the dates as YYMMDDHH. So the desire output is:
81080512 14 15
19050612 17 18
20040912 19 95
My script:
while read -r x;do
yy=$(echo $x | awk '{print substr($0,3,2)}')
mm=$(echo $x | awk '{print substr($0,5,2)}')
dd=$(echo $x | awk '{print substr($0,7,2)}')
hh=$(echo $x | awk '{print substr($0,9,2)}')
awk '{printf "%10s%4s%4s\n",'$yy$mm$dd$hh',$2,$3}'
done < file.txt
It is printing
81080512 14 15
81080512 17 18
Any help please. Thank you.
Please don't kill me for this simple answer, but what about this:
cut -c 3- file.txt
You simply cut the first two digits by showing character 3 till the end of every line (the -c switch indicates that you need to cut characters (not bytes, ...)).
You can do it using single GNU AWK's substr as follows, let file.txt content be then
1981080512 14 15
2019050612 17 18
2020040912 19 95
then
awk '{$1=substr($1,3);print}' file.txt
output
81080512 14 15
19050612 17 18
20040912 19 95
Explanation: I used substr function to get 3rd and onward characters from 1st column and assign it back to said column, then I print such changed line.
(tested in gawk 4.2.1)

Filter log lines within a 10 minute interval

I have below lines in my dummy.text file. I would like to filter these data using bash script or awk.
Jul 28 15:05:47 * aaa has joined
Jul 28 15:07:47 * bbb has joined
Jul 28 15:08:41 * ccc has joined
Jul 28 15:13:32 * ddd has joined
Jul 28 15:14:40 * eee has joined
For example, let's say aaa has joined the session at time 15:05:47 and ccc joined the session at 15:08:47. I want to get the line who has joined equal/after 15:00:00 and before 15:10:00. The expected result would be:
Jul 28 15:05:47 * aaa has joined
Jul 28 15:07:47 * bbb has joined
Jul 28 15:08:41 * ccc has joined
Side note: after getting the expected output I'm looking to write cron job in which this data will be forward to mail.
One way :
awk -F'[ :]' '$3 == 15 && $4 >= 0 && $4 <= 10' file.txt
If you specify the 10-minute interval as 15:00:00 up to but not including 15:10:00, then:
awk -v start=15:00:00 -v end=15:10:00 '$3 >= start && $3 < end'
If you decide you want to omit the final :00 for the seconds from the times, then:
awk -v start=15:00 -v end=15:10 '$3 >= start ":00" && $3 < end ":00"'
Both of these will report on entries in the time interval on any day. If you want to restrict the date, then you can apply further conditions (on $1 and $2).
If you calculate the start and end values in shell variables, then:
start=$(…) # Calculate start time hh:mm
end=$(…) # Calculate end time hh:mm
awk -v start="$start" -v end="$end" '$3 >= start ":00" && $3 < end ":00"'
cat dummy.text | awk '$3>="15:05:47" && $3<="15:08:47" {print}'

How to keep only those rows which are unique in a tab-delimited file in unix

Here, two rows are considered redundant if second value is same.
Is there any unix/linux command that can achieve the following.
1 aa
2 aa
1 ss
3 dd
4 dd
Result
1 aa
1 ss
3 dd
I generally use the following command but it does not achieve what I want here.
sort -k2 /Users/fahim/Desktop/delnow2.csv | uniq
Edit:
My file had roughly 25 million lines:
Time when using the solution suggested by #Steve : 33 seconds.
$date; awk -F '\t' '!a[$2]++' myfile.txt > outfile.txt; date
Wed Nov 27 18:00:16 EST 2013
Wed Nov 27 18:00:49 EST 2013
The sort and unique is taking too much time. I quit after waiting for 5 minutes.
Perhaps this is what you're looking for:
awk -F "\t" '!a[$2]++' file
Results:
1 aa
1 ss
3 dd
I understand that you want a unique sorted file by the second field.
You need to add -u to sort to achieve this.
sort -u -k2 /Users/fahim/Desktop/delnow2.csv

Resources