awk: get log data part by part

awk: get log data part by part - linux

the log file is
Oct 01 [time] a
Oct 02 [time] b
Oct 03 [time] c
.
.
.
Oct 04 [time] d
Oct 05 [time] e
Oct 06 [time] f
.
.
.
Oct 28 [time] g
Oct 29 [time] h
Oct 30 [time] i
and it is really big ( millions of lines )
I wanna to get logs between Oct 01 and Oct 30
I can do it with gawk
gawk 'some conditions' filter.log
and it works correctly.
and it return millions of log lines that is not good
because I wanna to get it part by part
some thing like this
gawk 'some conditions' -limit 100 -offset 200 filter.log
and every time when I change limit and offset
I can get another part of that.
How can I do that ?

awk solution
I would harness GNU AWK for this task following way, let file.txt content be
1
2
3
4
5
6
7
8
9
and say I want to print such lines that 1st field is odd in part starting at 3th line and ending at 7th line (inclusive), then I can use GNU AWK following way
awk 'NR<3{next}$1%2{print}NR>=7{exit}' file.txt
which will give
3
5
7
Explanation: NR is built-in variable, which hold number of row, when processing lines before 3 just go to next row without doing anything, when remainder from division by 2 is non-zero do print line, when processing 7th or further row just exit. Using exit might give notice boost in performance if you are processing relatively small part of file. Observe order of 3 pattern-action pairs in code above: next is first, then whatever you do want do, exit is last. If you want to know more about NR read 8 Powerful Awk Built-in Variables – FS, OFS, RS, ORS, NR, NF, FILENAME, FNR
(tested in GNU Awk 5.0.1)
linux solution
If you prefer working with offset limit, then you might exploit tail-head combination e.g. for above file.txt
tail -n +5 file.txt | head -3
gives output
5
6
7
observe that offset goest first and with + before value then limit with - before value.

Using OP's pseudo code mixed with some actual awk code:
gawk -v limit=100 -v offset=200 '
some conditions { matches++ # track number of matches
if (matches >= offset and limit > 0) {
print # print current line
limit-- # decrement limit
}
if (limit == 0) exit # optional: abort processing if we found "limit" number of matches
}
' filter.log

Related

Illogical number priority in file names in BASH [duplicate]

This question already has answers here:
How to loop over files in natural order in Bash?
(7 answers)
Closed 1 year ago.
It so happens that I wrote a script in BASH, part of which is supposed to take files from a specified directory in numerical order. Obviously, files in that directory are named as follows: 1, 2, 3, 4, 5, etc. The thing is, I discovered that while running this script with 10 files in the directory, something that appears quite illogical to me, occurs, as the script takes files in strange order: 10, 1, 2, 3, etc.
How do I make it run from minimum value of name of a file to maximum in decimals?
Also, I am using the following line of code to define loop and path:
for file in /dir/*
Don't know if it matters, but I'm using Fedora 33 as OS.

Directories are sorted by alphabetical order. So "10" is before "2".
If I list 20 files whose names correspond to the 20 first integers, I get:
1 10 11 12 13 14 15 16 17 18 19 2 20 3 4 5 6 7 8 9
I can call the function 'sort -n' so I'll sort them numerically rather than alphabetically. The following command:
for i in $(ls | sort -n) ; do echo $i ; done
produces the following output:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
i.e. your command:
for file in /dir/*
should be rewritten:
for file in "dir/"$(ls /dir/* | sort -n)

If you have GNU sort then use the -V flag.
for file in /dir/* ; do echo "$file" ; done | sort -V
Or store the data in an array.
files=(/dir/*); printf '%s\n' "${files[#]}" | sort -V

As an aside, if you have the option and work once ahead of time is preferable to sorting every time, you could also format the names of your directories with leading zeroes. This is frequently a better design when possible.
I made both for some comparisons.
$: echo [0-9][0-9]/ # perfect list based on default string sort
00/ 01/ 02/ 03/ 04/ 05/ 06/ 07/ 08/ 09/ 10/ 11/ 12/ 13/ 14/ 15/ 16/ 17/ 18/ 19/ 20/
That also filters out any non-numeric names, and any non-directories.
$: for d in [0-9][0-9]/; do echo "${d%/}"; done
00
01
02
03
04
05
06
07
08
09
10
11
12
13
14
15
16
17
18
19
20
If I show both single- and double-digit versions (I made both)
$: shopt -s extglob
$: echo #(?|??)
0 00 01 02 03 04 05 06 07 08 09 1 10 11 12 13 14 15 16 17 18 19 2 20 3 4 5 6 7 8 9
Only the single-digit versions without leading zeroes get out of order.

The shell sorts the names by the locale order (not necessarily the byte value) of each individual character. Anything that starts with 1 will go before anything that starts with 2, and so on.
There's two main ways to tackle your problem:
sort -n (numeric sort) the file list, and iterate that.
Rename or recreate the target files (if you can), so all numbers are the same length (in bytes/characters). Left pad shorter numbers with 0 (eg. 01). Then they'll expand like you want.
Using sort (properly):
mapfile -td '' myfiles <(printf '%s\0' * | sort -zn)
for file in "${myfiles[#]}"; do
# what you were going to do
sort -z for zero/null terminated lines is common but not posix. It makes processing paths/data that contains new lines safe. Without -z:
mapfile -t myfiles <(printf '%s\n' * | sort -n)
# Rest is the same.
Rename the target files:
#!/bin/bash
cd /path/to/the/number/files || exit 1
# Gets length of the highest number. Or you can just hardcode it.
length=$(printf '%s\n' * | sort -n | tail -n 1)
length=${#length}
for i in *; do
mv -n "$i" "$(printf "%.${length}d" "$i")"
done
Examples for making new files with zero padded numbers for names:
touch {000..100} # Or
for i in {000..100}; do
> "$i"
done
If it's your script that made the target files, something like $(printf %.Nd [file]) can be used to left pad the names before you write to them. But you need to know the length in characters of the highest number first (N).

Adding a number to column [line by line]

I have a text file named text: The row and columns are:
1 A 18 -180
2 B 19 -180
3 C 20 -150
50 D 21 -100
128 E 22 -130
10 F 23 -0
10 G 23 -0
What I want to do is to print out the 4th column with adding a constant number to each of the lines (except ==0). To do this is what I have done.
#!/bin/bash
FILE="/dir/text"
while IFS= read -r line
do
echo "$line"
done <"$FILE"
I can read the fourth column, but at the same time I want to put an argument $1 which will add a constant number to all of the lines in the fourth column except any line of the fourth column has ==0.
UPDATE:
The Desired output would be like: [the line has zeros are ignored]
-160
-160
-130
-80
-110
For example, the program name is example.sh. I want to add a number to the fourth column using an argument. Therefore it would be:
example.sh $1
where $1 could be any number I want to add in the 4th column.

You should awk here which will be faster than bash.
awk -v number="100" '$4!=0{$4+=number} 1' Input_file
number is an awk variable where you could set its value as per your need.
Explanation: Adding detailed explanation for above code.
awk -v number="100" ' ##Starting awk program from here and creating a variable number whose value is 100.
$4!=0{ ##Checking condition if 4th column is NOT zero then do following.
$4+=number ##Adding variable number to 4th column here.
}
1 ##Mentioning 1 will print edited/non-edited lines.
' Input_file ##mentioning Input_file name here.

In order to preserve your formatting using awk while adding the values to the 4th field, you can calculate the new value of the 4th field and then use sub to change the value without forcing awk to recalculate the fields and removing the whitespace.
For example, with your file stored as text and adding a value of 180 to the 4th field (except where 0), you could do:
awk -v n=180 '$4!=0 {newval=$4+n; sub(/[0-9]+$/,newval)}1' text
Doing so would produce the following output:
$ awk -v n=180 '$4!=0 {newval=$4+n; sub(/[0-9]+$/,newval)}1' text
1 A 18 0
2 B 19 0
3 C 20 30
50 D 21 80
128 E 22 50
10 F 23 -0
10 G 23 -0
If called withing a shell script, you could pass your $1 parameter as:
awk -v n="$1" '$4!=0 {newval=$4+n; sub(/[0-9]+$/,newval)}1' text
Though I would suggest checking that an argument has been provided to the script with:
[ -z "$1" ] && {
echo "error: value require as argument"
exit 1
}
or you can provide a default value -- up to you.

With bash:
while read -ra a; do [[ ${a[3]} != -0 ]] && ((a[3]+=42)); echo "${a[#]}"; done < file
Output:
1 A 18 -138
2 B 19 -138
3 C 20 -108
50 D 21 -58
128 E 22 -88
10 F 23 -0
10 G 23 -0

How to make awk print one of 2 different fields based on which of it matches the condition

I have a ascii table in Linux which would look like this:
Oct Dec Hex Char Oct Dec Hex Char
-------------------------------------------------------------
056 46 2E . 156 110 6E n
I want to build a one liner in awk, which would match the 3rd and 7th field to corresponding hex character , say "2E". If 3rd field matches then print 4th field, i.e ".". Else if 7th field matches to "2E", then print corresponding 8th field.
I have written something like this:
man ascii | awk '$3 == "2E"{print $4};$7 == "2E"{print $8}'
Output:
.
But the above works only if the match happens in 3rd field. If it happens in 7th field it prints nothing. For example for this case:
man ascii | awk '$3 == "6E"{print $4};$7 == "6E"{print $8}'
Expected output:
n
Output I'm getting:
nothing

analyzing time tracking data in linux

I have a log file containing a time series of events. Now, I want to analyze the data to count the number of event for different intervals. Each entry shows that an event has occured in this timestamp. For example here is a part of log file
09:00:00
09:00:35
09:01:20
09:02:51
09:03:04
09:05:12
09:06:08
09:06:46
09:07:42
09:08:55
I need to count the events for 5 minutes intervals. The result should be like:
09:00 4 //which means 4 events from time 09:00:00 until 09:04:59<br>
09:05 5 //which means 4 events from time 09:00:05 until 09:09:59<br>
and so on.
Do you know any trick in bash, shell, awk, ...?
Any help is appreciated.

awk to the rescue.
awk -v FS="" '{min=$5<5?0:5; a[$1$2$4min]++} END{for (i in a) print i, a[i]}' file
Explanation
It gets the values of the 1st, 2nd, 4th and 5th characters in every line and keeps track of how many times they have appeared. To group in 0-4 and 5-9 range, it creates the var min that is 0 in the first case and 5 in the second.
Sample
With your input,
$ awk -v FS="" '{min=$5<5?0:5; a[$1$2$4min]++} END{for (i in a) print i, a[i]}' a
0900 5
0905 5
With another sample input,
$ cat a
09:00:00
09:00:35
09:01:20
09:02:51
09:03:04
09:05:12
09:06:08
09:06:46
09:07:42
09:08:55
09:18:55
09:19:55
10:09:55
10:19:55
$ awk -v FS="" '{min=$5<5?0:5; a[$1$2$4min]++} END{for (i in a) print i, a[i]}' a
0900 5
0905 5
0915 2
1005 1
1015 1

another way with awk
awk -F : '{t=sprintf ("%02d",int($2/5)*5);a[$1 FS t]++}END{for (i in a) print i,a[i]}' file |sort -t: -k1n -k2n
09:00 5
09:05 5
explanation:
use : as field seperator
int($2/5)*5 is used to group the minutes into every 5 minute (00,05,10,15...)
a[$1 FS t]++ count the numbers.
the last sort command will output the sorted time.

Perl with output piped through uniq just for fun:
$ cat file
09:00:00
09:00:35
09:01:20
09:02:51
09:03:04
09:05:12
09:06:08
09:06:46
09:07:42
09:08:55
09:18:55
09:19:55
10:09:55
10:19:55
11:21:00
Command:
perl -F: -lane 'print $F[0].sprintf(":%02d",int($F[1]/5)*5);' file | uniq -c
Output:
5 09:00
5 09:05
2 09:15
1 10:05
1 10:15
1 11:20
1 11:00
Or just perl:
perl -F: -lane '$t=$F[0].sprintf(":%02d",int($F[1]/5)*5); $c{$t}++; END { print join(" ", $_, $c{$_}) for sort keys %c }' file
Output:
09:00 5
09:05 5
09:15 2
10:05 1
10:15 1
11:00 1
11:20 1

I realize this is an old question, but when I stumbled onto it I couldn't resist poking at it from another direction...
sed -e 's/:/ /' -e 's/[0-4]:.*$/0/' -e 's/[5-9]:.*$/5/' | uniq -c
In this form it assumes the data is from standard input, or add the filename as the final argument before the pipe.
It's not unlike Michal's initial approach, but if you happen to need a quick and dirty analysis of a huge log, sed is a lightweight and capable tool.
The assumption is that the data truly is in a regular format - any hiccups will appear in the result.
As a breakdown - given the input
09:00:35
09:01:20
09:02:51
09:03:04
09:05:12
09:06:08
and applying each edit clause individually, the intermediate results are as follows:
1) Eliminate the first colon.
-e 's/:/ /'
09 00:35
09 01:20
09 02:51
09 03:04
09 05:12
2) Transform minutes 0 through 4 to 0.
-e 's/[0-4]:.*$/0/'
09 00
09 00
09 00
09 00
09 05:12
09 06:08
3) Transform minutes 5-9 to 5:
-e 's/[5-9]:.*$/5/'
09 00
09 00
09 00
09 00
09 05
09 05
2 and 3 also delete all trailing content from the lines, which would make the lines non-unique (and hence 'uniq -c' would fail to produce the desired results).
Perhaps the biggest strength of using sed as the front end is that you can select on lines of interest, for example, if root logged in remotely:
sed -e '/sshd.*: Accepted .* for root from/!d' -e 's/:/ /' ... /var/log/secure

How to extract every N columns and write into new files?

I've been struggling to write a code for extracting every N columns from an input file and write them into output files according to their extracting order.
(My real world case is to extract every 800 columns from a total 24005 columns file starting at column 6, so I need a loop)
In a simpler case below, extracting every 3 columns(fields) from an input file with a start point of the 2nd column.
for example, if the input file looks like:
aa 1 2 3 4 5 6 7 8 9
bb 1 2 3 4 5 6 7 8 9
cc 1 2 3 4 5 6 7 8 9
dd 1 2 3 4 5 6 7 8 9
and I want the output to look like this:
output_file_1:
1 2 3
1 2 3
1 2 3
1 2 3
output_file_2:
4 5 6
4 5 6
4 5 6
4 5 6
output_file_3:
7 8 9
7 8 9
7 8 9
7 8 9
I tried this, but it doesn't work:
awk 'for(i=2;i<=10;i+a) {{printf "%s ",$i};a=3}' <inputfile>
It gave me syntax error and the more I fix the more problems coming out.
I also tried the linux command cut but while I was dealing with large files this seems effortless. And I wonder if cut would do a loop cut of every 3 fields just like the awk.
Can someone please help me with this and give a quick explanation? Thanks in advance.

Actions to be performed by awk on the input data must be included in curled braces, so the reason the awk one-liner you tried results in a syntax error is that the for cycle does not respect this rule. A syntactically correct version will be:
awk '{for(i=2;i<=10;i+a) {printf "%s ",$i};a=3}' <inputfile>
This is syntactically correct (almost, see end of this post.), but does not do what you think.
To separate the output by columns on different files, the best thing is to use awk redirection operator >. This will give you the desired output, given that your input files always has 10 columns:
awk '{ print $2,$3,$4 > "file_1"; print $5,$6,$7 > "file_2"; print $8,$9,$10 > "file_3"}' <inputfile>
mind the " " to specify the filenames.
EDITED: REAL WORLD CASE
If you have to loop along the columns because you have too many of them, you can still use awk (gawk), with two loops: one on the output files and one on the columns per file. This is a possible way:
#!/usr/bin/gawk -f
BEGIN{
CTOT = 24005 # total number of columns, you can use NF as well
DELTA = 800 # columns per file
START = 6 # first useful column
d = CTOT/DELTA # number of output files.
}
{
for ( i = 0 ; i < d ; i++)
{
for ( j = 0 ; j < DELTA ; j++)
{
printf("%f\t",$(START+j+i*DELTA)) > "file_out_"i
}
printf("\n") > "file_out_"i
}
}
I have tried this on the simple input files in your example. It works if CTOT can be divided by DELTA. I assumed you had floats (%f) just change that with what you need.
Let me know.
P.s. going back to your original one-liner, note that the loop is an infinite one, as i is not incremented: i+a must be substituted by i+=a, and a=3 must be inside the inner braces:
awk '{for(i=2;i<=10;i+=a) {printf "%s ",$i;a=3}}' <inputfile>
this evaluates a=3 at every cycle, which is a bit pointless. A better version would thus be:
awk '{for(i=2;i<=10;i+=3) {printf "%s ",$i}}' <inputfile>
Still, this will just print the 2nd, 5th and 8th column of your file, which is not what you wanted.

awk '{ print $2, $3, $4 >"output_file_1";
print $5, $6, $7 >"output_file_2";
print $8, $9, $10 >"output_file_3";
}' input_file
This makes one pass through the input file, which is preferable to multiple passes. Clearly, the code shown only deals with the fixed number of columns (and therefore a fixed number of output files). It can be modified, if necessary, to deal with variable numbers of columns and generating variable file names, etc.
(My real world case is to extract every 800 columns from a total 24005 columns file starting at column 6, so I need a loop)
In that case, you're correct; you need a loop. In fact, you need two loops:
awk 'BEGIN { gap = 800; start = 6; filebase = "output_file_"; }
{
for (i = start; i < start + gap; i++)
{
file = sprintf("%s%d", filebase, i);
for (j = i; j <= NF; j += gap)
printf("%s ", $j) > file;
printf "\n" > file;
}
}' input_file
I demonstrated this to my satisfaction with an input file with 25 columns (numbers 1-25 in the corresponding columns) and gap set to 8 and start set to 2. The output below is the resulting 8 files pasted horizontally.
2 10 18 3 11 19 4 12 20 5 13 21 6 14 22 7 15 23 8 16 24 9 17 25
2 10 18 3 11 19 4 12 20 5 13 21 6 14 22 7 15 23 8 16 24 9 17 25
2 10 18 3 11 19 4 12 20 5 13 21 6 14 22 7 15 23 8 16 24 9 17 25
2 10 18 3 11 19 4 12 20 5 13 21 6 14 22 7 15 23 8 16 24 9 17 25

With GNU awk:
$ awk -v d=3 '{for(i=2;i<NF;i+=d) print gensub("(([^ ]+ +){" i-1 "})(([^ ]+( +|$)){" d "}).*","\\3",""); print "----"}' file
1 2 3
4 5 6
7 8 9
----
1 2 3
4 5 6
7 8 9
----
1 2 3
4 5 6
7 8 9
----
1 2 3
4 5 6
7 8 9
----
Just redirect the output to files if desired:
$ awk -v d=3 '{sfx=0; for(i=2;i<NF;i+=d) print gensub("(([^ ]+ +){" i-1 "})(([^ ]+( +|$)){" d "}).*","\\3","") > ("output_file_" ++sfx)}' file
The idea is just to tell gensub() to skip the first few (i-1) fields then print the number of fields you want (d = 3) and ignore the rest (.*). If you're not printing exact multiples of the number of fields you'll need to massage how many fields get printed on the last loop iteration. Do the math...
Here's a version that'd work in any awk. It requires 2 loops and modifies the spaces between fields but it's probably easier to understand:
$ awk -v d=3 '{sfx=0; for(i=2;i<=NF;i+=d) {str=fs=""; for(j=i;j<i+d;j++) {str = str fs $j; fs=" "}; print str > ("output_file_" ++sfx)} }' file

I was successful using the following command line. :) It uses a for loop and pipes the awk program into it's stdin using -f -. The awk program itself is created using bash variable math.
for i in 0 1 2; do
echo "{print \$$((i*3+2)) \" \" \$$((i*3+3)) \" \" \$$((i*3+4))}" \
| awk -f - t.file > "file$((i+1))"
done
Update: After the question has updated I tried to hack a script that creates the requested 800-cols-awk script dynamically ( a version according to Jonathan Lefflers answer) and pipe that to awk. Although the scripts looks good (for me ) it produces an awk syntax error. The question is, is this too much for awk or am I missing something? Would really appreciate feedback!
Update: Investigated this and found documentation that says awk has a lot af restrictions. They told to use gawk in this situations. (GNU's awk implementation). I've done that. But still I'll get an syntax error. Still feedback appreciated!
#!/bin/bash
# Note! Although the script's output looks ok (for me)
# it produces an awk syntax error. is this just too much for awk?
# open pipe to stdin of awk
exec 3> >(gawk -f - test.file)
# verify output using cat
#exec 3> >(cat)
echo '{' >&3
# write dynamic script to awk
for i in {0..24005..800} ; do
echo -n " print " >&3
for (( j=$i; j <= $((i+800)); j++ )) ; do
echo -n "\$$j " >&3
if [ $j = 24005 ] ; then
break
fi
done
echo "> \"file$((i/800+1))\";" >&3
done
echo "}"

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

awk: get log data part by part - linux

Related

Illogical number priority in file names in BASH [duplicate]

Adding a number to column [line by line]

How to make awk print one of 2 different fields based on which of it matches the condition

analyzing time tracking data in linux

How to extract every N columns and write into new files?

Categories

Resources