I have some complex log files that I need to write some tools to process them. I have been playing with awk but I am not sure if awk is the right tool for this.
My log files are print outs of OSPF protocol decodes which contain a text log of the various protocol pkts and their contents with their various protocol fields identified with their values. I want to process these files and print out only certain lines of the log that pertain to specific pkts. Each pkt log can consist of a varying number of lines for that pkt's entry.
awk seems to be able to process a single line that matches a pattern. I can locate the desired pkt but then I need to match patterns in the lines that follow in order to determine if it is a pkt I want to print out.
Another way to look at this is that I would want to isolate several lines in the log file and print out those lines that are the details of a particular pkt based on pattern matches on several lines.
Since awk seems to be line-based, I am not sure if that would be the best tool to use.
If awk can do this, how it is done? If not, any suggestions on which tool to use for this?
Awk can easily detect multi-line combinations of patterns, but you need to create what is called a state machine in your code to recognize the sequence.
Consider this input:
how
second half #1
now
first half
second half #2
brown
second half #3
cow
As you have seen, it's easy to recognize a single pattern. Now, we can write an awk program that recognizes second half only when it is directly preceded by a first half line. (With a more sophisticated state machine you could detect an arbitrary sequence of patterns.)
/second half/ {
if(lastLine == "first half") {
print
}
}
{ lastLine = $0 }
If you run this you will see:
second half #2
Now, this example is absurdly simple and only barely a state machine. The interesting state lasts only for the duration of the if statement and the preceding state is implicit, depending on the value of lastLine. In a more canonical state machine you would keep an explicit state variable and transition from state-to-state depending on both the existing state and the current input. But you may not need that much control mechanism.
awk is able to process from start pattern until end pattern
/start-pattern/,/end-pattern/ {
print
}
I was looking for how to match
* Implements hook_entity_info_alter().
*/
function file_test_entity_type_alter(&$entity_types) {
so created
/\* Implements hook_/,/function / {
print
}
which the content I needed. A more complex example is to skip lines and scrub off non-space parts. Note awk is a record(line) and word(split by space) tool.
# start,end pattern match using comma
/ \* Implements hook_(.*?)\./,/function (.\S*?)/ {
# skip PHP multi line comment end
$0 ~ / \*\// skip
# Only print 3rd word
if ($0 ~ /Implements/) {
hook=$3
# scrub of opening parenthesis and following.
sub(/\(.*$/, "", hook)
print hook
}
# Only print function name without parenthesis
if ($0 ~ /function/) {
name=$2
# scrub of opening parenthesis and following.
sub(/\(.*$/, "", name)
print name
print ""
}
}
Hope this helps too.
See also GAWK ranges for more info.
Awk is really record-based. By default it thinks of a line as a record, but you can alter that with the RS (record separator) variable.
One way to approach this would be to do a first pass using sed (you could do this with awk, too, if you prefer), to separate the records with a different character like a form-feed. Then you can write your awk script where it will treat the group of lines as a single record.
For example, if this is your data:
animal 0
name: joe
type: dog
animal 1
name: bill
type: cat
animal 2
name: ed
type: cat
To separate the records with form-feeds:
$ cat data | sed $'s|^\(animal.*\)|\f\\1|'
Now we'll take that and pass it through awk. Here's an example of conditionally printing a record:
$ cat data | sed $'s|^\(animal.*\)|\f\\1|' | awk '
BEGIN { RS="\f" }
/type: cat/ { print }'
outputs:
animal 1
name: bill
type: cat
animal 2
name: ed
type: cat
Edit: as a bonus, here's how to do it with awk-ward ruby (-014 means use form-feed (octal code 014) as the record separator):
$ cat data | sed $'s|^\(animal.*\)|\f\\1|' |
ruby -014 -ne 'print if /type: cat/'
I do this sort of thing with sendmail logs, from time to time.
Given:
Jan 15 22:34:39 mail sm-mta[36383]: r0B8xkuT048547: to=<www#web3>, delay=4+18:34:53, xdelay=00:00:00, mailer=esmtp, pri=21092363, relay=web3., dsn=4.0.0, stat=Deferred: Operation timed out with web3.
Jan 15 22:34:39 mail sm-mta[36383]: r0B8hpoV047895: to=<www#web3>, delay=4+18:49:22, xdelay=00:00:00, mailer=esmtp, pri=21092556, relay=web3., dsn=4.0.0, stat=Deferred: Operation timed out with web3.
Jan 15 22:34:51 mail sm-mta[36719]: r0G3Youh036719: from=<obfTaIX3#nickhearn.com>, size=0, class=0, nrcpts=0, proto=ESMTP, daemon=IPv4, relay=[50.71.152.178]
Jan 15 22:35:04 mail sm-mta[36722]: r0G3Z2SF036722: lost input channel from [190.107.98.82] to IPv4 after rcpt
Jan 15 22:35:04 mail sm-mta[36722]: r0G3Z2SF036722: from=<amahrroc#europe.com>, size=0, class=0, nrcpts=0, proto=SMTP, daemon=IPv4, relay=[190.107.98.82]
Jan 15 22:35:36 mail sm-mta[36728]: r0G3ZXiX036728: lost input channel from ABTS-TN-dynamic-237.104.174.122.airtelbroadband.in [122.174.104.237] (may be forged) to IPv4 after rcpt
Jan 15 22:35:36 mail sm-mta[36728]: r0G3ZXiX036728: from=<clunch.hilarymas#javagame.ru>, size=0, class=0, nrcpts=0, proto=SMTP, daemon=IPv4, relay=ABTS-TN-dynamic-237.104.174.122.airtelbroadband.in [122.174.104.237] (may be forged)
I use a script something like this:
#!/usr/bin/awk -f
BEGIN {
search=ARGV[1]; # Grab the first command line option
delete ARGV[1]; # Delete it so it won't be considered a file
}
# First, store every line in an array keyed on the Queue ID.
# Obviously, this only works for smallish log segments, as it uses up memory.
{
line[$6]=sprintf("%s\n%s", line[$6], $0);
}
# Next, keep a record of Queue IDs with substrings that match our search string.
index($0, search) {
show[$6];
}
# Finally, once we've processed all input data, walk through our array of "found"
# Queue IDs, and print the corresponding records from the storage array.
END {
for(qid in show) {
print line[qid];
}
}
to get the following output:
$ mqsearch airtel /var/log/maillog
Jan 15 22:35:36 mail sm-mta[36728]: r0G3ZXiX036728: lost input channel from ABTS-TN-dynamic-237.104.174.122.airtelbroadband.in [122.174.104.237] (may be forged) to IPv4 after rcpt
Jan 15 22:35:36 mail sm-mta[36728]: r0G3ZXiX036728: from=<clunch.hilarymas#javagame.ru>, size=0, class=0, nrcpts=0, proto=SMTP, daemon=IPv4, relay=ABTS-TN-dynamic-237.104.174.122.airtelbroadband.in [122.174.104.237] (may be forged)
The idea here is that I'm printing all lines that match the Sendmail Queue ID of the string I want to search for. The structure of the code is of course a product of the structure of the log file, so you'll need to customize your solution for the data you're trying to analyse and extract.
awk '/pattern-start/,/pattern-end/'
ref
`pcregrep -M` works pretty well for this.
From pcregrep(1):
-M, --multiline
Allow patterns to match more than one line. When this option is given,
patterns may usefully contain literal newline characters and internal
occurrences of ^ and $ characters. The output for a successful match
may consist of more than
one line, the last of which is the one in which the match ended. If
the matched string ends with a newline sequence the output ends at the
end of that line.
When this option is set, the PCRE library is called in “multiline”
mode. There is a limit to the number of lines that can be matched,
imposed by the way that pcregrep buffers the input file as it scans
it. However, pcregrep ensures that at least 8K characters or the rest
of the document (whichever is the shorter) are available for forward
matching, and similarly the previous 8K characters (or all the
previous characters, if fewer than 8K) are guaranteed to be available
for lookbehind assertions. This option does not work when input is
read line by line (see --line-buffered.)
Related
Is there a way to remove both duplicates and redundant substrings from a list, using shell tools? By "redundant", I mean a string that is contained within another string, so "foo" is redundant with "foobar" and "barfoo".
For example, take this list:
abcd
abc
abd
abcd
bcd
and return:
abcd
abd
uniq, sort -u and awk '!seen[$0]++' remove duplicates effectively but not redundant strings:
How to delete duplicate lines in a file without sorting it in Unix?
Remove duplicate lines without sorting
I can loop through each line recursively with grep but this is is quite slow for large files. (I have about 10^8 lines to process.)
There's an approach using a loop in Python here: Remove redundant strings based on partial strings and Bash here: How to check if a string contains a substring in Bash but I'm trying to avoid loops. Edit: I mean nested loops here, thanks for the clarification #shellter
Is there a way to use a awk's match() function with an array index? This approach builds the array progressively so never has to search the whole file, so should be faster for large files. Or am I missing some other simple solution?
An ideal solution would allow matching of a specified column, as for the methods above.
EDIT
Both of the answers below work, thanks very much for the help. Currently testing for performance on a real dataset, will update with results and accept an answer. I tested both approaches on the same input file, which has 430,000 lines, of which 417,000 are non-redundant. For reference, my original looped grep approach took 7h30m with this file.
Update:
James Brown's original solution took 3h15m and Ed Morton's took 8h59m. On a smaller dataset, James's updated version was 7m versus the original's 20m. Thank you both, this is really helpful.
The data I'm working with are around 110 characters per string, with typically hundreds of thousands of lines per file. The way in which these strings (which are antibody protein sequences) are created can lead to characters from one or both ends of the string getting lost. Hence, "bcd" is likely to be a fragment of "abcde".
An awk that on first run extracts and stores all substrings and strings to two arrays subs and strs and checks on second run:
$ awk '
NR==FNR { # first run
if(($0 in strs)||($0 in subs)) # process only unseen strings
next
len=length()-1 # initial substring length
strs[$0] # hash the complete strings
while(len>=1) {
for(i=1;i+len-1<=length();i++) { # get all substrings of current len
asub=substr($0,i,len) # sub was already resetved :(
if(asub in strs) # if substring is in strs
delete strs[asub] # we do not want it there
subs[asub] # hash all substrings too
}
len--
}
next
}
($0 in strs)&&++strs[$0]==1' file file
Output:
abcd
abd
I tested the script with about 30 M records of 1-20 char ACGT strings. The script ran 3m27s and used about 20 % of my 16 GBs. Using strings of length 1-100 I OOM'd in a few mins (tried it again with about 400k records oflength of 50-100 and it uses about 200 GBs and runs about an hour). (20 M records of 1-30 chars ran 7m10s and used 80 % of the mem)
So if your data records are short or you have unlimited memory, my solution is fast but in the opposite case it's going to crash running out of memory.
Edit:
Another version that tries to preserve memory. On the first go it checks the min and max lengths of strings and on the second run won't store substrings shorter than global min. For about 400 k record of length 50-100 it used around 40 GBs and ran 7 mins. My random data didn't have any redundancy so input==putput. It did remove redundance with other datasets (2 M records of 1-20 char strings):
$ awk '
BEGIN {
while((getline < ARGV[1])>0) # 1st run, check min and max lenghts
if(length()<min||min=="") # TODO: test for length()>0, too
min=length()
else if(length()>max||max=="")
max=length()
# print min,max > "/dev/stderr" # debug
close(ARGV[1])
while((getline < ARGV[1])>0) { # 2nd run, hash strings and substrings
# if(++nr%10000==0) # debug
# print nr > "/dev/stderr" # debug
if(($0 in strs)||($0 in subs))
continue
len=length()-1
strs[$0]
while(len>=min) {
for(i=1;i+len-1<=length();i++) {
asub=substr($0,i,len)
if(asub in strs)
delete strs[asub]
subs[asub]
}
len--
}
}
close(ARGV[1])
while((getline < ARGV[1])>0) # 3rd run, output
if(($0 in strs)&&!strs[$0]++)
print
}' file
$ awk '{print length($0), $0}' file |
sort -k1,1rn -k2 -u |
awk '!index(str,$2){str = str FS $2; print $2}'
abcd
abd
The above assumes the set of unique values will fit in memory.
EDIT
This won't work. Sorry.
#Ed's solution is the best idea I can imagine without some explicit looping, and even that is implicitly scanning over the near-entire growing history of data on every record. It has to.
Can your existing resources hold that whole column in memory, plus a delimiter per record? If not, then you're going to be stuck with either very complex optimization algorithms, or VERY slow redundant searches.
Original post left for reference in case it gives someone else an inspiration.
That's a lot of data.
Given the input file as-is,
while read next
do [[ "$last" == "$next" ]] && continue # throw out repeats
[[ "$last" =~ $next ]] && continue # throw out sustrings
[[ "$next" =~ $last ]] && { last="$next"; continue; } # upgrade if last a substring of next
echo $last # distinct string
last="$next" # set new key
done < file
yields
abcd
abd
With a file of that size I wouldn't trust that sort order, though. Sorting is going to be very slow and take a lot of resources, but will give you more trustworthy results. If you can sort the file once and use that output as the input file, great. If not, replace that last line with done < <( sort -u file ) or something to that effect.
Reworking this logic in awk will be faster.
$: sort -u file | awk '1==NR{last=$0} last~$0{next} $0~last{last=$0;next} {print last;last=$0}'
Aside from the sort this uses trivial memory and should be very fast and efficient, for some value of "fast" on a file with 10^8 lines.
I need help with awk/grep/sed or whatever you think can do the job.
I have a log file and need to continuously monitor it and get some data out of the new lines as they are written to it.
The new lines are very long and not structured but they will contain the following pattern UserName=SOMEUSRNAME, NetworkDevice=SOMENETWORKDEVICE, Calling-Station-ID=SOMEMACADDRESS.
Exmaple:
May 15 03:59:16 MTN-LAB-ISE-B1 CISE_Passed_Authentications 0000043297 1 0 2017-05-15 03:59:16.979 +00:00 0013123384 5200 NOTICE Passed-Authentication: Authentication succeeded, ConfigVersionId=170, Device IP Address=10.97.31.130, DestinationIPAddress=10.62.56.152, DestinationPort=1812, UserName=abcd\testuser, Protocol=Radius, RequestLatency=313, NetworkDeviceName=SHROCLUSW-WLAN-LAB, User-Name=d4d748fefe96, NAS-IP-Address=10.97.31.130, NAS-Port=50005, Service-Type=Call Check, Framed-IP-Address=10.97.109.64, Framed-MTU=1500, Called-Station-ID=64-E9-50-B6-DE-05, Calling-Station-ID=D4-D7-48-FE-FE-96, NAS-Port-Type=Ethernet, NAS-Port-Id=GigabitEthernet0/5, EAP-Key-Name=,
I was thinking using tail -f to monitor the log file and pipe it to grep/sed/awk to extract the needed data.
I only need the SOMEUSERNAME, SOMENETWORKDEVICE, SOMEMACADDRESS and not the pattern also.
And of course to make this even more complicated after the extraction is done I need to pipe it to postgres.
Can someone give me a hint on how to do matching/extraction part and maybe the pipe to postgres?
This might be done with grep/sed as well but I personally prefer awk.
I did this short script filter.awk:
{
# find info in line
userName = gensub(/^.*UserName=([^,\r\n]+).*$/, "\\1", 1, $0)
networkDevice = gensub(/^.*NetworkDeviceName=([^,\r\n]+).*$/, "\\1", 1, $0)
callingStationId = gensub(/^.*Calling-Station-ID=([^,\r\n]+).*$/, "\\1", 1, $0)
# print filtered info (if any of patterns matched)
if (userName != "" || networkDevice != "" || callingStationId != "") {
print "INSERT INTO logs (username, networkdevice, calling_station_id) VALUES ('"userName"', '"networkDevice"', '"callingStationId"');"
}
# If "all patterns" is required instead of "any pattern"
# the "||" operators have to be replaced with "&&".
}
I tested it with GNU awk on bash in cygwin (Window 10):
$ cat >filter.txt <<EOF
> May 15 03:59:16 MTN-LAB-ISE-B1 CISE_Passed_Authentications 0000043297 1 0 2017-05-15 03:59:16.979 +00:00 0013123384 5200 NOTICE Passed-Authentication: Authentication succeeded, ConfigVersionId=170, Device IP Address=10.97.31.130, DestinationIPAddress=10.62.56.152, DestinationPort=1812, UserName=abcd\testuser, Protocol=Radius, RequestLatency=313, NetworkDeviceName=SHROCLUSW-WLAN-LAB, User-Name=d4d748fefe96, NAS-IP-Address=10.97.31.130, NAS-Port=50005, Service-Type=Call Check, Framed-IP-Address=10.97.109.64, Framed-MTU=1500, Called-Station-ID=64-E9-50-B6-DE-05, Calling-Station-ID=D4-D7-48-FE-FE-96, NAS-Port-Type=Ethernet, NAS-Port-Id=GigabitEthernet0/5, EAP-Key-Name=,
> EOF
$ awk -f filter.awk filter.txt
INSERT INTO logs (username, networkdevice, calling_station_id) VALUES ('abcd\testuser', 'SHROCLUSW-WLAN-LAB', 'D4-D7-48-FE-FE-96');
$
Notes:
The NetworkDevice= pattern doesn't seem to be sufficient for me. I replaced it with NetworkDeviceName=. (It should be easy to replace this if I'm wrong.)
I do not know how to format output correctly for postgres nor do I know the database structure of the questioner. Thus, the print statement probably has to be adjusted. (There is only one print statement in script.) However, the print statement outputs to standard output channel (what you already might have expected). Thus, it can be piped into any other input consuming process easily.
It is unclear whether it is required that all patterns must match or (instead) at least one.
I implemented "at least one".
To implement "all", the || operators in the if statement had to be replaced by && operators. (There is only one if statement in script.)
Unfortunately, the gensub() function is available in GNU awk only. For non-GNU awk, another solution could be done using gsub() instead. However, the gensub() function is much more convenient to use. Thus, I prefer it as long as a non-GNU awk solution is not explicitly required.
I use a postfix relay to send our mail from our email servers.
I want to first get the serial number value of the emails from certain users then using that value get the email address they are sending to. Below are examples of the from and to log entries. I would think you have to use a combination of awk and sed.
The sender line
Nov 4 14:29:53 server postfix/qmgr[2089]: 42UE78JD7JE: from=<sender#domain.som>, size=1182, nrcpt=1 (queue active)
The reciever line
Nov 4 14:29:54 server postfix/smtp[10544]: 42UE78JD7JE: to=<user#gmail.com>, relay=gmail-smtp-in.l.google.com[74.125.22.27]:25, delay=1, delays=0.02/0.
in the above example I want to extract 42UE78JD7JE from the first line and then find it in the second line. I am thinking you would use awk to get the serial number value in the line with 'sender#domain.com' then use that value to search for lines with that value and 'to='.
awk '
$7 ~ /^from=/ {from[$6]=$7}
$7 ~ /^to=/ {to[$6]=$7}
END {for (key in from) if (key in to) print from[key], to[key]}
' file
from=<sender#domain.som>, to=<user#gmail.com>,
using awk
awk 'NR==FNR{a[$6]=$6;b[$6]=$7}NR>FNR{if($6 in a){print substr($7,0,length($7)),substr(b[$6],0,length(b[$6]));}}' sender_file reciver_file
output:
to=<user#gmail.com> from=<sender#domain.som>
THis is what i think you want. if you give us the expected output, it can be improved
I'm trying to compute some news article popularity based on twitter data. However, while retrieving the tweets I forgot to escape the characters ending up with an unusable file.
Here is a line from the file:
1369283975$,$337427565662830592$,$0$,$username$,$Average U.S. 401(k) balance tops $80$,$000$,$ up 75 pct since 2009 http://t.co/etHHMUFpoo #news$,$http://www.reuters.com/article/2013/05/23/funds-fidelity-401k-idUSL2N0E31ZC20130523?feedType=RSS&feedName=marketsNews
The '$,$' pattern occurs not only as a field delimiter but also in the tweet, from where I want to remove it.
A correct line would be:
1369283975$,$337427565662830592$,$0$,$username$,$Average U.S. 401(k) balance tops $80000 up 75 pct since 2009 http://t.co/etHHMUFpoo #news$,$http://www.reuters.com/article/2013/05/23/funds-fidelity-401k-idUSL2N0E31ZC20130523?feedType=RSS&feedName=marketsNews
I tried to use cut and sed but I'm not getting the results I want. What would be a good strategy to solve this?
If we can assume that there are never extra separators in the time, id, retweets, username, and link fields, then you could take the middle part and remove all $,$ from it, for example like this:
perl -ne 'chomp; #a=split(/\$,\$/); $_ = join("", #a[4..($#a-1)]); print join("\$,\$", #a[0..3], $_, $a[$#a]), "\n"' < data.txt
What this does:
splits the line using $,$ as delimiter
takes the middle part = fields[4] .. fields[N-1]
joins again by $,$ the first 4 fields, the fixed middle part, and the last field (the link)
This works with your example, but I don't know what other corner cases you might have.
A good way to validate the result is to count the number of occurrences of $,$ is 6 on all lines. You can do that by piping the result to this:
... | perl -ne 'print scalar split(/\$,\$/), "\n"' | sort -u
(should output a single line, with "6")
I have some complex log files that I need to write some tools to process them. I have been playing with awk but I am not sure if awk is the right tool for this.
My log files are print outs of OSPF protocol decodes which contain a text log of the various protocol pkts and their contents with their various protocol fields identified with their values. I want to process these files and print out only certain lines of the log that pertain to specific pkts. Each pkt log can consist of a varying number of lines for that pkt's entry.
awk seems to be able to process a single line that matches a pattern. I can locate the desired pkt but then I need to match patterns in the lines that follow in order to determine if it is a pkt I want to print out.
Another way to look at this is that I would want to isolate several lines in the log file and print out those lines that are the details of a particular pkt based on pattern matches on several lines.
Since awk seems to be line-based, I am not sure if that would be the best tool to use.
If awk can do this, how it is done? If not, any suggestions on which tool to use for this?
Awk can easily detect multi-line combinations of patterns, but you need to create what is called a state machine in your code to recognize the sequence.
Consider this input:
how
second half #1
now
first half
second half #2
brown
second half #3
cow
As you have seen, it's easy to recognize a single pattern. Now, we can write an awk program that recognizes second half only when it is directly preceded by a first half line. (With a more sophisticated state machine you could detect an arbitrary sequence of patterns.)
/second half/ {
if(lastLine == "first half") {
print
}
}
{ lastLine = $0 }
If you run this you will see:
second half #2
Now, this example is absurdly simple and only barely a state machine. The interesting state lasts only for the duration of the if statement and the preceding state is implicit, depending on the value of lastLine. In a more canonical state machine you would keep an explicit state variable and transition from state-to-state depending on both the existing state and the current input. But you may not need that much control mechanism.
awk is able to process from start pattern until end pattern
/start-pattern/,/end-pattern/ {
print
}
I was looking for how to match
* Implements hook_entity_info_alter().
*/
function file_test_entity_type_alter(&$entity_types) {
so created
/\* Implements hook_/,/function / {
print
}
which the content I needed. A more complex example is to skip lines and scrub off non-space parts. Note awk is a record(line) and word(split by space) tool.
# start,end pattern match using comma
/ \* Implements hook_(.*?)\./,/function (.\S*?)/ {
# skip PHP multi line comment end
$0 ~ / \*\// skip
# Only print 3rd word
if ($0 ~ /Implements/) {
hook=$3
# scrub of opening parenthesis and following.
sub(/\(.*$/, "", hook)
print hook
}
# Only print function name without parenthesis
if ($0 ~ /function/) {
name=$2
# scrub of opening parenthesis and following.
sub(/\(.*$/, "", name)
print name
print ""
}
}
Hope this helps too.
See also GAWK ranges for more info.
Awk is really record-based. By default it thinks of a line as a record, but you can alter that with the RS (record separator) variable.
One way to approach this would be to do a first pass using sed (you could do this with awk, too, if you prefer), to separate the records with a different character like a form-feed. Then you can write your awk script where it will treat the group of lines as a single record.
For example, if this is your data:
animal 0
name: joe
type: dog
animal 1
name: bill
type: cat
animal 2
name: ed
type: cat
To separate the records with form-feeds:
$ cat data | sed $'s|^\(animal.*\)|\f\\1|'
Now we'll take that and pass it through awk. Here's an example of conditionally printing a record:
$ cat data | sed $'s|^\(animal.*\)|\f\\1|' | awk '
BEGIN { RS="\f" }
/type: cat/ { print }'
outputs:
animal 1
name: bill
type: cat
animal 2
name: ed
type: cat
Edit: as a bonus, here's how to do it with awk-ward ruby (-014 means use form-feed (octal code 014) as the record separator):
$ cat data | sed $'s|^\(animal.*\)|\f\\1|' |
ruby -014 -ne 'print if /type: cat/'
I do this sort of thing with sendmail logs, from time to time.
Given:
Jan 15 22:34:39 mail sm-mta[36383]: r0B8xkuT048547: to=<www#web3>, delay=4+18:34:53, xdelay=00:00:00, mailer=esmtp, pri=21092363, relay=web3., dsn=4.0.0, stat=Deferred: Operation timed out with web3.
Jan 15 22:34:39 mail sm-mta[36383]: r0B8hpoV047895: to=<www#web3>, delay=4+18:49:22, xdelay=00:00:00, mailer=esmtp, pri=21092556, relay=web3., dsn=4.0.0, stat=Deferred: Operation timed out with web3.
Jan 15 22:34:51 mail sm-mta[36719]: r0G3Youh036719: from=<obfTaIX3#nickhearn.com>, size=0, class=0, nrcpts=0, proto=ESMTP, daemon=IPv4, relay=[50.71.152.178]
Jan 15 22:35:04 mail sm-mta[36722]: r0G3Z2SF036722: lost input channel from [190.107.98.82] to IPv4 after rcpt
Jan 15 22:35:04 mail sm-mta[36722]: r0G3Z2SF036722: from=<amahrroc#europe.com>, size=0, class=0, nrcpts=0, proto=SMTP, daemon=IPv4, relay=[190.107.98.82]
Jan 15 22:35:36 mail sm-mta[36728]: r0G3ZXiX036728: lost input channel from ABTS-TN-dynamic-237.104.174.122.airtelbroadband.in [122.174.104.237] (may be forged) to IPv4 after rcpt
Jan 15 22:35:36 mail sm-mta[36728]: r0G3ZXiX036728: from=<clunch.hilarymas#javagame.ru>, size=0, class=0, nrcpts=0, proto=SMTP, daemon=IPv4, relay=ABTS-TN-dynamic-237.104.174.122.airtelbroadband.in [122.174.104.237] (may be forged)
I use a script something like this:
#!/usr/bin/awk -f
BEGIN {
search=ARGV[1]; # Grab the first command line option
delete ARGV[1]; # Delete it so it won't be considered a file
}
# First, store every line in an array keyed on the Queue ID.
# Obviously, this only works for smallish log segments, as it uses up memory.
{
line[$6]=sprintf("%s\n%s", line[$6], $0);
}
# Next, keep a record of Queue IDs with substrings that match our search string.
index($0, search) {
show[$6];
}
# Finally, once we've processed all input data, walk through our array of "found"
# Queue IDs, and print the corresponding records from the storage array.
END {
for(qid in show) {
print line[qid];
}
}
to get the following output:
$ mqsearch airtel /var/log/maillog
Jan 15 22:35:36 mail sm-mta[36728]: r0G3ZXiX036728: lost input channel from ABTS-TN-dynamic-237.104.174.122.airtelbroadband.in [122.174.104.237] (may be forged) to IPv4 after rcpt
Jan 15 22:35:36 mail sm-mta[36728]: r0G3ZXiX036728: from=<clunch.hilarymas#javagame.ru>, size=0, class=0, nrcpts=0, proto=SMTP, daemon=IPv4, relay=ABTS-TN-dynamic-237.104.174.122.airtelbroadband.in [122.174.104.237] (may be forged)
The idea here is that I'm printing all lines that match the Sendmail Queue ID of the string I want to search for. The structure of the code is of course a product of the structure of the log file, so you'll need to customize your solution for the data you're trying to analyse and extract.
awk '/pattern-start/,/pattern-end/'
ref
`pcregrep -M` works pretty well for this.
From pcregrep(1):
-M, --multiline
Allow patterns to match more than one line. When this option is given,
patterns may usefully contain literal newline characters and internal
occurrences of ^ and $ characters. The output for a successful match
may consist of more than
one line, the last of which is the one in which the match ended. If
the matched string ends with a newline sequence the output ends at the
end of that line.
When this option is set, the PCRE library is called in “multiline”
mode. There is a limit to the number of lines that can be matched,
imposed by the way that pcregrep buffers the input file as it scans
it. However, pcregrep ensures that at least 8K characters or the rest
of the document (whichever is the shorter) are available for forward
matching, and similarly the previous 8K characters (or all the
previous characters, if fewer than 8K) are guaranteed to be available
for lookbehind assertions. This option does not work when input is
read line by line (see --line-buffered.)