How to extract data from live log and pipe it to postgres - linux

I need help with awk/grep/sed or whatever you think can do the job.
I have a log file and need to continuously monitor it and get some data out of the new lines as they are written to it.
The new lines are very long and not structured but they will contain the following pattern UserName=SOMEUSRNAME, NetworkDevice=SOMENETWORKDEVICE, Calling-Station-ID=SOMEMACADDRESS.
Exmaple:
May 15 03:59:16 MTN-LAB-ISE-B1 CISE_Passed_Authentications 0000043297 1 0 2017-05-15 03:59:16.979 +00:00 0013123384 5200 NOTICE Passed-Authentication: Authentication succeeded, ConfigVersionId=170, Device IP Address=10.97.31.130, DestinationIPAddress=10.62.56.152, DestinationPort=1812, UserName=abcd\testuser, Protocol=Radius, RequestLatency=313, NetworkDeviceName=SHROCLUSW-WLAN-LAB, User-Name=d4d748fefe96, NAS-IP-Address=10.97.31.130, NAS-Port=50005, Service-Type=Call Check, Framed-IP-Address=10.97.109.64, Framed-MTU=1500, Called-Station-ID=64-E9-50-B6-DE-05, Calling-Station-ID=D4-D7-48-FE-FE-96, NAS-Port-Type=Ethernet, NAS-Port-Id=GigabitEthernet0/5, EAP-Key-Name=,
I was thinking using tail -f to monitor the log file and pipe it to grep/sed/awk to extract the needed data.
I only need the SOMEUSERNAME, SOMENETWORKDEVICE, SOMEMACADDRESS and not the pattern also.
And of course to make this even more complicated after the extraction is done I need to pipe it to postgres.
Can someone give me a hint on how to do matching/extraction part and maybe the pipe to postgres?

This might be done with grep/sed as well but I personally prefer awk.
I did this short script filter.awk:
{
# find info in line
userName = gensub(/^.*UserName=([^,\r\n]+).*$/, "\\1", 1, $0)
networkDevice = gensub(/^.*NetworkDeviceName=([^,\r\n]+).*$/, "\\1", 1, $0)
callingStationId = gensub(/^.*Calling-Station-ID=([^,\r\n]+).*$/, "\\1", 1, $0)
# print filtered info (if any of patterns matched)
if (userName != "" || networkDevice != "" || callingStationId != "") {
print "INSERT INTO logs (username, networkdevice, calling_station_id) VALUES ('"userName"', '"networkDevice"', '"callingStationId"');"
}
# If "all patterns" is required instead of "any pattern"
# the "||" operators have to be replaced with "&&".
}
I tested it with GNU awk on bash in cygwin (Window 10):
$ cat >filter.txt <<EOF
> May 15 03:59:16 MTN-LAB-ISE-B1 CISE_Passed_Authentications 0000043297 1 0 2017-05-15 03:59:16.979 +00:00 0013123384 5200 NOTICE Passed-Authentication: Authentication succeeded, ConfigVersionId=170, Device IP Address=10.97.31.130, DestinationIPAddress=10.62.56.152, DestinationPort=1812, UserName=abcd\testuser, Protocol=Radius, RequestLatency=313, NetworkDeviceName=SHROCLUSW-WLAN-LAB, User-Name=d4d748fefe96, NAS-IP-Address=10.97.31.130, NAS-Port=50005, Service-Type=Call Check, Framed-IP-Address=10.97.109.64, Framed-MTU=1500, Called-Station-ID=64-E9-50-B6-DE-05, Calling-Station-ID=D4-D7-48-FE-FE-96, NAS-Port-Type=Ethernet, NAS-Port-Id=GigabitEthernet0/5, EAP-Key-Name=,
> EOF
$ awk -f filter.awk filter.txt
INSERT INTO logs (username, networkdevice, calling_station_id) VALUES ('abcd\testuser', 'SHROCLUSW-WLAN-LAB', 'D4-D7-48-FE-FE-96');
$
Notes:
The NetworkDevice= pattern doesn't seem to be sufficient for me. I replaced it with NetworkDeviceName=. (It should be easy to replace this if I'm wrong.)
I do not know how to format output correctly for postgres nor do I know the database structure of the questioner. Thus, the print statement probably has to be adjusted. (There is only one print statement in script.) However, the print statement outputs to standard output channel (what you already might have expected). Thus, it can be piped into any other input consuming process easily.
It is unclear whether it is required that all patterns must match or (instead) at least one.
I implemented "at least one".
To implement "all", the || operators in the if statement had to be replaced by && operators. (There is only one if statement in script.)
Unfortunately, the gensub() function is available in GNU awk only. For non-GNU awk, another solution could be done using gsub() instead. However, the gensub() function is much more convenient to use. Thus, I prefer it as long as a non-GNU awk solution is not explicitly required.

Related

AWK how to process multiple files and comparing them IN CONTROL FILE! (not command line one-liner)

I read all of answers for similar problems but they are not working for me because my files are not uniformal, they contain several control headers and in such case is safer to create script than one-liner and all the answers focused on one-liners. In theory one-liners commands should be convertible to script but I am struggling to achieve:
printing the control headers
print only the records started with 16 in <file 1> where value of column 2 NOT EXISTS in column 2 of the <file 2>
I end up with this:
BEGIN {
FS="\x01";
OFS="\x01";
RS="\x02\n";
ORS="\x02\n";
file1=ARGV[1];
file2=ARGV[2];
count=0;
}
/^#/ {
print;
count++;
}
# reset counters after control headers
NR=1;
FNR=1;
# Below gives syntax error
/^16/ AND NR==FNR {
a[$2];next; 'FNR==1 || !$2 in a' file1 file2
}
END {
}
Googling only gives me results for command line processing and documentation is also silent in that regard. Does it mean it cannot be done?
Perhaps try:
script.awk:
BEGIN {
OFS = FS = "\x01"
ORS = RS = "\x02\n"
}
NR==FNR {
if (/^16/) a[$2]
next
}
/^16/ && !($2 in a) || /^#/
Note the parentheses: !$2 in a would be parsed as (!$2) in a
Invoke with:
awk -f script.awk FILE2 FILE1
Note order of FILE1 / FILE2 is reversed; FILE2 must be read first to pre-populate the lookup table.
First of all, short answer to my question should be "NOT POSSIBLE", if anyone read question carefully and knew AWK in full that is obvious answer, I wish I knew it sooner instead of wasting few days trying to write script.
Also, there is no such thing as minimal reproducible example (this was always constant pain on TeX groups) - I need full example working, if it works on 1 row there is no guarantee if it works on 2 rows and my number of rows is ~ 127 mln.
If you read code carefully than you would know what is not working - I put in comment section what is giving syntax error. Anyway, as #Daweo suggested there is no way to use logic operator in pattern section. So because we don't need printing in first file the whole trick is to do conditional in second brackets:
awk -F, 'BEGIN{} NR==FNR{a[$1];next} !($1 in a) { if (/^16/) print $0} ' set1.txt set2.txt
assuming in above example that separator is comma. I don't know where assumption about multiple RS support only in gnu awk came from. On MacOS BSD awk it works exactly the same, but in fact RS="\x02\n" is single separator not two separators.

Remove redundant strings without looping

Is there a way to remove both duplicates and redundant substrings from a list, using shell tools? By "redundant", I mean a string that is contained within another string, so "foo" is redundant with "foobar" and "barfoo".
For example, take this list:
abcd
abc
abd
abcd
bcd
and return:
abcd
abd
uniq, sort -u and awk '!seen[$0]++' remove duplicates effectively but not redundant strings:
How to delete duplicate lines in a file without sorting it in Unix?
Remove duplicate lines without sorting
I can loop through each line recursively with grep but this is is quite slow for large files. (I have about 10^8 lines to process.)
There's an approach using a loop in Python here: Remove redundant strings based on partial strings and Bash here: How to check if a string contains a substring in Bash but I'm trying to avoid loops. Edit: I mean nested loops here, thanks for the clarification #shellter
Is there a way to use a awk's match() function with an array index? This approach builds the array progressively so never has to search the whole file, so should be faster for large files. Or am I missing some other simple solution?
An ideal solution would allow matching of a specified column, as for the methods above.
EDIT
Both of the answers below work, thanks very much for the help. Currently testing for performance on a real dataset, will update with results and accept an answer. I tested both approaches on the same input file, which has 430,000 lines, of which 417,000 are non-redundant. For reference, my original looped grep approach took 7h30m with this file.
Update:
James Brown's original solution took 3h15m and Ed Morton's took 8h59m. On a smaller dataset, James's updated version was 7m versus the original's 20m. Thank you both, this is really helpful.
The data I'm working with are around 110 characters per string, with typically hundreds of thousands of lines per file. The way in which these strings (which are antibody protein sequences) are created can lead to characters from one or both ends of the string getting lost. Hence, "bcd" is likely to be a fragment of "abcde".
An awk that on first run extracts and stores all substrings and strings to two arrays subs and strs and checks on second run:
$ awk '
NR==FNR { # first run
if(($0 in strs)||($0 in subs)) # process only unseen strings
next
len=length()-1 # initial substring length
strs[$0] # hash the complete strings
while(len>=1) {
for(i=1;i+len-1<=length();i++) { # get all substrings of current len
asub=substr($0,i,len) # sub was already resetved :(
if(asub in strs) # if substring is in strs
delete strs[asub] # we do not want it there
subs[asub] # hash all substrings too
}
len--
}
next
}
($0 in strs)&&++strs[$0]==1' file file
Output:
abcd
abd
I tested the script with about 30 M records of 1-20 char ACGT strings. The script ran 3m27s and used about 20 % of my 16 GBs. Using strings of length 1-100 I OOM'd in a few mins (tried it again with about 400k records oflength of 50-100 and it uses about 200 GBs and runs about an hour). (20 M records of 1-30 chars ran 7m10s and used 80 % of the mem)
So if your data records are short or you have unlimited memory, my solution is fast but in the opposite case it's going to crash running out of memory.
Edit:
Another version that tries to preserve memory. On the first go it checks the min and max lengths of strings and on the second run won't store substrings shorter than global min. For about 400 k record of length 50-100 it used around 40 GBs and ran 7 mins. My random data didn't have any redundancy so input==putput. It did remove redundance with other datasets (2 M records of 1-20 char strings):
$ awk '
BEGIN {
while((getline < ARGV[1])>0) # 1st run, check min and max lenghts
if(length()<min||min=="") # TODO: test for length()>0, too
min=length()
else if(length()>max||max=="")
max=length()
# print min,max > "/dev/stderr" # debug
close(ARGV[1])
while((getline < ARGV[1])>0) { # 2nd run, hash strings and substrings
# if(++nr%10000==0) # debug
# print nr > "/dev/stderr" # debug
if(($0 in strs)||($0 in subs))
continue
len=length()-1
strs[$0]
while(len>=min) {
for(i=1;i+len-1<=length();i++) {
asub=substr($0,i,len)
if(asub in strs)
delete strs[asub]
subs[asub]
}
len--
}
}
close(ARGV[1])
while((getline < ARGV[1])>0) # 3rd run, output
if(($0 in strs)&&!strs[$0]++)
print
}' file
$ awk '{print length($0), $0}' file |
sort -k1,1rn -k2 -u |
awk '!index(str,$2){str = str FS $2; print $2}'
abcd
abd
The above assumes the set of unique values will fit in memory.
EDIT
This won't work. Sorry.
#Ed's solution is the best idea I can imagine without some explicit looping, and even that is implicitly scanning over the near-entire growing history of data on every record. It has to.
Can your existing resources hold that whole column in memory, plus a delimiter per record? If not, then you're going to be stuck with either very complex optimization algorithms, or VERY slow redundant searches.
Original post left for reference in case it gives someone else an inspiration.
That's a lot of data.
Given the input file as-is,
while read next
do [[ "$last" == "$next" ]] && continue # throw out repeats
[[ "$last" =~ $next ]] && continue # throw out sustrings
[[ "$next" =~ $last ]] && { last="$next"; continue; } # upgrade if last a substring of next
echo $last # distinct string
last="$next" # set new key
done < file
yields
abcd
abd
With a file of that size I wouldn't trust that sort order, though. Sorting is going to be very slow and take a lot of resources, but will give you more trustworthy results. If you can sort the file once and use that output as the input file, great. If not, replace that last line with done < <( sort -u file ) or something to that effect.
Reworking this logic in awk will be faster.
$: sort -u file | awk '1==NR{last=$0} last~$0{next} $0~last{last=$0;next} {print last;last=$0}'
Aside from the sort this uses trivial memory and should be very fast and efficient, for some value of "fast" on a file with 10^8 lines.

Filter a very large, numerically sorted CSV file based on a minimum/maximum value using Linux?

I'm trying to output lines of a CSV file which is quite large. In the past I have tried different things and ultimately come to find that Linux's command line interface (sed, awk, grep, etc) is the fastest way to handle these types of files.
I have a CSV file like this:
1,rand1,rand2
4,randx,randy,
6,randz,randq,
...
1001,randy,randi,
1030,rando,randn,
1030,randz,randc,
1036,randp,randu
...
1230994,randm,randn,
1230995,randz,randl,
1231869,rande,randf
Although the first column is numerically increasing, the space between each number varies randomly. I need to be able to output all lines that have a value between X and Y in their first column.
Something like:
sed ./csv -min --col1 1000 -max --col1 1400
which would output all the lines that have a first column value between 1000 and 1400.
The lines are different enough that in a >5 GB file there might only be ~5 duplicates, so it wouldn't be a big deal if it counted the duplicates only once -- but it would be a big deal if it threw an error due to a duplicate line.
I may not know whether particular line values exist (e.g. 1000 is a rough estimate and should not be assumed to exist as a first column value).
Optimizations matter when it comes to large files; the following awk command:
is parameterized (uses variables to define the range boundaries)
performs only a single comparison for records that come before the range.
exits as soon as the last record of interest has been found.
awk -F, -v from=1000 -v to=1400 '$1 < from { next } $1 > to { exit } 1' ./csv
Because awk performs numerical comparison (with input fields that look like numbers), the range boundaries needn't match field values precisely.
You can easily do this with awk, though it won't take full advantage of the file being sorted:
awk -F , '$1 > 1400 { exit(0); } $1 >= 1000 { print }' file.csv
If you know that the numbers are increasing and unique, you can use addresses like this:
sed '/^1000,/,/^1400,/!d' infile.csv
which does not print any line that is outside of the lines between the one that matches /^1000,/ and the one that matches /^1400,/.
Notice that this doesn't work if 1000 or 1400 don't actually exist as values, i.e., it wouldn't print anything at all in that case.
In any case, as demonstrated by the answers by mklement0 and that other guy, awk is a the better choice here.
Here's a bash-version of the script:
#! /bin/bash
fname="$1"
start_nr="$2"
end_nr="$3"
while IFS=, read -r nr rest || [[ -n $nr && -n $rest ]]; do
if (( $nr < $start_nr )); then continue;
elif (( $nr > $end_nr )); then break; fi
printf "%s,%s\n" "$nr" "$rest"
done < "$fname"
Which you would then call script.sh foo.csv 1000 2000
The script will start printing when the number is large enough and then immediately stops when the number gets above the limit.

grep search two lines in a text file [duplicate]

I have some complex log files that I need to write some tools to process them. I have been playing with awk but I am not sure if awk is the right tool for this.
My log files are print outs of OSPF protocol decodes which contain a text log of the various protocol pkts and their contents with their various protocol fields identified with their values. I want to process these files and print out only certain lines of the log that pertain to specific pkts. Each pkt log can consist of a varying number of lines for that pkt's entry.
awk seems to be able to process a single line that matches a pattern. I can locate the desired pkt but then I need to match patterns in the lines that follow in order to determine if it is a pkt I want to print out.
Another way to look at this is that I would want to isolate several lines in the log file and print out those lines that are the details of a particular pkt based on pattern matches on several lines.
Since awk seems to be line-based, I am not sure if that would be the best tool to use.
If awk can do this, how it is done? If not, any suggestions on which tool to use for this?
Awk can easily detect multi-line combinations of patterns, but you need to create what is called a state machine in your code to recognize the sequence.
Consider this input:
how
second half #1
now
first half
second half #2
brown
second half #3
cow
As you have seen, it's easy to recognize a single pattern. Now, we can write an awk program that recognizes second half only when it is directly preceded by a first half line. (With a more sophisticated state machine you could detect an arbitrary sequence of patterns.)
/second half/ {
if(lastLine == "first half") {
print
}
}
{ lastLine = $0 }
If you run this you will see:
second half #2
Now, this example is absurdly simple and only barely a state machine. The interesting state lasts only for the duration of the if statement and the preceding state is implicit, depending on the value of lastLine. In a more canonical state machine you would keep an explicit state variable and transition from state-to-state depending on both the existing state and the current input. But you may not need that much control mechanism.
awk is able to process from start pattern until end pattern
/start-pattern/,/end-pattern/ {
print
}
I was looking for how to match
* Implements hook_entity_info_alter().
*/
function file_test_entity_type_alter(&$entity_types) {
so created
/\* Implements hook_/,/function / {
print
}
which the content I needed. A more complex example is to skip lines and scrub off non-space parts. Note awk is a record(line) and word(split by space) tool.
# start,end pattern match using comma
/ \* Implements hook_(.*?)\./,/function (.\S*?)/ {
# skip PHP multi line comment end
$0 ~ / \*\// skip
# Only print 3rd word
if ($0 ~ /Implements/) {
hook=$3
# scrub of opening parenthesis and following.
sub(/\(.*$/, "", hook)
print hook
}
# Only print function name without parenthesis
if ($0 ~ /function/) {
name=$2
# scrub of opening parenthesis and following.
sub(/\(.*$/, "", name)
print name
print ""
}
}
Hope this helps too.
See also GAWK ranges for more info.
Awk is really record-based. By default it thinks of a line as a record, but you can alter that with the RS (record separator) variable.
One way to approach this would be to do a first pass using sed (you could do this with awk, too, if you prefer), to separate the records with a different character like a form-feed. Then you can write your awk script where it will treat the group of lines as a single record.
For example, if this is your data:
animal 0
name: joe
type: dog
animal 1
name: bill
type: cat
animal 2
name: ed
type: cat
To separate the records with form-feeds:
$ cat data | sed $'s|^\(animal.*\)|\f\\1|'
Now we'll take that and pass it through awk. Here's an example of conditionally printing a record:
$ cat data | sed $'s|^\(animal.*\)|\f\\1|' | awk '
BEGIN { RS="\f" }
/type: cat/ { print }'
outputs:
animal 1
name: bill
type: cat
animal 2
name: ed
type: cat
Edit: as a bonus, here's how to do it with awk-ward ruby (-014 means use form-feed (octal code 014) as the record separator):
$ cat data | sed $'s|^\(animal.*\)|\f\\1|' |
ruby -014 -ne 'print if /type: cat/'
I do this sort of thing with sendmail logs, from time to time.
Given:
Jan 15 22:34:39 mail sm-mta[36383]: r0B8xkuT048547: to=<www#web3>, delay=4+18:34:53, xdelay=00:00:00, mailer=esmtp, pri=21092363, relay=web3., dsn=4.0.0, stat=Deferred: Operation timed out with web3.
Jan 15 22:34:39 mail sm-mta[36383]: r0B8hpoV047895: to=<www#web3>, delay=4+18:49:22, xdelay=00:00:00, mailer=esmtp, pri=21092556, relay=web3., dsn=4.0.0, stat=Deferred: Operation timed out with web3.
Jan 15 22:34:51 mail sm-mta[36719]: r0G3Youh036719: from=<obfTaIX3#nickhearn.com>, size=0, class=0, nrcpts=0, proto=ESMTP, daemon=IPv4, relay=[50.71.152.178]
Jan 15 22:35:04 mail sm-mta[36722]: r0G3Z2SF036722: lost input channel from [190.107.98.82] to IPv4 after rcpt
Jan 15 22:35:04 mail sm-mta[36722]: r0G3Z2SF036722: from=<amahrroc#europe.com>, size=0, class=0, nrcpts=0, proto=SMTP, daemon=IPv4, relay=[190.107.98.82]
Jan 15 22:35:36 mail sm-mta[36728]: r0G3ZXiX036728: lost input channel from ABTS-TN-dynamic-237.104.174.122.airtelbroadband.in [122.174.104.237] (may be forged) to IPv4 after rcpt
Jan 15 22:35:36 mail sm-mta[36728]: r0G3ZXiX036728: from=<clunch.hilarymas#javagame.ru>, size=0, class=0, nrcpts=0, proto=SMTP, daemon=IPv4, relay=ABTS-TN-dynamic-237.104.174.122.airtelbroadband.in [122.174.104.237] (may be forged)
I use a script something like this:
#!/usr/bin/awk -f
BEGIN {
search=ARGV[1]; # Grab the first command line option
delete ARGV[1]; # Delete it so it won't be considered a file
}
# First, store every line in an array keyed on the Queue ID.
# Obviously, this only works for smallish log segments, as it uses up memory.
{
line[$6]=sprintf("%s\n%s", line[$6], $0);
}
# Next, keep a record of Queue IDs with substrings that match our search string.
index($0, search) {
show[$6];
}
# Finally, once we've processed all input data, walk through our array of "found"
# Queue IDs, and print the corresponding records from the storage array.
END {
for(qid in show) {
print line[qid];
}
}
to get the following output:
$ mqsearch airtel /var/log/maillog
Jan 15 22:35:36 mail sm-mta[36728]: r0G3ZXiX036728: lost input channel from ABTS-TN-dynamic-237.104.174.122.airtelbroadband.in [122.174.104.237] (may be forged) to IPv4 after rcpt
Jan 15 22:35:36 mail sm-mta[36728]: r0G3ZXiX036728: from=<clunch.hilarymas#javagame.ru>, size=0, class=0, nrcpts=0, proto=SMTP, daemon=IPv4, relay=ABTS-TN-dynamic-237.104.174.122.airtelbroadband.in [122.174.104.237] (may be forged)
The idea here is that I'm printing all lines that match the Sendmail Queue ID of the string I want to search for. The structure of the code is of course a product of the structure of the log file, so you'll need to customize your solution for the data you're trying to analyse and extract.
awk '/pattern-start/,/pattern-end/'
ref
`pcregrep -M` works pretty well for this.
From pcregrep(1):
-M, --multiline
Allow patterns to match more than one line. When this option is given,
patterns may usefully contain literal newline characters and internal
occurrences of ^ and $ characters. The output for a successful match
may consist of more than
one line, the last of which is the one in which the match ended. If
the matched string ends with a newline sequence the output ends at the
end of that line.
When this option is set, the PCRE library is called in “multiline”
mode. There is a limit to the number of lines that can be matched,
imposed by the way that pcregrep buffers the input file as it scans
it. However, pcregrep ensures that at least 8K characters or the rest
of the document (whichever is the shorter) are available for forward
matching, and similarly the previous 8K characters (or all the
previous characters, if fewer than 8K) are guaranteed to be available
for lookbehind assertions. This option does not work when input is
read line by line (see --line-buffered.)

Can awk patterns match multiple lines?

I have some complex log files that I need to write some tools to process them. I have been playing with awk but I am not sure if awk is the right tool for this.
My log files are print outs of OSPF protocol decodes which contain a text log of the various protocol pkts and their contents with their various protocol fields identified with their values. I want to process these files and print out only certain lines of the log that pertain to specific pkts. Each pkt log can consist of a varying number of lines for that pkt's entry.
awk seems to be able to process a single line that matches a pattern. I can locate the desired pkt but then I need to match patterns in the lines that follow in order to determine if it is a pkt I want to print out.
Another way to look at this is that I would want to isolate several lines in the log file and print out those lines that are the details of a particular pkt based on pattern matches on several lines.
Since awk seems to be line-based, I am not sure if that would be the best tool to use.
If awk can do this, how it is done? If not, any suggestions on which tool to use for this?
Awk can easily detect multi-line combinations of patterns, but you need to create what is called a state machine in your code to recognize the sequence.
Consider this input:
how
second half #1
now
first half
second half #2
brown
second half #3
cow
As you have seen, it's easy to recognize a single pattern. Now, we can write an awk program that recognizes second half only when it is directly preceded by a first half line. (With a more sophisticated state machine you could detect an arbitrary sequence of patterns.)
/second half/ {
if(lastLine == "first half") {
print
}
}
{ lastLine = $0 }
If you run this you will see:
second half #2
Now, this example is absurdly simple and only barely a state machine. The interesting state lasts only for the duration of the if statement and the preceding state is implicit, depending on the value of lastLine. In a more canonical state machine you would keep an explicit state variable and transition from state-to-state depending on both the existing state and the current input. But you may not need that much control mechanism.
awk is able to process from start pattern until end pattern
/start-pattern/,/end-pattern/ {
print
}
I was looking for how to match
* Implements hook_entity_info_alter().
*/
function file_test_entity_type_alter(&$entity_types) {
so created
/\* Implements hook_/,/function / {
print
}
which the content I needed. A more complex example is to skip lines and scrub off non-space parts. Note awk is a record(line) and word(split by space) tool.
# start,end pattern match using comma
/ \* Implements hook_(.*?)\./,/function (.\S*?)/ {
# skip PHP multi line comment end
$0 ~ / \*\// skip
# Only print 3rd word
if ($0 ~ /Implements/) {
hook=$3
# scrub of opening parenthesis and following.
sub(/\(.*$/, "", hook)
print hook
}
# Only print function name without parenthesis
if ($0 ~ /function/) {
name=$2
# scrub of opening parenthesis and following.
sub(/\(.*$/, "", name)
print name
print ""
}
}
Hope this helps too.
See also GAWK ranges for more info.
Awk is really record-based. By default it thinks of a line as a record, but you can alter that with the RS (record separator) variable.
One way to approach this would be to do a first pass using sed (you could do this with awk, too, if you prefer), to separate the records with a different character like a form-feed. Then you can write your awk script where it will treat the group of lines as a single record.
For example, if this is your data:
animal 0
name: joe
type: dog
animal 1
name: bill
type: cat
animal 2
name: ed
type: cat
To separate the records with form-feeds:
$ cat data | sed $'s|^\(animal.*\)|\f\\1|'
Now we'll take that and pass it through awk. Here's an example of conditionally printing a record:
$ cat data | sed $'s|^\(animal.*\)|\f\\1|' | awk '
BEGIN { RS="\f" }
/type: cat/ { print }'
outputs:
animal 1
name: bill
type: cat
animal 2
name: ed
type: cat
Edit: as a bonus, here's how to do it with awk-ward ruby (-014 means use form-feed (octal code 014) as the record separator):
$ cat data | sed $'s|^\(animal.*\)|\f\\1|' |
ruby -014 -ne 'print if /type: cat/'
I do this sort of thing with sendmail logs, from time to time.
Given:
Jan 15 22:34:39 mail sm-mta[36383]: r0B8xkuT048547: to=<www#web3>, delay=4+18:34:53, xdelay=00:00:00, mailer=esmtp, pri=21092363, relay=web3., dsn=4.0.0, stat=Deferred: Operation timed out with web3.
Jan 15 22:34:39 mail sm-mta[36383]: r0B8hpoV047895: to=<www#web3>, delay=4+18:49:22, xdelay=00:00:00, mailer=esmtp, pri=21092556, relay=web3., dsn=4.0.0, stat=Deferred: Operation timed out with web3.
Jan 15 22:34:51 mail sm-mta[36719]: r0G3Youh036719: from=<obfTaIX3#nickhearn.com>, size=0, class=0, nrcpts=0, proto=ESMTP, daemon=IPv4, relay=[50.71.152.178]
Jan 15 22:35:04 mail sm-mta[36722]: r0G3Z2SF036722: lost input channel from [190.107.98.82] to IPv4 after rcpt
Jan 15 22:35:04 mail sm-mta[36722]: r0G3Z2SF036722: from=<amahrroc#europe.com>, size=0, class=0, nrcpts=0, proto=SMTP, daemon=IPv4, relay=[190.107.98.82]
Jan 15 22:35:36 mail sm-mta[36728]: r0G3ZXiX036728: lost input channel from ABTS-TN-dynamic-237.104.174.122.airtelbroadband.in [122.174.104.237] (may be forged) to IPv4 after rcpt
Jan 15 22:35:36 mail sm-mta[36728]: r0G3ZXiX036728: from=<clunch.hilarymas#javagame.ru>, size=0, class=0, nrcpts=0, proto=SMTP, daemon=IPv4, relay=ABTS-TN-dynamic-237.104.174.122.airtelbroadband.in [122.174.104.237] (may be forged)
I use a script something like this:
#!/usr/bin/awk -f
BEGIN {
search=ARGV[1]; # Grab the first command line option
delete ARGV[1]; # Delete it so it won't be considered a file
}
# First, store every line in an array keyed on the Queue ID.
# Obviously, this only works for smallish log segments, as it uses up memory.
{
line[$6]=sprintf("%s\n%s", line[$6], $0);
}
# Next, keep a record of Queue IDs with substrings that match our search string.
index($0, search) {
show[$6];
}
# Finally, once we've processed all input data, walk through our array of "found"
# Queue IDs, and print the corresponding records from the storage array.
END {
for(qid in show) {
print line[qid];
}
}
to get the following output:
$ mqsearch airtel /var/log/maillog
Jan 15 22:35:36 mail sm-mta[36728]: r0G3ZXiX036728: lost input channel from ABTS-TN-dynamic-237.104.174.122.airtelbroadband.in [122.174.104.237] (may be forged) to IPv4 after rcpt
Jan 15 22:35:36 mail sm-mta[36728]: r0G3ZXiX036728: from=<clunch.hilarymas#javagame.ru>, size=0, class=0, nrcpts=0, proto=SMTP, daemon=IPv4, relay=ABTS-TN-dynamic-237.104.174.122.airtelbroadband.in [122.174.104.237] (may be forged)
The idea here is that I'm printing all lines that match the Sendmail Queue ID of the string I want to search for. The structure of the code is of course a product of the structure of the log file, so you'll need to customize your solution for the data you're trying to analyse and extract.
awk '/pattern-start/,/pattern-end/'
ref
`pcregrep -M` works pretty well for this.
From pcregrep(1):
-M, --multiline
Allow patterns to match more than one line. When this option is given,
patterns may usefully contain literal newline characters and internal
occurrences of ^ and $ characters. The output for a successful match
may consist of more than
one line, the last of which is the one in which the match ended. If
the matched string ends with a newline sequence the output ends at the
end of that line.
When this option is set, the PCRE library is called in “multiline”
mode. There is a limit to the number of lines that can be matched,
imposed by the way that pcregrep buffers the input file as it scans
it. However, pcregrep ensures that at least 8K characters or the rest
of the document (whichever is the shorter) are available for forward
matching, and similarly the previous 8K characters (or all the
previous characters, if fewer than 8K) are guaranteed to be available
for lookbehind assertions. This option does not work when input is
read line by line (see --line-buffered.)

Resources