Trailers option in git --pretty option - linux

I was trying to extract a summary of contributions from git log and create a concise summary of that and create an excel/csv out of it to present reports.
I did try
git log --after="2020-12-10" --pretty=format:'"%h","%an","%ae","%aD","%s","(trailers:key="Reviewed By")"'
and the CSV looks like with a blank CSV column at the end.
...
"7c87963cc","XYZ","xyz#abc.com","Tue Dec 8 17:40:13 2020 +0000","[TTI] Add support for target hook in compiler.", ""
...
and the git log looks something like
commit 7c87963cc
Author: XYZ <xyz#abc.com>
Date: Tue Dec 8 17:40:13 2020 +0000
[TTI] Add support for target hook in compiler.
This adds some code in the TabeleGen ...
This is my body of commit.
Reviewed By: Sushant
Differential Revision: https://codereviews.com/DD8822
What I couldn't be successful was in extracting the Differential Revision string using the (trailers:key="Reviewed By") command.
I couldn't find much on how to get this working.
I checked the git manual and I did try what it explains.
Is there something I might be missing in this command?
The expected output should have the text
https://codereviews.com/DD8822 at the last position in the above CVS output.

I'm not sure but:
trailer keys cannot have whitespaces (therefore Reviewed By -> Reviewed-By, and Differential Revision -> Differential-Revision);
trailers should not be delimited by new lines, but separated from the commit commit message (therefore Reviewed By from your question is not considered as a trailer).
I would also not recommend using CSV, but using TSV instead: git output is not aware of CSV syntax (semi-colons and commas escaping), therefore the output document may be generated unparsable.
If your commit messages would look like this (- instead of spaces, no new line delimiters):
commit 7c87963cc
Author: XYZ <xyz#abc.com>
Date: Tue Dec 8 17:40:13 2020 +0000
[TTI] Add support for target hook in compiler.
This adds some code in the TabeleGen ...
This is my body of commit.
Reviewed-By: Sushant
Differential-Revision: https://codereviews.com/DD8822
Then the following command would work for you:
git log --pretty=format:'%h%x09%an%x09%ae%x09%aD%x09%s%x09%(trailers:key=Reviewed-By,separator=%x20,valueonly)%x09%(trailers:key=Differential-Revision,separator=%x20,valueonly)'
producing short commit id, author name, author email, date, commit message, trailer Reviewed-By, and trailer Differential-Revision to your tab-separated values output.
If you may not change the old commit messages because your history is not safe for doing this (it's published, pulled by peers, your tools are bound to the published commit hashes), then you have to process the git log output with sed, awk, perl, or any other text-transforming tool to generate your report. Say, process something like git log --pretty=format:'%x02%h%x1F%an%x1F%ae%x1F%aD%x1F%s%x1F%n%B' where lines between ^B (STX) and EOF should be analyzed somehow (filtered for the trailers you are interestged in), then joined to their group lines starting with ^B, and then character replaced to replace field and entry separators with \t and no character respectively.
But again, if you may edit the history by fixing commit message trailers (not sure how much it may affect), I'd recommend you do that and then reject the idea of extra scripts processing trailers that are not recognized by git-interpret-trailers and simply fix the commit messages.
Edit 1 (text tools)
If rewriting the history is not an option, then implementing some scripts may help you out. I'm pretty weak at writing powerful sed/awk/perl scripts, but let me try.
git log --pretty=format:'%x02%h%x1F%an%x1F%ae%x1F%aD%x1F%s%x1F%n%B' \
| gawk -f trailers.awk \
| sed '$!N;s/\n/\x1F/' \
| sed 's/[\x02\x1E]//g' \
| sed 's/\x1F/\x09/g'
How it works:
git generates a log made of data delimited with standard C0 C1 codes assuming there are no such characters your commit messages (STX, RS and US -- I don't really know if it a good place to use them like that and if I apply them semantically correct);
gawk filters the log output trying to parse STX-started groups and extract the trailers, generating "two-rowed" output (each odd line for regular data, each even line for comma-joined trailer values even for missing trailers);
sed joins odd and even lines by pairs (credits go to Karoly Horvath);
sed removes STX and RS;
sed replaces US to TAB.
Here is the trailers.awk (again I'm not an awk guy and have no idea how idiomatic the following script it, but it seems to work):
#!/usr/bin/awk -f
BEGIN {
FIRST = 1
delete TRAILERS
}
function print_joined_array(array) {
if ( !length(array) ) {
return
}
for ( i in array ) {
if ( i > 0 ) {
printf(",")
}
printf("%s", array[i])
}
printf("\x1F")
}
function print_trailers() {
if ( FIRST ) {
FIRST = 0
return
}
print_joined_array(TRAILERS["Reviewed By"])
print_joined_array(TRAILERS["Differential Revision"])
print ""
}
/^\x02/ {
print_trailers()
print $0
delete TRAILERS
}
match($0, /^([-_ A-Za-z0-9]+):\s+(.*)\s*/, M) {
TRAILERS[M[1]][length(TRAILERS[M[1]])] = M[2]
}
END {
print_trailers()
}
A couple of words how the awk script works:
it assumes that records that do not require processing are starting with STX;
it tries to grep each non-"STX" line for a Key Name: Value pattern and saves the found result to a temporary array TRAILERS (that serves actually as a multimap, like Map<String, List<String>> in Java) for each record;
each record is written as is, but trailers are written either before detecting a new record or at EOF.
Edit 2 (better awk)
Well, I'm really weak at awk, so once I read more about awk internal variables, I figured out the awk script can be reimplemented entirely and produce a ready to use TSV-like output itself without any post-processing with sed or perl. So the shorter and improved version of the script is:
#!/bin/bash
git log --pretty=format:'%x1E%h%x1F%an%x1F%ae%x1F%aD%x1F%s%x1F%B%x1E' \
| gawk -f trailers.awk
#!/usr/bin/awk -f
BEGIN {
RS = "\x1E"
FS = "\x1F"
OFS = "\x09"
}
function extract(array, trailer_key, __buffer) {
for ( i in array ) {
if ( index(array[i], trailer_key) > 0 ) {
if ( length(__buffer) > 0 ) {
__buffer = __buffer ","
}
__buffer = __buffer substr(array[i], length(trailer_key))
}
}
return __buffer
}
NF > 1 {
split($6, array, "\n")
print $1, $2, $3, $4, $5, extract(array, "Reviewed By: "), extract(array, "Differential Revision: ")
}
Much more concise, easier to read, understand and maintain.

Related

Is there a linux command that can cut and pick columns that match string patterns?

I need to analyze logs and my end user has to be able to see them in a formatted way, as mentioned below, and my nature of logs is the key variables will be in different position, rather than at fixed columns based on the application, as these log formats are from various applications.
"thread":"t1","key1":"value1","key2":"value2",......"key15":"value15"
I have a way to split and cut this to analyze only particular keys, using the following,
cat file.txt | grep 'value1' | cut -d',' -f2,7,8-
This is the command I am able to get, the requirement is I need to grep all logs which have 'key1' as 'value1', and this value1 will be most likely unique among all, so I am using a grep directly, if required, I can use grep to pick along with the key and value string, but main problem I am facing, is the part is after cut. I want to pick only key2, key7, key8 among these lines, but key2, key7, key8 might not appear in the same column numbers like in this order, key2 might even be at column 3 or 4 or after key7/key8, so I want pick based on the key value and get exactly
"key2":"value2", "key7":"value7", "key8:value8"
The end user is not particularly picky about the order in which they appear, they need only these keys from each line to be displayed..
Can someone help me? I tried piping with awk / grep again, but they still match the entire line not on the columns alone
My input is
{"#timestamp":"2021-08-05T06:38:48.084Z","level":"INFO","thread":"main","logger":"className1","message":"Message 1"}
{"#timestamp":"2021-08-05T06:38:48.092Z","level":"DEBUG","thread":"main","logger":"className2","message":"Message 2"}
{"#timestamp":"2021-08-05T06:38:48.092Z","level":"DEBUG","thread":"thead1","logger":"className2","message":"Message 2"}
I basically want my output to be more like, find only the "thread":"main" lines and print only the key and values of "logger" and "message" for each line which matched, since the other key and value are irrelevant to me. there is more than 15 to 16 keys in my file and my key positions could be swapped, like "message" could be the first to appear and "logger" could be the second to appear in some log files. Of course, the keys are just an example, the real keys I am trying to find are not "logger" and "message" alone.
There are log analysis tools, but this is a pretty old system, and the logs are not real time ones I am analyzing and displaying some files which are pretty much older than years.
Not sure I really understand your specification but the following awk script could be a starting point:
$ cat foo.awk
BEGIN {
k["\"key1\""] = 1; k["\"key7\""] = 1; k["\"key8\""] = 1;
}
/"key1":"value1"/ {
s = "";
for(i = 1; i <= NF; i+=2)
if($i in k)
s = s (s ? "," : "") $i ":" $(i+1);
print s;
}
$ awk -F',|:' -f foo.awk foo.txt
"key1":"value1","key7":"value7","key8":"value8"
Explanation:
awk is called with the -F',|:' option such that the fields separator in each record is the comma or the colon.
In the BEGIN section we declare an associative array (k) of the selected keys, including the surrounding double quotes.
The rest of the awk script applies to each record containing "key1":"value1".
Variable s is used to prepare the output string; it is initialized to "".
For each odd field (the keys) we check if it is in k. If it is, we concatenate to s:
a comma if s is not empty,
the key field,
a colon,
the following even field (the value).
We print s.

How to extract data from live log and pipe it to postgres

I need help with awk/grep/sed or whatever you think can do the job.
I have a log file and need to continuously monitor it and get some data out of the new lines as they are written to it.
The new lines are very long and not structured but they will contain the following pattern UserName=SOMEUSRNAME, NetworkDevice=SOMENETWORKDEVICE, Calling-Station-ID=SOMEMACADDRESS.
Exmaple:
May 15 03:59:16 MTN-LAB-ISE-B1 CISE_Passed_Authentications 0000043297 1 0 2017-05-15 03:59:16.979 +00:00 0013123384 5200 NOTICE Passed-Authentication: Authentication succeeded, ConfigVersionId=170, Device IP Address=10.97.31.130, DestinationIPAddress=10.62.56.152, DestinationPort=1812, UserName=abcd\testuser, Protocol=Radius, RequestLatency=313, NetworkDeviceName=SHROCLUSW-WLAN-LAB, User-Name=d4d748fefe96, NAS-IP-Address=10.97.31.130, NAS-Port=50005, Service-Type=Call Check, Framed-IP-Address=10.97.109.64, Framed-MTU=1500, Called-Station-ID=64-E9-50-B6-DE-05, Calling-Station-ID=D4-D7-48-FE-FE-96, NAS-Port-Type=Ethernet, NAS-Port-Id=GigabitEthernet0/5, EAP-Key-Name=,
I was thinking using tail -f to monitor the log file and pipe it to grep/sed/awk to extract the needed data.
I only need the SOMEUSERNAME, SOMENETWORKDEVICE, SOMEMACADDRESS and not the pattern also.
And of course to make this even more complicated after the extraction is done I need to pipe it to postgres.
Can someone give me a hint on how to do matching/extraction part and maybe the pipe to postgres?
This might be done with grep/sed as well but I personally prefer awk.
I did this short script filter.awk:
{
# find info in line
userName = gensub(/^.*UserName=([^,\r\n]+).*$/, "\\1", 1, $0)
networkDevice = gensub(/^.*NetworkDeviceName=([^,\r\n]+).*$/, "\\1", 1, $0)
callingStationId = gensub(/^.*Calling-Station-ID=([^,\r\n]+).*$/, "\\1", 1, $0)
# print filtered info (if any of patterns matched)
if (userName != "" || networkDevice != "" || callingStationId != "") {
print "INSERT INTO logs (username, networkdevice, calling_station_id) VALUES ('"userName"', '"networkDevice"', '"callingStationId"');"
}
# If "all patterns" is required instead of "any pattern"
# the "||" operators have to be replaced with "&&".
}
I tested it with GNU awk on bash in cygwin (Window 10):
$ cat >filter.txt <<EOF
> May 15 03:59:16 MTN-LAB-ISE-B1 CISE_Passed_Authentications 0000043297 1 0 2017-05-15 03:59:16.979 +00:00 0013123384 5200 NOTICE Passed-Authentication: Authentication succeeded, ConfigVersionId=170, Device IP Address=10.97.31.130, DestinationIPAddress=10.62.56.152, DestinationPort=1812, UserName=abcd\testuser, Protocol=Radius, RequestLatency=313, NetworkDeviceName=SHROCLUSW-WLAN-LAB, User-Name=d4d748fefe96, NAS-IP-Address=10.97.31.130, NAS-Port=50005, Service-Type=Call Check, Framed-IP-Address=10.97.109.64, Framed-MTU=1500, Called-Station-ID=64-E9-50-B6-DE-05, Calling-Station-ID=D4-D7-48-FE-FE-96, NAS-Port-Type=Ethernet, NAS-Port-Id=GigabitEthernet0/5, EAP-Key-Name=,
> EOF
$ awk -f filter.awk filter.txt
INSERT INTO logs (username, networkdevice, calling_station_id) VALUES ('abcd\testuser', 'SHROCLUSW-WLAN-LAB', 'D4-D7-48-FE-FE-96');
$
Notes:
The NetworkDevice= pattern doesn't seem to be sufficient for me. I replaced it with NetworkDeviceName=. (It should be easy to replace this if I'm wrong.)
I do not know how to format output correctly for postgres nor do I know the database structure of the questioner. Thus, the print statement probably has to be adjusted. (There is only one print statement in script.) However, the print statement outputs to standard output channel (what you already might have expected). Thus, it can be piped into any other input consuming process easily.
It is unclear whether it is required that all patterns must match or (instead) at least one.
I implemented "at least one".
To implement "all", the || operators in the if statement had to be replaced by && operators. (There is only one if statement in script.)
Unfortunately, the gensub() function is available in GNU awk only. For non-GNU awk, another solution could be done using gsub() instead. However, the gensub() function is much more convenient to use. Thus, I prefer it as long as a non-GNU awk solution is not explicitly required.

grep search two lines in a text file [duplicate]

I have some complex log files that I need to write some tools to process them. I have been playing with awk but I am not sure if awk is the right tool for this.
My log files are print outs of OSPF protocol decodes which contain a text log of the various protocol pkts and their contents with their various protocol fields identified with their values. I want to process these files and print out only certain lines of the log that pertain to specific pkts. Each pkt log can consist of a varying number of lines for that pkt's entry.
awk seems to be able to process a single line that matches a pattern. I can locate the desired pkt but then I need to match patterns in the lines that follow in order to determine if it is a pkt I want to print out.
Another way to look at this is that I would want to isolate several lines in the log file and print out those lines that are the details of a particular pkt based on pattern matches on several lines.
Since awk seems to be line-based, I am not sure if that would be the best tool to use.
If awk can do this, how it is done? If not, any suggestions on which tool to use for this?
Awk can easily detect multi-line combinations of patterns, but you need to create what is called a state machine in your code to recognize the sequence.
Consider this input:
how
second half #1
now
first half
second half #2
brown
second half #3
cow
As you have seen, it's easy to recognize a single pattern. Now, we can write an awk program that recognizes second half only when it is directly preceded by a first half line. (With a more sophisticated state machine you could detect an arbitrary sequence of patterns.)
/second half/ {
if(lastLine == "first half") {
print
}
}
{ lastLine = $0 }
If you run this you will see:
second half #2
Now, this example is absurdly simple and only barely a state machine. The interesting state lasts only for the duration of the if statement and the preceding state is implicit, depending on the value of lastLine. In a more canonical state machine you would keep an explicit state variable and transition from state-to-state depending on both the existing state and the current input. But you may not need that much control mechanism.
awk is able to process from start pattern until end pattern
/start-pattern/,/end-pattern/ {
print
}
I was looking for how to match
* Implements hook_entity_info_alter().
*/
function file_test_entity_type_alter(&$entity_types) {
so created
/\* Implements hook_/,/function / {
print
}
which the content I needed. A more complex example is to skip lines and scrub off non-space parts. Note awk is a record(line) and word(split by space) tool.
# start,end pattern match using comma
/ \* Implements hook_(.*?)\./,/function (.\S*?)/ {
# skip PHP multi line comment end
$0 ~ / \*\// skip
# Only print 3rd word
if ($0 ~ /Implements/) {
hook=$3
# scrub of opening parenthesis and following.
sub(/\(.*$/, "", hook)
print hook
}
# Only print function name without parenthesis
if ($0 ~ /function/) {
name=$2
# scrub of opening parenthesis and following.
sub(/\(.*$/, "", name)
print name
print ""
}
}
Hope this helps too.
See also GAWK ranges for more info.
Awk is really record-based. By default it thinks of a line as a record, but you can alter that with the RS (record separator) variable.
One way to approach this would be to do a first pass using sed (you could do this with awk, too, if you prefer), to separate the records with a different character like a form-feed. Then you can write your awk script where it will treat the group of lines as a single record.
For example, if this is your data:
animal 0
name: joe
type: dog
animal 1
name: bill
type: cat
animal 2
name: ed
type: cat
To separate the records with form-feeds:
$ cat data | sed $'s|^\(animal.*\)|\f\\1|'
Now we'll take that and pass it through awk. Here's an example of conditionally printing a record:
$ cat data | sed $'s|^\(animal.*\)|\f\\1|' | awk '
BEGIN { RS="\f" }
/type: cat/ { print }'
outputs:
animal 1
name: bill
type: cat
animal 2
name: ed
type: cat
Edit: as a bonus, here's how to do it with awk-ward ruby (-014 means use form-feed (octal code 014) as the record separator):
$ cat data | sed $'s|^\(animal.*\)|\f\\1|' |
ruby -014 -ne 'print if /type: cat/'
I do this sort of thing with sendmail logs, from time to time.
Given:
Jan 15 22:34:39 mail sm-mta[36383]: r0B8xkuT048547: to=<www#web3>, delay=4+18:34:53, xdelay=00:00:00, mailer=esmtp, pri=21092363, relay=web3., dsn=4.0.0, stat=Deferred: Operation timed out with web3.
Jan 15 22:34:39 mail sm-mta[36383]: r0B8hpoV047895: to=<www#web3>, delay=4+18:49:22, xdelay=00:00:00, mailer=esmtp, pri=21092556, relay=web3., dsn=4.0.0, stat=Deferred: Operation timed out with web3.
Jan 15 22:34:51 mail sm-mta[36719]: r0G3Youh036719: from=<obfTaIX3#nickhearn.com>, size=0, class=0, nrcpts=0, proto=ESMTP, daemon=IPv4, relay=[50.71.152.178]
Jan 15 22:35:04 mail sm-mta[36722]: r0G3Z2SF036722: lost input channel from [190.107.98.82] to IPv4 after rcpt
Jan 15 22:35:04 mail sm-mta[36722]: r0G3Z2SF036722: from=<amahrroc#europe.com>, size=0, class=0, nrcpts=0, proto=SMTP, daemon=IPv4, relay=[190.107.98.82]
Jan 15 22:35:36 mail sm-mta[36728]: r0G3ZXiX036728: lost input channel from ABTS-TN-dynamic-237.104.174.122.airtelbroadband.in [122.174.104.237] (may be forged) to IPv4 after rcpt
Jan 15 22:35:36 mail sm-mta[36728]: r0G3ZXiX036728: from=<clunch.hilarymas#javagame.ru>, size=0, class=0, nrcpts=0, proto=SMTP, daemon=IPv4, relay=ABTS-TN-dynamic-237.104.174.122.airtelbroadband.in [122.174.104.237] (may be forged)
I use a script something like this:
#!/usr/bin/awk -f
BEGIN {
search=ARGV[1]; # Grab the first command line option
delete ARGV[1]; # Delete it so it won't be considered a file
}
# First, store every line in an array keyed on the Queue ID.
# Obviously, this only works for smallish log segments, as it uses up memory.
{
line[$6]=sprintf("%s\n%s", line[$6], $0);
}
# Next, keep a record of Queue IDs with substrings that match our search string.
index($0, search) {
show[$6];
}
# Finally, once we've processed all input data, walk through our array of "found"
# Queue IDs, and print the corresponding records from the storage array.
END {
for(qid in show) {
print line[qid];
}
}
to get the following output:
$ mqsearch airtel /var/log/maillog
Jan 15 22:35:36 mail sm-mta[36728]: r0G3ZXiX036728: lost input channel from ABTS-TN-dynamic-237.104.174.122.airtelbroadband.in [122.174.104.237] (may be forged) to IPv4 after rcpt
Jan 15 22:35:36 mail sm-mta[36728]: r0G3ZXiX036728: from=<clunch.hilarymas#javagame.ru>, size=0, class=0, nrcpts=0, proto=SMTP, daemon=IPv4, relay=ABTS-TN-dynamic-237.104.174.122.airtelbroadband.in [122.174.104.237] (may be forged)
The idea here is that I'm printing all lines that match the Sendmail Queue ID of the string I want to search for. The structure of the code is of course a product of the structure of the log file, so you'll need to customize your solution for the data you're trying to analyse and extract.
awk '/pattern-start/,/pattern-end/'
ref
`pcregrep -M` works pretty well for this.
From pcregrep(1):
-M, --multiline
Allow patterns to match more than one line. When this option is given,
patterns may usefully contain literal newline characters and internal
occurrences of ^ and $ characters. The output for a successful match
may consist of more than
one line, the last of which is the one in which the match ended. If
the matched string ends with a newline sequence the output ends at the
end of that line.
When this option is set, the PCRE library is called in “multiline”
mode. There is a limit to the number of lines that can be matched,
imposed by the way that pcregrep buffers the input file as it scans
it. However, pcregrep ensures that at least 8K characters or the rest
of the document (whichever is the shorter) are available for forward
matching, and similarly the previous 8K characters (or all the
previous characters, if fewer than 8K) are guaranteed to be available
for lookbehind assertions. This option does not work when input is
read line by line (see --line-buffered.)

diff line-format: Show delete, new, and changed lines?

backup.txt
user1:password:17002:0:99:7:::
user2:password:17003:0:99:7:::
user3:password:17004:0:99:7:::
"main.txt" is same with "backup.txt". If I rename "user1", add a new user, and remove "user2" in "main.txt". "main.txt" seems like:
username1:password:17002:0:99:7:::
user3:password:17004:0:99:7:::
newUser:password:17005:0:99:7:::
after that I use following command for compare two files:
diff --unchanged-line-format="" --old-line-format=":%dn: %L" --new-line-format=":%dn: %L" backup.txt main.txt
...with the actual output:
:1: user1:password:17002:0:99:7:::
:2: user2:password:17003:0:99:7:::
:1: username1:password:17002:0:99:7:::
:3: newUser:password:17005:0:99:7:::
However, my intended output was:
:1c: user1:password:17002:0:99:7:::
:2d: user2:password:17003:0:99:7:::
:1c: username1:password:17002:0:99:7:::
:3a: newUser:password:17005:0:99:7:::
like this. These characters are enable for default "diff" command using. How can I enable these characters for line formatting. Is it possible?
The LTYPEs offered by both BSD and GNU diff are "old", "new", and "unchanged". You thus can't distinguish between "new" and "changed".
That said, to get some distinctions in your format strings, you need to fill them out correctly. In %dn, both the d and the n are consumed (the former specifying a decimal value, the n specifying that it refer to the line number, or the number of lines modified, depending on context). Thus, if you want any extra characters (such as a c, d or a), you need to add those characters after that substitution has complete.
# declaring functions to allow testing without creating files on-disk
backup () { printf '%s\n' user1:password:17002:0:99:7::: user2:password:17002:0:99:7::: user3:password:17002:0:99:7:::; }
main () { printf '%s\n' username1:password:17002:0:99:7::: user3:password:17004:0:99:7::: newUser:password:17005:0:99:7:::; }
diff \
--unchanged-line-format=":%dnu: %L" \
--old-line-format=":%dnd: %L" \
--new-line-format=":%dnn: %L" \
<(backup) <(main)

Can awk patterns match multiple lines?

I have some complex log files that I need to write some tools to process them. I have been playing with awk but I am not sure if awk is the right tool for this.
My log files are print outs of OSPF protocol decodes which contain a text log of the various protocol pkts and their contents with their various protocol fields identified with their values. I want to process these files and print out only certain lines of the log that pertain to specific pkts. Each pkt log can consist of a varying number of lines for that pkt's entry.
awk seems to be able to process a single line that matches a pattern. I can locate the desired pkt but then I need to match patterns in the lines that follow in order to determine if it is a pkt I want to print out.
Another way to look at this is that I would want to isolate several lines in the log file and print out those lines that are the details of a particular pkt based on pattern matches on several lines.
Since awk seems to be line-based, I am not sure if that would be the best tool to use.
If awk can do this, how it is done? If not, any suggestions on which tool to use for this?
Awk can easily detect multi-line combinations of patterns, but you need to create what is called a state machine in your code to recognize the sequence.
Consider this input:
how
second half #1
now
first half
second half #2
brown
second half #3
cow
As you have seen, it's easy to recognize a single pattern. Now, we can write an awk program that recognizes second half only when it is directly preceded by a first half line. (With a more sophisticated state machine you could detect an arbitrary sequence of patterns.)
/second half/ {
if(lastLine == "first half") {
print
}
}
{ lastLine = $0 }
If you run this you will see:
second half #2
Now, this example is absurdly simple and only barely a state machine. The interesting state lasts only for the duration of the if statement and the preceding state is implicit, depending on the value of lastLine. In a more canonical state machine you would keep an explicit state variable and transition from state-to-state depending on both the existing state and the current input. But you may not need that much control mechanism.
awk is able to process from start pattern until end pattern
/start-pattern/,/end-pattern/ {
print
}
I was looking for how to match
* Implements hook_entity_info_alter().
*/
function file_test_entity_type_alter(&$entity_types) {
so created
/\* Implements hook_/,/function / {
print
}
which the content I needed. A more complex example is to skip lines and scrub off non-space parts. Note awk is a record(line) and word(split by space) tool.
# start,end pattern match using comma
/ \* Implements hook_(.*?)\./,/function (.\S*?)/ {
# skip PHP multi line comment end
$0 ~ / \*\// skip
# Only print 3rd word
if ($0 ~ /Implements/) {
hook=$3
# scrub of opening parenthesis and following.
sub(/\(.*$/, "", hook)
print hook
}
# Only print function name without parenthesis
if ($0 ~ /function/) {
name=$2
# scrub of opening parenthesis and following.
sub(/\(.*$/, "", name)
print name
print ""
}
}
Hope this helps too.
See also GAWK ranges for more info.
Awk is really record-based. By default it thinks of a line as a record, but you can alter that with the RS (record separator) variable.
One way to approach this would be to do a first pass using sed (you could do this with awk, too, if you prefer), to separate the records with a different character like a form-feed. Then you can write your awk script where it will treat the group of lines as a single record.
For example, if this is your data:
animal 0
name: joe
type: dog
animal 1
name: bill
type: cat
animal 2
name: ed
type: cat
To separate the records with form-feeds:
$ cat data | sed $'s|^\(animal.*\)|\f\\1|'
Now we'll take that and pass it through awk. Here's an example of conditionally printing a record:
$ cat data | sed $'s|^\(animal.*\)|\f\\1|' | awk '
BEGIN { RS="\f" }
/type: cat/ { print }'
outputs:
animal 1
name: bill
type: cat
animal 2
name: ed
type: cat
Edit: as a bonus, here's how to do it with awk-ward ruby (-014 means use form-feed (octal code 014) as the record separator):
$ cat data | sed $'s|^\(animal.*\)|\f\\1|' |
ruby -014 -ne 'print if /type: cat/'
I do this sort of thing with sendmail logs, from time to time.
Given:
Jan 15 22:34:39 mail sm-mta[36383]: r0B8xkuT048547: to=<www#web3>, delay=4+18:34:53, xdelay=00:00:00, mailer=esmtp, pri=21092363, relay=web3., dsn=4.0.0, stat=Deferred: Operation timed out with web3.
Jan 15 22:34:39 mail sm-mta[36383]: r0B8hpoV047895: to=<www#web3>, delay=4+18:49:22, xdelay=00:00:00, mailer=esmtp, pri=21092556, relay=web3., dsn=4.0.0, stat=Deferred: Operation timed out with web3.
Jan 15 22:34:51 mail sm-mta[36719]: r0G3Youh036719: from=<obfTaIX3#nickhearn.com>, size=0, class=0, nrcpts=0, proto=ESMTP, daemon=IPv4, relay=[50.71.152.178]
Jan 15 22:35:04 mail sm-mta[36722]: r0G3Z2SF036722: lost input channel from [190.107.98.82] to IPv4 after rcpt
Jan 15 22:35:04 mail sm-mta[36722]: r0G3Z2SF036722: from=<amahrroc#europe.com>, size=0, class=0, nrcpts=0, proto=SMTP, daemon=IPv4, relay=[190.107.98.82]
Jan 15 22:35:36 mail sm-mta[36728]: r0G3ZXiX036728: lost input channel from ABTS-TN-dynamic-237.104.174.122.airtelbroadband.in [122.174.104.237] (may be forged) to IPv4 after rcpt
Jan 15 22:35:36 mail sm-mta[36728]: r0G3ZXiX036728: from=<clunch.hilarymas#javagame.ru>, size=0, class=0, nrcpts=0, proto=SMTP, daemon=IPv4, relay=ABTS-TN-dynamic-237.104.174.122.airtelbroadband.in [122.174.104.237] (may be forged)
I use a script something like this:
#!/usr/bin/awk -f
BEGIN {
search=ARGV[1]; # Grab the first command line option
delete ARGV[1]; # Delete it so it won't be considered a file
}
# First, store every line in an array keyed on the Queue ID.
# Obviously, this only works for smallish log segments, as it uses up memory.
{
line[$6]=sprintf("%s\n%s", line[$6], $0);
}
# Next, keep a record of Queue IDs with substrings that match our search string.
index($0, search) {
show[$6];
}
# Finally, once we've processed all input data, walk through our array of "found"
# Queue IDs, and print the corresponding records from the storage array.
END {
for(qid in show) {
print line[qid];
}
}
to get the following output:
$ mqsearch airtel /var/log/maillog
Jan 15 22:35:36 mail sm-mta[36728]: r0G3ZXiX036728: lost input channel from ABTS-TN-dynamic-237.104.174.122.airtelbroadband.in [122.174.104.237] (may be forged) to IPv4 after rcpt
Jan 15 22:35:36 mail sm-mta[36728]: r0G3ZXiX036728: from=<clunch.hilarymas#javagame.ru>, size=0, class=0, nrcpts=0, proto=SMTP, daemon=IPv4, relay=ABTS-TN-dynamic-237.104.174.122.airtelbroadband.in [122.174.104.237] (may be forged)
The idea here is that I'm printing all lines that match the Sendmail Queue ID of the string I want to search for. The structure of the code is of course a product of the structure of the log file, so you'll need to customize your solution for the data you're trying to analyse and extract.
awk '/pattern-start/,/pattern-end/'
ref
`pcregrep -M` works pretty well for this.
From pcregrep(1):
-M, --multiline
Allow patterns to match more than one line. When this option is given,
patterns may usefully contain literal newline characters and internal
occurrences of ^ and $ characters. The output for a successful match
may consist of more than
one line, the last of which is the one in which the match ended. If
the matched string ends with a newline sequence the output ends at the
end of that line.
When this option is set, the PCRE library is called in “multiline”
mode. There is a limit to the number of lines that can be matched,
imposed by the way that pcregrep buffers the input file as it scans
it. However, pcregrep ensures that at least 8K characters or the rest
of the document (whichever is the shorter) are available for forward
matching, and similarly the previous 8K characters (or all the
previous characters, if fewer than 8K) are guaranteed to be available
for lookbehind assertions. This option does not work when input is
read line by line (see --line-buffered.)

Resources