procmail recipe to remove footer - linux

I've encountered some problem when doing the procmail recipe.
Here what I have get so far :
:0
* ^X-Loop: myemail#gmail\.com
/dev/null
:0
# filtering email by number 60
* ^Subject:.*(60)
{
:0c:
${DEFAULT}
#trying to take out input from the body
:0fb
| head -10
#Forward it to the other folder
:0
mytest/
}
The problem occur when procmail reading the body of the email.It will show output like this :
+96szV6aBDlD/F7vuiK8fUYVknMQPfPmPNikB+fdYLvbwsv9duz6HQaDuwhGn6dh9w2U
1sABcykpdyfWqWhLt5RzCqppYr5I4yCmB1CNOKwhlzI/w8Sx1QTzGT32G/ERTlbr91BM VmNQ==
MIME-Version: 1.0
Received: by 10.52.97.41 with SMTP id dx9mr14500007vdb.89.1337845760664; Thu,
24 May 2012 00:49:20 -0700 (PDT)
Received: by 10.52.34.75 with HTTP; Thu, 24 May 2012 00:49:20 -0700 (PDT)
Date: Thu, 24 May 2012 15:49:20 +0800
Message-ID: <CAE1Fe-r4Lid+YSgFTQdpsniE_wzeGjETWLLJJxat+HK94u1=AQ#mail.gmail.com>
Subject: 60136379500
From: my email <my email#gmail.com>
To: your email <your email#gmail.com>
Content-Type: multipart/alternative; boundary=20cf307f380654240604c0c37d07
--20cf307f380654240604c0c37d07
Content-Type: text/plain; charset=ISO-8859-1
hi
there
how
are
you
--20cf307f380654240604c0c37d07
+96szV6aBDlD/F7vuiK8fUYVknMQPfPmPNikB+fdYLvbwsv9duz6HQaDuwhGn6dh9w2U
1sABcykpdyfWqWhLt5RzCqppYr5I4yCmB1CNOKwhlzI/w8Sx1QTzGT32G/ERTlbr91BM VmNQ==
I have manage to get the output but it is not working if the sender send fewer than 3 lines as the output will print out the footer of the email as well (because it is between the range of head -10).
I only want the body of the email to be filter (print out in text file) in the procmail.
Is it possible?Can anyone show me the way?I'm in my wits ends.Thanks

Attempting to treat a MIME multipart as just a lump of text is fraught with peril. In order to properly process the body, you should use a MIME-aware tool. But if you just want to assume that the first part is a text part and drop all other parts, you can create something fairly simple and robust.
# Truncate everything after first body part:
# Change second occurrence of --$MATCH to --$MATCH--
# and trim anything after it
:0fb
* ^Content-type: multipart/[a-z]+; boundary="\/[^"]+
| sed -e "1,/^--$MATCH$/b" -e "/^--$MATCH$/!b" -e 's//&--/' -eq
For elegance points, you might be able to develop the script to implement your 10-line body truncation action at the same time, but at least, this should hopefully get you started. (I would switch to awk or Perl at this point.)
:0fb
* ^Content-type: multipart/[a-z]+; boundary="\/[^"]+
| awk -v "b=--$MATCH" ' \
($0 == b || $0 == b "--") && seen++ { printf "%s--\n", $0; exit } \
!seen || p++ < 10'
Properly, the MIME part's headers should not count towards the line count.
This is slightly speculative; I assume by "footer" you mean the ugly base64-encoded attachment after the first body part, and of course, this recipe will do nothing at all for single-part messages. Maybe you want to fall back to your original recipe for those.

Recently had a similar issue and solved it with this (adapted to OP)...
#trying to take out input from the body
:0fb
| sed -n '/^Content-Type/,/^--/ { /^Content-Type/b; /^--/b; p }'
Explanations: in general form....
sed -n '/begin/,/end/ { /begin/b; /end/b; p }'
-n: --> turn printing off
/begin/ --> begin of pattern range (remainder commands only apply inside range)
,/end/ --> , end of sed pattern range
{ /begin/b; --> /b branch causes lines with pattern /begin/ to skip remaining commands
/end/b; --> (same as above), these lines will skip the upcoming (p)rint command
p }' --> prints lines that in pattern that made it to this command

Related

Trailers option in git --pretty option

I was trying to extract a summary of contributions from git log and create a concise summary of that and create an excel/csv out of it to present reports.
I did try
git log --after="2020-12-10" --pretty=format:'"%h","%an","%ae","%aD","%s","(trailers:key="Reviewed By")"'
and the CSV looks like with a blank CSV column at the end.
...
"7c87963cc","XYZ","xyz#abc.com","Tue Dec 8 17:40:13 2020 +0000","[TTI] Add support for target hook in compiler.", ""
...
and the git log looks something like
commit 7c87963cc
Author: XYZ <xyz#abc.com>
Date: Tue Dec 8 17:40:13 2020 +0000
[TTI] Add support for target hook in compiler.
This adds some code in the TabeleGen ...
This is my body of commit.
Reviewed By: Sushant
Differential Revision: https://codereviews.com/DD8822
What I couldn't be successful was in extracting the Differential Revision string using the (trailers:key="Reviewed By") command.
I couldn't find much on how to get this working.
I checked the git manual and I did try what it explains.
Is there something I might be missing in this command?
The expected output should have the text
https://codereviews.com/DD8822 at the last position in the above CVS output.
I'm not sure but:
trailer keys cannot have whitespaces (therefore Reviewed By -> Reviewed-By, and Differential Revision -> Differential-Revision);
trailers should not be delimited by new lines, but separated from the commit commit message (therefore Reviewed By from your question is not considered as a trailer).
I would also not recommend using CSV, but using TSV instead: git output is not aware of CSV syntax (semi-colons and commas escaping), therefore the output document may be generated unparsable.
If your commit messages would look like this (- instead of spaces, no new line delimiters):
commit 7c87963cc
Author: XYZ <xyz#abc.com>
Date: Tue Dec 8 17:40:13 2020 +0000
[TTI] Add support for target hook in compiler.
This adds some code in the TabeleGen ...
This is my body of commit.
Reviewed-By: Sushant
Differential-Revision: https://codereviews.com/DD8822
Then the following command would work for you:
git log --pretty=format:'%h%x09%an%x09%ae%x09%aD%x09%s%x09%(trailers:key=Reviewed-By,separator=%x20,valueonly)%x09%(trailers:key=Differential-Revision,separator=%x20,valueonly)'
producing short commit id, author name, author email, date, commit message, trailer Reviewed-By, and trailer Differential-Revision to your tab-separated values output.
If you may not change the old commit messages because your history is not safe for doing this (it's published, pulled by peers, your tools are bound to the published commit hashes), then you have to process the git log output with sed, awk, perl, or any other text-transforming tool to generate your report. Say, process something like git log --pretty=format:'%x02%h%x1F%an%x1F%ae%x1F%aD%x1F%s%x1F%n%B' where lines between ^B (STX) and EOF should be analyzed somehow (filtered for the trailers you are interestged in), then joined to their group lines starting with ^B, and then character replaced to replace field and entry separators with \t and no character respectively.
But again, if you may edit the history by fixing commit message trailers (not sure how much it may affect), I'd recommend you do that and then reject the idea of extra scripts processing trailers that are not recognized by git-interpret-trailers and simply fix the commit messages.
Edit 1 (text tools)
If rewriting the history is not an option, then implementing some scripts may help you out. I'm pretty weak at writing powerful sed/awk/perl scripts, but let me try.
git log --pretty=format:'%x02%h%x1F%an%x1F%ae%x1F%aD%x1F%s%x1F%n%B' \
| gawk -f trailers.awk \
| sed '$!N;s/\n/\x1F/' \
| sed 's/[\x02\x1E]//g' \
| sed 's/\x1F/\x09/g'
How it works:
git generates a log made of data delimited with standard C0 C1 codes assuming there are no such characters your commit messages (STX, RS and US -- I don't really know if it a good place to use them like that and if I apply them semantically correct);
gawk filters the log output trying to parse STX-started groups and extract the trailers, generating "two-rowed" output (each odd line for regular data, each even line for comma-joined trailer values even for missing trailers);
sed joins odd and even lines by pairs (credits go to Karoly Horvath);
sed removes STX and RS;
sed replaces US to TAB.
Here is the trailers.awk (again I'm not an awk guy and have no idea how idiomatic the following script it, but it seems to work):
#!/usr/bin/awk -f
BEGIN {
FIRST = 1
delete TRAILERS
}
function print_joined_array(array) {
if ( !length(array) ) {
return
}
for ( i in array ) {
if ( i > 0 ) {
printf(",")
}
printf("%s", array[i])
}
printf("\x1F")
}
function print_trailers() {
if ( FIRST ) {
FIRST = 0
return
}
print_joined_array(TRAILERS["Reviewed By"])
print_joined_array(TRAILERS["Differential Revision"])
print ""
}
/^\x02/ {
print_trailers()
print $0
delete TRAILERS
}
match($0, /^([-_ A-Za-z0-9]+):\s+(.*)\s*/, M) {
TRAILERS[M[1]][length(TRAILERS[M[1]])] = M[2]
}
END {
print_trailers()
}
A couple of words how the awk script works:
it assumes that records that do not require processing are starting with STX;
it tries to grep each non-"STX" line for a Key Name: Value pattern and saves the found result to a temporary array TRAILERS (that serves actually as a multimap, like Map<String, List<String>> in Java) for each record;
each record is written as is, but trailers are written either before detecting a new record or at EOF.
Edit 2 (better awk)
Well, I'm really weak at awk, so once I read more about awk internal variables, I figured out the awk script can be reimplemented entirely and produce a ready to use TSV-like output itself without any post-processing with sed or perl. So the shorter and improved version of the script is:
#!/bin/bash
git log --pretty=format:'%x1E%h%x1F%an%x1F%ae%x1F%aD%x1F%s%x1F%B%x1E' \
| gawk -f trailers.awk
#!/usr/bin/awk -f
BEGIN {
RS = "\x1E"
FS = "\x1F"
OFS = "\x09"
}
function extract(array, trailer_key, __buffer) {
for ( i in array ) {
if ( index(array[i], trailer_key) > 0 ) {
if ( length(__buffer) > 0 ) {
__buffer = __buffer ","
}
__buffer = __buffer substr(array[i], length(trailer_key))
}
}
return __buffer
}
NF > 1 {
split($6, array, "\n")
print $1, $2, $3, $4, $5, extract(array, "Reviewed By: "), extract(array, "Differential Revision: ")
}
Much more concise, easier to read, understand and maintain.

grep search two lines in a text file [duplicate]

I have some complex log files that I need to write some tools to process them. I have been playing with awk but I am not sure if awk is the right tool for this.
My log files are print outs of OSPF protocol decodes which contain a text log of the various protocol pkts and their contents with their various protocol fields identified with their values. I want to process these files and print out only certain lines of the log that pertain to specific pkts. Each pkt log can consist of a varying number of lines for that pkt's entry.
awk seems to be able to process a single line that matches a pattern. I can locate the desired pkt but then I need to match patterns in the lines that follow in order to determine if it is a pkt I want to print out.
Another way to look at this is that I would want to isolate several lines in the log file and print out those lines that are the details of a particular pkt based on pattern matches on several lines.
Since awk seems to be line-based, I am not sure if that would be the best tool to use.
If awk can do this, how it is done? If not, any suggestions on which tool to use for this?
Awk can easily detect multi-line combinations of patterns, but you need to create what is called a state machine in your code to recognize the sequence.
Consider this input:
how
second half #1
now
first half
second half #2
brown
second half #3
cow
As you have seen, it's easy to recognize a single pattern. Now, we can write an awk program that recognizes second half only when it is directly preceded by a first half line. (With a more sophisticated state machine you could detect an arbitrary sequence of patterns.)
/second half/ {
if(lastLine == "first half") {
print
}
}
{ lastLine = $0 }
If you run this you will see:
second half #2
Now, this example is absurdly simple and only barely a state machine. The interesting state lasts only for the duration of the if statement and the preceding state is implicit, depending on the value of lastLine. In a more canonical state machine you would keep an explicit state variable and transition from state-to-state depending on both the existing state and the current input. But you may not need that much control mechanism.
awk is able to process from start pattern until end pattern
/start-pattern/,/end-pattern/ {
print
}
I was looking for how to match
* Implements hook_entity_info_alter().
*/
function file_test_entity_type_alter(&$entity_types) {
so created
/\* Implements hook_/,/function / {
print
}
which the content I needed. A more complex example is to skip lines and scrub off non-space parts. Note awk is a record(line) and word(split by space) tool.
# start,end pattern match using comma
/ \* Implements hook_(.*?)\./,/function (.\S*?)/ {
# skip PHP multi line comment end
$0 ~ / \*\// skip
# Only print 3rd word
if ($0 ~ /Implements/) {
hook=$3
# scrub of opening parenthesis and following.
sub(/\(.*$/, "", hook)
print hook
}
# Only print function name without parenthesis
if ($0 ~ /function/) {
name=$2
# scrub of opening parenthesis and following.
sub(/\(.*$/, "", name)
print name
print ""
}
}
Hope this helps too.
See also GAWK ranges for more info.
Awk is really record-based. By default it thinks of a line as a record, but you can alter that with the RS (record separator) variable.
One way to approach this would be to do a first pass using sed (you could do this with awk, too, if you prefer), to separate the records with a different character like a form-feed. Then you can write your awk script where it will treat the group of lines as a single record.
For example, if this is your data:
animal 0
name: joe
type: dog
animal 1
name: bill
type: cat
animal 2
name: ed
type: cat
To separate the records with form-feeds:
$ cat data | sed $'s|^\(animal.*\)|\f\\1|'
Now we'll take that and pass it through awk. Here's an example of conditionally printing a record:
$ cat data | sed $'s|^\(animal.*\)|\f\\1|' | awk '
BEGIN { RS="\f" }
/type: cat/ { print }'
outputs:
animal 1
name: bill
type: cat
animal 2
name: ed
type: cat
Edit: as a bonus, here's how to do it with awk-ward ruby (-014 means use form-feed (octal code 014) as the record separator):
$ cat data | sed $'s|^\(animal.*\)|\f\\1|' |
ruby -014 -ne 'print if /type: cat/'
I do this sort of thing with sendmail logs, from time to time.
Given:
Jan 15 22:34:39 mail sm-mta[36383]: r0B8xkuT048547: to=<www#web3>, delay=4+18:34:53, xdelay=00:00:00, mailer=esmtp, pri=21092363, relay=web3., dsn=4.0.0, stat=Deferred: Operation timed out with web3.
Jan 15 22:34:39 mail sm-mta[36383]: r0B8hpoV047895: to=<www#web3>, delay=4+18:49:22, xdelay=00:00:00, mailer=esmtp, pri=21092556, relay=web3., dsn=4.0.0, stat=Deferred: Operation timed out with web3.
Jan 15 22:34:51 mail sm-mta[36719]: r0G3Youh036719: from=<obfTaIX3#nickhearn.com>, size=0, class=0, nrcpts=0, proto=ESMTP, daemon=IPv4, relay=[50.71.152.178]
Jan 15 22:35:04 mail sm-mta[36722]: r0G3Z2SF036722: lost input channel from [190.107.98.82] to IPv4 after rcpt
Jan 15 22:35:04 mail sm-mta[36722]: r0G3Z2SF036722: from=<amahrroc#europe.com>, size=0, class=0, nrcpts=0, proto=SMTP, daemon=IPv4, relay=[190.107.98.82]
Jan 15 22:35:36 mail sm-mta[36728]: r0G3ZXiX036728: lost input channel from ABTS-TN-dynamic-237.104.174.122.airtelbroadband.in [122.174.104.237] (may be forged) to IPv4 after rcpt
Jan 15 22:35:36 mail sm-mta[36728]: r0G3ZXiX036728: from=<clunch.hilarymas#javagame.ru>, size=0, class=0, nrcpts=0, proto=SMTP, daemon=IPv4, relay=ABTS-TN-dynamic-237.104.174.122.airtelbroadband.in [122.174.104.237] (may be forged)
I use a script something like this:
#!/usr/bin/awk -f
BEGIN {
search=ARGV[1]; # Grab the first command line option
delete ARGV[1]; # Delete it so it won't be considered a file
}
# First, store every line in an array keyed on the Queue ID.
# Obviously, this only works for smallish log segments, as it uses up memory.
{
line[$6]=sprintf("%s\n%s", line[$6], $0);
}
# Next, keep a record of Queue IDs with substrings that match our search string.
index($0, search) {
show[$6];
}
# Finally, once we've processed all input data, walk through our array of "found"
# Queue IDs, and print the corresponding records from the storage array.
END {
for(qid in show) {
print line[qid];
}
}
to get the following output:
$ mqsearch airtel /var/log/maillog
Jan 15 22:35:36 mail sm-mta[36728]: r0G3ZXiX036728: lost input channel from ABTS-TN-dynamic-237.104.174.122.airtelbroadband.in [122.174.104.237] (may be forged) to IPv4 after rcpt
Jan 15 22:35:36 mail sm-mta[36728]: r0G3ZXiX036728: from=<clunch.hilarymas#javagame.ru>, size=0, class=0, nrcpts=0, proto=SMTP, daemon=IPv4, relay=ABTS-TN-dynamic-237.104.174.122.airtelbroadband.in [122.174.104.237] (may be forged)
The idea here is that I'm printing all lines that match the Sendmail Queue ID of the string I want to search for. The structure of the code is of course a product of the structure of the log file, so you'll need to customize your solution for the data you're trying to analyse and extract.
awk '/pattern-start/,/pattern-end/'
ref
`pcregrep -M` works pretty well for this.
From pcregrep(1):
-M, --multiline
Allow patterns to match more than one line. When this option is given,
patterns may usefully contain literal newline characters and internal
occurrences of ^ and $ characters. The output for a successful match
may consist of more than
one line, the last of which is the one in which the match ended. If
the matched string ends with a newline sequence the output ends at the
end of that line.
When this option is set, the PCRE library is called in “multiline”
mode. There is a limit to the number of lines that can be matched,
imposed by the way that pcregrep buffers the input file as it scans
it. However, pcregrep ensures that at least 8K characters or the rest
of the document (whichever is the shorter) are available for forward
matching, and similarly the previous 8K characters (or all the
previous characters, if fewer than 8K) are guaranteed to be available
for lookbehind assertions. This option does not work when input is
read line by line (see --line-buffered.)

Can awk patterns match multiple lines?

I have some complex log files that I need to write some tools to process them. I have been playing with awk but I am not sure if awk is the right tool for this.
My log files are print outs of OSPF protocol decodes which contain a text log of the various protocol pkts and their contents with their various protocol fields identified with their values. I want to process these files and print out only certain lines of the log that pertain to specific pkts. Each pkt log can consist of a varying number of lines for that pkt's entry.
awk seems to be able to process a single line that matches a pattern. I can locate the desired pkt but then I need to match patterns in the lines that follow in order to determine if it is a pkt I want to print out.
Another way to look at this is that I would want to isolate several lines in the log file and print out those lines that are the details of a particular pkt based on pattern matches on several lines.
Since awk seems to be line-based, I am not sure if that would be the best tool to use.
If awk can do this, how it is done? If not, any suggestions on which tool to use for this?
Awk can easily detect multi-line combinations of patterns, but you need to create what is called a state machine in your code to recognize the sequence.
Consider this input:
how
second half #1
now
first half
second half #2
brown
second half #3
cow
As you have seen, it's easy to recognize a single pattern. Now, we can write an awk program that recognizes second half only when it is directly preceded by a first half line. (With a more sophisticated state machine you could detect an arbitrary sequence of patterns.)
/second half/ {
if(lastLine == "first half") {
print
}
}
{ lastLine = $0 }
If you run this you will see:
second half #2
Now, this example is absurdly simple and only barely a state machine. The interesting state lasts only for the duration of the if statement and the preceding state is implicit, depending on the value of lastLine. In a more canonical state machine you would keep an explicit state variable and transition from state-to-state depending on both the existing state and the current input. But you may not need that much control mechanism.
awk is able to process from start pattern until end pattern
/start-pattern/,/end-pattern/ {
print
}
I was looking for how to match
* Implements hook_entity_info_alter().
*/
function file_test_entity_type_alter(&$entity_types) {
so created
/\* Implements hook_/,/function / {
print
}
which the content I needed. A more complex example is to skip lines and scrub off non-space parts. Note awk is a record(line) and word(split by space) tool.
# start,end pattern match using comma
/ \* Implements hook_(.*?)\./,/function (.\S*?)/ {
# skip PHP multi line comment end
$0 ~ / \*\// skip
# Only print 3rd word
if ($0 ~ /Implements/) {
hook=$3
# scrub of opening parenthesis and following.
sub(/\(.*$/, "", hook)
print hook
}
# Only print function name without parenthesis
if ($0 ~ /function/) {
name=$2
# scrub of opening parenthesis and following.
sub(/\(.*$/, "", name)
print name
print ""
}
}
Hope this helps too.
See also GAWK ranges for more info.
Awk is really record-based. By default it thinks of a line as a record, but you can alter that with the RS (record separator) variable.
One way to approach this would be to do a first pass using sed (you could do this with awk, too, if you prefer), to separate the records with a different character like a form-feed. Then you can write your awk script where it will treat the group of lines as a single record.
For example, if this is your data:
animal 0
name: joe
type: dog
animal 1
name: bill
type: cat
animal 2
name: ed
type: cat
To separate the records with form-feeds:
$ cat data | sed $'s|^\(animal.*\)|\f\\1|'
Now we'll take that and pass it through awk. Here's an example of conditionally printing a record:
$ cat data | sed $'s|^\(animal.*\)|\f\\1|' | awk '
BEGIN { RS="\f" }
/type: cat/ { print }'
outputs:
animal 1
name: bill
type: cat
animal 2
name: ed
type: cat
Edit: as a bonus, here's how to do it with awk-ward ruby (-014 means use form-feed (octal code 014) as the record separator):
$ cat data | sed $'s|^\(animal.*\)|\f\\1|' |
ruby -014 -ne 'print if /type: cat/'
I do this sort of thing with sendmail logs, from time to time.
Given:
Jan 15 22:34:39 mail sm-mta[36383]: r0B8xkuT048547: to=<www#web3>, delay=4+18:34:53, xdelay=00:00:00, mailer=esmtp, pri=21092363, relay=web3., dsn=4.0.0, stat=Deferred: Operation timed out with web3.
Jan 15 22:34:39 mail sm-mta[36383]: r0B8hpoV047895: to=<www#web3>, delay=4+18:49:22, xdelay=00:00:00, mailer=esmtp, pri=21092556, relay=web3., dsn=4.0.0, stat=Deferred: Operation timed out with web3.
Jan 15 22:34:51 mail sm-mta[36719]: r0G3Youh036719: from=<obfTaIX3#nickhearn.com>, size=0, class=0, nrcpts=0, proto=ESMTP, daemon=IPv4, relay=[50.71.152.178]
Jan 15 22:35:04 mail sm-mta[36722]: r0G3Z2SF036722: lost input channel from [190.107.98.82] to IPv4 after rcpt
Jan 15 22:35:04 mail sm-mta[36722]: r0G3Z2SF036722: from=<amahrroc#europe.com>, size=0, class=0, nrcpts=0, proto=SMTP, daemon=IPv4, relay=[190.107.98.82]
Jan 15 22:35:36 mail sm-mta[36728]: r0G3ZXiX036728: lost input channel from ABTS-TN-dynamic-237.104.174.122.airtelbroadband.in [122.174.104.237] (may be forged) to IPv4 after rcpt
Jan 15 22:35:36 mail sm-mta[36728]: r0G3ZXiX036728: from=<clunch.hilarymas#javagame.ru>, size=0, class=0, nrcpts=0, proto=SMTP, daemon=IPv4, relay=ABTS-TN-dynamic-237.104.174.122.airtelbroadband.in [122.174.104.237] (may be forged)
I use a script something like this:
#!/usr/bin/awk -f
BEGIN {
search=ARGV[1]; # Grab the first command line option
delete ARGV[1]; # Delete it so it won't be considered a file
}
# First, store every line in an array keyed on the Queue ID.
# Obviously, this only works for smallish log segments, as it uses up memory.
{
line[$6]=sprintf("%s\n%s", line[$6], $0);
}
# Next, keep a record of Queue IDs with substrings that match our search string.
index($0, search) {
show[$6];
}
# Finally, once we've processed all input data, walk through our array of "found"
# Queue IDs, and print the corresponding records from the storage array.
END {
for(qid in show) {
print line[qid];
}
}
to get the following output:
$ mqsearch airtel /var/log/maillog
Jan 15 22:35:36 mail sm-mta[36728]: r0G3ZXiX036728: lost input channel from ABTS-TN-dynamic-237.104.174.122.airtelbroadband.in [122.174.104.237] (may be forged) to IPv4 after rcpt
Jan 15 22:35:36 mail sm-mta[36728]: r0G3ZXiX036728: from=<clunch.hilarymas#javagame.ru>, size=0, class=0, nrcpts=0, proto=SMTP, daemon=IPv4, relay=ABTS-TN-dynamic-237.104.174.122.airtelbroadband.in [122.174.104.237] (may be forged)
The idea here is that I'm printing all lines that match the Sendmail Queue ID of the string I want to search for. The structure of the code is of course a product of the structure of the log file, so you'll need to customize your solution for the data you're trying to analyse and extract.
awk '/pattern-start/,/pattern-end/'
ref
`pcregrep -M` works pretty well for this.
From pcregrep(1):
-M, --multiline
Allow patterns to match more than one line. When this option is given,
patterns may usefully contain literal newline characters and internal
occurrences of ^ and $ characters. The output for a successful match
may consist of more than
one line, the last of which is the one in which the match ended. If
the matched string ends with a newline sequence the output ends at the
end of that line.
When this option is set, the PCRE library is called in “multiline”
mode. There is a limit to the number of lines that can be matched,
imposed by the way that pcregrep buffers the input file as it scans
it. However, pcregrep ensures that at least 8K characters or the rest
of the document (whichever is the shorter) are available for forward
matching, and similarly the previous 8K characters (or all the
previous characters, if fewer than 8K) are guaranteed to be available
for lookbehind assertions. This option does not work when input is
read line by line (see --line-buffered.)

searching and storing specific part of file

I've problem in searching and storing specific part of a file into a variable in bash shell.
Here it's one sample of my files:
From root#machine2.com Mon Jan 7 16:56:50 2013
Return-Path: <root#machine2.com>
X-Original-To: smsto+9121403571#machine2.com
Delivered-To: smsto+9121403571#machine2.com
Received: by machine2.com (Postfix, from userid 0)
id 43C191A1ECE; Mon, 7 Jan 2013 16:56:50 +0330 (IRST)
Date: Mon, 07 Jan 2013 16:56:50 +0330
To: smsto+9121403571#machine2.com
Subject: =?us-ascii?Q?Testing\=08?=
User-Agent: Heirloom mailx 12.5 7/5/10
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Message-Id: <20130107132650.43C191A1ECE#machine2.com>
From: root#machine2.com (root)
My note ..
blah blah ...
What i need to do is to Storing some of these fields into variables (parameter like FROM, SUBJECT and EMAIL BODY)
for FROM and SUBJECT fields it was easy to search and get the data.
But for EMAIL BODY, as you see there is no any Labels to search for it ... so i was thinking one of possible ways to get the email body would be searching for FROM label and then using its line number to get EMAIL BODY from next line to end of file.
unfortunately I'm not that familiar with linux command to do such thing.
Please help me.
You can use sed to print from the blank line to the end of the file:
$ sed -n '/^\s*$/,$p' file
My note ..
blah blah ...
# Command substitution to store into a variable
$ body=$(sed -n '/^\s*$/,$p' file)
$ echo $body
My note .. blah blah ...
# Remember to quote variables to respect newlines
$ echo "$body"
My note ..
blah blah ...
If you don't want to include the first blank line use:
$ sed -n '/^\s*$/,$ {/^.*[^ ]\+.*/,$p}' file
Or strip all blank lines in the body:
$ sed -n '/^\s*$/,$ {/^.*[^ ]\+.*/p}' file
Another way to approach the problem is to look for the first empty line (which occurs right after the 'From:' line you talk about) and print everything after that. You can do this using awk and setting a null record separator. For example:
BODY=$(awk 'NR>1' RS= file)
However, the advantage/problem of the above is that blank lines will be discarded. If this is undesirable, here's a method that should satisfy:
BODY=$(awk 'i==1; /^$/ { i=1 }' file)
Then:
echo "$BODY"
Results:
My note ..
blah blah ...

How to extract file data from an HTTP MIME-encoded message in Linux?

I have a program that accepts HTTP post of files and write all the POST result into a file, I want to write a script to delete the HTTP headers, only leave the binary file data, how to do it?
The file content is below (the data between Content-Type: application/octet-stream and ------------KM7cH2GI3cH2Ef1Ij5gL6GI3Ij5GI3 is what I want:
POST /?user_name=vvvvvvvv&size=837&file_name=logo.gif& HTTP/1.1^M
Accept: text/*^M
Content-Type: multipart/form-data; boundary=----------KM7cH2GI3cH2Ef1Ij5gL6GI3Ij5GI3^M
User-Agent: Shockwave Flash^M
Host: 192.168.0.198:9998^M
Content-Length: 1251^M
Connection: Keep-Alive^M
Cache-Control: no-cache^M
Cookie: cb_fullname=ddddddd; cb_user_name=cdc^M
^M
------------KM7cH2GI3cH2Ef1Ij5gL6GI3Ij5GI3^M
Content-Disposition: form-data; name="Filename"^M
^M
logo.gif^M
------------KM7cH2GI3cH2Ef1Ij5gL6GI3Ij5GI3^M
Content-Disposition: form-data; name="Filedata"; filename="logo.gif"^M
Content-Type: application/octet-stream^M
^M
GIF89an^#I^^M
------------KM7cH2GI3cH2Ef1Ij5gL6GI3Ij5GI3^M
Content-Disposition: form-data; name="Upload"^M
^M
Submit Query^M
------------KM7cH2GI3cH2Ef1Ij5gL6GI3Ij5GI3-
You want to do this as the file is going over, or is this something you want to do after the file comes over?
Almost any scripting language should work. My AWK is a bit rusty, but...
awk '/^Content-Type: application\/octet-stream/,/^--------/'
That should print everything between application/octet-stream and the ---------- lines. It might also include both those lines too which means you'll have to do something a bit more complex:
BEGIN {state = 0}
{
if ($0 ~ /^------------/) {
state = 0;
}
if (state == 1) {
print $0
}
if ($0 ~ /^Content-Type: application\/octet-stream/) {
state = 1;
}
}
The application\/octet-stream line is after the print statement because you want to set state to 1 after you see application/octet-stream.
Of course, being Unix, you could pipe the output of your program through awk and then save the file.
If you use Python, email.parser.Parser will allow you to parse a multipart MIME document.
This may be a crazy idea, but I would try stripping the headers with procmail.
Look at the Mime::Tools suite for Perl. It has a rich set of classes; I’m sure you could put something together in just a few lines.
This probably contains some typos or something, but bear with me anyway. First determine the boundary (input is the file containing the data - pipe if necessary):
boundary=`grep '^Content-Type: multipart/form-data; boundary=' input|sed 's/.*boundary=//'`
Then filter the Filedata part:
fd='Content-Disposition: form-data; name="Filedata"'
sed -n "/$fd/,/$boundary/p"
The last part is filter a few extra lines - header lines before and including the empty line and the boundary itself, so change the last line from previous to:
sed -n "/$fd/,/$boundary/p" | sed '1,/^$/d' | sed '$d'
sed -n "/$fd/,/$boundary/p" filters the lines between the Filedata header and the boundary (inclusive),
sed '1,/^$/d' is deleting everything up to and including the first line (so removes the headers) and
sed '$d' removes the last line (the boundary).
After this, you wait for Dennis (see comments) to optimize it and you get this:
sed "1,/$fd/d;/^$/d;/$boundary/,$d"
Now that you've come here, scratch all this and do what Ignacio suggested. Reason - this probably won't work (reliably) for this, as GIF is binary data.
Ah, it was a good exercise! Anyway, for the lovers of sed, here's the excellent page:
http://sed.sourceforge.net/sed1line.txt
Outstanding information.

Resources