Multiple greps in pipeline not terminating after completion

Multiple greps in pipeline not terminating after completion - linux

I'm seem to be having a problem with a simple grep statement not finishing/terminating after it's been completed.
For example:
grep -v -E 'syslogd [0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}: restart' |
grep -v 'printStats: total reads from cache:' /var/log/customlog.log >\
/tmp/filtered_log.tmp
The above statement will strip out the contents and save them into a temp file, however after the grep finishes processing the entire file, the shell script hangs and cannot proceed anymore. This behavior is also triggered when manually running the command within the command line. Essentially combining multiple grep statements causes a PAGER like action (more/less).
Does anyone have any suggestions to overcome this limitation? Ideally I wouldn't want to do the following giving that the customlog.log file might get huge at times.
cat /var/log/customlog.log |
grep -v -E 'syslogd [0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}: restart' |
grep -v 'printStats: total reads from cache:' > /tmp/filtered_log.tmp
Thanks,
Tony

As explained above, you need to move here your file name:
grep -v -E \
'syslogd [0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}: restart' /var/log/customlog.log
| grep -v 'printStats: total reads from cache:' > /tmp/filtered_log.tmp
But you can also combine the two greps:
grep -v -E \
-e 'syslogd [0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}: restart' \
-e 'printStats: total reads from cache:' /var/log/customlog.log > \
/tmp/filtered_log.tmp
Saves a bit of CPU and will fix your error at the same time.
BTW, another possible issue: What if two instances of this script are run at the same time? Both will be using the same temp file. This probably isn't an issue in this particular case, but you might as well get used to developing scripts for that situation. I recommend that you use $$ to put the process ID in your temporary file:
tempFileName="/tmp/filtered_log.$$.tmp"
grep -v -E -e [blah... blah.. blah] /var/log/customlog.log > $tempFileName
Now, if two different people are running this process, you won't get them using the same temp file.
Appended
As pointed out by uwe-kleine-konig, you're actually better off using mktemp:
tempFileName=$(mktemp filtered_log.XXXXX)
grep -v -E -e [blah... blah.. blah] /var/log/customlog.log > $tempFileName
Thanks for the suggestion.

Related

Loop to filter out lines from apache log files

I have several apache access files that I would like to clean up a bit before I analyze them. I am trying to use grep in the following way:
grep -v term_to_grep apache_access_log
I have several terms that I want to grep, so I am piping every grep action as follow:
grep -v term_to_grep_1 apache_access_log | grep -v term_to_grep_2 | grep -v term_to_grep_3 | grep -v term_to_grep_n > apache_access_log_cleaned
Until here my rudimentary script works as expected! But I have many apache access logs, and I don't want to do that for every file. I have started to write a bash script but so far I couldn't make it work. This is my try:
for logs in ./access_logs/*;
do
cat $logs | grep -v term_to_grep | grep -v term_to_grep_2 | grep -v term_to_grep_3 | grep -v term_to_grep_n > $logs_clean
done;
Could anyone point me out what I am doing wrong?

If you have a variable and you append _clean to its name, that's a new variable, and not the value of the old one with _clean appended. To fix that, use curly braces:
$ var=file.log
$ echo "<$var>"
<file.log>
$ echo "<$var_clean>"
<>
$ echo "<${var}_clean>"
<file.log_clean>
Without it, your pipeline tries to redirect to the empty string, which results in an error. Note that "$file"_clean would also work.
As for your pipeline, you could combine that into a single grep command:
grep -Ev 'term_to_grep|term_to_grep_2|term_to_grep_3|term_to_grep_n' "$logs" > "${logs}_clean"
No cat needed, only a single invocation of grep.
Or you could stick all your terms into a file:
$ cat excludes
term_to_grep_1
term_to_grep_2
term_to_grep_3
term_to_grep_n
and then use the -f option:
grep -vf excludes "$logs" > "${logs}_clean"
If your terms are strings and not regular expressions, you might be able to speed this up by using -F ("fixed strings"):
grep -vFf excludes "$logs" > "${logs}_clean"
I think GNU grep checks that for you on its own, though.

You are looping over several files, but in your loop you constantly overwrite your result file, so it will only contain the last result from the last file.
You don't need a loop, use this instead:
egrep -v 'term_to_grep|term_to_grep_2|term_to_grep_3' ./access_logs/* > "$logs_clean"
Note, it is always helpful to start a Bash script with set -eEuCo pipefail. This catches most common errors -- it would have stopped with an error when you tried to clobber the $logs_clean file.

Problems with tail -f and awk? [duplicate]

Is that possible to use grep on a continuous stream?
What I mean is sort of a tail -f <file> command, but with grep on the output in order to keep only the lines that interest me.
I've tried tail -f <file> | grep pattern but it seems that grep can only be executed once tail finishes, that is to say never.

Turn on grep's line buffering mode when using BSD grep (FreeBSD, Mac OS X etc.)
tail -f file | grep --line-buffered my_pattern
It looks like a while ago --line-buffered didn't matter for GNU grep (used on pretty much any Linux) as it flushed by default (YMMV for other Unix-likes such as SmartOS, AIX or QNX). However, as of November 2020, --line-buffered is needed (at least with GNU grep 3.5 in openSUSE, but it seems generally needed based on comments below).

I use the tail -f <file> | grep <pattern> all the time.
It will wait till grep flushes, not till it finishes (I'm using Ubuntu).

I think that your problem is that grep uses some output buffering. Try
tail -f file | stdbuf -o0 grep my_pattern
it will set output buffering mode of grep to unbuffered.

If you want to find matches in the entire file (not just the tail), and you want it to sit and wait for any new matches, this works nicely:
tail -c +0 -f <file> | grep --line-buffered <pattern>
The -c +0 flag says that the output should start 0 bytes (-c) from the beginning (+) of the file.

In most cases, you can tail -f /var/log/some.log |grep foo and it will work just fine.
If you need to use multiple greps on a running log file and you find that you get no output, you may need to stick the --line-buffered switch into your middle grep(s), like so:
tail -f /var/log/some.log | grep --line-buffered foo | grep bar

you may consider this answer as enhancement .. usually I am using
tail -F <fileName> | grep --line-buffered <pattern> -A 3 -B 5
-F is better in case of file rotate (-f will not work properly if file rotated)
-A and -B is useful to get lines just before and after the pattern occurrence .. these blocks will appeared between dashed line separators
But For me I prefer doing the following
tail -F <file> | less
this is very useful if you want to search inside streamed logs. I mean go back and forward and look deeply

Didn't see anyone offer my usual go-to for this:
less +F <file>
ctrl + c
/<search term>
<enter>
shift + f
I prefer this, because you can use ctrl + c to stop and navigate through the file whenever, and then just hit shift + f to return to the live, streaming search.

sed would be a better choice (stream editor)
tail -n0 -f <file> | sed -n '/search string/p'
and then if you wanted the tail command to exit once you found a particular string:
tail --pid=$(($BASHPID+1)) -n0 -f <file> | sed -n '/search string/{p; q}'
Obviously a bashism: $BASHPID will be the process id of the tail command. The sed command is next after tail in the pipe, so the sed process id will be $BASHPID+1.

Yes, this will actually work just fine. Grep and most Unix commands operate on streams one line at a time. Each line that comes out of tail will be analyzed and passed on if it matches.

This one command workes for me (Suse):
mail-srv:/var/log # tail -f /var/log/mail.info |grep --line-buffered LOGIN >> logins_to_mail
collecting logins to mail service

Coming some late on this question, considering this kind of work as an important part of monitoring job, here is my (not so short) answer...
Following logs using bash
1. Command tail
This command is a little more porewfull than read on already published answer
Difference between follow option tail -f and tail -F, from manpage:
-f, --follow[={name|descriptor}]
output appended data as the file grows;
...
-F same as --follow=name --retry
...
--retry
keep trying to open a file if it is inaccessible
This mean: by using -F instead of -f, tail will re-open file(s) when removed (on log rotation, for sample).
This is usefull for watching logfile over many days.
Ability of following more than one file simultaneously
I've already used:
tail -F /var/www/clients/client*/web*/log/{error,access}.log /var/log/{mail,auth}.log \
/var/log/apache2/{,ssl_,other_vhosts_}access.log \
/var/log/pure-ftpd/transfer.log
For following events through hundreds of files... (consider rest of this answer to understand how to make it readable... ;)
Using switches -n (Don't use -c for line buffering!).By default tail will show 10 last lines. This can be tunned:
tail -n 0 -F file
Will follow file, but only new lines will be printed
tail -n +0 -F file
Will print whole file before following his progression.
2. Buffer issues when piping:
If you plan to filter ouptuts, consider buffering! See -u option for sed, --line-buffered for grep, or stdbuf command:
tail -F /some/files | sed -une '/Regular Expression/p'
Is (a lot more efficient than using grep) a lot more reactive than if you does'nt use -u switch in sed command.
tail -F /some/files |
sed -une '/Regular Expression/p' |
stdbuf -i0 -o0 tee /some/resultfile
3. Recent journaling system
On recent system, instead of tail -f /var/log/syslog you have to run journalctl -xf, in near same way...
journalctl -axf | sed -une '/Regular Expression/p'
But read man page, this tool was built for log analyses!
4. Integrating this in a bash script
Colored output of two files (or more)
Here is a sample of script watching for many files, coloring ouptut differently for 1st file than others:
#!/bin/bash
tail -F "$#" |
sed -une "
/^==> /{h;};
//!{
G;
s/^\\(.*\\)\\n==>.*${1//\//\\\/}.*<==/\\o33[47m\\1\\o33[0m/;
s/^\\(.*\\)\\n==> .* <==/\\o33[47;31m\\1\\o33[0m/;
p;}"
They work fine on my host, running:
sudo ./myColoredTail /var/log/{kern.,sys}log
Interactive script
You may be watching logs for reacting on events?
Here is a little script playing some sound when some USB device appear or disappear, but same script could send mail, or any other interaction, like powering on coffe machine...
#!/bin/bash
exec {tailF}< <(tail -F /var/log/kern.log)
tailPid=$!
while :;do
read -rsn 1 -t .3 keyboard
[ "${keyboard,}" = "q" ] && break
if read -ru $tailF -t 0 _ ;then
read -ru $tailF line
case $line in
*New\ USB\ device\ found* ) play /some/sound.ogg ;;
*USB\ disconnect* ) play /some/othersound.ogg ;;
esac
printf "\r%s\e[K" "$line"
fi
done
echo
exec {tailF}<&-
kill $tailPid
You could quit by pressing Q key.

you certainly won't succeed with
tail -f /var/log/foo.log |grep --line-buffered string2search
when you use "colortail" as an alias for tail, eg. in bash
alias tail='colortail -n 30'
you can check by
type alias
if this outputs something like
tail isan alias of colortail -n 30.
then you have your culprit :)
Solution:
remove the alias with
unalias tail
ensure that you're using the 'real' tail binary by this command
type tail
which should output something like:
tail is /usr/bin/tail
and then you can run your command
tail -f foo.log |grep --line-buffered something
Good luck.

Use awk(another great bash utility) instead of grep where you dont have the line buffered option! It will continuously stream your data from tail.
this is how you use grep
tail -f <file> | grep pattern
This is how you would use awk
tail -f <file> | awk '/pattern/{print $0}'

xargs bash -c unexpected token

I'm experiencing an issue calling xargs inside a bash script to parallelize the launch of a function.
I have this line:
grep -Ev '^#|^$' "$listOfTables" | xargs -d '\n' -l1 -I args -P"$parallels" bash -c "doSqoop 'args'"
that launches the function doSqoop that I previously exported.
I am passing to xargs and then to bash -c a single, very long line, containing fields that I split and handle inside the function.
It is something like schema|tab|dest|desttab|query|splits|.... that I read from a file, via the grep command above. I am fine with this solution, I know xargs can split the line on | but I'm ok this way.
It used to work well since I had to add another field at the end, which contains this kind of value:
field1='varchar(12)',field2='varchar(4)',field3='timestamp',....
Now I have this error:
bash: -c: line 0: syntax error near unexpected token '('
I tried to escape the pharhentesis and and single quotes, without success.
It appears to me that bash -c is interpreting the arguments

Use GNU parallel that can call exported functions, and also has an easier syntax and much more capabilities.
Your sample command should could be replaced with
grep -Ev '^#|^$' file | parallel doSqoop
Test with below script:
#!/bin/bash
doSqoop() {
printf "%s\n" "$#"
}
export -f doSqoop
grep -Ev '^#|^$' file | parallel doSqoop
You can also set the number of processes with the -P option, otherwise it matches the number of cores in your system:
grep -Ev '^#|^$' file | parallel -P "$num" doSqoop

Why `read -t` is not timing out in bash on RHEL?

Why read -t doesn't time out when reading from pipe on RHEL5 or RHEL6?
Here is my example which doesn't timeout on my RHEL boxes wile reading from the pipe:
tail -f logfile.log | grep 'something' | read -t 3 variable
If I'm correct read -t 3 should timeout after 3 seconds?
Many thanks in advance.
Chris
GNU bash, version 4.1.2(1)-release (x86_64-redhat-linux-gnu)

The solution given by chepner should work.
An explanation why your version doesn't is simple: When you construct a pipe like yours, the data flows through the pipe from the left to the right. When your read times out however, the programs on the left side will keep running until they notice that the pipe is broken, and that happens only when they try to write to the pipe.
A simple example is this:
cat | sleep 5
After five seconds the pipe will be broken because sleep will exit, but cat will nevertheless keep running until you press return.
In your case that means, until grep produces a result, your command will keep running despite the timeout.

While not a direct answer to your specific question, you will need to run something like
read -t 3 variable < <( tail -f logfile.log | grep "something" )
in order for the newly set value of variable to be visible after the pipeline completes. See if this times out as expected.
Since you are simply using read as a way of exiting the pipeline after a fixed amount of time, you don't have to worry about the scope of variable. However, grep may find a match without printing it within your timeout due to its own internal buffering. You can disable that (with GNU grep, at least), using the --line-buffered option:
tail -f logfile.log | grep --line-buffered "something" | read -t 3
Another option, if available, is the timeout command as a replacement for the read:
timeout 3 tail -f logfile.log | grep -q --line-buffered "something"
Here, we kill tail after 3 seconds, and use the exit status of grep in the usual way.

I dont have a RHEL server to test your script right now but I could bet than read is exiting on timeout and working as it should. Try run:
grep 'something' | strace bash -c "read -t 3 variable"
and you can confirm that.

How do I grep multiple lines (output from another command) at the same time?

I have a Linux driver running in the background that is able to return the current system data/stats. I view the data by running a console utility (let's call it dump-data) in a console. All data is dumped every time I run dump-data. The output of the utility is like below
Output:
- A=reading1
- B=reading2
- C=reading3
- D=reading4
- E=reading5
...
- variableX=readingX
...
The list of readings returned by the utility can be really long. Depending on the scenario, certain readings would be useful while everything else would be useless.
I need a way to grep only the useful readings whose names might have have nothing in common (via a bash script). I.e. Sometimes I'll need to collect A,D,E; and other times I'll need C,D,E.
I'm attempting to graph the readings over time to look for trends, so I can't run something like this:
# forgive my pseudocode
Loop
dump-data | grep A
dump-data | grep D
dump-data | grep E
End Loop
to collect A,D,E as that would actually give me readings from 3 separate calls of dump-data as that would not be accurate.

If you want to save all result of grep in the same file, you can just join all expressions in one:
grep -E 'expr1|expr2|expr3'
But if you want to have results (for expr1, expr2 and expr3) in separate files, things are getting more interesting.
You can do this using tee >(command).
For example, here I process the same pipe with thre different commands:
$ echo abc | tee >(sed s/a/_a_/ > file1) | tee >(sed s/b/_b_/ > file2) | sed s/c/_c_/ > file3
$ grep "" file[123]
file1:_a_bc
file2:a_b_c
file3:ab_c_
But the command seems to be too complex.
I would better save dump-data results to a file and then grep it.
TEMP=$(mktemp /tmp/dump-data-XXXXXXXX)
dump-data > ${TEMP}
grep A ${TEMP}
grep B ${TEMP}
grep C ${TEMP}

You can use dump-data | grep -E "A|D|E". Note the -E option of grep. Alternatively you could use egrep without the -E option.

you can simply use:
dump-data | grep -E 'A|D|E'

awk '/MY PATTERN/{print > "matches-"FILENAME;}' myfile{1,3}
thx Guru at Stack Exchange

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Multiple greps in pipeline not terminating after completion - linux

Related

Loop to filter out lines from apache log files

Problems with tail -f and awk? [duplicate]

xargs bash -c unexpected token

Why `read -t` is not timing out in bash on RHEL?

How do I grep multiple lines (output from another command) at the same time?

Categories

Resources