Shell - How to find a word at a certain point in a message - linux

I want to change my command:
anzahl=`cat $1 | grep -i "error" | wc -l`
This command also counts messages which are like this:
2017-07-15 03:07:02,746 [INFO] blabla:123 #blabla:123 - rhsmd started. Error.
But there is the word Info. So I dont want that it counts.
I just want messages like this:
2017-07-15 06:12:45,362 [ERROR] blabla:123 #blabla:123- Either the consumer is not registered or the certificates are corrupted. Certificate update using daemon failed.
Some tips how I can do this?

Generally you want:
anzahl=$(grep -c '\[ERROR\]' "$1")
This would search for the literal string [ERROR] in the logfile, -c returns the number of matches which makes wc -l superfluous.
Anyhow this would still match [ERROR] at any position of the strings. While this should be good enough in most cases, more precise would be this awk command:
anzahl=$(awk '$3=="[ERROR]"{c++}END{print c}' "$1")
This command would check if [ERROR] appears exactly in the third column of a line and counts those lines. At the end of input it prints the count.
Btw, German variable names doesn't suit for an international audience as on Stackoverflow. I recommend to use English variable names: count

If you don't actually want a regular expression but really just want to count a string, there are grep options for that:
-c, --count
Suppress normal output; instead print a count of matching lines
for each input file.
-F, --fixed-strings
Interpret PATTERN as a list of fixed strings, separated by new-
lines, any of which is to be matched.
So your command should be:
anzahl=$(grep -c -F '[ERROR]' "$1")
Of course, even that string might appear some place other than the third whitespace-delimited field of the line. If you want to stick with grep rather than switching to a tool like awk for your counting, you can do so by going back to what is perhaps an awkward regular expression:
anzahl=$(grep -c -E '^[^ ]+ [^ ]+ [[]ERROR[]]' "$1")
This uses grep's -E option to specify that you're using an Extended regular expression. The expression consists of two strings of not-space, each followed by a space, all of which is followed by your error tag.

Related

grep and cut a specific pattern [duplicate]

Is there a way to make grep output "words" from files that match the search expression?
If I want to find all the instances of, say, "th" in a number of files, I can do:
grep "th" *
but the output will be something like (bold is by me);
some-text-file : the cat sat on the mat
some-other-text-file : the quick brown fox
yet-another-text-file : i hope this explains it thoroughly
What I want it to output, using the same search, is:
the
the
the
this
thoroughly
Is this possible using grep? Or using another combination of tools?
Try grep -o:
grep -oh "\w*th\w*" *
Edit: matching from Phil's comment.
From the docs:
-h, --no-filename
Suppress the prefixing of file names on output. This is the default
when there is only one file (or only standard input) to search.
-o, --only-matching
Print only the matched (non-empty) parts of a matching line,
with each such part on a separate output line.
Cross distribution safe answer (including windows minGW?)
grep -h "[[:alpha:]]*th[[:alpha:]]*" 'filename' | tr ' ' '\n' | grep -h "[[:alpha:]]*th[[:alpha:]]*"
If you're using older versions of grep (like 2.4.2) which do not include the -o option, then use the above. Else use the simpler to maintain version below.
Linux cross distribution safe answer
grep -oh "[[:alpha:]]*th[[:alpha:]]*" 'filename'
To summarize: -oh outputs the regular expression matches to the file content (and not its filename), just like how you would expect a regular expression to work in vim/etc... What word or regular expression you would be searching for then, is up to you! As long as you remain with POSIX and not perl syntax (refer below)
More from the manual for grep
-o Print each match, but only the match, not the entire line.
-h Never print filename headers (i.e. filenames) with output lines.
-w The expression is searched for as a word (as if surrounded by
`[[:<:]]' and `[[:>:]]';
The reason why the original answer does not work for everyone
The usage of \w varies from platform to platform, as it's an extended "perl" syntax. As such, those grep installations that are limited to work with POSIX character classes use [[:alpha:]] and not its perl equivalent of \w. See the Wikipedia page on regular expression for more
Ultimately, the POSIX answer above will be a lot more reliable regardless of platform (being the original) for grep
As for support of grep without -o option, the first grep outputs the relevant lines, the tr splits the spaces to new lines, the final grep filters only for the respective lines.
(PS: I know most platforms by now would have been patched for \w.... but there are always those that lag behind)
Credit for the "-o" workaround from #AdamRosenfield answer
It's more simple than you think. Try this:
egrep -wo 'th.[a-z]*' filename.txt #### (Case Sensitive)
egrep -iwo 'th.[a-z]*' filename.txt ### (Case Insensitive)
Where,
egrep: Grep will work with extended regular expression.
w : Matches only word/words instead of substring.
o : Display only matched pattern instead of whole line.
i : If u want to ignore case sensitivity.
You could translate spaces to newlines and then grep, e.g.:
cat * | tr ' ' '\n' | grep th
Just awk, no need combination of tools.
# awk '{for(i=1;i<=NF;i++){if($i~/^th/){print $i}}}' file
the
the
the
this
thoroughly
grep command for only matching and perl
grep -o -P 'th.*? ' filename
I was unsatisfied with awk's hard to remember syntax but I liked the idea of using one utility to do this.
It seems like ack (or ack-grep if you use Ubuntu) can do this easily:
# ack-grep -ho "\bth.*?\b" *
the
the
the
this
thoroughly
If you omit the -h flag you get:
# ack-grep -o "\bth.*?\b" *
some-other-text-file
1:the
some-text-file
1:the
the
yet-another-text-file
1:this
thoroughly
As a bonus, you can use the --output flag to do this for more complex searches with just about the easiest syntax I've found:
# echo "bug: 1, id: 5, time: 12/27/2010" > test-file
# ack-grep -ho "bug: (\d*), id: (\d*), time: (.*)" --output '$1, $2, $3' test-file
1, 5, 12/27/2010
cat *-text-file | grep -Eio "th[a-z]+"
You can also try pcregrep. There is also a -w option in grep, but in some cases it doesn't work as expected.
From Wikipedia:
cat fruitlist.txt
apple
apples
pineapple
apple-
apple-fruit
fruit-apple
grep -w apple fruitlist.txt
apple
apple-
apple-fruit
fruit-apple
I had a similar problem, looking for grep/pattern regex and the "matched pattern found" as output.
At the end I used egrep (same regex on grep -e or -G didn't give me the same result of egrep) with the option -o
so, I think that could be something similar to (I'm NOT a regex Master) :
egrep -o "the*|this{1}|thoroughly{1}" filename
To search all the words with start with "icon-" the following command works perfect. I am using Ack here which is similar to grep but with better options and nice formatting.
ack -oh --type=html "\w*icon-\w*" | sort | uniq
You could pipe your grep output into Perl like this:
grep "th" * | perl -n -e'while(/(\w*th\w*)/g) {print "$1\n"}'
grep --color -o -E "Begin.{0,}?End" file.txt
? - Match as few as possible until the End
Tested on macos terminal
$ grep -w
Excerpt from grep man page:
-w: Select only those lines containing matches that form whole words. The test is that the matching substring must either be at the beginning of the line, or preceded by a non-word constituent character.
ripgrep
Here are the example using ripgrep:
rg -o "(\w+)?th(\w+)?"
It'll match all words matching th.

How to grep for a matching word, not the surrounding line, with a wildcard?

Maybe an odd question, but I'm attempting to grep the output of a command to select just the matching word and not the line. This word also has a wildcard in it.
git log --format=%aD <file> | tail -1 | grep -oh 201
The first and second sections of the command check the git log for a file and grabs the line pertaining to the date and time of creation. I'm attempting to write a bash script that does something with the year it was created, so I need to grab just that one word (the year).
Looking at the grep documentation, -o specifically prints the matching word (and -h suppresses filenames). I can't find anything that allows for matching the rest of the word that it's matching, though (I could just be spacing).
So the output of that previous command is:
201
And I need it to be (as an example):
2017
Help would be much appreciated!
You can use . as a wildcard character:
$ echo 'before2017after' | grep -o '201.'
2017
Or, better yet, specify that the fourth character be a digit:
$ echo 'before2017after' | grep -o '201[[:digit:]]'
2017
Notes:
Since you are getting input from stdin, there are no filenames. Consequently, in this case, -h changes nothing.
[[:digit:]] is a unicode-safe way of specifying a digit.

Do not print unmatched text with sed

I want to print only matched lines and strip unmatched ones, but with following:
$ echo test12 test | sed -n 's/^.*12/**/p'
I always get:
** test
instead of:
**
What am I doing wrong?
[edit1]
I provide more information of what I need - and actually I should start with it. So, I have a command which produced lots of lines of output, I want to grab only parts of the lines - the ones that matches, and strip the result. So in the above example 12 was meant to find end of matched part of the line, and instead of ** I should have put & which represents matched string. So the full example is:
echo test12 test | sed -n 's/^.*12/&/p'
which produces exactly the same output as input:
test12 test
the expected output is:
test12
As suggested I started to find a grep alternative and the following looks promising:
$ echo test12 test | grep -Eo "^.*12"
but I dont see how to format the matched part, this only strips unmatched text.
EDIT: In some cases, the -E flag might be needed for sed. But then the brackets don't need to be escaped anymore. check your sed's man page.
I think what you are looking for is this:
echo test12 test | sed -n 's/^\(.*12\).*$/\1/p'
if you want to discard the rest of the line, you have to match it as well, but not include it in the output. the \( and \) denote a group that is then referenced by the \1.
Good luck :)
Additional information on sed:
sed works on lines, and the ampersand characters represents the entire line that was matched by the given regular expression. if a regex is "open" at the end (i.e. doesn't end with the endline character ($), it acts as if .*$ is appended to the match string. (not sure if that is how it is implemented, but could very well be.)
Try:
echo test12 test | sed -n 's/^.*/**/p'
You don't need to match the number 12, since that is already being done in your regex.
Your regular expression is matching anything from the beginning of the line until the expression '12'. All the matched expression is replaced with '**', that is why you get '** test'. If you want only match I recommend you using grep.

sed explanation so I can recreate a bit of code?

Can someone please explain the following sed command?
title=$(wget -q -O - https://twitter.com/intent/user?user_id=$ID | sed -n 's/^.*<title>\(.*\) on Twitter<.title>.*$/\1/p')
printf "%s\n" "$title"
I tried (and failed terribly) to recreate it because I thought I understood what was going on in the code. So I wrote (well, more modded) it to be the following:
data-user-id=$(wget -q -O - https://twitter.com/$Username | sed -n 's/^.*"data-user-id">\([^<]*\)<.*$/\1/p')
printf "%s\n" "$data-user-id"
Obviously it errored because the syntax is wrong or something. But I'm trying to understand what is going on so I can make my own variant of it.
P.S. I can't just use the API for this due to how everything needs to be configured.
Give a try to this:
wget -q -O - https://twitter.com/"${Username}" | sed -n '/data-screen-name=.'"${Username}"'".*data-user-id=/I {s/^.*data-screen-name=.'"${Username}"'".*data-user-id="\([0-9]*\)".*$/\1/Ip;q}'
128700677
data-user-id is present in several lines, so it is needed to select a line where data-screen-name=Username
sed is using regular expression, there are 2 good tutorials to start with:
Regular Expressions
Sed - An Introduction and Tutorial by Bruce Barnett
A different sed script with a different output:
Username="StackOverflow"
wget -q -O - https://twitter.com/"${Username}" | sed -n '/data-screen-name=.'"${Username}"'".*data-user-id=/I {p;q}'
data-screen-name="StackOverflow" data-name="Stack Overflow" data-user-id="128700677"
-n instructs sed to not print anything, except when p command is used.
. means any char.
* applies to the previous char in the regex and it means zero or any number of this char.
.* means zero or any number of any char.
/data-screen-name=.'"${Username}"'".*data-user-id=/ select lines which contains data-screen-name= and any one char (.) and StackOverflow and " char and zero or any number of any char (.*) and data-user-id=.
/I means ignore case.
{p;q} are commands executed when above regex is true.
p prints the current line.
q exits the sed script.
The first sed script at the top contains an additional s/regex/replacement/ to clean up the line.
The additional elements used:
^ means the start of the line.
\( ... \) are used to define a group.
"\([0-9]*\)" is a group made of only digits, surrended with 2 " which are not part of the group. It is the first group found in the regex, so it can be referenced in the replacement part with \1.
Assuming the title of the page is "foo on Twitter", it extracts "foo" from it.
But use XMLStarlet instead, since it allows you to specify XPath to extract the data instead of having to poke around with regular expressions.

Line numbering in Grep

I have command in Grep:
cat nastava.html | grep '<td>[A-Z a-z]*</td><td>[0-9/]*</td>' | sed 's/[ \t]*<td>\([A-Z a-z]*\)<\/td><td>\([0-9]\{1,3\}\)\/[0-9]\{2\}\([0-9]\{2\}\)<\/td>.*/\1 mi\3\2 /'
|sort|grep -n ".*" | sed -r 's/(.*):(.*)/\1. \2/' >studenti.txt
I don't understand second line, sort is ok, grep -n means to num that sorted list, but why do we use here ".*"? It won't work without it, and i don't understand why.
The grep is used purely for the side effect of the line numbering with the -n option here, so the main thing is really to use a regular expression which matches all the input lines. As such, .* is not very elegant -- ^ would work without scanning every line, and $ trivially matches every line as well. Since you know the input lines are not empty, thus contain at least one character, the simple regular expression . would work perfectly, too.
However, as the end goal is to perform line numbering, a better solution is to use a dedicated tool for this purpose.
... | sort | nl -ba -s '. '
The -ba option specifies to number all lines (the default is to only add a line number to non-empty lines; we know there are no empty lines, so it's not strictly necessary here, but it's good to know) and the -s option specifies the separator string to put after the number.
A possible minor complication is that the line number format is whitespace-padded, so in the end, this solution may not work for you if you specifically want unpadded numbers. (But a sed postprocessor to fix that up is a lot simpler than the postprocessor for grep you have now -- just sed 's/^ *//' will remove leading whitespace).
... As an aside, the ugly cat | grep | sed pipeline can be abbreviated to just
sed -n 's%[ \t]*<td>\([A-Z a-z]*\)</td><td>\([0-9]\{1,3\}\)/[0-9]\{2\}\([0-9]\{2\}\)</td>.*%\1 mi\3\2 %p' nastava.html
The cat was never necessary in the first place, and the sed script can easily be refactored to only print when a substitution was performed (your grep regular expression was not exactly equivalent to the one you have in the sed script but I assume that was the intent). Also, using a different separator avoids having to backslash the slashes.
... And of course, if nastava.html is your own web page, the whole process is umop apisdn. You should have the students results in a machine-readable form, and generate a web page from that, rather than the other way around.
grep needs a regular expression to match. You can't run grep with no expression at all. If you want to number all the lines, just specify an expression that matches anything. I'd probably use ^ instead of .*.

Resources