Extracting key word from a log line - linux

I have a log which got like this :
.....client connection.....remote=/xxx.xxx.xxx.xxx]].......
I need to extract all lines in the log which contain the above,and print just the ip after remote=.. This would be something in the pattern :
grep "client connection" xxx.log | sed -e ....

Using grep:
grep -oP '(?<=remote=/)[^\]]+' file
o is to extract only the pattern, instead of entire line.
P is to match perl like regex. In this case, we are using "negative look behind". It will try to match set of characters which is not "]" which is preceeded by remote=/

grep -oP 'client connection.*remote=/\K.*?(?=])' input
Prints anything between remote=/ and closest ] on the lines which contain client connection.
Or by using sed back referencing: Here the line is divided into three parts/groups which are later referred by \1 \2 or \3. Each group is enclosed by ( and ). Here IP address belongs to 2nd group, so whole line is replaced by 2nd group which is IP address.
sed -r '/client connection/ s_(^.*remote=/)(.*?)]](.*)_\2_g' input
Or using awk :
awk -F'/|]]' '/client connection/{print $2}' input

Try this:
grep 'client connection' test.txt | awk -F'[/\\]]' '{print $2}'
Test case
test.txt
---------
abcd
.....client connection.....remote=/10.20.30.40]].......
abcs
.....client connection.....remote=/11.20.30.40]].......
.....client connection.....remote=/12.20.30.40]].......
Result
10.20.30.40
11.20.30.40
12.20.30.40
Explanation
grep will shortlist the results to only lines matching client connection. awk uses -F flag for delimiter to split text. We ask awk to use / and ] delimiters to split text. In order to use more than one delimiter, we place the delimiters in [ and ]. For example, to split text by = and :, we'd do [=:].
However, in our case, one of the delimiters is ] since my intent is to extract IP specifically from /x.x.x.x] by spitting the text with / and ]. So we escape it ]. The IP is the 2nd item from the splitting.

A more robust way, improved over this answer would be to also use GNU grep in PCRE mode with -P for perl style regEx match, but matching both the patterns as suggested in the question.
grep -oP "client connection.*remote=/\K(\d{1,3}\.){3}\d{1,3}" file
10.20.30.40
11.20.30.40
12.20.30.40
Here, client connection.*remote matches both the patterns in the lines and extracts IP from the file. The \K is a PCRE syntax to ignore strings up to that point and print only the capture group following it.
(\d{1,3}\.){3}\d{1,3}
To match the IP i.e. 3 groups of digits separated by dots of length from 1 to 3 followed by 4th octet.

Related

How can I find the number of 8 letter words that do not contain the letter "e", using the grep command?

I want to find the number of 8 letter words that do not contain the letter "e" in a number of text files (*.txt). In the process I ran into two issues: my lack of understanding in quantifiers and how to exclude characters.
I'm quite new to the Unix terminal, but this is what I have tried:
cat *.txt | grep -Eo "\w+" | grep -i ".*[^e].*"
I need to include the cat command because it otherwise includes the names of the text files in the pipe. The second pipe is to have all the words in a list, and it works, but the last pipe was meant to find all the words that do not have the letter "e" in them, but doesn't seem to work. (I thought "." for no or any number of any character, followed by a character that is not an "e", and followed by another "." for no or any number of any character.)
cat *.txt | grep -Eo "\w+" | grep -wi "[a-z][a-z][a-z][a-z][a-z][a-z][a-z][a-z]"
This command works to find the words that contain 8 characters, but it is quite ineffective, because I have to repeat "[a-z]" 8 times. I thought it could also be "[a-z]{8}", but that doesn't seem to work.
cat *.txt | grep -Eo "\w+" | grep -wi "[a-z][a-z][a-z][a-z][a-z][a-z][a-z][a-z]" | grep -i ".*[^e].*"
So finally, this would be my best guess, however, the third pipe is ineffective and the last pipe doesn't work.
You may use this grep:
grep -hEiwo '[a-df-z]{8}' *.txt
Here:
[a-df-z]{8}: Matches all letters except e
-h: Don't print filename in output
-i: Ignore case search
-o: Print matches only
-w: Match complete words
In case you are ok with GNU awk and assuming that you want to print only the exact words and could be multiple matches in a line if this is the case one could try following.
awk -v IGNORECASE="1" '{for(i=1;i<=NF;i++){if($i~/^[a-df-z]{8}$/){print $i}}}' *.txt
OR without the use of IGNORCASE one could try:
awk '{for(i=1;i<=NF;i++){if(tolower($i)~/^[a-df-z]{8}$/){print $i}}}' *.txt
NOTE: Considering that you want exact matches of 8 letters only in lines. 8 letter words followed by a punctuation mark will be excluded.
Here is a crazy thought with GNU awk:
awk 'BEGIN{FPAT="\\<\\w{8}\\>"}{c+=NF}END{print c}' file
Or if you want to make it work only on a select set of characters:
awk 'BEGIN{FPAT="\\<[a-df-z]{8}\\>"}{c+=NF}END{print c}' file
What this does is, it defines the fields, to be a set of 8 characters (\w as a word-constituent or [a-df-z] as a selected set) which is enclosed by word-boundaries (\< and \>). This is done with FPAT (note the Gory details about escaping).
Sometimes you might also have words which contain diatrics, so you have to expand. Then this might be the best solution:
awk 'BEGIN{FPAT="\\<\\w{8}\\>"}{for(i=1;i<=NF;++i) if($i !~ /e/) c++}END{print c}' file

How to use sed to replace a string that contains the slash?

I have a text file that contain a lot of mess text.
I used grep to get all the text that contains the string prod like this
cat textfile | grep "<host>prod*"
The result
<host>prod-reverse-proxy01</host>
<host>prod-reverse-proxy01</host>
<host>prod-reverse-proxy01</host>
Continually, i used sed with the intention to remove all the "host" part
cat textfile | grep "<host>prod*" | sed "s/<host>//g"; "s/</host>//g"
But only the first "host" was removed.
prod-reverse-proxy01</host>
prod-reverse-proxy01</host>
prod-reverse-proxy01</host>
How can i remove the other "/host" part?
sed -n -e "s/^<host>\(.*\)<\/host>/\1/p" textfile
sed can process your file directly. No need to grep or cat.
-n is there to suppress any lines that do not match. Last 'p' in the script will print all matching files.
Script dissection:
s/.../.../...
is the search/replace form. The bit between the first and the second '/' is what you search for. The bit between the second and third is what you replace it with. The last part is any commands you want to apply to the replacement.
Search:
^<host>\(.*\)<\/host>
finds all lines beginning with <host> followed by any text (.*) followed by </host>. Any text between <host> and </host> is stored into internal variable '1' using '(' and ')'. Note that (, ) and / (in </host>) have to be escaped.
Replace:
\1
Replace found text with contents of variable 1 (1 has to be escaped, otherwise, everything is replaced by character '1'.
Commands:
p
Print resulting line (after replacement).
Note: Your search involves removing two similar but not identical strings (<host> and </host>).
I think this sed is enough
sed 's/<[/]*host>//g' infile

How to extract a string from between two patterns in bash

Im begining with bash and I want to find my ip in a .txt file analyzing it.
This is an example of part of the file:
"Direc. inet:192.****** Difus.:"
The path I think on is searching all the text between "inet:" and " ". My biggest approach until now is getting the entire line with "grep inet:" but I can't figure out how to get just the ip not the entire line with the ip.
Thank you!
Perl to the rescue:
perl -ne 'print $1, "\n" if /inet:([^ ]+)/'
-n reads the input line by line;
[^ ] matches a character that isn't a space
+ means the character must be present one or more times
(...) creates a capture group, the first capture group is referenced as $1
Since you're on Linux, you can take advantage of GNU grep's -o and -P options:
grep -oP '(?<= inet:)[^ ]+' file.txt
Example:
$ grep -oP '(?<= inet:)[^ ]+' <<<'Direc. inet:192.****** Difus.:'
192.******
-o tells grep to only output the matching part(s) of each line.
-P activates support for PCREs (Perl-compatible regular expressions), which support look-behind assertions such as (?<= inet:), which allow a sub-expression (inet:, in this case) to participate in matching, without being captured (returned) as part of the matched string.
[^ ]+ then simply captures everything after inet: up to the first space char. (character set [^ ] matches any char. that is not (^) a space, and + matches this set 1 or more times).
Try combination of awk and grep. Below solution may help Link 1 .
Lin 2

Filter out only matched values from a text file in each line

I have a file "test.txt" with the lines below and also lot bunch of extra stuff after the "version"
soainfra_metrics{metric_group="sca_composite",partition="test",is_active="true",state="on",is_default="true",composite="test123"} map:stats version:1.0
soainfra_metrics{metric_group="sca_composite",partition="gello",is_active="true",state="on",is_default="true",composite="test234"} map:stats version:1.8
soainfra_metrics{metric_group="sca_composite",partition="bolo",is_active="true",state="on",is_default="true",composite="3415"} map:stats version:3.1
soainfra_metrics{metric_group="sca_composite",partition="solo",is_active="true",state="on",is_default="true",composite="hji"} map:stats version:1.1
I tried:
egrep -r 'partition|is_active|state|is_default|composite' test.txt
It's displaying every line, but I need only specific mentioned fields like this below,ignoring rest of the data/stuff or lines
in a nut shell, i want to display only these fields from a line not the rest
partition="test",is_active="true",state="on",is_default="true",composite="test123"
partition="gello",is_active="true",state="on",is_default="true",composite="test234"
partition="bolo",is_active="true",state="on",is_default="true",composite="3415"
partition="solo",is_active="true",state="on",is_default="true",composite="hji"
If your version of grep supports Perl-style regular expressions, then I'd use this:
grep -oP '.*?,\K[^}]+' file
It removes everything up to the first comma (\K kills any previous output) and prints everything up to the }.
Alternatively, using awk:
awk -F'}' '{ sub(/[^,]+,/, ""); print $1 }' file
This sets the field separator to } so the part you're interested in is the first field. It then uses sub to remove the part up to the first comma.
For completeness, you could also use sed:
sed 's/[^,]*,\([^}]*\).*/\1/' file
This captures the part after the first , up to the } and replaces the content of the line with it.
After the grep to pick out the lines you want, use sed to edit the lines:
sed 's/.*\(partition[^}]*\)} map.*/\1/'
This means: "whenever you see anything .*, followed by partition and
any number of non-}, then } map and anything else, grab the part
from partition up to but not including the brace \(...\) as group 1.
The replacement text is just group 1 \1.
Use a pipe | to connect the output of egrep to the input of sed:
egrep ... | sed ...
As far as i understood your file might have more lines you don't want to see, so i would use:
sed -n 's/.*\(partition.*\)}.*/\1/p' file
we use -n p to show only lines where we made substitution. The substitution part just gets the part of the line you need substituting the whole line with the pattern.
This might work for you (GNU sed):
sed -r 's/(partition|is_active|state|is_default|composite)="[^"]*"/\n&\n/g;s/[^\n]*\n([^\n]*)\n[^\n]*/\1,/g;s/,$//' file
Treat the problem as if it were a "decomposed club sandwich". Identify the fillings, remove the bread and tidy up.

find words in two quotes unix

I would like to display the last word in these lines I tried to look for example the word value but no answer, so I thought to look for the words between quotes but my file contains other words between quotes that I have I need not actually want to display the values ​​of the select tag knowing that my html file is.
grep '*' hosts.html | awk '{print $NF}'
For example:
value='www.visit-tunisia.com'>www.visit-tunisia.com
value='www.watania1.tn'>www.watania1.tn
value='www.watania2.tn'>www.watania2.tn
I would have
www.visit-tunisia.com
www.watania1.tn
www.watania2.tn
You need to set the field separator to > you do this with the -F option:
$ awk -F'>' '{print $NF}' hosts.html
www.visit-tunisia.com
www.watania1.tn
www.watania2.tn
Note: I'm not sure what you are trying to achieve by grep '*' hosts.html?
Interpreting the comment liberally, you have input lines which might contain:
value='www.visit-tunisia.com'>www.visit-tunisia.com
value='www.watania1.tn'>www.watania1.tn
value='www.watania2.tn'>www.watania2.tn
and you would like the names which are repeated on a line as the output:
www.visit-tunisia.com
www.watania1.tn
www.watania2.tn
This can be done using sed and capturing parentheses.
sed -n -e "s/.*'\([^']*\)'.*\1.*/\1/p"
The -n says "don't print unless I say to do so". The s///p command prints if the substitute works. The pattern looks for a stream of 'anything' (.*), a single quote, captures what's inside up to the next single quote ('\([^']*\)') followed by any text, the captured text (the first \1), and anything. The replacement text is what was captured (the second \1).
Example:
$ cat data
www and wotnot
value='www.visit-tunisia.com'>www.visit-tunisia.com
blah
value='www.watania1.tn'>www.watania1.tn
hooplah
value='www.watania2.tn'>www.watania2.tn
if 'nothing' is required, nothing will be done.
$ sed -n -e "s/.*'\([^']*\)'.*\1.*/\1/p" data
www.visit-tunisia.com
www.watania1.tn
www.watania2.tn
nothing
$
Clearly, you can refine the [^']* part of the match if you want to. I used double quotes around the expression since the pattern matches on single quotes. Life is trickier if you need to allow both single and double quotes; at that point, I'd put the script into a file and run sed -f script data to make life easier.
sed 's/.*>\(.*\)/\1/g' your_file

Resources