Grep and replace text with special characters - linux

I have 10K+ XML where about half of them have the following line of code I'd like to replace:
<protocol_name_from_source><![CDATA[This section will be completed when reviewed by an Expert Review Panel.]]></protocol_name_from_source>
with this:
<protocol_name_from_source><![CDATA[Not applicable.]]></protocol_name_from_source>
I've been able to successfully grep for the affected files:
grep -rl '<process\_review><\!\[CDATA\[<p>The Expert Review Panel has not reviewed this measure yet\.<\/p>\]\]><\/process\_review>' ./
but I can't seem to be able to replace the text with sed:
grep -rl '<process\_review><\!\[CDATA\[<p>The Expert Review Panel has not reviewed this measure yet\.<\/p>\]\]><\/process\_review>' ./ | xargs sed -i 's/<process\_review><\!\[CDATA\[<p>The Expert Review Panel has not reviewed this measure yet\.<\/p>\]\]><\/process\_review>/<process\_review><\!\[CDATA\[<p>Not applicable\.<\/p>\]\]><\/process\_review>/g'
Appreciate any help in advance.
edit: These XMLs are in a git repo. Is there any risk of corrupting the repo?

Hmm, according to my man page, sed -i option expect to be followed with one (eventually 0 length) extension. And the command option should be introduced with -e if it is not the only command parameter.
So here I would use:
grep -rl '<process\_review><\!\[CDATA\[<p>The Expert Review Panel has not reviewed this measure yet\.<\/p>\]\]><\/process\_review>' ./ | xargs sed -i '' -e 's/<process\_review><\!\[CDATA\[<p>The Expert Review Panel has not reviewed this measure yet\.<\/p>\]\]><\/process\_review>/<process\_review><\!\[CDATA\[<p>Not applicable\.<\/p>\]\]><\/process\_review>/g'
Beware, I have not reviewed the (long...) sed s/.../.../ command...

Related

Use grep and get text after the pattern [duplicate]

This question already has answers here:
How to grep for contents after pattern?
(8 answers)
Closed 4 years ago.
I need to get the IP from a log I need to grep the true-client and after that I need to grep true-client-ip=[191.168.171.15] and get just the IP
2019.02.14-08:26:06:713,asd:1234:chan,0.000,asd,S,request-begin-site,POST,{remoteHost=1.2.3.4,remoteAddr=1.2.3.4,requestType=POST,serverName=api=[text/html],accept-charset=[iso-12345-15, utf-8;q=0.5, *;q=0.5],accept-encoding=[gzip],server-origin=[5],cache-control=[no-cache, max-age=0],pragma=[no-cache],program-header=[true],te=[chunked;q=1.0],true-client-ip=[191.168.171.15],true-host=[www.server.com]
I was trying grep -o "true-client-ip=[^ ]*," but it brings me:
true-client-ip=[191.168.171.15],true-host=[www.server.com]
I need just true-client-ip=[191.168.171.15] so I can cut after to bring get the IP like true-client-ip=[191.168.171.15] | cut -d= -f2
Using grep -P flag if available :
grep -oP 'true-client-ip=\[\K[^]]*'
Perl's \K meta-character discards what precedes when displaying the result, so it will match the "true-client-ip=[" part but only display the IP.
If grep -P isn't available, I would use sed :
sed -nE 's/.*true-client-ip=\[([^]]*).*/\1/p'
If you have GNU grep, you can do it like this:
$ grep -oP "(?<=true-client-ip=\[)[^\]]*" file
191.168.171.15
The (?<=) is called Positive Lookbehind, which you can find related doc here.
The backslash \ in [^\]] is actually unnecessary, I just feel like to add it to make it more intuitive, less misleading-prone :-) .

Optimizing search in linux

I have a huge log file close to 3GB in size.
My task is to generate some reporting based on # of times something is being logged.
I need to find the number of time StringA , StringB , StringC is being called separately.
What I am doing right now is:
grep "StringA" server.log | wc -l
grep "StringB" server.log | wc -l
grep "StringC" server.log | wc -l
This is a long process and my script takes close to 10 minutes to complete. What I want to know is that whether this can be optimized or not ? Is is possible to run one grep command and find out the number of time StringA, StringB and StringC has been called individually ?
You can use grep -c instead of wc -l:
grep -c "StringA" server.log
grep can't report count of individual strings. You can use awk:
out=$(awk '/StringA/{a++;} /StringB/{b++;} /StringC/{c++;} END{print a, b, c}' server.log)
Then you can extract each count with a simple bash array:
arr=($out)
echo "StringA="${arr[0]}
echo "StringA="${arr[1]}
echo "StringA="${arr[2]}
This (grep without wc) is certainly going to be faster and possibly awk solution is also faster. But I haven't measured any.
Certainly this approach could be optimized since grep doesn't perform any text indexing. I would use a text indexing engine like one of those from this review or this stackexchange QA . Also you may consider using journald from systemd which stores logs in a structured and indexed format so lookups are more effective.
So many greps so little time... :-)
According to David Lyness, a straight grep search is about 7 times as fast as an awk in large file searches.
If that is the case, the current approach could be optimized by changing grep to fgrep, but only if the patterns being searched for are not regular expressions. fgrep is optimized for fixed patterns.
If the number of instances is relatively small compared to the original log file entries, it may be an improvement to use the egrep version of grep to create a temporary file filled with all three instances:
egrep "StringA|StringB|StringC" server.log > tmp.log
grep "StringA" tmp.log | wc -c
grep "StringB" tmp.log | wc -c
grep "StringC" tmp.log | wc -c
The egrep variant of grep allows for a | (vertical bar/pipe) character to be used between two or more separate search strings so that you can find multiple strings in statement. You can use grep -E to do the same thing.
Full documentation is in the man grep page and information about the Extended Regular Expressions that egrep uses from the man 7 re_format command.

grep 2 words at if statements in Bash

I am trying to see if my nohup file contains the words that I am looking for. If it does, then I need to put that into tmp file.
So I am currently using:
if grep -q "Started|missing" $DIR3/$dirName/nohup.out
then
grep -E "Started|missing" "$DIR3/$dirName/nohup.out" > tmp
fi
But it never goes into the if statement even if there are words that I am looking for.
How can I fix this?
Since basic sed uses BRE, regex alternation operator is represented by \| . | matches a literal | symbol. And you don't need to touch | symbol in the grep which uses ERE.
if grep -q "Started\|missing" $DIR3/$dirName/nohup.out
You should use egrep instead of grep (Avinash Raj has explained that in other words already in his answer).
I would generally recommend using egrep as a default for everyday use (even though many expressions only contain the basic regular expression syntax). From a practical point the standard grep is only interesting for performance reasons.
Details about the advantages of grep vs. egrep can be found in that superuser question.
When you only put the grep results into the tmp-file, you do not want to grep the file twice.
You can not use
egrep "Started|missing" $DIR3/$dirName/nohup.out > tmp
since that would create an empty tmp file when nothing is found.
You can remove empty files with if [ ! -s tmp ] or use another solution:
Redirectong the grep results without grepping again can be done with
rm -f tmp 2>/dev/null
egrep "Started|missing" $DIR3/$dirName/nohup.out | while read -r strange_line; do
echo "${strange_line}" >> tmp
done

Extract multiple substrings in bash

I have a page exported from a wiki and I would like to find all the links on that page using bash. All the links on that page are in the form [wiki:<page_name>]. I have a script that does:
...
# First search for the links to the pages
search=`grep '\[wiki:' pages/*`
# Check is our search turned up anything
if [ -n "$search" ]; then
# Now, we want to cut out the page name and find unique listings
uniquePages=`echo "$search" | cut -d'[' -f 2 | cut -d']' -f 1 | cut -d':' -f2 | cut -d' ' -f 1 | sort -u`
....
However, when presented with a grep result with multiple [wiki: text in it, it only pulls the last one and not any others. For example if $search is:
Before starting the configuration, all the required libraries must be installed to be detected by Cmake. If you have missed this step, see the [wiki:CT/Checklist/Libraries "Libr By pressing [t] you can switch to advanced mode screen with more details. The 5 pages are available [wiki:CT/Checklist/Cmake/advanced_mode here]. To obtain information about ea - '''Installation of Cantera''': If Cantera has not been correctly installed or if you do not have sourced the setup file '''~/setup_cantera''' you should receive the following message. Refer to the [wiki:CT/FormulationCantera "Cantera installation"] page to fix this problem. You can set the Cantera options to OFF if you plan to use built-in transport, thermodynamics and chemistry.
then it only returns CT/FormulationCantera and it doesn't give me any of the other links. I know this is due to using cut so I need a replacement for the $uniquepages line.
Does anybody have any suggestions in bash? It can use sed or perl if needed, but I'm hoping for a one-liner to extract a list of page names if at all possible.
egrep -o '\[wiki:[^]]*]' pages/* | sed 's/\[wiki://;s/]//' | sort -u
upd. to remove all after space without cut
egrep -o '\[wiki:[^]]*]' pages/* | sed 's/\[wiki://;s/]//;s/ .*//' | sort -u

Using alias to shorten my command?

I am used to searching specific keywords under Linux.
For example, I may search "TAIWAN" under home by
grep -i -r TAIWAN ./ | grep -v".svn"
However, I thought this is a little redundant; I want to use an alias so I can type
grep TAIWAN
and then the alias will expand my command into
grep -i -r TAIWAN ./ | grep -v".svn"
How could I achieve this?
You won't be able to do it with an alias, but you can create a bash function:
mygrep () { grep -i -r $* ./ | grep -v".svn"; }
I don't believe alias can accomplish what you want because there is no way to reorder the arguments. It is simpler to make a small shell command. Rather than replace grep, and thus possibly mess up programs which expect grep to behave in a certain way, I'd give it a new name such as rgrep.
#!/bin/sh
grep -i -r "$#" | grep -v .svn
Put that somewhere in your PATH such as ~/bin and make it executable with chmod +x ~/bin/rgrep. Then you can rgrep TAIWAN ..
Unfortunately, this will ignore lines which contain .svn as well as files.
You could try to fix up the grep -v pattern match to only match the file part of the grep output, or you could whip up a more complicated command using find instead of grep -r... or you can use ack.
Ack is a better grep and will avoid version control directories and other common files and directories you don't care about. It will also automatically use a pager and color the output.

Resources