Replacing html content using sed and regex - linux

I am trying to replace the content of some HTML content using sed in a bash script. For some reason I'm not getting the proper result as it's not replacing anything mainly the regex part
HTML i want to replace
<h3 class="indicate-hover css-5fzt5q">For the Most Complex Heroines Animation
<h3 class="indicate-hover css-1pvrrwb">The Psychology Behind Sibling
to
head For the Most Complex Heroines Animation
head The Psychology Behind Sibling
i used
sed -e 's/<h3 class="indicate-hover css-([a-b0-9]+)">/head/g'
mainly ([a-b0-9]) this part is getting failed in execution , i must be missing something out,i want to get it more specific , i have "<p class="summary-class css-1azn4ub">How many words can" i want to substitute it to 'tail ' and have many more other tags . The regex part is giving me the pain

Using sed
$ sed 's/.*-[[:alnum:]]\+">/head /' input_file
Output
head For the Most Complex Heroines Animation
head The Psychology Behind Sibling

You need to use \+, unless you use sed -E
\+ is a valid quantifier in the (default) Basic regular expressions
+ is a valid quantifier in Extended regular expressions

Related

How to capitalize and replace characters in shell script in one echo

I am trying to find a way to capitalize and replace dashes of a string in one echo. I do not have the ability to use multiple lines for reassigning the string value.
For example:
string='test-e2e-uber' needs to echo $string as TEST_E2E_UBER
I currently can do one or the other by utilizing
${string^^} for capitalization
${string//-/_} for replacement
However, when I try to combine them it does not appear to work (bad substitution error).
Is there a correct syntax to achieve this?
echo ${string^^//-/_}
This does not answer directly your question, but still following script achieves what you wanted :
declare -u string='test-e2e-uber'
echo ${string//-/_}
You can do that directly with the 'tr' command, in just one 'echo'
echo "$string" | tr "-" "_" | tr "[:lower:]" "[:upper:]"
TEST_E2E_UBER
I don't think 'tr' allows to do the conversion of 2 objects in one command only, so I used pipe for output redirection
or you could do something similar with 'awk'
echo "$string" | awk '{gsub("-","_",$0)} {print toupper($0)}'
TEST_E2E_UBER
in this case, I'm replacing with 'gsub' the hyphen, then i'm printing the whole record to uppercase
Why do you dislike it so much to have two successive assignment statements? If you really hate it, you will have to revert to some external program to do the task for you, such as
string=$(tr a-z- A-Z_ <<<$string)
but I would consider it a waste of resources to create a child process for such a simple operation.

Finding and replacing text within a file

I have a large taxonomy file that I need to edit. There is an issue with the file as "Candida" is listed as both Candida and [Candida]. What I want to do is change every case of [Candida] to Candida within the file.
I have tried doing this several ways but never get the output I am after. This is the first few lines of the taxonomy file:
Penicillium;marneffei;NW_002197112.1
Penicillium;marneffei;NW_002197111.1
Penicillium;marneffei;NW_002197110.1
Penicillium;marneffei;NW_002197109.1
Penicillium;marneffei;NW_002197108.1
Using sed gives me this output:
$ sed -i -e 's/[Candida]/Candida/g' Full_HMS_Taxonomy.txt
PeCandidaCandidacCandidallCandidaum;mCandidarCandidaeffeCandida;NW_002197112.1
PeCandidaCandidacCandidallCandidaum;mCandidarCandidaeffeCandida;NW_002197111.1
PeCandidaCandidacCandidallCandidaum;mCandidarCandidaeffeCandida;NW_002197110.1
PeCandidaCandidacCandidallCandidaum;mCandidarCandidaeffeCandida;NW_002197109.1
PeCandidaCandidacCandidallCandidaum;mCandidarCandidaeffeCandida;NW_002197108.1
Using awk gives me this output:
$ awk '{gsub(/[Candida]/,"Candida")}1' Full_HMS_Taxonomy.txt
PeCandidaCandidacCandidallCandidaum;mCandidarCandidaeffeCandida;NW_002197112.1
PeCandidaCandidacCandidallCandidaum;mCandidarCandidaeffeCandida;NW_002197111.1
PeCandidaCandidacCandidallCandidaum;mCandidarCandidaeffeCandida;NW_002197110.1
PeCandidaCandidacCandidallCandidaum;mCandidarCandidaeffeCandida;NW_002197109.1
PeCandidaCandidacCandidallCandidaum;mCandidarCandidaeffeCandida;NW_002197108.1
In both cases it is adding Candida to multiple places and multiple lines, instead of just replacing each instance of [Candida]. Any ideas on what I am doing wrong?
[] are special characters in regexp, so you should escape them like that:
's/\[Candida\]/Candida/g'
Brackets are treated specially by regular expression parsers, matching each character listed inside them. So, [Candida] matches any of the characters inside it (C, a, n...). That's why you get a lot of substitutions.
You need to tell those utilities that you want literal brackets by escaping them with backslashes, e.g. with sed:
sed -i 's/\[Candida\]/Candida/g' Full_HMS_Taxonomy.txt

Using sed - how to replace two HTML tags or patterns with unknown content in-between?

I want to leave the unknown content between tags intact, but want to match all tags that use:
<div class="section1-title">arbitrary content here</div>
and replace the surrounding tags with:
<h2>arbitrary content here</h2>
I've come up with the following, but obviously it's not working as in the second part it's literally substituting "].*[<]/h2[>]" for each match found.
sed -i 's/[<]div class=\"section1-title\"[>].*[<]\/div[>]/<h2[>].*[<]\/h2[>]/g'
I'd like to specifically know how to leave that middle content intact, no matter what is in there, and just match those surrounding tags as obviously there are quite a few elements with so I can't just separately search & replace them. The first part of the sed statement does seem to match the right content as far as I can tell, it's mostly part 2 that I'm unsure of.
What you need is a backref.
bash-3.2$ sed 's/<div class=\"section1-title\">\(.*\)
<\/div>/<h2>\1<\/h2>/g' <<< '<div class="section1-title">arbitrary
content here</div>'
<h2>arbitrary content here</h2>
The parentheses around your content - \(.*\) - allow it to be referenced later as is with the \1.
See: https://www.regular-expressions.info/backref.html
and .bash_profile sed: \1 not defined in the RE for an explanation about why the parentheses should be escaped in your regex.

I am trying to replace a text for example

Example:
"word" -nothing
To
word" - nothing
in gvim.
I tried
:%s/^.*\"/
But what I get is: -nothing
Well I am new to scripting so I would like to know if it can be done in any other way like using gvim or awk or sed.
In vim... Check for \(word + quote + space + hyphen\) as first reference, followed directly by another \(word\) as second reference... replace by first reference + space + second reference... Make sure the find/replace can happen multiple times on a line with g suffix.
:%s/\(\w" -\)\(\w\)/\1 \2/g
Note that I left out the leading quote... I suppose it is possible you might have spaces in the quoted text - and I think this form might be better for you. Now in sed, that is the really cool thing about the relationship between *nix tools - they all use similar (or the same) regular expressions pattern language. So, the same exact pattern as above can be done in sed (using : as delimiter for clarity).
sed 's:\(\w" -\)\(\w\):\1 \2:g'
Awk doesn't do back references; so, not to say it can't be done, but it is not so convenient.
Could you please try following and let me know if this helps you.
awk '{sub(/^"/,"");sub(/-/,"- ")} 1' Input_file
Solution 2nd: With sed.
sed 's/^"//;s/-/- /' Input_file
Since you also tagged grep: GNU grep has the -P switch for PCRE (Perl compatible reg ex) which has \K: Keep the stuff left of the \K, don't include it in $&, so:
$ echo \"word\" | grep -oP "\"\Kword\""
word"
If I understand your question correctly you want to replace first " in each line with empty string. So in sed it is just:
sed 's/"//'
Without g flag it will replace only first occurrence in each line.
EDIT:
The same way it will work in Vim (unless you have 'gdefault' option set), so in Vim you can:
:%s/"//
try this :
:%s/\"(.*)\"/\1\"/gc

How do I substring piped output from grep in Linux?

I'm trying to write a script to login to a Drupal website automagically to put it into maintenance mode. Here's what I have so far, and the grep gives me back the line I want.
curl http://www.drupalwebsite.org/?q=user | grep '<input type="hidden" name="form_build_id" id="form-[a-zA-Z0-9]*" value="form-[a-zA-Z0-9]*" />'
Now I'm kind of a Linux newbie, and I'm using Cygwin with BASH. How would I then pipe the output and use a command to get the value of the id attribute from the output that grep generated? I'll be using this substring later to do another curl request to actually submit the login.
I was looking at using expr but I don't really understand how I would tell expr "oh hey this stdin data I want to you manipulate in this way". It seems like the only way I could do this would be by saving off the grep output in a variable and then feeding the variable to expr.
Use sed to trim the results you get from your grep, ie.
edit : added myID variable, use any name you like.
myID=$(
curl http://www.drupalwebsite.org/?q=user \
| grep '<input type="hidden" name="form_build_id" id="form-[a-zA-Z0-9]*" value="form-[a-zA-Z0-9]*" />' \
| sed 's/^.* id="//;s/" value=.*$//'
)
#use ${myID} later in script
printf "myID=${myID}\n"
The first part removes the 'front' part of the string, everything up to the id=", while the 2nd part removes every " value= .....
Note that you can chain together multiple sub-replace actions in sed by separating them with the ';'.
edit2
Also, once you're using sed, there's no reason to use grep, try this:
myID=$(
curl http://www.drupalwebsite.org/?q=user \
| sed -n '\#<input type="hidden" name="form_build_id" id="form-[a-zA-Z0-9]*" value="form-[a-zA-Z0-9]*" />#{
s\#^.* id="##
s\#" value=.*$##p
}'
)
( It's a good habit to get into to removing unnecessary processes. It may not matter in this case, but if you get to where you are writing code that will be executed 1000's of time in a hour, then having an extra grep when you don't need it is creating 1000's of extra processes that don't need to be created.)
You may have to escape the '< and >' chars like '\< >' or , worst case '[<] [>]'.
I'm using the '#' as the reg-ex replacement separator now to avoid having to escape any '/' chars in the srch-target string. And I continue using it in the whole example, just to be consistent. For some seds you have tell them that you're using a non-standard separator, hence the leading \# at the front of each block of sed code.
The -n means "don't default print each line of input", and because of that, we have to add the 'p' at the end, which means print the current buffer.
Finally, I'm not sure about your regular expression, particularly the -[a-zA-Z0-9]*, this means zero or more of the previous character (or character class in this case). Typically people wanting at least one alpha-numeric, will use -[a-zA-Z0-9][a-zA-Z0-9]*, yes OR [[:alnum:]][[:alnum:]]*, but I don't know your data well enough to say for sure.
I hope this helps.
You could use grep again with the -o option. Possibly two consecutive greps to also filter out the surrounding id="..." part.
-o, --only-matching
Print only the matched (non-empty) parts of a matching line,
with each such part on a separate output line.

Resources