How to use re.sub in Python - python-3.x

Please help me replace a particular string with re.sub()
'<a href="/abc-10063/" target="_blank">'
needs to be
'<a href="./abc-10063.html" target="_blank">'
Wrote a script below
import re
test = '<a href="/abcd-10063/" target="_blank">'
print(re.sub(r'/abcd-[0-9]','./abcd-[0-9].html', test))
which returns
<a href="./abcd-[0-9].html0063/" target="_blank">

First of all your regular expression is incorrect. It will match /abcd-1 only.
You need to change your regex to /abcd-[0-9]+. Adding a + will match all the numbers. Also to match the trailing /, you need to add that in your regex.
So final regex will be /abcd-[0-9]+/.
Now to reuse the matched content in substitution you need to create groups in your regex. Since we want to reuse just the /abcd-[0-9]+and not the /. Put /abcd-[0-9]+ in group, like this: (/abcd-[0-9]+)/.
Now we can use \1 to use matched group in the substitution, where 1 is the group number. If you wanted to use second group, you will use \2.
So your final code will be:
import re
test = '<a href="/abcd-10063/" target="_blank">'
print(re.sub(r'(/abcd-[0-9]+)/', r'.\1.html', test))

Related

Groovy replace using Regex

I have varibale which contains raw data as shown below.
I want to replace the comma inside double quotes to nothing/blank.
I used replaceAll(',',''), but all other commas are also getting replaced.
So need regex function to identify pattern like "123,456,775" and then replace here comma into blank.
var = '
2/5/2023,25,"717,990","18,132,406"
2/4/2023,27,"725,674","19,403,116"
2/3/2023,35,"728,501","25,578,008"
1/31/2023,37,"716,580","26,358,186"
2/1/2023,37,"720,466","26,494,010"
1/30/2023,37,"715,685","26,517,878"
2/2/2023,37,"723,545","26,603,765" '
Tried replaceAll, but did not work
If you just want to replace "," with "", you have to escape the quotes this will do:
var.replaceAll(/\",\"/, /\"\"/)
If you want to replace commas inside the number strings, "725,674" with "725674" you will have to use a regex and capture groups, like this:
var.replaceAll(/(\"\d+),(\d+\")/, /$1$2/)
It will change for three groupings, like "18,132,406", you will have to use three capture groups.

remove color codes from console text output

I have a string which consists of color codes for some of the words. For example:
[38;2;139;0;0mHello [38;2;255;255;255m[38;2;128;128;128mWorld [38;2;255;255;255m
I need a way to remove these codes. Color code values are dynamics and keep changing but it has a pattern of [38;2;r;g;bm. The regular expression, using the above string, should return 'Hello World'.
I tried regular expression ^.*$ so far but it did not work.
I would like to do this in Python, not Perl, sed or Bash.
Any suggestion how can it be done? Or a valid regex to replace with ''.
You can create a regular expression for those color codes then use the Pattern.sub() method, passing an empty string as the first argument, to remove those unwanted parts.
A regular expression which matches the pattern provided ([38;2;r;g;bm):
\[\d{,3}(;\d{,3}){4}m
Using this regular expression with the sub() method and the test string provided:
>>> import re
>>> regex = re.compile(r"\[38;2(;\d{,3}){3}m")
>>> regex.sub("", "[38;2;139;0;0mHello [38;2;255;255;255m[38;2;128;128;128mWorld [38;2;255;255;255m")
'Hello World '

Separating an HTML Element String into Multiple Strings

I am webscraping using puppeteer and I am trying to extract the innerText of this h4 element.
<h4 class="loss">
(NA)
<br>
<span class="team-name">TEAMNAME</span>
<br>
<span class="win spoiler-wrap">0</span>
</h4>
I am able to get this element using:
const teamName = await matches.$eval('h4', (h4) => h4.innerHTML);
This will set teamName to:
(NA)<br><span class="team-name">TEAMNAME</span><br><span class="win spoiler-wrap">0</span>
I am trying to get only the inner text of each element.
I can get the (NA) using const s = teamName.substr(0, teamName.indexOf('<'));
But I cannot seem to figure out how to get "TEAMNAME" or "0" out of this string. I have thoughts of using regex, but I am not sure how I would accomplish this.
PS the inner text will not always be the same so I can't look for specific words.
With regex, you can do it like this:
teamName.match(/<span class="team-name">(.*)<\/span>/)[1]
match returns an array, where the first element is the match of the whole regex, the second element is the match of the first regex group, the third element is the match of the second regex group (there is none in this case), etc.
The /.../ marks a regex which matches the first biggest match it can find. . in a regex is any character. * specifies that any number of occurrences of the character is matched, including 0 occurences. (...) is a regex group, which is used by match. \ is an escape character, because / is a special character to start and end a regex.
I very much recommend reading the Mozilla docs on match and on regexes for details. You will often find them useful.
However, in the case of puppeteer there probably also is a way of directly matching the selector h4 span, which would be more straightforward than using regexes. I don't know enough about puppeteer to tell you the exact way of doing that. :/
With a bit more thinking, I was able to solve my issue.
Here is a solution:
const teamName = await matches.$eval('h4', (h4) => h4.innerHTML);
const openSpanGT = teamName.indexOf('>', 20);
const closeSpanLT = teamName.indexOf('<', openSpanGT);
const teamTitle = teamName.substr(openSpanGT + 1, closeSpanLT - openSpanGT - 1);
console.log(teamTitle);
This will output "TEAMNAME" no matter how long the string is.

Find and replace text and wrap in "href"

I am trying to find specific word in a div (id="Test") that starts with "a04" (no case). I can find and replace the words found. But I am unable to correctly use the word found in a "href" link.
I am trying the following working code that correctly identifies my search criteria. My current code is working as expected but I would like help as i do not know how to used the found work as the url id?
var test = document.getElementById("test").innerHTML
function replacetxt(){
var str_rep = document.getElementById("test").innerHTML.replace(/a04(\w)+/g,'TEST');
var temp = str_rep;
//alert(temp);
document.getElementById("test").innerHTML = temp;
}
I would like to wrap the found word in an href but i do not know how to use the found word as the url id (url.com?id=found word).
Can someone help point out how to reference the found work please?
Thanks
If you want to use your pattern with the capturing group, you could move the quantifier + inside the group or else you would only get the value of the last iteration.
\ba04(\w+)
\b word boundary to prevent the match being part of a longer word
a04 Match literally
(\w+) Capture group 1, match 1+ times a word character
Regex demo
Then you could use the first capturing group in the replacement by referring to it with $1
If the string is a04word, you would capture word in group 1.
Your code might look like:
function replacetxt(){
var elm = document.getElementById("test");
if (elm) {
elm.innerHTML = elm.innerHTML.replace(/\ba04(\w+)/g,'TEST');
}
}
replacetxt();
<div id="test">This is text a04word more text here</div>
Note that you don't have to create extra variables like var temp = str_rep;

Reading from a string using sscanf in Matlab

I'm trying to read a string in a specific format
RealSociedad
this is one example of string and what I want to extract is the name of the team.
I've tried something like this,
houseteam = sscanf(str, '%s');
but it does not work, why?
You can use regexprep like you did in your post above to do this for you. Even though your post says to use sscanf and from the comments in your post, you'd like to see this done using regexprep. You would have to do this using two nested regexprep calls, and you can retrieve the team name (i.e. RealSociedad) like so, given that str is in the format that you have provided:
str = 'RealSociedad';
houseteam = regexprep(regexprep(str, '^<a(.*)">', ''), '</a>$', '')
This looks very intimidating, but let's break this up. First, look at this statement:
regexprep(str, '^<a(.*)">', '')
How regexprep works is you specify the string you want to analyze, the pattern you are searching for, then what you want to replace this pattern with. The pattern we are looking for is:
^<a(.*)">
This says you are looking for patterns where the beginning of the string starts with a a<. After this, the (.*)"> is performing a greedy evaluation. This is saying that we want to find the longest sequence of characters until we reach the characters of ">. As such, what the regular expression will match is the following string:
<ahref="/teams/spain/real-sociedad-de-futbol/2028/">
We then replace this with a blank string. As such, the output of the first regexprep call will be this:
RealSociedad</a>
We want to get rid of the </a> string, and so we would make another regexprep call where we look for the </a> at the end of the string, then replace this with the blank string yet again. The pattern you are looking for is thus:
</a>$
The dollar sign ($) symbolizes that this pattern should appear at the end of the string. If we find such a pattern, we will replace it with the blank string. Therefore, what we get in the end is:
RealSociedad
Found a solution. So, %s stops when it finds a space.
str = regexprep(str, '<', ' <');
str = regexprep(str, '>', '> ');
houseteam = sscanf(str, '%*s %s %*s');
This will create a space between my desired string.

Resources