Python regex get (any) last word in string - python-3.x

My goal is to get the last word of a string, no matter what the word is.
With a lot of trials and error I got kinda lucky with the following code, because instead of \w+ I tried \W+ and got a result I could work with.
But my actual code (the one you don't see here) is kinda messy, so my question is; What is the right compile regex to use to get the last word, or two words?
Thanks in advance!
import re
var = ' hello my name is eddie '
r = re.compile(r'\S+\W+$')
r2 = r.findall(var)
print(r2)
#result ['eddie ']

Use
import re
var = ' hello my name is eddie '
r_last_word = re.compile(r'\S+(?=\s*$)')
r_last_but_one = re.compile(r'\S+(?=\s+\S+\s*$)')
print(r_last_word.findall(var))
print(r_last_but_one.findall(var))
Results:
['eddie']
['is']
See proof.
\S+(?=\s*$) - one or more non-whitespace characters that may have optional whitespaces after up to the end of string.
\S+(?=\s+\S+\s*$) - one or more non-whitespace characters that may have one or more whitespace characters, one or more non-whitespace characters and then optional whitespaces after up to the end of string.

Related

Get number from string in Python

I have a string, I have to get digits only from that string.
url = "www.mylocalurl.com/edit/1987"
Now from that string, I need to get 1987 only.
I have been trying this approach,
id = [int(i) for i in url.split() if i.isdigit()]
But I am getting [] list only.
You can use regex and get the digit alone in the list.
import re
url = "www.mylocalurl.com/edit/1987"
digit = re.findall(r'\d+', url)
output:
['1987']
Replace all non-digits with blank (effectively "deleting" them):
import re
num = re.sub('\D', '', url)
See live demo.
You aren't getting anything because by default the .split() method splits a sentence up where there are spaces. Since you are trying to split a hyperlink that has no spaces, it is not splitting anything up. What you can do is called a capture using regex. For example:
import re
url = "www.mylocalurl.com/edit/1987"
regex = r'(\d+)'
numbers = re.search(regex, url)
captured = numbers.groups()[0]
If you do not what what regular expressions are, the code is basically saying. Using the regex string defined as r'(\d+)' which basically means capture any digits, search through the url. Then in the captured we have the first captured group which is 1987.
If you don't want to use this, then you can use your .split() method but this time provide a split using / as the separator. For example `url.split('/').

How to replace hyphen and newline in string in Python

I am working in a text with several syllables divisions.
A typical string is something like that
"this good pe-
riod has"
I tried:
my_string.replace('-'+"\r","")
However, it is not working.
I would like to get
"this good period has"
Have you tried this?
import re
text = """this good pe-
riod has"""
print(re.sub(r"-\s+", '', text))
# this good period has
After you match -, you should match the newline \n :
my_string = """this good pe-
riod has"""
print(my_string.replace("-\n",""))
# this good period has
It depends how your string ends, you could also use my_string.replace('-\r\n', '') or an optional carriage return using re.sub and -(?:\r?\n|\r)
If there has to be a word character before and after, instead of removing all the hyphens at the end of the line, you could use lookarounds:
(?<=\w)-\r?\n(?=\w)
Regex demo | Python demo
For example
import re
regex = r"(?<=\w)-\r?\n(?=\w)"
my_string = """this good pe-
riod has"""
print (re.sub(regex, "", my_string))
Output
this good period has

How to get the content after a string using regex in python

I am having a string as follows:
A5697[2:10] = {ravi, rageev, raghav, smith};
I want the content after "A5697[2:10] =". So, my output should be:
{ravi, rageev, raghav, smith};
This is my code:
print(re.search(r'(?<=A\d+\[.*\] =\s).*', line).group())
But, this is giving error:
sre_constants.error: look-behind requires fixed-width pattern
Can anyone help to solve this issue? I would prefer to use regex.
You can try re.sub , like below, Since you have given only one data point. I am assuming all the other data points are following the similar pattern.
import re
text = "A5697[2:10] = {ravi, rageev, raghav, smith}"
re.sub(r'(A\d+\[\d+:\d+\]\s+=\s+)(.+)', r'\2', text)
returns,
'{ravi, rageev, raghav, smith}'
re.sub : substitutes the entire match as given as regex with the 2nd capturing group. The second capturing group captures every thing after '= '.
Simply replace the bits you don't want:
print re.sub(r'A\d[^=]*= *','',line)
See demo here: https://rextester.com/NSG17655

Adding a space character (" ") 2 character spaces after a period character (".")?

We have a legacy system that is exporting reports as .txt files, but in almost all instances when a date is supplied, it is after a currency denomination, and looks like this example:
25.0002/14/18 (25 bucks on feb 14th) or 287.4312/08/17.
Is there an easy way to parse for . and add a space character two spaces to the right to separate the string in Python? Any help is greatly appreciated!
The code below would add a space between the currency and the data given a string.
import re
my_file_text = "This is some text 287.4312/08/17"
new_text = re.sub("(\d+\.\d{2})(\d{2}/\d{2}/\d{2})", r"\1 \2", my_file_text)
print(new_text)
OUTPUT
'This is some text 287.43 12/08/17'
REGEX
(\d+\.\d{2}): This part of the regex captures the currency in it's own group, it assumes that it will have any number of digits (>1) before the . and then only two digits after, so something like (1000.25) would be captured correctly, while (1000.205) and (.25) won't.
(\d{2}/\d{2}/\d{2}): This part captures the date, it assumes that the day, month and year portion of the dates will always be represented using two digits each and separated by a /.
Perhaps more efficient methods, but an easy way could be:
def fix(string):
if '.' in string:
part_1, part_2 = string.split('.')
part_2_fixed = part_2[:2] + ' ' + part_2[2:]
string = part_1 + '.' + part_2_fixed
return string
In [1]: string = '25.0002/14/18'
In [2]: fix(string)
Out[2]: '25.00 02/14/18'

regex - Making all letters in a text lowercase using re.sub in python but exclude specific string?

I am writing a script to convert all uppercase letters in a text to lower case using regex, but excluding specific strings/characters such as "TEA", "CHI", "I", "#Begin", "#Language", "ENG", "#Participants", "#Media", "#Transcriber", "#Activities", "SBR", "#Comment" and so on.
The script I have is currently shown below. However, it does not provide the desired outputs. For instance when I input "#Activities: SBR", the output given is "#Activities#activities: sbr#activities: sbrSBR". The intended output is "#Activities": "SBR".
I am using Python 3.5.2
Can anyone help to provide some guidance? Thank you.
import os
from itertools import chain
import re
def lowercase_exclude_specific_string(line):
line = line.strip()
PATTERN = r'[^TEA|CHI|I|#Begin|#Language|ENG|#Participants|#Media|#Transcriber|#Activities|SBR|#Comment]'
filtered_line = re.sub(PATTERN, line.lower(), line)
return filtered_line
First, let's see why you're getting the wrong output.
For instance when I input "#Activities: SBR", the output given is
"#Activities#activities: sbr#activities: sbrSBR".
This is because your code
PATTERN = r'[^TEA|CHI|I|#Begin|#Language|ENG|#Participants|#Media|#Transcriber|#Activities|SBR|#Comment]'
filtered_line = re.sub(PATTERN, line.lower(), line)
is doing negated character class matching, meaning it will match all characters that are not in the list and replace them with line.lower() (which is "#activities: sbr"). You can see the matched characters in this regex demo.
The code will match ":" and " " (whitespace) and replace both with "#activities: sbr", giving you the result "#Activities#activities: sbr#activities: sbrSBR".
Now to fix that code. Unfortunately, there is no direct way to negate words in a line and apply substitution on the other words on that same line. Instead, you can split the line first into individual words, then apply re.sub on it using your PATTERN. Also, instead of a negated character class, you should use a negative lookahead:
(?!...)
Negative lookahead assertion. This is the opposite of the positive assertion; it succeeds if the contained expression doesn’t match at
the current position in the string.
Here's the code I got:
def lowercase_exclude_specific_string(line):
line = line.strip()
words = re.split("\s+", line)
result = []
for word in words:
PATTERN = r"^(?!TEA|CHI|I|#Begin|#Language|ENG|#Participants|#Media|#Transcriber|#Activities|SBR|#Comment).*$"
lword = re.sub(PATTERN, word.lower(), word)
result.append(lword)
return " ".join(result)
The re.sub will only match words not in the PATTERN, and replace it with its lowercase value. If the word is part of the excluded pattern, it will be unmatched and re.sub returns it unchanged.
Each word is then stored in a list, then joined later to form the line back.
Samples:
print(lowercase_exclude_specific_string("#Activities: SBR"))
print(lowercase_exclude_specific_string("#Activities: SOME OTHER TEXT SBR"))
print(lowercase_exclude_specific_string("Begin ABCDEF #Media #Comment XXXX"))
print(lowercase_exclude_specific_string("#Begin AT THE BEGINNING."))
print(lowercase_exclude_specific_string("PLACE #Begin AT THE MIDDLE."))
print(lowercase_exclude_specific_string("I HOPe thIS heLPS."))
#Activities: SBR
#Activities: some other text SBR
begin abcdef #Media #Comment xxxx
#Begin at the beginning.
place #Begin at the middle.
I hope this helps.
EDIT:
As mentioned in the comments, apparently there is a tab in between : and the next character. Since the code splits the string using \s, the tab can't be preserved, but it can be restored by replacing : with :\t in the final result.
return " ".join(result).replace(":", ":\t")

Resources