Remove newlines from a regex matched string - python-3.x

I have a string as below:
Financial strain: No\n?Food insecurity:\nWorry: No\nInability: No\n?Transportation needs:\nMedical: No\nNon-medical: No\nTobacco Use\n?Smoking status: Never Smoker\n?
I want to first match the substring/sentence of interest (I.e. the sentence beginning with "Food insecurity" and ending with "\n?") then remove all the newlines in this sentence apart from the last one i.e. the one before the question mark.
I have been able to match the sentence w/o its last newline and question mark with regex (Food insecurity:).*?(?=\\n\?) but I struggle to remove the first 2 newlines of the matched sentence and return the whole preprocessed string. Any advice?

You could use re.sub with a callback function:
inp = "Financial strain: No\n?Food insecurity:\nWorry: No\nInability: No\n?Transportation needs:\nMedical: No\nNon-medical: No\nTobacco Use\n?Smoking status: Never Smoker\n?"
output = re.sub(r'Food insecurity:\nWorry: No\nInability: No(?=\n\?)', lambda m: m.group().replace('\n', ''), inp)
print(output)

Related

How can I find all the strings that contains "/1" and remove from a file using Python?

I have this file that contains these kinds of strings "1405079/1" the only common in them is the "/1" at the end. I want to be able to find those strings and remove them, below is sample code
but it's not doing anything.
with open("jobstat.txt","r") as jobstat:
with open("runjob_output.txt", "w") as runjob_output:
for line in jobstat:
string_to_replace = ' */1'
line = line.replace(string_to_replace, " ")
with open("jobstat.txt","r") as jobstat:
with open("runjob_output.txt", "w") as runjob_output:
for line in jobstat:
string_to_replace ='/1'
line =line.rstrip(string_to_replace)
print(line)
Anytime you have a "pattern" you want to match against, use a regular expression. The pattern here, given the information you've provided, is a string with an arbitrary number of digits followed by /1.
You can use re.sub to match against that pattern, and replace instances of it with another string.
import re
original_string= "some random text with 123456/1, and midd42142/1le of words"
pattern = r"\d*\/1"
replacement = ""
re.sub(pattern, replacement, original_string)
Output:
'some random text with , and middle of words'
Replacing instances of the pattern with something else:
>>> re.sub(pattern, "foo", original_string)
'some random text with foo, and middfoole of words'

Remove all words after specific words including that specific words

I have a dataframe that consists of sentences
I want to delete specific sentence in the dataframe if the sentence start with specific match that
df['data']=["First: This is the sentence good mode:one line","Second: This sentence is also good mode:one line","Third: this sentence is too long mode:two lines"]
I would like to remove the words starting from mode until the end including the mode.
Expected result
df['data']=["First: This is the sentence good","Second: This sentence is also good","Third: this sentence is too long"]
This is what I tried
unwanted_list=["mode: one line"]
df['data'].str.replace(unwanted_list, '', regex=True)
The result it remove one line but mode still there, I would like to remove mode: one line
expected output
df['data']=["First: This is the sentence good","Second: This sentence is also good ","Third: this sentence is too long"]
IIUC, use str.replace with the \s*\bmode:.* regex.
df['data'] = df['data'].str.replace(r'\s*\bmode:.*', '', regex=True)
output:
data
0 First: This is the sentence good
1 Second: This sentence is also good
2 Third: this sentence is too long

How to extract the exact word from a string while reducing false positive discovery

I want to extract the exact word from a string. My code causes false discovery by considering the search item as a substring. Here is the code:
import re
text="Hello I am not react-dom"
item_search=['react', 'react-dom']
Found_item=[]
for i in range(0, len(item_search)):
Q=re.findall(r'\b%s\b'%item_search[i], text, flags=re.IGNORECASE | re.MULTILINE | re.UNICODE)
Found_item.append(Q)
print(Found_item)
The output is: [['react'], ['react-dom']]. So, In the result, I dont want to see the react as item.
The expected Output is: [[''], ['react-dom']]
\b is used to to indicate the boundary between types. eg between words and punctuations etc. so \b will be present between t from react and -. Thus here since we need whole words, we just use a lookbehind and lookahead to ensure that there no non-space between the two( THis is not the same as saying there is a space between the two). Thus you could use:
re.findall(rf"(?<!\S)({'|'.join(item_search)})(?!\S)", text)
['react-dom']
Edit:
If you were to include other non-word stuff like periods in the sentence, check #DYZ comment then you could use:
(?<!\S)({'|'.join(item_search)})\W*(?!\S)

regex - Making all letters in a text lowercase using re.sub in python but exclude specific string?

I am writing a script to convert all uppercase letters in a text to lower case using regex, but excluding specific strings/characters such as "TEA", "CHI", "I", "#Begin", "#Language", "ENG", "#Participants", "#Media", "#Transcriber", "#Activities", "SBR", "#Comment" and so on.
The script I have is currently shown below. However, it does not provide the desired outputs. For instance when I input "#Activities: SBR", the output given is "#Activities#activities: sbr#activities: sbrSBR". The intended output is "#Activities": "SBR".
I am using Python 3.5.2
Can anyone help to provide some guidance? Thank you.
import os
from itertools import chain
import re
def lowercase_exclude_specific_string(line):
line = line.strip()
PATTERN = r'[^TEA|CHI|I|#Begin|#Language|ENG|#Participants|#Media|#Transcriber|#Activities|SBR|#Comment]'
filtered_line = re.sub(PATTERN, line.lower(), line)
return filtered_line
First, let's see why you're getting the wrong output.
For instance when I input "#Activities: SBR", the output given is
"#Activities#activities: sbr#activities: sbrSBR".
This is because your code
PATTERN = r'[^TEA|CHI|I|#Begin|#Language|ENG|#Participants|#Media|#Transcriber|#Activities|SBR|#Comment]'
filtered_line = re.sub(PATTERN, line.lower(), line)
is doing negated character class matching, meaning it will match all characters that are not in the list and replace them with line.lower() (which is "#activities: sbr"). You can see the matched characters in this regex demo.
The code will match ":" and " " (whitespace) and replace both with "#activities: sbr", giving you the result "#Activities#activities: sbr#activities: sbrSBR".
Now to fix that code. Unfortunately, there is no direct way to negate words in a line and apply substitution on the other words on that same line. Instead, you can split the line first into individual words, then apply re.sub on it using your PATTERN. Also, instead of a negated character class, you should use a negative lookahead:
(?!...)
Negative lookahead assertion. This is the opposite of the positive assertion; it succeeds if the contained expression doesn’t match at
the current position in the string.
Here's the code I got:
def lowercase_exclude_specific_string(line):
line = line.strip()
words = re.split("\s+", line)
result = []
for word in words:
PATTERN = r"^(?!TEA|CHI|I|#Begin|#Language|ENG|#Participants|#Media|#Transcriber|#Activities|SBR|#Comment).*$"
lword = re.sub(PATTERN, word.lower(), word)
result.append(lword)
return " ".join(result)
The re.sub will only match words not in the PATTERN, and replace it with its lowercase value. If the word is part of the excluded pattern, it will be unmatched and re.sub returns it unchanged.
Each word is then stored in a list, then joined later to form the line back.
Samples:
print(lowercase_exclude_specific_string("#Activities: SBR"))
print(lowercase_exclude_specific_string("#Activities: SOME OTHER TEXT SBR"))
print(lowercase_exclude_specific_string("Begin ABCDEF #Media #Comment XXXX"))
print(lowercase_exclude_specific_string("#Begin AT THE BEGINNING."))
print(lowercase_exclude_specific_string("PLACE #Begin AT THE MIDDLE."))
print(lowercase_exclude_specific_string("I HOPe thIS heLPS."))
#Activities: SBR
#Activities: some other text SBR
begin abcdef #Media #Comment xxxx
#Begin at the beginning.
place #Begin at the middle.
I hope this helps.
EDIT:
As mentioned in the comments, apparently there is a tab in between : and the next character. Since the code splits the string using \s, the tab can't be preserved, but it can be restored by replacing : with :\t in the final result.
return " ".join(result).replace(":", ":\t")

Removing whitespace after splitting a string in python

1)Prompt the user for a string that contains two strings separated by a comma.
2)Report an error if the input string does not contain a comma. Continue to prompt until a valid string is entered. Note: If the input contains a comma, then assume that the input also contains two strings.
3)Using string splitting, extract the two words from the input string and then remove any spaces. Output the two words.
4)Using a loop, extend the program to handle multiple lines of input. Continue until the user enters q to quit.
I wrote a program with these instructions although I cannot work out how to remove extra spaces that may be attached to the outputted words. For example if you enter "Billy, Bob" it works fine, but if you enter "Billy,Bob" you will get an IndexError: list index out of range, or if you enter "Billy , Bob" Billy will be outputted with an extra space attached to the string. Here is my code.
usrIn=0
while usrIn!='q':
usrIn = input("Enter input string: \n")
if "," in usrIn:
tokens = usrIn.split(", ")
print("First word:",tokens[0])
print("Second word:",tokens[1])
print('')
print('')
else:
print("Error: No comma in string.")
How do I remove spaces from the outputs so I can just use a usrIn.split(",") ?
You can use .trim() method which will remove leading and trailing white-spaces.
usrIn.trim().split(","). After this is done you can split them again using the white-space regex, for instance, usrIn.split("\\s+")
The \s will look for white space while the + operator will look for repeated white-spaces.
Hope this helps :)

Resources