I have a string as:
s="(2021-07-29 01:00:00 AM BST)
---
peter.j.matthew has joined the conversation
(2021-07-29 01:00:00 AM BST)
---
john cheung has joined the conversation
(2021-07-29 01:11:19 AM BST)
---
allen.p.jonas
Hi, james
(2021-07-30 12:51:16 AM BST)
---
karren wenda
how're you ?
---
* * *"
I want to extract the names as:
names_list= ['allen.p.jonas','karren wenda']
what I have tried:
names_list=re.findall(r'--- [\S\n](\D+ [\S\n])',s)
This answer assumes that you want to find names on whose lines do not end with the text has joined the conversation:
names = re.findall(r'\(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2} [AP]M [A-Z]{3}\)\s+---\s+\r?\n((?:(?!\bhas joined the conversation).)+?)[ ]*\r?\n', s)
print(names) # ['allen.p.jonas', 'karren wenda']
The salient portion of the regex is this:
((?:(?!\bhas joined the conversation).)+?)[ ]*\r?\n
This captures a name without matching has joined the conversation by using a tempered dot trick. It matches one character at a time on the line containing the name, making sure that the conversation text does not appear anywhere, until reaching the CR?LF at the end of the line.
If you only want to match ['allen.p.jonas','karren wenda'], you can use match a non whitespace char after it on the next line:
^---[^\S\n]*\n(\S.*?)[^\S\r\n]*\n\S
The pattern matches:
^ Start of string
--- Match ---
[^\S\n]*\n Match optional spaces and a newline
(\S.*?) Capture group 1 (returned by re.findall) match a non whitespace char followed by as least as possible chars
[^\S\r\n]* Match optional whitespace chars without a newline
\n\S Match a newline and a non whitespace char
Regex demo | Python demo
For example
print(re.findall(r"^---[^\S\n]*\n(\S.*?)[^\S\r\n]*\n\S", s, re.M))
Output
['allen.p.jonas', 'karren wenda']
To explicitly exclude lines that contain has joined the conversation you can use a negative lookahead:
^---[^\S\n]*\n(?!.*\bhas joined the conversation\b)(\S.*?)[^\S\r]*$
Regex demo | Python demo
For example:
print(re.findall(r"^---[^\S\n]*\n(?!.*\bhas joined the conversation\b)(\S.*?)[^\S\r]*$", s, re.M))
Output
['allen.p.jonas', 'karren wenda']
Supposing you want to match names that are not followed by "has joined the conversation":
name_pattern = re.compile(r'---\s*\n(\w(?:[\w\. ](?!has joined the conversation))*?)\s*\n', re.MULTILINE)
print(re.findall(name_pattern, s))
Explanation:
---\s*\n matches the dashes possibly followed by whitespaces and a required new line
Then comes our matching group composed of:
\w starts with a 'word' character (a-Z, 0-9 or _)
(?:[\w\. ](?!has joined the conversation))*? a non capturing group of repeating \w, . or whitespace not followed by "has joined the conversation". The capturing goes on until the next whitespace or new line. (*? makes the expression lazy instead of greedy)
Output:
['allen.p.jonas', 'karren wenda']
Related
I want to match a pattern that starts with $ and ends with either dot(.) or double quote(").
I tried with this
re.findall(r"\$(.+?)\.",query1)
Above works for starting with $ and ending with .
How to add OR in ending characters so that it matches with pattern ending either with . or with "
Any suggestions ?
The regex pattern you want is:
\$(.+?)[."]
Your updated script:
query1 = "The item cost $100. It also is $not good\""
matches = re.findall(r"\$(.+?)[.\"]", query1)
print(matches) # ['100', 'not good']
To match both a dot or double quote, you can use a character class [."]
As you want to exclude matching a single character in between, you can make use of a negated character class to exclude matching one of them [^".]
\$([^".]+)[."]
Regex demo
Example
import re
query1 = 'With a dollar sign $abc. or $123"'
print(re.findall(r'\$([^".]+)[."]', query1))
Output
['abc', '123']
Note: as the negated character class can also match newlines, you can exclude that using:
\$([^".\n]+)[."]
I have a string as below:
Financial strain: No\n?Food insecurity:\nWorry: No\nInability: No\n?Transportation needs:\nMedical: No\nNon-medical: No\nTobacco Use\n?Smoking status: Never Smoker\n?
I want to first match the substring/sentence of interest (I.e. the sentence beginning with "Food insecurity" and ending with "\n?") then remove all the newlines in this sentence apart from the last one i.e. the one before the question mark.
I have been able to match the sentence w/o its last newline and question mark with regex (Food insecurity:).*?(?=\\n\?) but I struggle to remove the first 2 newlines of the matched sentence and return the whole preprocessed string. Any advice?
You could use re.sub with a callback function:
inp = "Financial strain: No\n?Food insecurity:\nWorry: No\nInability: No\n?Transportation needs:\nMedical: No\nNon-medical: No\nTobacco Use\n?Smoking status: Never Smoker\n?"
output = re.sub(r'Food insecurity:\nWorry: No\nInability: No(?=\n\?)', lambda m: m.group().replace('\n', ''), inp)
print(output)
I want to extract the exact word from a string. My code causes false discovery by considering the search item as a substring. Here is the code:
import re
text="Hello I am not react-dom"
item_search=['react', 'react-dom']
Found_item=[]
for i in range(0, len(item_search)):
Q=re.findall(r'\b%s\b'%item_search[i], text, flags=re.IGNORECASE | re.MULTILINE | re.UNICODE)
Found_item.append(Q)
print(Found_item)
The output is: [['react'], ['react-dom']]. So, In the result, I dont want to see the react as item.
The expected Output is: [[''], ['react-dom']]
\b is used to to indicate the boundary between types. eg between words and punctuations etc. so \b will be present between t from react and -. Thus here since we need whole words, we just use a lookbehind and lookahead to ensure that there no non-space between the two( THis is not the same as saying there is a space between the two). Thus you could use:
re.findall(rf"(?<!\S)({'|'.join(item_search)})(?!\S)", text)
['react-dom']
Edit:
If you were to include other non-word stuff like periods in the sentence, check #DYZ comment then you could use:
(?<!\S)({'|'.join(item_search)})\W*(?!\S)
I am writing a script to convert all uppercase letters in a text to lower case using regex, but excluding specific strings/characters such as "TEA", "CHI", "I", "#Begin", "#Language", "ENG", "#Participants", "#Media", "#Transcriber", "#Activities", "SBR", "#Comment" and so on.
The script I have is currently shown below. However, it does not provide the desired outputs. For instance when I input "#Activities: SBR", the output given is "#Activities#activities: sbr#activities: sbrSBR". The intended output is "#Activities": "SBR".
I am using Python 3.5.2
Can anyone help to provide some guidance? Thank you.
import os
from itertools import chain
import re
def lowercase_exclude_specific_string(line):
line = line.strip()
PATTERN = r'[^TEA|CHI|I|#Begin|#Language|ENG|#Participants|#Media|#Transcriber|#Activities|SBR|#Comment]'
filtered_line = re.sub(PATTERN, line.lower(), line)
return filtered_line
First, let's see why you're getting the wrong output.
For instance when I input "#Activities: SBR", the output given is
"#Activities#activities: sbr#activities: sbrSBR".
This is because your code
PATTERN = r'[^TEA|CHI|I|#Begin|#Language|ENG|#Participants|#Media|#Transcriber|#Activities|SBR|#Comment]'
filtered_line = re.sub(PATTERN, line.lower(), line)
is doing negated character class matching, meaning it will match all characters that are not in the list and replace them with line.lower() (which is "#activities: sbr"). You can see the matched characters in this regex demo.
The code will match ":" and " " (whitespace) and replace both with "#activities: sbr", giving you the result "#Activities#activities: sbr#activities: sbrSBR".
Now to fix that code. Unfortunately, there is no direct way to negate words in a line and apply substitution on the other words on that same line. Instead, you can split the line first into individual words, then apply re.sub on it using your PATTERN. Also, instead of a negated character class, you should use a negative lookahead:
(?!...)
Negative lookahead assertion. This is the opposite of the positive assertion; it succeeds if the contained expression doesn’t match at
the current position in the string.
Here's the code I got:
def lowercase_exclude_specific_string(line):
line = line.strip()
words = re.split("\s+", line)
result = []
for word in words:
PATTERN = r"^(?!TEA|CHI|I|#Begin|#Language|ENG|#Participants|#Media|#Transcriber|#Activities|SBR|#Comment).*$"
lword = re.sub(PATTERN, word.lower(), word)
result.append(lword)
return " ".join(result)
The re.sub will only match words not in the PATTERN, and replace it with its lowercase value. If the word is part of the excluded pattern, it will be unmatched and re.sub returns it unchanged.
Each word is then stored in a list, then joined later to form the line back.
Samples:
print(lowercase_exclude_specific_string("#Activities: SBR"))
print(lowercase_exclude_specific_string("#Activities: SOME OTHER TEXT SBR"))
print(lowercase_exclude_specific_string("Begin ABCDEF #Media #Comment XXXX"))
print(lowercase_exclude_specific_string("#Begin AT THE BEGINNING."))
print(lowercase_exclude_specific_string("PLACE #Begin AT THE MIDDLE."))
print(lowercase_exclude_specific_string("I HOPe thIS heLPS."))
#Activities: SBR
#Activities: some other text SBR
begin abcdef #Media #Comment xxxx
#Begin at the beginning.
place #Begin at the middle.
I hope this helps.
EDIT:
As mentioned in the comments, apparently there is a tab in between : and the next character. Since the code splits the string using \s, the tab can't be preserved, but it can be restored by replacing : with :\t in the final result.
return " ".join(result).replace(":", ":\t")
1)Prompt the user for a string that contains two strings separated by a comma.
2)Report an error if the input string does not contain a comma. Continue to prompt until a valid string is entered. Note: If the input contains a comma, then assume that the input also contains two strings.
3)Using string splitting, extract the two words from the input string and then remove any spaces. Output the two words.
4)Using a loop, extend the program to handle multiple lines of input. Continue until the user enters q to quit.
I wrote a program with these instructions although I cannot work out how to remove extra spaces that may be attached to the outputted words. For example if you enter "Billy, Bob" it works fine, but if you enter "Billy,Bob" you will get an IndexError: list index out of range, or if you enter "Billy , Bob" Billy will be outputted with an extra space attached to the string. Here is my code.
usrIn=0
while usrIn!='q':
usrIn = input("Enter input string: \n")
if "," in usrIn:
tokens = usrIn.split(", ")
print("First word:",tokens[0])
print("Second word:",tokens[1])
print('')
print('')
else:
print("Error: No comma in string.")
How do I remove spaces from the outputs so I can just use a usrIn.split(",") ?
You can use .trim() method which will remove leading and trailing white-spaces.
usrIn.trim().split(","). After this is done you can split them again using the white-space regex, for instance, usrIn.split("\\s+")
The \s will look for white space while the + operator will look for repeated white-spaces.
Hope this helps :)