Regex pattern to match a string ends with multiple charachters - python-regex

I want to match a pattern that starts with $ and ends with either dot(.) or double quote(").
I tried with this
re.findall(r"\$(.+?)\.",query1)
Above works for starting with $ and ending with .
How to add OR in ending characters so that it matches with pattern ending either with . or with "
Any suggestions ?

The regex pattern you want is:
\$(.+?)[."]
Your updated script:
query1 = "The item cost $100. It also is $not good\""
matches = re.findall(r"\$(.+?)[.\"]", query1)
print(matches) # ['100', 'not good']

To match both a dot or double quote, you can use a character class [."]
As you want to exclude matching a single character in between, you can make use of a negated character class to exclude matching one of them [^".]
\$([^".]+)[."]
Regex demo
Example
import re
query1 = 'With a dollar sign $abc. or $123"'
print(re.findall(r'\$([^".]+)[."]', query1))
Output
['abc', '123']
Note: as the negated character class can also match newlines, you can exclude that using:
\$([^".\n]+)[."]

Related

Remove newlines from a regex matched string

I have a string as below:
Financial strain: No\n?Food insecurity:\nWorry: No\nInability: No\n?Transportation needs:\nMedical: No\nNon-medical: No\nTobacco Use\n?Smoking status: Never Smoker\n?
I want to first match the substring/sentence of interest (I.e. the sentence beginning with "Food insecurity" and ending with "\n?") then remove all the newlines in this sentence apart from the last one i.e. the one before the question mark.
I have been able to match the sentence w/o its last newline and question mark with regex (Food insecurity:).*?(?=\\n\?) but I struggle to remove the first 2 newlines of the matched sentence and return the whole preprocessed string. Any advice?
You could use re.sub with a callback function:
inp = "Financial strain: No\n?Food insecurity:\nWorry: No\nInability: No\n?Transportation needs:\nMedical: No\nNon-medical: No\nTobacco Use\n?Smoking status: Never Smoker\n?"
output = re.sub(r'Food insecurity:\nWorry: No\nInability: No(?=\n\?)', lambda m: m.group().replace('\n', ''), inp)
print(output)

extract names from string

I have a string as:
s="(2021-07-29 01:00:00 AM BST)
---
peter.j.matthew has joined the conversation
(2021-07-29 01:00:00 AM BST)
---
john cheung has joined the conversation
(2021-07-29 01:11:19 AM BST)
---
allen.p.jonas
Hi, james
(2021-07-30 12:51:16 AM BST)
---
karren wenda
how're you ?
---
* * *"
I want to extract the names as:
names_list= ['allen.p.jonas','karren wenda']
what I have tried:
names_list=re.findall(r'--- [\S\n](\D+ [\S\n])',s)
This answer assumes that you want to find names on whose lines do not end with the text has joined the conversation:
names = re.findall(r'\(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2} [AP]M [A-Z]{3}\)\s+---\s+\r?\n((?:(?!\bhas joined the conversation).)+?)[ ]*\r?\n', s)
print(names) # ['allen.p.jonas', 'karren wenda']
The salient portion of the regex is this:
((?:(?!\bhas joined the conversation).)+?)[ ]*\r?\n
This captures a name without matching has joined the conversation by using a tempered dot trick. It matches one character at a time on the line containing the name, making sure that the conversation text does not appear anywhere, until reaching the CR?LF at the end of the line.
If you only want to match ['allen.p.jonas','karren wenda'], you can use match a non whitespace char after it on the next line:
^---[^\S\n]*\n(\S.*?)[^\S\r\n]*\n\S
The pattern matches:
^ Start of string
--- Match ---
[^\S\n]*\n Match optional spaces and a newline
(\S.*?) Capture group 1 (returned by re.findall) match a non whitespace char followed by as least as possible chars
[^\S\r\n]* Match optional whitespace chars without a newline
\n\S Match a newline and a non whitespace char
Regex demo | Python demo
For example
print(re.findall(r"^---[^\S\n]*\n(\S.*?)[^\S\r\n]*\n\S", s, re.M))
Output
['allen.p.jonas', 'karren wenda']
To explicitly exclude lines that contain has joined the conversation you can use a negative lookahead:
^---[^\S\n]*\n(?!.*\bhas joined the conversation\b)(\S.*?)[^\S\r]*$
Regex demo | Python demo
For example:
print(re.findall(r"^---[^\S\n]*\n(?!.*\bhas joined the conversation\b)(\S.*?)[^\S\r]*$", s, re.M))
Output
['allen.p.jonas', 'karren wenda']
Supposing you want to match names that are not followed by "has joined the conversation":
name_pattern = re.compile(r'---\s*\n(\w(?:[\w\. ](?!has joined the conversation))*?)\s*\n', re.MULTILINE)
print(re.findall(name_pattern, s))
Explanation:
---\s*\n matches the dashes possibly followed by whitespaces and a required new line
Then comes our matching group composed of:
\w starts with a 'word' character (a-Z, 0-9 or _)
(?:[\w\. ](?!has joined the conversation))*? a non capturing group of repeating \w, . or whitespace not followed by "has joined the conversation". The capturing goes on until the next whitespace or new line. (*? makes the expression lazy instead of greedy)
Output:
['allen.p.jonas', 'karren wenda']

How to extract the exact word from a string while reducing false positive discovery

I want to extract the exact word from a string. My code causes false discovery by considering the search item as a substring. Here is the code:
import re
text="Hello I am not react-dom"
item_search=['react', 'react-dom']
Found_item=[]
for i in range(0, len(item_search)):
Q=re.findall(r'\b%s\b'%item_search[i], text, flags=re.IGNORECASE | re.MULTILINE | re.UNICODE)
Found_item.append(Q)
print(Found_item)
The output is: [['react'], ['react-dom']]. So, In the result, I dont want to see the react as item.
The expected Output is: [[''], ['react-dom']]
\b is used to to indicate the boundary between types. eg between words and punctuations etc. so \b will be present between t from react and -. Thus here since we need whole words, we just use a lookbehind and lookahead to ensure that there no non-space between the two( THis is not the same as saying there is a space between the two). Thus you could use:
re.findall(rf"(?<!\S)({'|'.join(item_search)})(?!\S)", text)
['react-dom']
Edit:
If you were to include other non-word stuff like periods in the sentence, check #DYZ comment then you could use:
(?<!\S)({'|'.join(item_search)})\W*(?!\S)

how to find maximal matching pattern in data in python

I have a format in a file called file.txt which has lines like:
babies:n:baby
flies:n:fly
ladies:n:lady
sheep:n:sheep
furniture:n:furniture
luggages:n:luggage
etc.
Now,i need to extract only the common pattern between f1 and f3 and want to write in the format:
example:babies
here, babies have common pattern till 'bab' and 'ies' is addition in the following words also.
Format:<e lm="babies"><i>bab</i><par n="bab"/></e>
your question is not clear, it would be great if you can explain more.
but, i think that you want to use regex (regular expression).
here is a nice website to play with regex: https://regex101.com/
in python you can use the re module (import re).
if you have string like "babies:n:baby" , you can extract the similarity with the regex: (\w+).*:n:(\1).*
which mean:
(\w+) - find sequence of alphabet chars
:n: - and then find :n:
(\1) - and then the same word that we catch in the first ()
python sample:
for one search:
import re
pattern = r"(\w+).*:n:(\1).*"
result = re.search(pattern, word)
return result.group()
and for many searches:
import re
pattern = r"(\w+).*:n:(\1).*"
result = re.findall(pattern, word)
return result

regex - Making all letters in a text lowercase using re.sub in python but exclude specific string?

I am writing a script to convert all uppercase letters in a text to lower case using regex, but excluding specific strings/characters such as "TEA", "CHI", "I", "#Begin", "#Language", "ENG", "#Participants", "#Media", "#Transcriber", "#Activities", "SBR", "#Comment" and so on.
The script I have is currently shown below. However, it does not provide the desired outputs. For instance when I input "#Activities: SBR", the output given is "#Activities#activities: sbr#activities: sbrSBR". The intended output is "#Activities": "SBR".
I am using Python 3.5.2
Can anyone help to provide some guidance? Thank you.
import os
from itertools import chain
import re
def lowercase_exclude_specific_string(line):
line = line.strip()
PATTERN = r'[^TEA|CHI|I|#Begin|#Language|ENG|#Participants|#Media|#Transcriber|#Activities|SBR|#Comment]'
filtered_line = re.sub(PATTERN, line.lower(), line)
return filtered_line
First, let's see why you're getting the wrong output.
For instance when I input "#Activities: SBR", the output given is
"#Activities#activities: sbr#activities: sbrSBR".
This is because your code
PATTERN = r'[^TEA|CHI|I|#Begin|#Language|ENG|#Participants|#Media|#Transcriber|#Activities|SBR|#Comment]'
filtered_line = re.sub(PATTERN, line.lower(), line)
is doing negated character class matching, meaning it will match all characters that are not in the list and replace them with line.lower() (which is "#activities: sbr"). You can see the matched characters in this regex demo.
The code will match ":" and " " (whitespace) and replace both with "#activities: sbr", giving you the result "#Activities#activities: sbr#activities: sbrSBR".
Now to fix that code. Unfortunately, there is no direct way to negate words in a line and apply substitution on the other words on that same line. Instead, you can split the line first into individual words, then apply re.sub on it using your PATTERN. Also, instead of a negated character class, you should use a negative lookahead:
(?!...)
Negative lookahead assertion. This is the opposite of the positive assertion; it succeeds if the contained expression doesn’t match at
the current position in the string.
Here's the code I got:
def lowercase_exclude_specific_string(line):
line = line.strip()
words = re.split("\s+", line)
result = []
for word in words:
PATTERN = r"^(?!TEA|CHI|I|#Begin|#Language|ENG|#Participants|#Media|#Transcriber|#Activities|SBR|#Comment).*$"
lword = re.sub(PATTERN, word.lower(), word)
result.append(lword)
return " ".join(result)
The re.sub will only match words not in the PATTERN, and replace it with its lowercase value. If the word is part of the excluded pattern, it will be unmatched and re.sub returns it unchanged.
Each word is then stored in a list, then joined later to form the line back.
Samples:
print(lowercase_exclude_specific_string("#Activities: SBR"))
print(lowercase_exclude_specific_string("#Activities: SOME OTHER TEXT SBR"))
print(lowercase_exclude_specific_string("Begin ABCDEF #Media #Comment XXXX"))
print(lowercase_exclude_specific_string("#Begin AT THE BEGINNING."))
print(lowercase_exclude_specific_string("PLACE #Begin AT THE MIDDLE."))
print(lowercase_exclude_specific_string("I HOPe thIS heLPS."))
#Activities: SBR
#Activities: some other text SBR
begin abcdef #Media #Comment xxxx
#Begin at the beginning.
place #Begin at the middle.
I hope this helps.
EDIT:
As mentioned in the comments, apparently there is a tab in between : and the next character. Since the code splits the string using \s, the tab can't be preserved, but it can be restored by replacing : with :\t in the final result.
return " ".join(result).replace(":", ":\t")

Resources