string = "My QUIZZING codes is GREATLY bad so quizzing number is the integer 94.4; I don't like any other BuzzcuT except 1.\n"
From this string of gibberish, I want to pull out the words QUIZZING GREATLY and BuzzcuT leaving their capitalization's or lack thereof as is.
caps = re.findall('([A-Z]+(?:(?!\s?[A-Z][a-z])\s?[A-Z])+)', string)
print(string)
This code that I have/the code that you see results in ['QUIZZING', 'GREATLY']....but I am hoping to get ['QUIZZING', 'GREATLY', 'BuzzcuT']
Although it's gibberish, the point is the various alpha/numeric combinations that make it a challenge .
The regex below finds the 3 patterns in your example string.
import re
string = "My QUIZZING codes is GREATLY bad so quizzing number is the integer 94.4; I don't like any other BuzzcuT except 1.\n"
# The regex contains 2 patterns
# \b[A-Z]{3,}\S*\b -- will match QUIZZING and GREATLY
# \b[A-Z]{1}[a-z]\S*[A-Z]\b -- will match BuzzcuT
#
# You could use a single pattern -- [A-Z]{1,}\S*[A-Z]
# to match all 3 words
#
word_pattern = re.compile(r'\b[A-Z]{3,}\S*\b|\b[A-Z]{1}[a-z]\S*[A-Z]\b')
find_words = re.findall(word_pattern, string)
if find_words:
print (find_words)
# output
['QUIZZING', 'GREATLY', 'BuzzcuT']
Related
I have a string, I have to get digits only from that string.
url = "www.mylocalurl.com/edit/1987"
Now from that string, I need to get 1987 only.
I have been trying this approach,
id = [int(i) for i in url.split() if i.isdigit()]
But I am getting [] list only.
You can use regex and get the digit alone in the list.
import re
url = "www.mylocalurl.com/edit/1987"
digit = re.findall(r'\d+', url)
output:
['1987']
Replace all non-digits with blank (effectively "deleting" them):
import re
num = re.sub('\D', '', url)
See live demo.
You aren't getting anything because by default the .split() method splits a sentence up where there are spaces. Since you are trying to split a hyperlink that has no spaces, it is not splitting anything up. What you can do is called a capture using regex. For example:
import re
url = "www.mylocalurl.com/edit/1987"
regex = r'(\d+)'
numbers = re.search(regex, url)
captured = numbers.groups()[0]
If you do not what what regular expressions are, the code is basically saying. Using the regex string defined as r'(\d+)' which basically means capture any digits, search through the url. Then in the captured we have the first captured group which is 1987.
If you don't want to use this, then you can use your .split() method but this time provide a split using / as the separator. For example `url.split('/').
I have list of websites unfortunately which looks like "rs--google.com--plain" how to remove 'rs--' and '--plain' from the url? I tried strip() but it didn't remove anything.
The way to remove "rs--" and "--plain" from that url (which is a string most likely) is to use some basic regex on it:
import re
url = 'rs--google.com--plain'
cleaned_url = re.search('rs--(.*)--plain', url).group(1)
print(cleaned_url)
Which prints out:
google.com
What is done here is use re's search module to check if anything exists between "rs--" and "--plain" and if it does match it to group 1, we then check for group 1 by doing .group(1) and set our entire "cleaned url" to it:
cleaned_url = re.search('rs--(.*)--plain', url).group(1)
And now we only "google.com" in our cleaned_url.
This assumes "rs--" and "--plain" are always in the url.
Updated to handle any letters on either side of --:
import re
url = 'po--google.com--plain'
cleaned_url = re.search('[A-z]+--(.*)--[A-z]+', url).group(1)
print(cleaned_url)
This will handle anything that has letters before -- and after -- and get only the url in the middle. What that does is check any letters on either side of -- regardless of how many letters are there. This will allow queries with letters that match that regular expression so long as --myurl.com-- letters exist before the first "--" and after the second "--"
A great resource for working on regex is regex101
You can use replace function in python.
>>> val = "rs--google.com--plain"
>>> newval =val.replace("rs--","").replace("--plain","")
>>> newval
'google.com'
I'm starting with Lark and got stuck on an issue with parsing special characters.
I have expressions given by a grammar. For example, these are valid expressions: Car{_}, Apple3{3+}, Dog{a_7}, r2d2{A3*}, A{+}... More formally, they have form: name{feature} where
name: CNAME
feature: (DIGIT|LETTER|"+"|"-"|"*"|"_")+
The definition of constants can be found here.
The problem is that the special characters are not present in produced tree (see example below). I have seen this answer, but it did not help me. I tried to place ! before special characters, escaping them. I also enabled keep_all_tokens, but this is not desired because then characters { and } are also present in the tree. Any ideas how to solve this problem? Thank you.
from lark import Lark
grammar = r"""
start: object
object : name "{" feature "}" | name
feature: (DIGIT|LETTER|"+"|"-"|"*"|"_")+
name: CNAME
%import common.LETTER
%import common.DIGIT
%import common.CNAME
%import common.WS
%ignore WS
"""
parser = Lark(grammar, parser='lalr',
lexer='standard',
propagate_positions=False,
maybe_placeholders=False
)
def test():
test_str = '''
Apple_3{3+}
'''
j = parser.parse(test_str)
print(j.pretty())
if __name__ == '__main__':
test()
The output looks like this:
start
object
name Apple_3
feature 3
instead of
start
object
name Apple_3
feature
3
+
You said you tried placing ! before special characters. As I understand the question you linked, the ! has to be replaced before the rule:
!feature: (DIGIT|LETTER|"+"|"-"|"*"|"_")+
This produces your expected result for me:
start
object
name Apple_3
feature
3
+
I want to extract sentences that containing a drug and gene name from 10,000 articles.
and my code is
import re
import glob
import fnmatch
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
flist= glob.glob ("C:/Users/Emma Belladona/Desktop/drug working/*.txt")
print (flist)
for txt in flist:
#print (txt)
fr = open (txt, "r")
tmp = fr.read().strip()
a = (sent_tokenize(tmp))
b = (word_tokenize(tmp))
for c, value in enumerate(a, 1):
if value.find("SLC22A1") != -1 and value.find("Metformin"):
print ("Result", value)
re.findall("\w+\s?[gene]+", a)
else:
if value.find("Metformin") != -1 and value.find("SLC22A1"):
print ("Results", value)
if value.find("SLC29B2") != -1 and value.find("Metformin"):
print ("Result", value)
I want to extract sentences that have gene and drug name from the whole body of article. For example "Metformin decreased logarithmically converted SLC22A1 excretion (from 1.5860.47 to 1.0060.52, p¼0.001)." "In conclusion, we could not demonstrate striking associations of the studied polymorphisms of SLC22A1, ACE, AGTR1, and ADD1 with antidiabetic responses to metformin in this well-controlled study."
This code return a lot of sentences i.e if one word of above came into the sentence that get printed out...!
Help me making the code for this
You don't show your real code, but the code you have now has at least one mistake that would lead to lots of spurious output. It's on this line:
re.findall("\w+\s?[gene]+", a)
This regexp does not match strings containing gene, as you clearly intended. It matches (almost) any string contains one of the letters g, e or n.
This cannot be your real code, since a is a list and you would get an error on this line-- plus you ignore the results of the findall()! Sort out your question so it reflects reality. If your problem is still not solved, edit your question and include at least one sentence that is part of the output but you do NOT want to be seeing.
When you do this:
if value.find("SLC22A1") != -1 and value.find("Metformin"):
You're testing for "SLC22A1 in the string and "Metformin" not at the start of the string (the second part is probably not what you want)
You probably wanted this:
if value.find("SLC22A1") != -1 and value.find("Metformin") != -1:
This find method is error-prone due to its return value and you don't care for the position, so you'd be better off with in.
To test for 2 words in a sentence (possibly case-insensitive for the 2nd occurrence) do like this:
if "SLC22A1" in vlow and "metformin" in value.lower():
I'd take a different approach:
Read in the text file
Split the text file into sentences. Check out https://stackoverflow.com/a/28093215/223543 for a hand-rolled approach to do this. Or you could use the ntlk.tokenizer.punkt module. (Edited after Alexis pointed me in the right direction in the comments below).
Check if I find your key terms in each sentence and print if I do.
As long as your text files are well formatted, this should work.
I am trying to create a function which will take 2 parameters. A word with wildcards in it like "*arn*val" and a file name containing a dictionary. It returns a list of all words that match the word like ["carnival"].
My code works fine for anything with only one "*" in it, however any more and I'm stumped as to how to do it.
Just searching for the wildcard string in the file was returning nothing.
Here is my code:
dictionary_file = open(dictionary_filename, 'r')
dictionary = dictionary_file.read()
dictionary_file.close()
dictionary = dictionary.split()
alphabet = ["a","b","c","d","e","f","g","h","i",
"j","k","l","m","n","o","p","q","r",
"s","t","u","v","w","x","y","z"]
new_list = []
for letter in alphabet:
if wildcard.replace("*", letter) in dictionary:
new_list += [wildcard.replace("*", letter)]
return new_list
The parameters parameters: First is the wildcard string (wildcard), and second is the dictionary file name (dictionary_filename).
Most answers on this site were about Regex, which I have no knowledge of.
Your particular error is that .replace replaces all occurrences e.g., "*arn*val" -> "CarnCval" or "IarnIval". You want different letters here. You could use the second nested loop over the alphabet (or use itertools.product() to generate all possible letter pairs) to fix it but a simpler way is to use regular expressions:
import re
# each `*` corresponds to an ascii lowercase letter
pattern = re.escape(wildcard).replace("\\*", "[a-z]")
matches = list(filter(re.compile(pattern+"$").match, known_words))
Note: it doesn't support escaping * in the wildcard.
If input wildcards are file patterns then you could use fnmatch module to filter words:
import fnmatch
matches = fnmatch.filter(known_words, wildcard)