I want to extract sentences that containing a drug and gene name from 10,000 articles - python-3.x

I want to extract sentences that containing a drug and gene name from 10,000 articles.
and my code is
import re
import glob
import fnmatch
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
flist= glob.glob ("C:/Users/Emma Belladona/Desktop/drug working/*.txt")
print (flist)
for txt in flist:
#print (txt)
fr = open (txt, "r")
tmp = fr.read().strip()
a = (sent_tokenize(tmp))
b = (word_tokenize(tmp))
for c, value in enumerate(a, 1):
if value.find("SLC22A1") != -1 and value.find("Metformin"):
print ("Result", value)
re.findall("\w+\s?[gene]+", a)
else:
if value.find("Metformin") != -1 and value.find("SLC22A1"):
print ("Results", value)
if value.find("SLC29B2") != -1 and value.find("Metformin"):
print ("Result", value)
I want to extract sentences that have gene and drug name from the whole body of article. For example "Metformin decreased logarithmically converted SLC22A1 excretion (from 1.5860.47 to 1.0060.52, p¼0.001)." "In conclusion, we could not demonstrate striking associations of the studied polymorphisms of SLC22A1, ACE, AGTR1, and ADD1 with antidiabetic responses to metformin in this well-controlled study."
This code return a lot of sentences i.e if one word of above came into the sentence that get printed out...!
Help me making the code for this

You don't show your real code, but the code you have now has at least one mistake that would lead to lots of spurious output. It's on this line:
re.findall("\w+\s?[gene]+", a)
This regexp does not match strings containing gene, as you clearly intended. It matches (almost) any string contains one of the letters g, e or n.
This cannot be your real code, since a is a list and you would get an error on this line-- plus you ignore the results of the findall()! Sort out your question so it reflects reality. If your problem is still not solved, edit your question and include at least one sentence that is part of the output but you do NOT want to be seeing.

When you do this:
if value.find("SLC22A1") != -1 and value.find("Metformin"):
You're testing for "SLC22A1 in the string and "Metformin" not at the start of the string (the second part is probably not what you want)
You probably wanted this:
if value.find("SLC22A1") != -1 and value.find("Metformin") != -1:
This find method is error-prone due to its return value and you don't care for the position, so you'd be better off with in.
To test for 2 words in a sentence (possibly case-insensitive for the 2nd occurrence) do like this:
if "SLC22A1" in vlow and "metformin" in value.lower():

I'd take a different approach:
Read in the text file
Split the text file into sentences. Check out https://stackoverflow.com/a/28093215/223543 for a hand-rolled approach to do this. Or you could use the ntlk.tokenizer.punkt module. (Edited after Alexis pointed me in the right direction in the comments below).
Check if I find your key terms in each sentence and print if I do.
As long as your text files are well formatted, this should work.

Related

Count up misspelled words in a sentence of variable size

as a part of a large project, I need a function that will check for any misspelt words in a sentence, however, this sentence can be one word or it can be 30 words or any size really.
It needs to be fast, if possible I would like to use text blob or pyspellcheck as python_language_tool has problems installing on my comp.
My code so far (non-working):
def spell2():
from textblob import TextBlob
count = 0
sentence = "Tish soulhd al be corrrectt"
split_sen = sentence.split(" ")
for thing in split_sen:
thing = Word(thing)
thing.spellcheck()
# if thing is not spelt correctly add to count, if it is go to
# next word
spell2()
this gives me this error:
thing = Word(thing)
NameError: name 'Word' is not defined
Any suggestions appreciated:)
def spell3():
from spellchecker import SpellChecker
s = "Tish soulhd al be corrrectt, riiiigghtttt?"
wordlist=s.split()
spell = SpellChecker()
amount_miss = len(list(spell.unknown(wordlist)))
print("Possible amount of misspelled words in the text:",amount_miss)
spell3()

How to filter only text in a line?

I have many lines like these:
_ÙÓ´Immediate Transformation With Vee_ÙÓ´
‰ÛÏThe Real Pernell Stacks‰Û
I want to get something like this:
Immediate Transformation With Vee
The Real Pernell Stacks
I tried this:
for t in test:
t.isalpha()
but characters like this Ó count as well
So I also thought that I can create a list of English words, a space and punctuation marks and delete all the elements from the line that are not in this list, but I do not think that this is the right option, since the line can contain not only English words and that's fine.
Using Regex.
Ex:
import re
data = """_ÙÓ´Immediate Transformation With Vee_ÙÓ´
‰ÛÏThe Real Pernell Stacks‰Û"""
for line in data.splitlines(keepends=False):
print(re.sub(r"[^A-Za-z\s]", "", line))
Output:
Immediate Transformation With Vee
The Real Pernell Stacks
use re
result = ' '.join(re.split(r'[^A-Za-z]', s))

How to remove/delete characters from end of string that match another end of string

I have thousands of strings (not in English) that are in this format:
['MyWordMyWordSuffix', 'SameVocabularyItemMyWordSuffix']
I want to return the following:
['MyWordMyWordSuffix', 'SameVocabularyItem']
Because strings are immutable and I want to start the matching from the end I keep confusing myself on how to approach it.
My best guess is some kind of loop that starts from the end of the strings and keeps checking for a match.
However, since I have so many of these to process it seems like there should be a built in way faster than looping through all the characters, but as I'm still learning Python I don't know of one (yet).
The nearest example I could find already on SO can be found here but it isn't really what I'm looking for.
Thank you for helping me!
You can use commonprefix from os.path to find the common suffix between them:
from os.path import commonprefix
def getCommonSuffix(words):
# get common suffix by reversing both words and finding the common prefix
prefix = commonprefix([word[::-1] for word in words])
return prefix[::-1]
which you can then use to slice out the suffix from the second string of the list:
word_list = ['MyWordMyWordSuffix', 'SameVocabularyItemMyWordSuffix']
suffix = getCommonSuffix(word_list)
if suffix:
print("Found common suffix:", suffix)
# filter out suffix from second word in the list
word_list[1] = word_list[1][0:-len(suffix)]
print("Filtered word list:", word_list)
else:
print("No common suffix found")
Output:
Found common suffix: MyWordSuffix
Filtered word list: ['MyWordMyWordSuffix', 'SameVocabularyItem']
Demo: https://repl.it/#glhr/55705902-common-suffix

How to decode a text file by extracting alphabet characters and listing them into a message?

So we were given an assignment to create a code that would sort through a long message filled with special characters (ie. [,{,%,$,*) with only a few alphabet characters throughout the entire thing to make a special message.
I've been searching on this site for a while and haven't found anything specific enough that would work.
I put the text file into a pastebin if you want to see it
https://pastebin.com/48BTWB3B
Anywho, this is what I've come up with for code so far
code = open('code.txt', 'r')
lettersList = code.readlines()
lettersList.sort()
for letters in lettersList:
print(letters)
It prints the code.txt out but into short lists, essentially cutting it into smaller pieces. I want it to find and sort out the alphabet characters into a list and print the decoded message.
This is something you can do pretty easily with regex.
import re
with open('code.txt', 'r') as filehandle:
contents = filehandle.read()
letters = re.findall("[a-zA-Z]+", contents)
if you want to condense the list into a single string, you can use a join:
single_str = ''.join(letters)

Dictionary within a list - how to find values

I have a dictionary of around 4000 Latin words and their English meanings. I've opened it within Python and added it to a list and it looks something like this:
[{"rex":"king"},{"ego":"I"},{"a, ab":"away from"}...]
I want the user to be able to input the Latin word they're looking for and then the program prints out the English meaning. Any ideas as to how I do this?
You shouldn't put them in a list. You can use a main dictionary to hold all the items in one dictionary, then simply access to the relative meanings by indexing.
If you get the tiny dictionaries from an iterable object you can create your main_dict like following:
main_dict = {}
for dictionary in iterable_of_dict:
main_dict.update(dictionary)
word = None
while word != "exit"
word = input("Please enter your word (exit for exit): ")
print(main_dict.get(word, "Sorry your word doesn't exist in dictionary!"))
Well you could just cycle through your 4000 items...
listr = [{"rex":"king"},{"ego":"I"},{"a, ab":"away from"}]
out = "no match found"
get = input("input please ")
for c in range(0,len(listr)):
try:
out = listr[c][get]
break
except:
pass
print(out)
Maybe you should make the existing list into multiple lists alphabetically ordered to make the search shorter.
Also if the entry does not match exactly with the Latin word in the dictionary, then it won't find anything.

Resources