Right now, I have amino acid string.
The amino acid mutation column looks like this A59M, T133G, K2*, G1927? and ? only.
So, I tried to use re to separate one column into three columns and remove those ? only but keep G1297?.
import re
AA_mut = AA_mut.replace('p.','')
m = re.search(r'^(\w+)(\d+)(\S+)$',AA_mut)
But, I got
(A5,9,M; T13,3,M;....)
Please give me some advise.
Thanks
\w matches letters and digits in perl. It looks to me like it's doing the same thing in python.
You might try being more explicit. Is that a single, capital letter on the front? If so maybe you want something like
^([A-Z])(\d+)(\D+)$
In perl:
print join ("<>", m/^([A-Z])(\d+)(\D+)$/) while <DATA>;
__DATA__
A59M
T133G
K2*
G1927?
?
prints
A<>59<>M
T<>133<>G
K<>2<>*
G<>1927<>?
Assuming you have:
data = ["A59M", "T133G", "K2*", "G1927?", "?"]
You can extract it using:
out = [(s[0], s[1:-1], s[-1]) for s in data if len(s) > 2]
This gives me:
out == [('A', '59', 'M'), ('T', '133', 'G'),
('K', '2', '*'), ('G', '1927', '?')]
import re
AA_mut = AA_mut.replace('p.','')
m = re.search(r'^(\w)(\d+)(\S+)$',AA_mut)
I use this one to solve my problem. The original \w+ leaves one digit for \d+ and one alphabet for \S+. Once I removed the "+". It takes only first alphabet and leaves other parts.
Related
I have a string containing thousands of lines of this data without line break (only a few lines shown for readability with line break)
5BengaluruUrban4598962MSARICoughBreathlessnessDM23.07.2020atGovernmenthospital
7DakshinaKannada4786665FSARICoughDMHTN23-07-2020atPrivatehospital
Format is
(entry number)(district)(patient number)(age)(gender)(case of)(symptoms)(comorbidity)(date of death)(place of death)
without spaces, or brackets.
Problem : The data i want to collect is age.
However i cant seem to find a way to single out the age since its clouded by a lot of other numbers in the data. I have tried various iterations of count, limiting it to 1 to 99, separating the data etc, and failed.
My Idea : Since the gender is always either 'M'/'F', and the two numbers before the gender is the age. Isolating the two numbers before the gender seems like an ideal solution.
xxM
xxF
My Goal : I would like to collect all the xx numbers irrespective of gender and store them in a list. How do i go about this?
import re
input_str = '5BengaluruUrban4598962MSARICoughBreathlessnessDM23.07.2020atGovernmenthospital7DakshinaKannada4786665FSARICoughDMHTN23-07-2020atPrivatehospital'
ages = [found[-3:-1] for found in re.findall('[0-9]+[M,F]', input_str, re.I)]
print(ages)
# ['62', '65']
This works fine with the sample but if there are districts starting with 'M/F' then entry number will be collected as well.
A workaround is to match exactly seven digits (if the patient number is always 5 digits and and the age is generally 2 digits).
ages = [found[-3:-1] for found in re.findall(r'\d{7}[M,F]', input_str, re.I)]
With the structure you gave I've built a dict of reg expressions to match components. Then put this back into a dict
There are ways I can imagine this will not work
if age < 10, only 1 digit so you will pick up a digit of patient number
there maybe strings that don't match the re expressions which will mean odd results
It's the most structured way I can think to go....
import re
data = "5BengaluruUrban4598962MSARICoughBreathlessnessDM23.07.2020atGovernmenthospital7DakshinaKannada4786665FSARICoughDMHTN23-07-2020atPrivatehospital"
md = {
"entrynum": "([0-9]+)",
"district": "([A-Z,a-z]+)",
"patnum_age": "([0-9]+)",
"sex": "([M,F])",
"remainder": "(.*)$"
}
data_dict = {list(md.keys())[i]:tk
for i, tk in
enumerate([tk for tk in re.split("".join(md.values()), data) if tk!=""])
}
print(f"Assumed age:{data_dict['patnum_age'][-2:]}\nparsed:{data_dict}\n")
output
Assumed age:62
parsed:{'entrynum': '5', 'district': 'BengaluruUrban', 'patnum_age': '4598962', 'sex': 'M', 'remainder': 'SARICoughBreathlessnessDM23.07.2020atGovernmenthospital7DakshinaKannada4786665FSARICoughDMHTN23-07-2020atPrivatehospital'}
I'm trying to clean up a bunch of tweets so that they can be used for k-means clustering. I've written the following code that should strip each tweet of its unwanted characters.
from nltk.corpus import stopwords
import nltk
import json
with open("/Users/titus/Desktop/trumptweets.json",'r', encoding='utf8') as f:
data = json.loads(f.readline())
tweets = []
for sentence in data:
tokens = nltk.wordpunct_tokenize(sentence['text'])
type(tokens)
text = nltk.Text(tokens)
type(text)
words = [w.lower() for w in text if w.isalpha() and w not in
stopwords.words('english') and w is not 'the']
s = " "
useful_sentence = s.join(words)
tweets.append(useful_sentence)
print(tweets)
I'm trying to remove words like "I" and "the", but for some reason I can't figure out how. If I look at the tweets after they've gone through the loop, the word "the" still occurs.
Question: How is it possible that there are still occurences of "the" and "I" in the tweets? How should I fix this?
Beware of the processing order.
Here are two test strings for you:
THIS THE REMAINS.
this the is removed
Because "THE" is not "the". You lowercase after filtering, but you should first lowercase then filter.
The bad news for you: k-means works horribly bad on noisy short text like twitter. Because it is sensitive to noise, and the TFIDF vectors need very long texts to be reliable. So carefully verify your results, they probably are not as good as they may seem in the first enthusiasm.
Have you tried lowering w in check?
words = [w.lower() for w in text if w.isalpha() and w.lower() not in
stopwords.words('english') and w.lower() is not 'the']
is (and is not) is the (reference) identity check. It compares if two variable names point to the same object in memory. Typically this is only used to compare with None, or for some other speical cases.
In your case, use the != operator or the negation of == to compare with the string "the".
See also: Is there a difference between `==` and `is` in Python?
Is there any way to include punctuation in keras tokenizer?
I would like to have a transformation...
FROM
Tomorrow will be cold.
TO
Index-tomorrow, Index-will,...,Index-point
How can I achieve that?
This is possible if you do some pre-processing on the text.
First you want to make sure that the punctuation is not filtered out by the Tokenizer. You can see from the documentation that the Tokenizer takes a filter argument on initialization. You can replace the default value with the set of characters you would like to filter, and exclude the ones you want to have in your index.
The second part is making sure that the punctuation is recognized as its own token. If you tokenize the example sentence the result would take "cold." as a token instead of "cold" and ".". What you need is a seperator between the word and the punctuation. A naive approach is to replace the punctuation in the text with a space + punctuation.
Following code does what you ask:
from keras.preprocessing.text import Tokenizer
t = Tokenizer(filters='!"#$%&()*+,-/:;<=>?#[\\]^_`{|}~\t\n') # all without .
text = "Tomorrow will be cold."
text = text.replace(".", " .")
t.fit_on_texts([text])
print(t.word_index)
-> prints: {'will': 2, 'be': 3, 'cold': 4, 'tomorrow': 1, '.': 5}
The replace logic can be done in a smarter way (eg. with regex if you want to capture all punctuation), but you get the gist.
A general solutions, inspired by the one proposed by lmartens, using Regex expressions to replace a set of punctuation marks. Here the code:
from keras.preprocessing.text import Tokenizer
import re
to_exclude = '!"#$%&()*+-/:;<=>#[\\]^_`{|}~\t\n'
to_tokenize = '.,:;!?'
t = Tokenizer(filters=to_exclude) # all without .
text = "Tomorrow, will be. cold?"
text = re.sub(r'(['+to_tokenize+'])', r' \1 ', text)
t.fit_on_texts([text])
print(t.word_index) # {'tomorrow': 1, ',': 2, 'will': 3, 'be': 4, '.': 5, 'cold': 6, '?': 7}
There is a pool of letters (chosen randomly), and you want to make a word with these letters. I found some codes that can help me with this, but then if the word has for example 2 L's and the pool only 1, I'd like the program to know when this happens.
If I understand this correctly, you will also need a list of all valid words in whichever language you are using.
Assuming you have this, then one strategy for solving this problem could be to generate a key for every word in the dictionary that is a sorted list of the letters in that word. You could then group all words in the dictionary by these keys.
Then the task of finding out if a valid word can be constructed from a given list of random characters would be easy and fast.
Here is a simple implementation of what I am suggesting:
list_of_all_valid_words = ['this', 'pot', 'is', 'not', 'on', 'top']
def make_key(word):
return "".join(sorted(word))
lookup_dictionary = {}
for word in list_of_all_valid_words:
key = make_key(word)
lookup_dictionary[key] = lookup_dictionary.get(key, set()).union(set([word]))
def words_from_chars(s):
return list(lookup_dictionary.get(make_key(s), set()))
print words_from_chars('xyz')
print words_from_chars('htsi')
print words_from_chars('otp')
Output:
[]
['this']
['pot', 'top']
I have a table of strings (about 100,000) in following format:
pattern , string
e.g. -
*l*ph*nt , elephant
c*mp*t*r , computer
s*v* , save
s*nn] , sunny
]*rr] , worry
To simplify, assume a * denotes a vowel, a consonant stands unchanged and ] denotes either a 'y' or a 'w' (say, for instance, semi-vowels/round-vowels in phonology).
Given a pattern, what is the best way to generate the possible sensible strings? A sensible string is defined as a string having each of its consecutive two-letter substrings, that were not specified in the pattern, inside the data-set.
e.g. -
h*ll* --> hallo, hello, holla ...
'hallo' is sensible because 'ha', 'al', 'lo' can be seen in the data-set as with the words 'have', 'also', 'low'. The two letters 'll' is not considered because it was specified in the pattern.
What are the simple and efficient ways to do this?
Are there any libraries/frameworks for achieving this?
I've no specific language in mind but prefer to use java for this program.
This is particularly well suited to Python itertools, set and re operations:
import re
import itertools
VOWELS = 'aeiou'
SEMI_VOWELS = 'wy'
DATASET = '/usr/share/dict/words'
SENSIBLES = set()
def digraphs(word, digraph=r'..'):
'''
>>> digraphs('bar')
set(['ar', 'ba'])
'''
base = re.findall(digraph, word)
base.extend(re.findall(digraph, word[1:]))
return set(base)
def expand(pattern, wildcard, elements):
'''
>>> expand('h?', '?', 'aeiou')
['ha', 'he', 'hi', 'ho', 'hu']
'''
tokens = re.split(re.escape(wildcard), pattern)
results = set()
for perm in itertools.permutations(elements, len(tokens)):
results.add(''.join([l for p in zip(tokens, perm) for l in p][:-1]))
return sorted(results)
def enum(pattern):
not_sensible = digraphs(pattern, r'[^*\]]{2}')
for p in expand(pattern, '*', VOWELS):
for q in expand(p, ']', SEMI_VOWELS):
if (digraphs(q) - not_sensible).issubset(SENSIBLES):
print q
## Init the data-set (may be long...)
## you may want to pre-compute this
## and adapt it to your data-set.
for word in open(DATASET, 'r').readlines():
for digraph in digraphs(word.rstrip()):
SENSIBLES.add(digraph)
enum('*l*ph*nt')
enum('s*nn]')
enum('h*ll*')
As there aren't many possibilites for two-letter substrings, you can go through your dataset and generate a table that contains the count for every two-letter substring, so the table will look something like this:
ee 1024 times
su 567 times
...
xy 45 times
xz 0 times
The table will be small as you'll only have about 26*26 = 676 values to store.
You have to do this only once for your dataset (or update the table every time it changes if the dataset is dynamic) and can use the table for evaluating possible strings. F.e., for your example, add the values for 'ha', 'al' and 'lo' to get a "score" for the string 'hallo'. After that, choose the string(s) with the highest score(s).
Note that the scoring can be improved by checking longer substrings, f.e. three letters, but this will also result in larger tables.