delete characters that are not letters, numbers, whitespace? - python-3.x

community,
I need to clean a string, so that it will contain only letters, numbers and whitespace.
The string momentarily consists of different sentences.
I tried:
for entry in s:
if not isalpha() or isdigit() or isspace:
del (entry)
else: s.append(entry) # the wanted characters should be saved in the string, the rest should be deleted
I am using python 3.4.0

You can use this:
clean_string = ''.join(c for c in s if c.isalnum() or c.isspace())
It iterates through each character, leaving you only with the ones that satisfy at least one of the two criteria, then joins them all back together. I am using isalnum() to check for alphanumeric characters, rather than both isalpha() and isdigit() separately.
You can achieve the same thing using a filter:
clean_string = filter(lambda c: c.isalnum() or c.isspace(), s)

The or does not work the way you think it works in English. Instead, you should do:
new_s = ''
for entry in s:
if entry.isalpha() or entry.isdigit() or entry.isspace():
new_s += entry
print(new_s)

Related

Removing Characters With Regular Expression in List Comprehension in Python

I am learning python and I am trying to do some text preprocessing and I have been reading and borrowing ideas from Stackoverflow. I was able to come up with the following formulations below, but they don't appear to do what I was expecting, and they don't throw any errors either, so I'm stumped.
First, in a Pandas dataframe column, I am trying to remove the third consecutive character in a word; it's kind of like running a spell check on words that are supposed to have two consecutive characters instead of three
buttter = butter
bettter = better
ladder = ladder
The code I used is below:
import re
docs['Comments'] = [c for c in docs['Comments'] if re.sub(r'(\w)\1{2,}', r'\1', c)]
In the second instance, I just want to to replace multiple punctuations with the last one.
????? = ?
..... = .
!!!!! = !
---- = -
***** = *
And the code I have for that is:
docs['Comments'] = [i for i in docs['Comments'] if re.sub(r'[\?\.\!\*]+(?=[\?\.\!\*])', '', i)]
It looks like you want to use
docs['Comments'] = docs['Comments'].str.replace(r'(\w)\1{2,}', r'\1\1', regex=True)
.str.replace(r'([^\w\s]|_)(\1)+', r'\2', regex=True)
The r'(\w)\1{2,}' regex finds three or more repeated word chars and \1\1 replaces with two their occurrences. See this regex demo.
The r'([^\w\s]|_)(\1)+' regex matches repeated punctuation chars and captures the last into Group 2, so \2 replaces the match with the last punctuation char. See this regex demo.

Python - Replacing repeated consonants with other values in a string

I want to write a function that, given a string, returns a new string in which occurences of a sequence of the same consonant with 2 or more elements are replaced with the same sequence except the first consonant - which should be replaced with the character 'm'.
The explanation was probably very confusing, so here are some examples:
"hello world" should return "hemlo world"
"Hannibal" should return "Hamnibal"
"error" should return "emror"
"although" should return "although" (returns the same string because none of the characters are repeated in a sequence)
"bbb" should return "mbb"
I looked into using regex but wasn't able to achieve what I wanted. Any help is appreciated.
Thank you in advance!
Regex is probably the best tool for the job here. The 'correct' expression is
test = """
hello world
Hannibal
error
although
bbb
"""
output = re.sub(r'(.)\1+', lambda g:f'm{g.group(0)[1:]}', test)
# '''
# hemlo world
# Hamnibal
# emror
# although
# mbb
# '''
The only real complicated part of this is the lambda that we give as an argument. re.sub() can accept one as its 'replacement criteria' - it gets passed a regex object (which we call .group(0) on to get the full match, i.e. all of the repeated letters) and should output a string, with which to replace whatever was matched. Here, we use it to output the character 'm' followed by the second character onwards of the match, in an f-string.
The regex itself is pretty straightforward as well. Any character (.), then the same character (\1) again one or more times (+). If you wanted just alphanumerics (i.e. not to replace duplicate whitespace characters), you could use (\w) instead of (.)

re.sub replacing string using original sub-string

I have a text file. I would like to remove all decimal points and their trailing numbers, unless text is preceding.
e.g 12.29,14.6,8967.334 should be replaced with 12,14,8967
e.g happypants2.3#email.com should not be modified.
My code is:
import re
txt1 = "9.9,8.8,22.2,88.7,morris1.43#email.com,chat22.3#email.com,123.6,6.54"
txt1 = re.sub(r',\d+[.]\d+', r'\d+',txt1)
print(txt1)
unless there is an easier way of completing this, how do I modify r'\d+' so it just returns the number without a decimal place?
You need to make use of groups in your regex. You put the digits before the '.' into parentheses, and then you can use '\1' to refer to them later:
txt1 = re.sub(r',(\d+)[.]\d+', r',\1',txt1)
Note that in your attempted replacement code you forgot to replace the comma, so your numbers would have been glommed together. This still isn't perfect though; the first number, since it doesn't begin with a comma, isn't processed.
Instead of checking for a comma, the better way is to check word boundaries, which can be done using \b. So the solution is:
import re
txt1 = "9.9,8.8,22.2,88.7,morris1.43#email.com,chat22.3#email.com,123.6,6.54"
txt1 = re.sub(r'\b(\d+)[.]\d+\b', r'\1',txt1)
print(txt1)
Considering these are the only two types of string that is present in your file, you can explicitly check for these conditions.
This may not be an efficient way, but what I have done is split the str and check if the string contains #email.com. If thats true, I am just appending to a new list. For your 1st condition to satisfy, we can convert the str to int which will eliminate the decimal points.
If you want everything back to a str variable, you can use .join().
Code:
txt1 = "9.9,8.8,22.2,88.7,morris1.43#email.com,chat22.3#email.com,123.6,6.54"
txt_list = []
for i in (txt1.split(',')):
if '#email.com' in i:
txt_list.append(i)
else:
txt_list.append(str(int(float(i))))
txt_new = ",".join(txt_list)
txt_new
Output:
'9,8,22,88,morris1.43#email.com,chat22.3#email.com,123,6'

How to replace several characters in a string using Julia

I'm essentially trying to solve this problem: http://rosalind.info/problems/revc/
I want to replace all occurrences of A, C, G, T with their compliments T, G, C, A .. in other words all A's will be replaced with T's, all C's with G's and etc.
I had previously used the replace() function to replace all occurrences of 'T' with 'U' and was hoping that the replace function would take a list of characters to replace with another list of characters but I haven't been able to make it work, so it might not have that functionality.
I know I could solve this easily using the BioJulia package and have done so using the following:
# creating complementary strand of DNA
# reverse the string
# find the complementary nucleotide
using Bio.Seq
s = dna"AAAACCCGGT"
t = reverse(complement(s))
println("$t")
But I'd like to not have to rely on the package.
Here's the code I have so far, if someone could steer me in the right direction that'd be great.
# creating complementary strand of DNA
# reverse the string
# find the complementary nucleotide
s = open("nt.txt") # open file containing sequence
t = reverse(s) # reverse the sequence
final = replace(t, r'[ACGT]', '[TGCA]') # this is probably incorrect
# replace characters ACGT with TGCA
println("$final")
It seems that replace doesn't yet do translations quite like, say, tr in Bash. So instead, here are couple of approaches using a dictionary mapping instead (the BioJulia package also appears to make similar use of dictionaries):
compliments = Dict('A' => 'T', 'C' => 'G', 'G' => 'C', 'T' => 'A')
Then if str = "AAAACCCGGT", you could use join like this:
julia> join([compliments[c] for c in str])
"TTTTGGGCCA"
Another approach could be to use a function and map:
function translate(c)
compliments[c]
end
Then:
julia> map(translate, str)
"TTTTGGGCCA"
Strings are iterable objects in Julia; each of these approaches reads one character in turn, c, and passes it to the dictionary to get back the complimentary character. A new string is built up from these complimentary characters.
Julia's strings are also immutable: you can't swap characters around in place, rather you need to build a new string.

Python Challenge # 2 = removing characters from a string

I have the code:
theory = """}#)$[]_+(^_#^][]_)*^*+_!{&$##]((](}}{[!$#_{&{){
*_{^}$#!+]{[^&++*#!]*)]%$!{#^&%(%^*}#^+__])_$#_^#[{{})}$*]#%]{}{][#^!#)_[}{())%)
())&##*[#}+#^}#%!![#&*}^{^(({+#*[!{!}){(!*#!+#[_(*^+*]$]+#+*_##)&)^(#$^]e#][#&)(
%%{})+^$))[{))}&$(^+{&(#%*#&*(^&{}+!}_!^($}!(}_##++$)(%}{!{_]%}$!){%^%%#^%&#([+[
_+%){{}(#_}&{&++!#_)(_+}%_#+]&^)+]_[#]+$!+{#}$^!&)#%#^&+$#[+&+{^{*[#]#!{_*[)(#[[
]*!*}}*_(+&%{&#$&+*_]#+#]!&*#}$%)!})#&)*}#(#}!^(]^#}]#&%)![^!$*)&_]^%{{}(!)_&{_{
+[_*+}]$_[##_^]*^*##{&%})*{&**}}}!_!+{&^)__)#_#$#%{+)^!{}^#[$+^}&(%%)&!+^_^#}^({
*%]&#{]++}#$$)}#]{)!+#[^)!#[%#^!!"""
#theory = open("temp.txt")
key = "##!$%+{}[]_-&*()*^#/"
new2 =""
print()
for letter in theory:
if letter not in key:
new2 += letter
print(new2)
This is a test piece of code to solve the python challenge #2: http://www.pythonchallenge.com/pc/def/ocr.html
The only trouble is, the code I wrote seems to leaves lots of whitespace but I'm not sure why.
Any ideas on how to remove the unnecessary white? In other words I want the code to return "e" not " e ".
The challenge is to find a rare character. You could use collections.Counter for that:
from collections import Counter
c = Counter(theory)
print(c.most_common()[-1])
Output
('e', 1)
The unnecessary whitespace could be removed using .strip():
new2.strip()
Adding '\n' to the key works too.
The best would be to use regular expression library, like so
import re
characters = re.findall("[a-zA-Z]", sourcetext)
print ("".join(characters))
In a resulting string you will have ONLY an alphabetic characters.
If you look at the distribution of characters (using collections.Counter), you get:
6000+ each of )#(]#_%[}!+$&{*^ (which you are correctly excluding from the output)
1220 newlines (which you are not excluding from the output)
1 each of — no, I'm not going to give away the answer
Just add \n to your key variable to exclude the unwanted newlines. This will leave you with just the rare (i.e., 1 occurrence only) characters you need.
P.S., it's highly inefficient to concatenate strings in a loop. Instead of:
new2 =""
for letter in theory:
if letter not in key:
new2 += letter
write:
new2 = ''.join(letter for letter in theory if letter not in key)
The theory string contains several newlines. They get printed by your code. You can either get rid of the newline, like this:
theory = "}#)$[]_+(^_#^][]_)*^*+_!{&$##]((](}}{[!$#_{&{){" \
"*_{^}$#!+]{[^&++*#!]*)]%$!{#^&%(%^*}#^+__])_$#_^#[{{})}$*]#%]{}{][#^!#)_[}{())%)" \
"())&##*[#}+#^}#%!![#&*}^{^(({+#*[!{!}){(!*#!+#[_(*^+*]$]+#+*_##)&)^(#$^]e#][#&)(" \
"%%{})+^$))[{))}&$(^+{&(#%*#&*(^&{}+!}_!^($}!(}_##++$)(%}{!{_]%}$!){%^%%#^%&#([+[" \
"_+%){{}(#_}&{&++!#_)(_+}%_#+]&^)+]_[#]+$!+{#}$^!&)#%#^&+$#[+&+{^{*[#]#!{_*[)(#[[" \
"]*!*}}*_(+&%{&#$&+*_]#+#]!&*#}$%)!})#&)*}#(#}!^(]^#}]#&%)![^!$*)&_]^%{{}(!)_&{_{" \
"+[_*+}]$_[##_^]*^*##{&%})*{&**}}}!_!+{&^)__)#_#$#%{+)^!{}^#[$+^}&(%%)&!+^_^#}^({" \
"*%]&#{]++}#$$)}#]{)!+#[^)!#[%#^!!"
or your can filter them out, like this:
key = "##!$%+{}[]_-&*()*^#/\n"
Both work fine (yes, I tested).
a simpler way to output the answer is to:
print ''.join([ c for c in theory if c not in key])
and in your case you might want to add the newline character to key to also filter it out:
key += "\n"
You'd better work in reverse, something like this:
out = []
for i in theory:
a = ord(i)
if (a > 96 and a < 122) or (a > 65 and a < 90):
out.append(chr(a))
print ''.join(out)
Or better, use a regexp.

Resources