Python3: counting and extracting letters out of a text - python-3.x

I'm doing some processing tasks on a medium-sized (1.7 Mb) Persian text corpus. I want to make lists of three set of characters in the text:
alphabets
white spaces (including newline, tab, space, no-breaking space and etc.) and
punctuation.
I wrote this:
# -*- coding: utf8 -*-
TextObj = open ('text.txt', 'r', encoding = 'UTF8')
import string
LCh = LSpc = LPunct = []
TotalCh = TotalPunct = TotalSpc = 0
TempSet = 'ابپتثجچحخدذرزژسشصضطظعغفقکگلمنوهی'
#TempSet variable holds alphabets of Persian language.
ReadObj = TextObj.read ()
for Char in ReadObj:
if Char in TempSet: #This's supposed to count & extract alphabets only.
TotalCh += 1
LCh.append (Char)
elif Char in string.punctuation: #This's supposed to count puncts.
TotalPunct += 1
LPunct.append (Char)
elif Char in ('', '\n', '\t'): #This counts & extracts spacey things.
TotalSpc += 1
LSpc.append (Char)
else: #This'll ignore anything else.
continue
But when I try:
print (LPunct)
print (LSpc)
I tried this code on both Linux and Windows 7. On both of them, the result is not what I expected at all. The punctuation's and space's lists, both contains Persian letters.
Another question:
How can I improve this condition elif Char in ('', '\n', '\t'): so that it covers all kind of space family?

On line 3 you've assigned all the lists to be the same list!
Don't do this:
LCh = LSpc = LPunct = []
Do this:
LCh = []
LSpc = []
LPunct = []
The string class has whitespace built in.
elif Char in string.whitespace:
TotalSpc += 1
LSpc.append (Char)
In your example you didn't actually put a space in your '' character which also may be causing it to fail. Shouldn't this be ' '?
Also, take into account the other answer here, this code is not very pythonic.
I'd write it like this:
# -*- coding: utf8 -*-
import fileinput
import string
persian_chars = 'ابپتثجچحخدذرزژسشصضطظعغفقکگلمنوهی'
filename = 'text.txt'
persian_list = []
punctuation_list = []
whitespace_list = []
ignored_list = []
for line in fileinput.input(filename):
for ch in line:
if ch in persian_chars:
persian_list.append(ch)
elif ch in string.punctuation:
punctuation_list.append(ch)
elif ch in string.whitespace:
whitespace_list.append(ch)
else:
ignored_list.append(ch)
total_persian, total_punctuation, total_whitepsace = \
map(len, [persian_list, punctuation_list, whitespace_list])

First of all as a more pythonic way for dealing with files you better to use with statement for opening the files which will close the file at the end of the block.
Secondly since you want to count the number of special characters within your text and preserve them separately, you can use a dictionary with the list names as the keys and relative characters in a list as value. Then use len method to get the length.
And finally for check the membership in whitespaces you can use string.whitespace method.
import string
TempSet = 'ابپتثجچحخدذرزژسشصضطظعغفقکگلمنوهی'
result_dict={}
with open ('text.txt', 'r', encoding = 'UTF8') as TextObj :
ReadObj = TextObj.read ()
for ch in ReadObj :
if Char in TempSet:
result_dict['TempSet'].append(ch)
elif Char in string.punctuation:
result_dict['LPunct'].append(ch)
elif Char in string.whitespace:
result_dict['LSpc'].append(ch)
TotalCh =len(result_dict['LSpc'])

Related

Having Issues Concatenating Strings into list without \n - Python3

I am currently having some issues trying to append strings into a new list. However, when I get to the end, my list looks like this:
['MDAALLLNVEGVKKTILHGGTGELPNFITGSRVIFHFRTMKCDEERTVIDDSRQVGQPMH\nIIIGNMFKLEVWEILLTSMRVHEVAEFWCDTIHTGVYPILSRSLRQMAQGKDPTEWHVHT\nCGLANMFAYHTLGYEDLDELQKEPQPLVFVIELLQVDAPSDYQRETWNLSNHEKMKAVPV\nLHGEGNRLFKLGRYEEASSKYQEAIICLRNLQTKEKPWEVQWLKLEKMINTLILNYCQCL\nLKKEEYYEVLEHTSDILRHHPGIVKAYYVRARAHAEVWNEAEAKADLQKVLELEPSMQKA\nVRRELRLLENRMAEKQEEERLRCRNMLSQGATQPPAEPPTEPPAQSSTEPPAEPPTAPSA\nELSAGPPAEPATEPPPSPGHSLQH\n']
I'd like to remove the newlines somehow. I looked at other questions on here and most suggest to use .rstrip however in adding that to my code, I get the same output. What am I missing here? Apologies if this question has been asked.
My input also looks like this(took the first 3 lines):
sp|Q9NZN9|AIPL1_HUMAN Aryl-hydrocarbon-interacting protein-like 1 OS=Homo sapiens OX=9606 GN=AIPL1 PE=1 SV=2
MDAALLLNVEGVKKTILHGGTGELPNFITGSRVIFHFRTMKCDEERTVIDDSRQVGQPMH
IIIGNMFKLEVWEILLTSMRVHEVAEFWCDTIHTGVYPILSRSLRQMAQGKDPTEWHVHT
from sys import argv
protein = argv[1] #fasta file
sequence = '' #string linker
get_line = False #False = not the sequence
Uniprot_ID = []
sequence_list =[]
with open(protein) as pn:
for line in pn:
line.rstrip("\n")
if line.startswith(">") and get_line == False:
sp, u_id, name = line.strip().split('|')
Uniprot_ID.append(u_id)
get_line = True
continue
if line.startswith(">") and get_line == True:
sequence.rstrip('\n')
sequence_list.append(sequence) #add the amino acids onto the list
sequence = '' #resets the str
if line != ">" and get_line == True: #if the first line is not a fasta ID and is it a sequence?
sequence += line
print(sequence_list)
Per documentation, rstrip removes trailing characters – the ones at the end. You probably misunderstood others' use of it to remove \ns because typically those would only appear at the end.
To replace a character with something else in an entire string, use replace instead.
These commands do not modify your string! They return a new string, so if you want to change something 'in' a current string variable, assign the result back to the original variable:
>>> line = 'ab\ncd\n'
>>> line.rstrip('\n')
'ab\ncd' # note: this is the immediate result, which is not assigned back to line
>>> line = line.replace('\n', '')
>>> line
'abcd'
When I asked this question I didn't take my time in looking at documentation & understanding my code. After looking, I realized two things:
my code isn't actually getting what I am interested in.
For the specific question I asked, I could have simply used line.split() to remove the '\n'.
sequence = '' #string linker
get_line = False #False = not the sequence
uni_seq = {}
"""this block of code takes a uniprot FASTA file and creates a
dictionary with the key as the uniprot id and the value as a sequence"""
with open (protein) as pn:
for line in pn:
if line.startswith(">"):
if get_line == False:
sp, u_id, name = line.strip().split('|')
Uniprot_ID.append(u_id)
get_line = True
else:
uni_seq[u_id] = sequence
sequence_list.append(sequence)
sp, u_id, name = line.strip().split('|')
Uniprot_ID.append(u_id)
sequence = ''
else:
if get_line == True:
sequence += line.strip() # removes the newline space
uni_seq[u_id] = sequence
sequence_list.append(sequence)

Scrambling a string in Python without using random.shuffle()

I'm trying to scramble a string, "string", without using random.shuffle(), but my code keeps producing output that has missing and repeating characters, e.g. gtrgtg, gnrtnn, etc. I'm not sure what I'm doing wrong.
import random
s = "string"
new_s=[]
for c in s:
if random.choice(s) not in new_s:
new_s.append(random.choice(s))
print(''.join(new_s))
In its current state, your program checks whether the randomly chosen character is in a string. If it is, it doesn't do anything other than continuing the loop. Also since you don't assign random.choice(s) to a variable, you generate another character after you do the check.
A working version would be:
import random
s = "string"
new_s = []
for c in s:
char = random.choice(s) # assign it to a variable
while char in new_s: # until a new character comes, repeat the procedure
char = random.choice(s)
new_s.append(char)
print(''.join(new_s))
This generates strings like ngtsri, gsrnit, etc. Note that this won't work if you have duplicates in the original string.
The above code is highly inefficient. I only gave the correction assuming this was for learning purposes. Normally, if you want to repeatedly check if something is in a collection, that collection should be a set or a dictionary.
random.choice choses a random character out of string s, but doesn't remove it - so it's possible for the same character to be chosen multiple times, and for some characters to not be chosen at all.
import random
s = 'string'
new_s = []
# rather than choosing a character, chose an index, use it and slice it out
while s:
i = random.randint(0, len(s)-1)
new_s.append(s[i])
s = s[:i] + s[i+1:]
print(''.join(new_s))
# this is more elegant with lists:
s = list(s)
while s:
i = random.randint(0, len(s)-1)
new_s.append(s.pop(i))
print(''.join(new_s))
Neither option is very efficient... but for efficiency, use random.shuffle. :)
Using while, you could loop through s until the length of new_s matches with that of s and the resultant string has non-repeating characters.
import random
s = "string"
new_s = '' # So you will not need ''.join() when you print this result
while len(new_s) != len(s):
char = random.choice(s)
if char not in new_s:
new_s += char
print(new_s)
rntigs
>>>
try this:
from random import randint
def shuffle(sr):
n = len(sr)
s = list(sr)
for i in range(n):
cur, idx = s[i], randint(0, n - 1)
s[i], s[idx] = s[idx], cur
return ''.join(s)
print(shuffle("hello"))

How would I reverse each word individually rather than the whole string as a whole

I'm trying to reverse the words in a string individually so the words are still in order however just reversed such as "hi my name is" with output "ih ym eman si" however the whole string gets flipped
r = 0
def readReverse(): #creates the function
start = default_timer() #initiates a timer
r = len(n.split()) #n is the users input
if len(n) == 0:
return n
else:
return n[0] + readReverse(n[::-1])
duration = default_timer() - start
print(str(r) + " with a runtime of " + str(duration))
print(readReverse(n))
First split the string into words, punctuation and whitespace with a regular expression similar to this. Then you can use a generator expression to reverse each word individually and finally join them together with str.join.
import re
text = "Hello, I'm a string!"
split_text = re.findall(r"[\w']+|[^\w]", text)
reversed_text = ''.join(word[::-1] for word in split_text)
print(reversed_text)
Output:
olleH, m'I a gnirts!
If you want to ignore the punctuation you can omit the regular expression and just split the string:
text = "Hello, I'm a string!"
reversed_text = ' '.join(word[::-1] for word in text.split())
However, the commas, exclamation marks, etc. will then be a part of the words.
,olleH m'I a !gnirts
Here's the recursive version:
def read_reverse(text):
idx = text.find(' ') # Find index of next space character.
if idx == -1: # No more spaces left.
return text[::-1]
else: # Split off the first word and reverse it and recurse.
return text[:idx][::-1] + ' ' + read_reverse(text[idx+1:])

How to remove all characters after first instance of punctuation/blank space?

I have short strings (tweets) in which I must extract all instances of mentions from the text and return a list of these instances including repeats.
extract_mentions('.#AndreaTantaros-supersleuth! You are a true journalistic professional. Keep up the great work! #MakeAmericaGreatAgain')
[AndreaTantaros]
How do I make it so that I remove all text after the first instance of punctuation after '#'? (In this case it would be '-') Note, punctuation can be varied. Please no use of regex.
I have used the following:
tweet_list = tweet.split()
mention_list = []
for word in tweet_list:
if '#' in word:
x = word.index('#')
y = word[x+1:len(word)]
if y.isalnum() == False:
y = word[x+1:-1]
mention_list.append(y)
else:
mention_list.append(y)
return mention_list
This would only work for instances with one extra character
import string
def extract_mentions(s, delimeters = string.punctuation + string.whitespace):
mentions = []
begin = s.find('#')
while begin >= 0:
end = begin + 1
while end < len(s) and s[end] not in delimeters:
end += 1
mentions.append(s[begin+1:end])
begin = s.find('#', end)
return mentions
>>> print(extract_mentions('.#AndreaTantaros-supersleuth! You are a true journalistic professional. Keep up the great work! #MakeAmericaGreatAgain'))
['AndreaTantaros']
Use string.punctuation module to get all punctuation chars.
Remove the first characters while they are punctuation (else the answer would be empty string all the time). Then find the first punctuation char.
This uses 2 loops with opposite conditions and a set for better speed.
z =".#AndreaTantaros-supersleuth! You are a true journalistic professional. Keep up the great work! #MakeAmericaGreatAgain') [AndreaTantaros]"
import string
# skip leading punctuation: find position of first non-punctuation
spun=set(string.punctuation) # faster if searched from a set
start_pos = 0
while z[start_pos] in spun:
start_pos+=1
end_pos = start_pos
while z[end_pos] not in spun:
end_pos+=1
print(z[start_pos:end_pos])
Just use regexp to match and extract part of the text.

Python Spell Checker Linear Search

I'm learning Python and one of the labs requires me to import a list of words to serve as a dictionary, then compare that list of words to some text that is also imported. This isn't for a class, I'm just learning this on my own, or I'd ask the teacher. I've been hung up on how to covert that imported text to uppercase before making the comparision.
Here is the URL to the lab: http://programarcadegames.com/index.php?chapter=lab_spell_check
I've looked at the posts/answers below and some youtube videos and I still can't figure out how to do this. Any help would be appreciated.
Convert a Python list with strings all to lowercase or uppercase
How to convert upper case letters to lower case
Here is the code I have so far:
# Chapter 16 Lab 11
import re
# This function takes in a line of text and returns
# a list of words in the line.
def split_line(line):
return re.findall('[A-Za-z]+(?:\'[A-Za-z]+)?',line)
dfile = open("dictionary.txt")
dictfile = []
for line in dfile:
line = line.strip()
dictfile.append(line)
dfile.close()
print ("--- Linear Search ---")
afile = open("AliceInWonderLand200.txt")
for line in afile:
words = []
line = split_line(line)
words.append(line)
for word in words:
lineNumber = 0
lineNumber += 1
if word != (dictfile):
print ("Line ",(lineNumber)," possible misspelled word: ",(word))
afile.close()
Like the lb says: You use .upper():
dictfile = []
for line in dfile:
line = line.strip()
dictfile.append(line.upper()) # <- here.

Resources