Remove Punctuation from Strings in File - python-3.x

So I have having trouble making a tuples list in ascending order that removes special characters. So far, I am able to strip to each character and gain a count as well as printing in ascending order. However I have been unsuccessful in removing special characters from the strings and they are included in my count. I understand I need to use the .translate method and include string.punctuation, but I haven't been able to place it in the right area. Here is my code.
import string
def Tuplelist():
fname = input('Please enter file name to process: ')
try:
fopen = open(fname)
except:
print('file name', fname, "doesn't exist.")
return
counts = {}
for line in fopen:
words = line.strip('\n')
word2 = words.lower()
wordy = word2.split()
for word in wordy:
for letter in word:
counts[letter] = counts.get(letter,0) + 1
listletter = []
for key, val in counts.items():
listletter.append((key, val))
print( sorted ( [ (v,k) for k,v in counts.items() ] ) )
I am still unable to convert the special characters by inserting string.punctuation as it states "'dict object has no attribute "translate"".

Related

How can I print the line index of a specific word in a text file?

I was trying to find a way to print the biggest word from a txt file, it's size and it's line index. I managed to get the first two done but can't quite figure it out how to print the line index. Can anyone help me?
def BiggestWord():
list_words = []
with open('song.txt', 'r') as infile:
lines = infile.read().split()
for i in lines:
words = i.split()
list_words.append(max(words, key=len))
biggest_word = str(max(list_words, key=len))
print biggest_word
print len(biggest_words)
FindWord(biggest_word)
def FindWord(biggest_word):
You don't need to do another loop through your list of largest words from each line. Every for-loop increases function time and complexity, and it's better to avoid unnecessary ones when possible.
As one of the options, you can use Python's built-in function enumerate to get an index for each line from the list of lines, and instead of adding each line maximum to the list, you can compare it to the current max word.
def get_largest_word():
# Setting initial variable values
current_max_word = ''
current_max_word_length = 0
current_max_word_line = None
with open('song.txt', 'r') as infile:
lines = infile.read().splitlines()
for line_index, line in enumerate(lines):
words = line.split()
max_word_in_line = max(words, key=len)
max_word_in_line_length = len(max_word_in_line)
if max_word_in_line_length > current_max_word_length:
# updating the largest word value with a new maximum word
current_max_word = max_word_in_line
current_max_word_length = max_word_in_line_length
current_max_word_line = line_index + 1 # line number starting from 1
print(current_max_word)
print(current_max_word_length)
print(current_max_word_line)
return current_max_word, current_max_word_length, current_max_word_line
P.S.: This function doesn't suggest what to do with the line maximum words of the same length, and which of them should be chosen as absolute max. You would need to adjust the code accordingly.
P.P.S.: This example is in Python 3, so change the snippet to work in Python 2.7 if needed.
With a limited amount of info I'm working with, this is the best solution I could think of. Assuming that each line is separated by a new line, such as '\n', you could do:
def FindWord(largest_word):
with open('song.txt', 'r') as infile:
lines = infile.read().splitlines()
linecounter = 1
for i in lines:
if largest_word in lines:
return linecounter
linecounter += 1
You can use enumerate in your for to get the current line and sorted with a lambda to get the longest word:
def longest_word_from_file(filename):
list_words = []
with open(filename, 'r') as input_file:
for index, line in enumerate(input_file):
words = line.split()
list_words.append((max(words, key=len), index))
sorted_words = sorted(list_words, key=lambda x: -len(x[0]))
longest_word, line_index = sorted_words[0]
return longest_word, line_index
Are you aware that there can be:
many 'largest' words with the same length
several lines contain word(s) with the biggest length
Here is the code that finds ONE largest word and returns a LIST of numbers of lines that contain the word:
# built a dictionary:
# line_num: largest_word_in_this_line
# line_num: largest_word_in_this_line
# etc...
# !!! actually, a line can contain several largest words
list_words = {}
with open('song.txt', 'r') as infile:
for i, line in enumerate(infile.read().splitlines()):
list_words[i] = max(line.split(), key=len)
# get the largest word from values of the dictionary
# !!! there can be several different 'largest' words with the same length
largest_word = max(list_words.values(), key=len)
# get a list of numbers of lines (keys of the dictionary) that contain the largest word
lines = list(filter(lambda key: list_words[key] == largest_word, list_words))
print(lines)
If you want to get all lines that have words with the same biggest length you need to modify the last two lines in my code this way:
lines = list(filter(lambda key: len(list_words[key]) == len(largest_word), list_words))
print(lines)

Is there anyway to remove non-alphanumeric elements from a list?

So I have a huge text document with 100k sentences, and I want to count how many of each letters there are. I was thinking about sending every letter from the string in to a list and then having a function that takes the list and removes all the characters I dont want to count, and then counting them with a dictionary. But I am 1: unsure about if you can even do things like that with a list and 2: if this is the best way to make this program.
listy_mc_listface = []
with open("\\Users\\saksa\\python_courses\\1DV501\\assign3\\eng_news_100K-sentences.txt", "r", encoding= 'utf-8') as f:
string_name=f
for line in string_name:
for item in line:
listy_mc_listface.append(item)
You can use use the negative character class in regex to remove symbols.
import re
a_count = 0
listy_mc_listface = []
with open("textfile.txt", "r", encoding= 'utf-8') as f:
string_name=f
for line in string_name:
line=re.sub('[^0-9a-zA-Z]+', '', line)
for item in line:
listy_mc_listface.append(item)
if item == 'a':
a_count = a_count + 1
print(listy_mc_listface)
print(a_count)

Append string based on condition python

I just want to append strings based on my condition. For example all strings starting with http won't be appended but all the other strings in each that has a length of 40 will be appended.
words = []
store1 = []
disregard = ["http","gen"]
for all in glob.glob(r'MYDIR'):
with open(all, "r",encoding="utf-16") as f:
text = f.read()
lines = text.split("\n")
for each in lines:
words += each.split()
for each in words:
if len(each) == 40 and each not in disregard:
store1.append(each)
Update:
if disregard[0] not in each:
works but how can I compare it to all the contents in my list? using disregard only doesnt work
Here is my input text file :
http://1234ashajkhdajkhdajkhdjkaaaaaaad1
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
genp://1234ashajkhdajkhdajkhdjkaaaaaaad1
a\a
The only thing that will append will be "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
I think the answers should depend on the number of words you want to disregard.
It's important to define what word means. If the word ends with spaces, should they all be stripped?
One solution could be to create a regular expression from all your words and use that to match the line.
import glob
import re
disregard = ["http","gen"]
pattern = "|".join([re.escape(w) for w in disregard])
for all in glob.glob(r'MYDIR/*'):
with open(all, "r", encoding="utf-16") as f:
matched_words = []
for line in f:
line = line.rstrip("\n")
if len(line) == 40 and not re.match(pattern, line):
matched_words.append(line)
print(matched_words)
The basic structure looks ok, it seems the place where it's breaking is setting up incorrect conditionals. You say you want to check where each line starts with the supplied strings, but then you split each line and check for existence of those strings. Use .startswith() instead. This will also make it so there doesn't have to be a space after "http" in order for that string to be caught.
Also, either the conditional testing should be placed after the loop that builds the words list, or else the words list should be reset at the start of each loop so you're not re-testing words you've already checked.
# adjusted some variable names for clarity
words = []
output = []
disregard = ["http","gen"]
for fname in glob.glob(r'MYDIR'):
with open(fname, "r", encoding="utf-16") as f:
text = f.read()
lines = text.split("\n")
for line in lines:
words += line.split()
for word in words:
if len(word) == 40 and not any([word.startswith(dis) for dis in disregard]):
output.append(each)

Having Issues Concatenating Strings into list without \n - Python3

I am currently having some issues trying to append strings into a new list. However, when I get to the end, my list looks like this:
['MDAALLLNVEGVKKTILHGGTGELPNFITGSRVIFHFRTMKCDEERTVIDDSRQVGQPMH\nIIIGNMFKLEVWEILLTSMRVHEVAEFWCDTIHTGVYPILSRSLRQMAQGKDPTEWHVHT\nCGLANMFAYHTLGYEDLDELQKEPQPLVFVIELLQVDAPSDYQRETWNLSNHEKMKAVPV\nLHGEGNRLFKLGRYEEASSKYQEAIICLRNLQTKEKPWEVQWLKLEKMINTLILNYCQCL\nLKKEEYYEVLEHTSDILRHHPGIVKAYYVRARAHAEVWNEAEAKADLQKVLELEPSMQKA\nVRRELRLLENRMAEKQEEERLRCRNMLSQGATQPPAEPPTEPPAQSSTEPPAEPPTAPSA\nELSAGPPAEPATEPPPSPGHSLQH\n']
I'd like to remove the newlines somehow. I looked at other questions on here and most suggest to use .rstrip however in adding that to my code, I get the same output. What am I missing here? Apologies if this question has been asked.
My input also looks like this(took the first 3 lines):
sp|Q9NZN9|AIPL1_HUMAN Aryl-hydrocarbon-interacting protein-like 1 OS=Homo sapiens OX=9606 GN=AIPL1 PE=1 SV=2
MDAALLLNVEGVKKTILHGGTGELPNFITGSRVIFHFRTMKCDEERTVIDDSRQVGQPMH
IIIGNMFKLEVWEILLTSMRVHEVAEFWCDTIHTGVYPILSRSLRQMAQGKDPTEWHVHT
from sys import argv
protein = argv[1] #fasta file
sequence = '' #string linker
get_line = False #False = not the sequence
Uniprot_ID = []
sequence_list =[]
with open(protein) as pn:
for line in pn:
line.rstrip("\n")
if line.startswith(">") and get_line == False:
sp, u_id, name = line.strip().split('|')
Uniprot_ID.append(u_id)
get_line = True
continue
if line.startswith(">") and get_line == True:
sequence.rstrip('\n')
sequence_list.append(sequence) #add the amino acids onto the list
sequence = '' #resets the str
if line != ">" and get_line == True: #if the first line is not a fasta ID and is it a sequence?
sequence += line
print(sequence_list)
Per documentation, rstrip removes trailing characters – the ones at the end. You probably misunderstood others' use of it to remove \ns because typically those would only appear at the end.
To replace a character with something else in an entire string, use replace instead.
These commands do not modify your string! They return a new string, so if you want to change something 'in' a current string variable, assign the result back to the original variable:
>>> line = 'ab\ncd\n'
>>> line.rstrip('\n')
'ab\ncd' # note: this is the immediate result, which is not assigned back to line
>>> line = line.replace('\n', '')
>>> line
'abcd'
When I asked this question I didn't take my time in looking at documentation & understanding my code. After looking, I realized two things:
my code isn't actually getting what I am interested in.
For the specific question I asked, I could have simply used line.split() to remove the '\n'.
sequence = '' #string linker
get_line = False #False = not the sequence
uni_seq = {}
"""this block of code takes a uniprot FASTA file and creates a
dictionary with the key as the uniprot id and the value as a sequence"""
with open (protein) as pn:
for line in pn:
if line.startswith(">"):
if get_line == False:
sp, u_id, name = line.strip().split('|')
Uniprot_ID.append(u_id)
get_line = True
else:
uni_seq[u_id] = sequence
sequence_list.append(sequence)
sp, u_id, name = line.strip().split('|')
Uniprot_ID.append(u_id)
sequence = ''
else:
if get_line == True:
sequence += line.strip() # removes the newline space
uni_seq[u_id] = sequence
sequence_list.append(sequence)

Get filename from user and convert the number into list

So far, I have this:
def main():
bad_filename = True
l =[]
while bad_filename == True:
try:
filename = input("Enter the filename: ")
fp = open(filename, "r")
for f_line in fp:
a=(f_line)
b=(f_line.strip('\n'))
l.append(b)
print (l)
bad_filename = False
except IOError:
print("Error: The file was not found: ", filename)
main()
this is my program and when i print this what i get
['1,2,3,4,5']
['1,2,3,4,5', '6,7,8,9,0']
['1,2,3,4,5', '6,7,8,9,0', '1.10,2.20,3.30,0.10,0.30']
but instead i need to get
[1,2,3,4,5]
[6,7,8,9,0.00]
[1.10,2.20,3.3.0,0.10,0.30]
Each line of the file is a series on numbers separated by commas, but to python they are just characters. You need one more conversion step to get your string into a list. First split on commas to create a list of strings each of which is a number. Then use what is called "list comprehension" (or a for loop) to convert each string into a number:
b = f_line.strip('\n').split(',')
c = [float(v) for v in b]
l.append(c)
If you really want to reset the list each time through the loop (your desired output shows only the last line) then instead of appending, just assign the numerical list to l:
b = f_line.strip('\n').split(',')
l = [float(v) for v in b]
List comprehension is a shorthand way of saying:
l = []
for v in b:
l.append(float(v))
You don't need a or the extra parentheses around the assignment of a and b.

Resources