I am currently having some issues trying to append strings into a new list. However, when I get to the end, my list looks like this:
['MDAALLLNVEGVKKTILHGGTGELPNFITGSRVIFHFRTMKCDEERTVIDDSRQVGQPMH\nIIIGNMFKLEVWEILLTSMRVHEVAEFWCDTIHTGVYPILSRSLRQMAQGKDPTEWHVHT\nCGLANMFAYHTLGYEDLDELQKEPQPLVFVIELLQVDAPSDYQRETWNLSNHEKMKAVPV\nLHGEGNRLFKLGRYEEASSKYQEAIICLRNLQTKEKPWEVQWLKLEKMINTLILNYCQCL\nLKKEEYYEVLEHTSDILRHHPGIVKAYYVRARAHAEVWNEAEAKADLQKVLELEPSMQKA\nVRRELRLLENRMAEKQEEERLRCRNMLSQGATQPPAEPPTEPPAQSSTEPPAEPPTAPSA\nELSAGPPAEPATEPPPSPGHSLQH\n']
I'd like to remove the newlines somehow. I looked at other questions on here and most suggest to use .rstrip however in adding that to my code, I get the same output. What am I missing here? Apologies if this question has been asked.
My input also looks like this(took the first 3 lines):
sp|Q9NZN9|AIPL1_HUMAN Aryl-hydrocarbon-interacting protein-like 1 OS=Homo sapiens OX=9606 GN=AIPL1 PE=1 SV=2
MDAALLLNVEGVKKTILHGGTGELPNFITGSRVIFHFRTMKCDEERTVIDDSRQVGQPMH
IIIGNMFKLEVWEILLTSMRVHEVAEFWCDTIHTGVYPILSRSLRQMAQGKDPTEWHVHT
from sys import argv
protein = argv[1] #fasta file
sequence = '' #string linker
get_line = False #False = not the sequence
Uniprot_ID = []
sequence_list =[]
with open(protein) as pn:
for line in pn:
line.rstrip("\n")
if line.startswith(">") and get_line == False:
sp, u_id, name = line.strip().split('|')
Uniprot_ID.append(u_id)
get_line = True
continue
if line.startswith(">") and get_line == True:
sequence.rstrip('\n')
sequence_list.append(sequence) #add the amino acids onto the list
sequence = '' #resets the str
if line != ">" and get_line == True: #if the first line is not a fasta ID and is it a sequence?
sequence += line
print(sequence_list)
Per documentation, rstrip removes trailing characters – the ones at the end. You probably misunderstood others' use of it to remove \ns because typically those would only appear at the end.
To replace a character with something else in an entire string, use replace instead.
These commands do not modify your string! They return a new string, so if you want to change something 'in' a current string variable, assign the result back to the original variable:
>>> line = 'ab\ncd\n'
>>> line.rstrip('\n')
'ab\ncd' # note: this is the immediate result, which is not assigned back to line
>>> line = line.replace('\n', '')
>>> line
'abcd'
When I asked this question I didn't take my time in looking at documentation & understanding my code. After looking, I realized two things:
my code isn't actually getting what I am interested in.
For the specific question I asked, I could have simply used line.split() to remove the '\n'.
sequence = '' #string linker
get_line = False #False = not the sequence
uni_seq = {}
"""this block of code takes a uniprot FASTA file and creates a
dictionary with the key as the uniprot id and the value as a sequence"""
with open (protein) as pn:
for line in pn:
if line.startswith(">"):
if get_line == False:
sp, u_id, name = line.strip().split('|')
Uniprot_ID.append(u_id)
get_line = True
else:
uni_seq[u_id] = sequence
sequence_list.append(sequence)
sp, u_id, name = line.strip().split('|')
Uniprot_ID.append(u_id)
sequence = ''
else:
if get_line == True:
sequence += line.strip() # removes the newline space
uni_seq[u_id] = sequence
sequence_list.append(sequence)
Related
I do not understand why when you open a document in bytes format with the 'open' function and decode it to text, when compared to a variable that contains exactly the same text, python says they are different. But that only happens when the decoded text of the document has line breaks.
example:
o = open('New.py','rb')
t = o.read().decode()
x = '''this is a
message for test'''
if t == x:
print('true')
else:
print('false')
Although the decoded text 't' and the text of the 'x' are exactly the same, python recognizes them as different and prints false.
I have really tried to find the difference in many ways but I still don't understand how they differ and how I can convert 't' to equal 'x'?
It's because the line breaks are still part of the string (represented as \n) even if you don't see it.
import binascii
o = open('new.py','rb')
t = o.read().decode()
print(binascii.hexlify(t.encode()))
# b'7468697320697320610a6d65737361676520666f7220746573740a'
x = '''this is a
message for test'''
print(binascii.hexlify(x.encode()))
# b'7468697320697320610a6d65737361676520666f722074657374'
Here, 0x0a at the end of t is the byte representation for the new line.
To make them the same, you need to strip out whitespaces and new lines. Assuming new.py looks like this (same as the value for x):
this is a
message for test
Then just do this:
o = open('new.py','rb')
t = o.read().decode().strip()
x = '''this is a
message for test'''
if t == x:
print('true')
else:
print('false')
I just want to append strings based on my condition. For example all strings starting with http won't be appended but all the other strings in each that has a length of 40 will be appended.
words = []
store1 = []
disregard = ["http","gen"]
for all in glob.glob(r'MYDIR'):
with open(all, "r",encoding="utf-16") as f:
text = f.read()
lines = text.split("\n")
for each in lines:
words += each.split()
for each in words:
if len(each) == 40 and each not in disregard:
store1.append(each)
Update:
if disregard[0] not in each:
works but how can I compare it to all the contents in my list? using disregard only doesnt work
Here is my input text file :
http://1234ashajkhdajkhdajkhdjkaaaaaaad1
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
genp://1234ashajkhdajkhdajkhdjkaaaaaaad1
a\a
The only thing that will append will be "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
I think the answers should depend on the number of words you want to disregard.
It's important to define what word means. If the word ends with spaces, should they all be stripped?
One solution could be to create a regular expression from all your words and use that to match the line.
import glob
import re
disregard = ["http","gen"]
pattern = "|".join([re.escape(w) for w in disregard])
for all in glob.glob(r'MYDIR/*'):
with open(all, "r", encoding="utf-16") as f:
matched_words = []
for line in f:
line = line.rstrip("\n")
if len(line) == 40 and not re.match(pattern, line):
matched_words.append(line)
print(matched_words)
The basic structure looks ok, it seems the place where it's breaking is setting up incorrect conditionals. You say you want to check where each line starts with the supplied strings, but then you split each line and check for existence of those strings. Use .startswith() instead. This will also make it so there doesn't have to be a space after "http" in order for that string to be caught.
Also, either the conditional testing should be placed after the loop that builds the words list, or else the words list should be reset at the start of each loop so you're not re-testing words you've already checked.
# adjusted some variable names for clarity
words = []
output = []
disregard = ["http","gen"]
for fname in glob.glob(r'MYDIR'):
with open(fname, "r", encoding="utf-16") as f:
text = f.read()
lines = text.split("\n")
for line in lines:
words += line.split()
for word in words:
if len(word) == 40 and not any([word.startswith(dis) for dis in disregard]):
output.append(each)
Input:
to-camel-case
to_camel_case
Desired output:
toCamelCase
My code:
def to_camel_case(text):
lst =['_', '-']
if text is None:
return ''
else:
for char in text:
if text in lst:
text = text.replace(char, '').title()
return text
Issues:
1) The input could be an empty string - the above code does not return '' but None;
2) I am not sure that the title()method could help me obtaining the desired output(only the first letter of each word before the '-' or the '_' in caps except for the first.
I prefer not to use regex if possible.
A better way to do this would be using a list comprehension. The problem with a for loop is that when you remove characters from text, the loop changes (since you're supposed to iterate over every item originally in the loop). It's also hard to capitalize the next letter after replacing a _ or - because you don't have any context about what came before or after.
def to_camel_case(text):
# Split also removes the characters
# Start by converting - to _, then splitting on _
l = text.replace('-','_').split('_')
# No text left after splitting
if not len(l):
return ""
# Break the list into two parts
first = l[0]
rest = l[1:]
return first + ''.join(word.capitalize() for word in rest)
And our result:
print to_camel_case("hello-world")
Gives helloWorld
This method is quite flexible, and can even handle cases like "hello_world-how_are--you--", which could be difficult using regex if you're new to it.
I'm doing some processing tasks on a medium-sized (1.7 Mb) Persian text corpus. I want to make lists of three set of characters in the text:
alphabets
white spaces (including newline, tab, space, no-breaking space and etc.) and
punctuation.
I wrote this:
# -*- coding: utf8 -*-
TextObj = open ('text.txt', 'r', encoding = 'UTF8')
import string
LCh = LSpc = LPunct = []
TotalCh = TotalPunct = TotalSpc = 0
TempSet = 'ابپتثجچحخدذرزژسشصضطظعغفقکگلمنوهی'
#TempSet variable holds alphabets of Persian language.
ReadObj = TextObj.read ()
for Char in ReadObj:
if Char in TempSet: #This's supposed to count & extract alphabets only.
TotalCh += 1
LCh.append (Char)
elif Char in string.punctuation: #This's supposed to count puncts.
TotalPunct += 1
LPunct.append (Char)
elif Char in ('', '\n', '\t'): #This counts & extracts spacey things.
TotalSpc += 1
LSpc.append (Char)
else: #This'll ignore anything else.
continue
But when I try:
print (LPunct)
print (LSpc)
I tried this code on both Linux and Windows 7. On both of them, the result is not what I expected at all. The punctuation's and space's lists, both contains Persian letters.
Another question:
How can I improve this condition elif Char in ('', '\n', '\t'): so that it covers all kind of space family?
On line 3 you've assigned all the lists to be the same list!
Don't do this:
LCh = LSpc = LPunct = []
Do this:
LCh = []
LSpc = []
LPunct = []
The string class has whitespace built in.
elif Char in string.whitespace:
TotalSpc += 1
LSpc.append (Char)
In your example you didn't actually put a space in your '' character which also may be causing it to fail. Shouldn't this be ' '?
Also, take into account the other answer here, this code is not very pythonic.
I'd write it like this:
# -*- coding: utf8 -*-
import fileinput
import string
persian_chars = 'ابپتثجچحخدذرزژسشصضطظعغفقکگلمنوهی'
filename = 'text.txt'
persian_list = []
punctuation_list = []
whitespace_list = []
ignored_list = []
for line in fileinput.input(filename):
for ch in line:
if ch in persian_chars:
persian_list.append(ch)
elif ch in string.punctuation:
punctuation_list.append(ch)
elif ch in string.whitespace:
whitespace_list.append(ch)
else:
ignored_list.append(ch)
total_persian, total_punctuation, total_whitepsace = \
map(len, [persian_list, punctuation_list, whitespace_list])
First of all as a more pythonic way for dealing with files you better to use with statement for opening the files which will close the file at the end of the block.
Secondly since you want to count the number of special characters within your text and preserve them separately, you can use a dictionary with the list names as the keys and relative characters in a list as value. Then use len method to get the length.
And finally for check the membership in whitespaces you can use string.whitespace method.
import string
TempSet = 'ابپتثجچحخدذرزژسشصضطظعغفقکگلمنوهی'
result_dict={}
with open ('text.txt', 'r', encoding = 'UTF8') as TextObj :
ReadObj = TextObj.read ()
for ch in ReadObj :
if Char in TempSet:
result_dict['TempSet'].append(ch)
elif Char in string.punctuation:
result_dict['LPunct'].append(ch)
elif Char in string.whitespace:
result_dict['LSpc'].append(ch)
TotalCh =len(result_dict['LSpc'])
I'm trying to use the str.find() and it keeps raising an error, what am I doing wrong?
import codecs
def countLOC(inFile):
""" Receives a file and then returns the amount
of actual lines of code by not counting commented
or blank lines """
LOC = 0
for line in inFile:
if line.isspace():
continue
comment = line.find('#')
if comment > 0:
for letter in range(comment):
if not letter.whitespace:
LOC += 1
break
return LOC
if __name__ == "__main__":
while True:
file_loc = input("Enter the file name: ").strip()
try:
source = codecs.open(file_loc)
except:
print ("**Invalid filename**")
else:
break
LOC_count = countLOC(source)
print ("\nThere were {0} lines of code in {1}".format(LOC_count,source.name))
Error
File "C:\Users\Justen-san\Documents\Eclipse Workspace\countLOC\src\root\nested\linesOfCode.py", line 12, in countLOC
comment = line.find('#')
TypeError: expected an object with the buffer interface
Use the built-in function open() instead of codecs.open().
You're running afoul of the difference between non-Unicode (Python 3 bytes, Python 2 str) and Unicode (Python 3 str, Python 2 unicode) string types. Python 3 won't convert automatically between non-Unicode and Unicode like Python 2 will. Using codecs.open() without an encoding parameter returns an object which yields bytes when you read from it.
Also, your countLOC function won't work:
for letter in range(comment):
if not letter.whitespace:
LOC += 1
break
That for loop will iterate over the numbers from zero to one less than the position of '#' in the string (letter = 0, 1, 2...); whitespace isn't a method of integers, and even if it were, you're not calling it.
Also, you're never incrementing LOC if the line doesn't contain #.
A "fixed" but otherwise faithful (and inefficient) version of your countLOC:
def countLOC(inFile):
LOC = 0
for line in inFile:
if line.isspace():
continue
comment = line.find('#')
if comment > 0:
for letter in line[:comment]:
if not letter.isspace():
LOC += 1
break
else:
LOC += 1
return LOC
How I might write the function:
def count_LOC(in_file):
loc = 0
for line in in_file:
line = line.lstrip()
if len(line) > 0 and not line.startswith('#'):
loc += 1
return loc
Are you actually passing an open file to the function? Maybe try printing type(file) and type(line), as there's something fishy here -- with an open file as the argument, I just can't reproduce your problem! (There are other bugs in your code but none that would cause that exception). Oh btw, as best practice, DON'T use names of builtins, such as file, for your own purposes -- that causes incredible amounts of confusion!