Create dictionary from Fasta file - python-3.x

I recently picked up a program i started some time ago (sorting and listing of gene code) and since I am a beginner and couldn't find anything about this specific problem online i will need some help.
I want to crate a dictionary from a Fasta (fna) file that looks like this:
(actual file uploaded, the editor distorts everything i copy&paste; I hope that is allowed):
http://www.filedropper.com/firstreads
http://pastebin.com/NNXV09A7 <- smallReads
I know how to make a dictionary manually but i have no idea how i could combine it with reading the dictionary entries from a file.
I appreciate any help.
edit: manual dict from example above:
dict= {'ATTC': 'T', 'CATT': 'C'}
or using the constructor:
dict([('ATTC', 'T'), ('CATT', 'C')])
edit2:
Thanks to Byte Commander i was able to make the function by adding the parameters:
def makeSuffixDict (inputfile="smallReads.fna", n=15):
my_dict = {}
with open(inputfile) as file:
for line in file:
word = line.split()[-1]
my_dict[word[:-n]] = word[-n:]
return()
if __name__ == "__main__":
makeSuffixDict
for keys,values in my_dict.items():
print(keys)
print(values)
print('\n')
However, when I try to change the suffix length, i get the same result. How can i make the suffix length variable?
There is also one small issue and that's the "'': '5'" at the beginning. why is it there and can i make it not show up?
{'': '5', 'ATTG': 'T', 'GCCC': 'T', 'TTTT': 'T', 'AGTC': 'C'}
Similar to this when I try another file with 30000 reads instead of 5 every now and then numbers pop up and I have no clue where they come from.
Example:
CAAGATCTAATATGAATTACAGAGAGCTGTTCAGCAAATACTTGTTGCATCAATGGAATTACAGCAGTAACACATATATTGACCTGGAACCAGAATCATGTTCTGAATGCAGAAGTACGTACTTTCTTTTTCTTTCTTGAGAACGCTGGATCTTTTTTAAAATGTTAATTTGCAGTTTGAAGCTGTTTAGGTTAAAAAAAAAATACAAGAAGCAGCAGCAAAAGAGACC : A
2407 : 9
ATTCTTTCATACCATTAAATATTTATTTTTCAAAACTGATCTTAGTAGAGGCCTAGTACTGTCTCATATAAATATAGGATAATATATATAATAAATCCCCTGACATCAGACATTAAGGTTACTCCCAATTACTTATTATCTTTATATATATGTTAAAAATATGTGTGTATAATATGTAAGTAAACAATTTGCATAGTTTATATGTGGTAATATATGGTTAATATATAGG : C

# create an empty dictionary:
my_dict = {}
# open the file:
with open("file.fna") as file:
# read the file line by line:
for line in file:
# split at whitespace and take the last word:
word = line.split()[-1]
# add entry to dictionary: all but the last character -> key; last character -> value
my_dict[word[:-1]] = word[-1]
See this code running on ideone.com (reading from STDIN instead of file though...)
Update:
If you want variable length suffixes, replace the last line above with this, where n is the length of the suffix which becomes the dictionary value:
my_dict[word[:-n]] = word[-n:]
See this code running on ideone.com (reading from STDIN instead of file though...)
Update 2:
Your code as stated in the question has some problems with the indentation. Also, there are no braces after return, but you need them to call a function. You need to return the dictionary created in the function as well to work with it outside.
I also now parse the 3rd whitespace-separated word in each row instead of the last one. Lines with not exactly 3 words are ignored.
Here's my version:
def makeSuffixDict (inputfile="smallReads.fna", n=15):
my_dict = {}
with open(inputfile) as file:
for line in file:
words = line.split()
if len(words) == 3: # <-- replaced "last word" with "3rd word"
word = words[2]
my_dict[word[:-n]] = word[-n:]
return my_dict
if __name__ == "__main__":
my_dict = makeSuffixDict()
for key,value in my_dict.items():
print(key, value)

Related

Problem with reading text then put the text to the list and sort them in the proper way

Open the file romeo.txt and read it line by line. For each line, split the line into a list of words using the split() method. The program should build a list of words. For each word on each line check to see if the word is already in the list and if not append it to the list. When the program completes, sort and print the resulting words in alphabetical order.
This is the question my problem is I cannot write a proper code and gathering true data, always my code gives me 4 different lists for each raw!
** This is my code**
fname = input("Enter file name: ")
fh = open(fname)
lst = list()
for line in fh:
line=line.rstrip()
line =line.split()
if line in last:
print(true)
else:
lst.append(line)
print(lst)
*** the text is here, please copy and paste in text editor***
But soft what light through yonder window breaks
It is the east and Juliet is the sun
Arise fair sun and kill the envious moon
Who is already sick and pale with grief
You are not checking the presence of individual words in the list, but rather the presence of the entire list of words in that line.
With some modifications, you can achieve what you are trying to do this way:
fname = input("Enter file name: ")
fh = open(fname)
lst = list()
for line in fh:
line = line.rstrip()
words = line.split()
for word in words:
if word not in lst:
lst.append(word)
print(lst)
However, a few things I would like to point out looking at your code:
Why are you using rstrip() instead of strip()?
It is better to use list = [] as opposed to your lst = list(). It is shorter, faster, more Pythonic and avoids the use of this confusing lst variable.
You should want to remove punctuation marks attached to words, eg: ,.: which do not get removed by split()
If you want a loop body to not do anything, use pass. Why are you printing true? Also, in Python, it's True and not true.

How to split strings from .txt file into a list, sorted from A-Z without duplicates?

For instance, the .txt file includes 2 lines, separated by commas:
John, George, Tom
Mark, James, Tom,
Output should be:
[George, James, John, Mark, Tom]
The following will create the list and store each item as a string.
def test(path):
filename = path
with open(filename) as f:
f = f.read()
f_list = f.split('\n')
for i in f_list:
if i == '':
f_list.remove(i)
res1 = []
for i in f_list:
res1.append(i.split(', '))
res2 = []
for i in res1:
res2 += i
res3 = [i.strip(',') for i in res2]
for i in res3:
if res3.count(i) != 1:
res3.remove(i)
res3.sort()
return res3
print(test('location/of/file.txt'))
Output:
['George', 'James', 'John', 'Mark', 'Tom']
Your file opening is fine, although the 'r' is redundant since that's the default. You claim it's not, but it is. Read the documentation.
You have not described what task is so I have no idea what's going on there. I will assume that it is correct.
Rather than populating a list and doing a membership test on every iteration - which is O(n^2) in time - can you think of a different data structure that guarantees uniqueness? Google will be your friend here. Once you discover this data structure, you will not have to perform membership checks at all. You seem to be struggling with this concept; the answer is a set.
The input data format is not rigorously defined. Separators may be commas or commas with trailing spaces, and may appear (or not) at the end of the line. Consider making an appropriate regular expression and using its splitting feature to split individual lines, though normal splitting and stripping may be easier to start.
In the following example code, I've:
ignored task since you've said that that's fine;
separated actual parsing of file content from parsing of in-memory content to demonstrate the function without a file;
used a set comprehension to store unique results of all split lines; and
used a generator to sorted that drops empty strings.
from io import StringIO
from typing import TextIO, List
def parse(f: TextIO) -> List[str]:
words = {
word.strip()
for line in f
for word in line.split(',')
}
return sorted(
word for word in words if word != ''
)
def parse_file(filename: str) -> List[str]:
with open(filename) as f:
return parse(f)
def test():
f = StringIO('John, George , Tom\nMark, James, Tom, ')
words = parse(f)
assert words == [
'George', 'James', 'John', 'Mark', 'Tom',
]
f = StringIO(' Han Solo, Boba Fet \n')
words = parse(f)
assert words == [
'Boba Fet', 'Han Solo',
]
if __name__ == '__main__':
test()
I came up with a very simple solution if anyone will need:
lines = x.read().split()
lines.sort()
new_list = []
[new_list.append(word) for word in lines if word not in new_list]
return new_list
with open("text.txt", "r") as fl:
list_ = set()
for line in fl.readlines():
line = line.strip("\n")
line = line.split(",")
[list_.add(_) for _ in line if _ != '']
print(list_)
I think that you missed a comma after Jim in the first line.
You can avoid the use of a loop by using split property :
content=file.read()
my_list=content.split(",")
to delete the occurence in your list you can transform it to set :
my_list=list(set(my_list))
then you can sort it using sorted
so the finale code :
with open("file.txt", "r") as file :
content=file.read()
my_list=content.replace("\n","").replace(" ", "").split(",")
result=sorted(list(set(my_list)))
you can add a key to your sort function

python: How to read a file and store each line using map function?

I'm trying to reconvert a program that I wrote but getting rid of all for loops.
The original code reads a file with thousands of lines that are structured like:
Ex. 2 lines of a file:
As you can see, the first line starts with LPPD;LEMD and the second line starts with DAAE;LFML. I'm only interested in the very first and second element of each line.
The original code I wrote is:
# Libraries
import sys
from collections import Counter
import collections
from itertools import chain
from collections import defaultdict
import time
# START
# #time=0
start = time.time()
# Defining default program argument
if len(sys.argv)==1:
fileName = "file.txt"
else:
fileName = sys.argv[1]
takeOffAirport = []
landingAirport = []
# Reading file
lines = 0 # Counter for file lines
try:
with open(fileName) as file:
for line in file:
words = line.split(';')
# Relevant data, item1 and item2 from each file line
origin = words[0]
destination = words[1]
# Populating lists
landingAirport.append(destination)
takeOffAirport.append(origin)
lines += 1
except IOError:
print ("\n\033[0;31mIoError: could not open the file:\033[00m %s" %fileName)
airports_dict = defaultdict(list)
# Merge lists into a dictionary key:value
for key, value in chain(Counter(takeOffAirport).items(),
Counter(landingAirport).items()):
# 'AIRPOT_NAME':[num_takeOffs, num_landings]
airports_dict[key].append(value)
# Sum key values and add it as another value
for key, value in airports_dict.items():
#'AIRPOT_NAME':[num_totalMovements, num_takeOffs, num_landings]
airports_dict[key] = [sum(value),value]
# Sort dictionary by the top 10 total movements
airports_dict = sorted(airports_dict.items(),
key=lambda kv:kv[1], reverse=True)[:10]
airports_dict = collections.OrderedDict(airports_dict)
# Print results
print("\nAIRPORT"+ "\t\t#TOTAL_MOVEMENTS"+ "\t#TAKEOFFS"+ "\t#LANDINGS")
for k in airports_dict:
print(k,"\t\t", airports_dict[k][0],
"\t\t\t", airports_dict[k][1][1],
"\t\t", airports_dict[k][1][0])
# #time=1
end = time.time()- start
print("\nAlgorithm execution time: %0.5f" % end)
print("Total number of lines read in the file: %u\n" % lines)
airports_dict.clear
takeOffAirport.clear
landingAirport.clear
My goal is to simplify the program using map, reduce and filter. So far I have sorted teh creation of the two independent lists, one for each first element of each file line and another list with the second element of each file line by using:
# Creates two independent lists with the first and second element from each line
takeOff_Airport = list(map(lambda sub: (sub[0].split(';')[0]), lines))
landing_Airport = list(map(lambda sub: (sub[0].split(';')[1]), lines))
I was hoping to find the way to open the file and achieve the exact same result as the original code by been able to opemn the file thru a map() function, so I could pass each list to the above defined maps; takeOff_Airport and landing_Airport.
So if we have a file as such
line 1
line 2
line 3
line 4
and we do like this
open(file_name).read().split('\n')
we get this
['line 1', 'line 2', 'line 3', 'line 4', '']
Is this what you wanted?
Edit 1
I feel this is somewhat reduntant but since map applies a function to each element of an iterator we will have to have our file name in a list, and we ofcourse define our function
def open_read(file_name):
return open(file_name).read().split('\n')
print(list(map(open_read, ['test.txt'])))
This gets us
>>> [['line 1', 'line 2', 'line 3', 'line 4', '']]
So first off, calling split('\n') on each line is silly; the line is guaranteed to have at most one newline, at the end, and nothing after it, so you'd end up with a bunch of ['all of line', ''] lists. To avoid the empty string, just strip the newline. This won't leave each line wrapped in a list, but frankly, I can't imagine why you'd want a list of one-element lists containing a single string each.
So I'm just going to demonstrate using map+strip to get rid of the newlines, using operator.methodcaller to perform the strip on each line:
from operator import methodcaller
def readFile(fileName):
try:
with open(fileName) as file:
return list(map(methodcaller('strip', '\n'), file))
except IOError:
print ("\n\033[0;31mIoError: could not open the file:\033[00m %s" %fileName)
Sadly, since your file is context managed (a good thing, just inconvenient here), you do have to listify the result; map is lazy, and if you didn't listify before the return, the with statement would close the file, and pulling data from the map object would die with an exception.
To get around that, you can implement it as a trivial generator function, so the generator context keeps the file open until the generator is exhausted (or explicitly closed, or garbage collected):
def readFile(fileName):
try:
with open(fileName) as file:
yield from map(methodcaller('strip', '\n'), file)
except IOError:
print ("\n\033[0;31mIoError: could not open the file:\033[00m %s" %fileName)
yield from will introduce a tiny amount of overhead over directly iterating the map, but not much, and now you don't have to slurp the whole file if you don't want to; the caller can just iterate the result and get a split line on each iteration without pulling the whole file into memory. It does have the slight weakness that opening the file will be done lazily, so you won't see the exception (if there is any) until you begin iterating. This can be worked around, but it's not worth the trouble if you don't really need it.
I'd generally recommend the latter implementation as it gives the caller flexibility. If they want a list anyway, they just wrap the call in list and get the list result (with a tiny amount of overhead). If they don't, they can begin processing faster, and have much lower memory demands.
Mind you, this whole function is fairly odd; replacing IOErrors with prints and (implicitly) returning None is hostile to API consumers (they now have to check return values, and can't actually tell what went wrong). In real code, I'd probably just skip the function and insert:
with open(fileName) as file:
for line in map(methodcaller('strip', '\n'), file)):
# do stuff with line (with newline pre-stripped)
inline in the caller; maybe define split_by_newline = methodcaller('split', '\n') globally to use a friendlier name. It's not that much code, and I can't imagine that this specific behavior is needed in that many independent parts of your file, and inlining it removes the concerns about when the file is opened and closed.

Find items in a text file that is a incantinated string of capitalized words that begin with a certain capital letter in python

I am trying to pull a string of input names that get saved to a text file. I need to pull them by capital letter which is input. I.E. the saved text file contains names DanielDanClark, and I need to pull the names that begin with D. I am stuck at this part
for i in range(num):
print("Name",i+1," >> Enter the name:")
n=input("")
names+=n
file=open("names.txt","w")
file.write(names)
lookUp=input("Did you want to look up any names?(Y/N)")
x= ord(lookUp)
if x == 110 or x == 78:
quit()
else:
letter=input("Enter the first letter of the names you want to look up in uppercase:")
file=open("names.txt","r")
fileNames=[]
file.list()
for letter in file:
fileNames.index(letter)
fileNames.close()
I know that the last 4 lines are probably way wrong. It is what I tried in my last failed attempt
Lets break down your code block by block
num = 5
names = ""
for i in range(num)
print("Name",i+1," >> Enter the name:")
n=input("")
names+=n
I took the liberty of giving num a value of 5, and names a value of "", just so the code will run. This block has no problems. And will create a string called names with all the input taken. You might consider putting a delimiter in, which makes it more easier to read back your data. A suggestion would be to use \n which is a line break, so when you get to writing the file, you actually have one name on each line, example:
num = 5
names = ""
for i in range(num)
print("Name",i+1," >> Enter the name:")
n = input()
names += n + "\n"
Now you are going to write the file:
file=open("names.txt","w")
file.write(names)
In this block you forget to close the file, and a better way is to fully specify the pathname of the file, example:
file = open(r"c:\somedir\somesubdir\names.txt","w")
file.write(names)
file.close()
or even better using with:
with open(r"c:\somedir\somesubdir\names.txt","w") as openfile:
openfile.write(names)
The following block you are asking if the user want to lookup a name, and then exit:
lookUp=input("Did you want to look up any names?(Y/N)")
x= ord(lookUp)
if x == 110 or x == 78:
quit()
First thing is that you are using quit() which should not be used in production code, see answers here you really should use sys.exit() which means you need to import the sys module. You then proceed to get the numeric value of the answer being either N or n and you check this in a if statement. You do not have to do ord() you can use a string comparisson directly in your if statement. Example:
lookup = input("Did you want to look up any names?(Y/N)")
if lookup.lower() == "n":
sys.exit()
Then you proceed to lookup the requested data, in the else: block of previous if statement:
else:
letter=input("Enter the first letter of the names you want to look up in uppercase:")
file=open("names.txt","r")
fileNames=[]
file.list()
for letter in file:
fileNames.index(letter)
fileNames.close()
This is not really working properly either, so this is where the delimiter \n is coming in handy. When a text file is opened, you can use a for line in file block to enumerate through the file line by line, and with \n delimiter added in your first block, each line is a name. You also go wrong in the for letter in file block, it does not do what you think it should be doing. It actually returns each letter in the file, regardless of whay you type in the input earlier. Here is a working example:
letter = input("Enter the first letter of the names you want to look up in uppercase:")
result = []
with open(r"c:\somedir\somesubdir\names.txt", "r") as openfile:
for line in openfile: ## loop thru the file line by line
line = line.strip('\n') ## get rid of the delimiter
if line[0].lower() == letter.lower(): ## compare the first (zero) character of the line
result.append(line) ## append to result
print(result) ## do something with the result
Putting it all together:
import sys
num = 5
names = ""
for i in range(num)
print("Name",i+1," >> Enter the name:")
n = input("")
names += n + "\n"
with open(r"c:\somedir\somesubdir\names.txt","w") as openfile:
openfile.write(names)
lookup = input("Did you want to look up any names?(Y/N)")
if lookup.lower() == "n":
sys.exit()
letter = input("Enter the first letter of the names you want to look up in uppercase:")
result = []
with open(r"c:\somedir\somesubdir\names.txt", "r") as openfile:
for line in openfile:
line = line.strip('\n')
if line[0].lower() == letter.lower():
result.append(line)
print(result)
One caveat I like to point out, when you create the file, you open the file in w mode, which will create a new file every time, therefore overwriting the a previous file. If you like to append to a file, you need to open it in a mode, which will append to an existing file, or create a new file when the file does not exist.

Python-Is it possible to index values in a specific line of a file?

I've opened a file for reading and got all the lines printed out.
6,78,84,78,100
146,90,100,90,90
149,91,134,95,80
641,79,115,70,111
643,100,120,100,90
I need to grab the first number in each line to create a dictionary key. The rest of the numbers are the values for the dictionary. Is there a way to use indexing with a for loop to grab each thing from the line?
I've tried using readlines() but that has gotten overly complicated in other ways that I won't go into detail. I would prefer to keep the lines as is and iterate over them if possible.
I tried:
fo=open('tester.csv','r')
def search(fo):
for line in fo:
key=line[0]
value= (line[1],line[2],line[3],line[4])
I want my final output to be a dictionary= {6: (78,84,78,100)}
Are you trying to get an output like this?
OUTPUT:
['6', '1', '1', '6', '6']
Then,
f = open('data.csv')
result = []
for line in f:
if line != '\n':
result.append(line[0])
print(result)
t = open("tester.csv", "r")
tstuff = t.readlines()
outdict = {}
tstufflength = len(tstuff)
for i in tstuff:
thing1, thing2 = i.split(",", 1)
realthing2 = thing2.strip("\n")
outdict[thing1]=realthing2
print(outdict)
Will only work if the lines are all on, well, different lines.
OUTPUT:
{'6': '78,84,78,100', '149': '91,134,95,80', '643': '100,120,100,90', '146': '90,100,90,90', '641': '79,115,70,111'}

Resources