Related
I am trying to write a function that returns a generator that can be iterated over all starting position of a k-window in the DNA sequence. For each starting position, the generator returns the nucleotide frequencies in the window as a dictionary.
def sliding(s,k):
d = {}
for i in range(len(s)-3):
chunk = ''.join([s[i],s[i+(k-3)],s[i+(k-2)],s[i+(k-1)]])
for j in chunk:
if j not in d:
d[j] = 1
else:
d[j] += 1
yield d
seq = "ACGTTGCA"
for d in sliding(seq,4):
print(d)
Output:
{'A': 1, 'C': 1, 'G': 1, 'T': 1}
{'A': 1, 'C': 2, 'G': 2, 'T': 3}
{'A': 1, 'C': 2, 'G': 4, 'T': 5}
{'A': 1, 'C': 3, 'G': 5, 'T': 7}
{'A': 2, 'C': 4, 'G': 6, 'T': 8}
Expected Output:
{'T': 1, 'C': 1, 'A': 1, 'G': 1}
{'T': 2, 'C': 1, 'A': 0, 'G': 1}
{'T': 2, 'C': 0, 'A': 0, 'G': 2}
{'T': 2, 'C': 1, 'A': 0, 'G': 1}
{'T': 1, 'C': 1, 'A': 1, 'G': 1}
However, in my function, as one can see, the dictionary is the same for all the windows and the nucleotide counts to the same dictionary key in every iteration. For every window (chunk) there should be different dictionary.
You should initialize d inside the loop instead so that it starts with a new dict for each iteration:
for i in range(len(s) - 3):
d = {}
...
If you want the dicts in the output to always have the same keys even if their values are 0, as suggested by your expected output, you can initialize a dict with all of the distinct letters as keys, and copy the dict to d for each iteration:
initialized_dict = dict.fromkeys(s, 0)
for i in range(len(s) - 3):
d = initialized_dict.copy()
...
In order to sort in a descending manner, the frequency of char appearance in a string, I've developed the following algorithm.
First I pass the string to a dictionary using each char as a key along with its frequency of appearance as value. Afterwards I have converted the dictionary to a descending sorted multi-dimension list.
I'd like to know how to improve the algorithm, was it a good approach? Can it be done diferently? All proposals are welcome.
#Libraries
from operator import itemgetter
# START
# Function
# String to Dict. Value as freq.
# of appearance and char as key.
def frequencyChar(string):
#string = string.lower() # Optional
freq = 0
thisDict = {}
for char in string:
if char.isalpha(): # just chars
freq = string.count(char)
thisDict[char] = freq # {key:value}
return(thisDict)
str2Dict = frequencyChar("Would you like to travel with me?")
#print(str2Dict)
# Dictionary to list
list_key_value = [[k,v] for k, v in str2Dict.items()]
# Descending sorted list
list_key_value = sorted(list_key_value, key=itemgetter(1), reverse=True)
print("\n", list_key_value, "\n")
#END
You're doing way too much work. collections.Counter counts things for you automatically, and even sorts by frequency:
from collections import Counter
s = "Would you like to travel with me?"
freq = Counter(s)
# Counter({' ': 6, 'o': 3, 'l': 3, 'e': 3, 't': 3, 'u': 2, 'i': 2, 'W': 1, 'd': 1, 'y': 1, 'k': 1, 'r': 1, 'a': 1, 'v': 1, 'w': 1, 'h': 1, 'm': 1, '?': 1})
If you want to remove the spaces from the count:
del freq[' ']
# Counter({'o': 3, 'l': 3, 'e': 3, 't': 3, 'u': 2, 'i': 2, 'W': 1, 'd': 1, 'y': 1, 'k': 1, 'r': 1, 'a': 1, 'v': 1, 'w': 1, 'h': 1, 'm': 1, '?': 1})
Also just in general, your algorithm is doing too much work. string.count involves iterating over the whole string for each character you're trying to count. Instead, you can just iterate once over the whole string, and for every letter you just keep incrementing the key associated with that letter (initialize it to 1 if it's a letter you haven't seen before). That's essentially what Counter is doing for you.
Spelling it out:
count = {}
for letter in the_string:
if not letter.isalpha():
continue
if letter not in count:
count[letter] = 1
else:
count[letter] += 1
And then to sort it you don't need to convert to a list first, you can just do it directly:
ordered = sorted(count.items(), key=itemgetter(1), reverse=True)
d = {'U': 4, '_': 2, 'C': 2, 'K': 1, 'D': 4, 'T': 6, 'Q': 1, 'V': 2, 'A': 9, 'F': 2, 'O': 8, 'J': 1, 'I': 9, 'N': 6, 'P': 2, 'S': 4, 'M': 2, 'W': 2, 'E': 12, 'Z': 1, 'G': 3, 'Y': 2, 'B': 2, 'L': 4, 'R': 6, 'X': 1, 'H': 2}
def __str__(self):
omgekeerd = {}
for sleutel, waarde in self.inhoud.items():
letters = omgekeerd.get(waarde, '')
letters += sleutel
omgekeerd[waarde] = letters
for aantal in sorted(omgekeerd):
return '{}: {}'.format(aantal, ''.join(sorted(omgekeerd[aantal])))
I need to return the value, followed by a ':' and then followed by every letter that has that value.
The problem is that when I use return, it only returns one value instead of every vale on a new line.
I can't use print() because that is not supported by the method str(self).
The return statement ends function execution and specifies a value to
be returned to the function caller.
I believe that your code is terminated too early because of wrong usage of return statement.
What you could do is to store what you would like to return in a seperate list/dictionary and then when everything is done, you can return the new dict/list that you've stored the results in.
If I understood you correctly; This is what might be looking for:
def someFunc():
d = {'U': 4, '_': 2, 'C': 2, 'K': 1, 'D': 4, 'T': 6, 'Q': 1, 'V': 2, 'A': 9,
'F': 2, 'O': 8, 'J': 1, 'I': 9, 'N': 6, 'P': 2, 'S': 4, 'M': 2, 'W': 2, 'E': 12,
'Z': 1, 'G': 3, 'Y': 2, 'B': 2, 'L': 4, 'R': 6, 'X': 1, 'H': 2}
result = {}
for key, value in d.iteritems():
result[value] = [k for k,v in d.iteritems() if v == value]
return result
# call function and iterate over given dictionary
for key, value in someFunc().iteritems():
print key, value
Result:
1 ['K', 'J', 'Q', 'X', 'Z']
2 ['C', 'B', 'F', 'H', 'M', 'P', 'W', 'V', 'Y', '_']
3 ['G']
4 ['D', 'L', 'S', 'U']
6 ['N', 'R', 'T']
8 ['O']
9 ['A', 'I']
12 ['E']
the list is this :
List1 = ['a','b','c','d','e','f','g','h','h','i','j','k','l','m','n']
And I am hoping for the outcome to be where each times the item appears in the list its assigned an integer e.g:
List1 = ['a:1']
without using the 'import counter' module
You could use this list comprehension:
dict((x, List1.count(x)) for x in set(List1))
Example output:
{'d': 1, 'f': 1, 'l': 1, 'c': 1, 'j': 1, 'e': 1, 'i': 1, 'a': 1, 'h': 2, 'b': 1, 'm': 1, 'n': 1, 'k': 1, 'g': 1}
(Edited to match edited question.)
Use a dictionary comprehension and count.
>>> List1 = ['a','b','c','d','e','f','g','h','h','i','j','k','l','m','n']
>>> mapping = {v: List1.count(v) for v in List1}
>>> mapping
{'a': 1, 'b': 1, 'c': 1, 'd': 1, 'e': 1, 'f': 1,
'g': 1, 'h': 2, 'i': 1, 'j': 1, 'k': 1, 'l': 1, 'm': 1, 'n': 1}
let's say I have multiple lists of lists, I'll a include a shortened version of three of them in this example.
list1=[['name', '1A5ZA'], ['length', 83], ['A', 28], ['V', 31], ['I', 24]]
list2=[['name', '1AJ8A'], ['length', 49], ['A', 18], ['V', 11], ['I', 20]]
list3=[['name', '1AORA'], ['length', 96], ['A', 32], ['V', 49], ['I', 15]]
all of the lists are in the same format: they have the same number of nested lists, with the same labels.
I generate each of these lists with the following function
def GetResCount(sequence):
residues=[['A',0],['V',0],['I',0],['L',0],['M',0],['F',0],['Y',0],['W',0],
['S',0],['T',0],['N',0],['Q',0],['C',0],['U',0],['G',0],['P',0],['R',0],
['H',0],['K',0],['D',0],['E',0]]
name=sequence[0:5]
AAseq=sequence[27:]
for AA in AAseq:
for n in range(len(residues)):
if residues[n][0] == AA:
residues[n][1]=residues[n][1]+1
length=len(AAseq)
nameLsit=(['name', name])
lengthList=(['length', length])
residues.insert(0,lengthList)
residues.insert(0,nameLsit)
return residues
the script takes a sequence such as this
1A5ZA:A|PDBID|CHAIN|SQUENCEMKIGIVGLGRVGSSTAFAL
and will create a list similar to the ones mentioned above.
As each individual list is generated, I would like to append it to a final form, such that all of them combined together looks like this:
final=[['name', '1A5ZA', '1AJ8A', '1AORA'], ['length', 83, 49, 96], ['A', 28, 18, 32], ['V', 31, 11, 49], ['I', 24, 20, 15]]
maybe the final form of the data isn't in the right format. I am open to suggestion on how to format the final form better...
To summarize, what the script should do is to get a sequence of letters with the name of the sequence being at beginning, count the occurrence of each letter withing the sequence as well as the overall sequence length, and output the name length and the letter frequency to a list. Then it should combine the info from each sequence into a larger list(maybe dictionary?..)
at the very end all of this info will go into a spreadsheet that will look like this:
name length A V I
1A5ZA 83 28 31 24
1AJ8A 49 18 11 20
1AORA 96 32 49 15
I'm including this last bit because maybe I'm not starting starting in the right way to end up with what I want.
Anyway,
I hope you made it here and thanks for the help!
So if you are looking for a table then a dict might be a better approach. (Note: collections.Counter does the same as your counting), e.g.:
from collections import Counter
def GetResCount(sequence):
name, AAseq = sequence[0:5], sequence[27:]
residuals = {'name': name, 'length': len(AAseq), 'A': 0, 'V': 0, 'I': 0, 'L': 0,
'M': 0, 'F': 0, 'Y': 0, 'W': 0, 'S': 0, 'T': 0, 'N': 0, 'Q': 0, 'C': 0,
'U': 0, 'G': 0, 'P': 0, 'R': 0, 'H': 0, 'K': 0, 'D': 0, 'E': 0}
residuals.update(Counter(AAseq))
return residuals
In []:
GetResCount('1A5ZA:A|PDBID|CHAIN|SQUENCEMKIGIVGLGRVGSSTAFAL')
Out[]:
{'name': '1A5ZA', 'length': 19, 'A': 2, 'V': 2, 'I': 2, 'L': 2, 'M': 1, 'F': 1, 'Y': 0,
'W': 0, 'S': 2, 'T': 1, 'N': 0, 'Q': 0, 'C': 0, 'U': 0, 'G': 4, 'P': 0, 'R': 1,
'H': 0, 'K': 1, 'D': 0, 'E': 0}
Note: this may only be in the order you might be looking in Py3.6+ but we can fix that later as we create the table if necessary.
Then you can create a list of the dicts, e.g. (assuming you are reading these lines from a file):
with open(<file>) as file:
data = [GetResCount(line.strip()) for line in file]
Then you can load it directly into pandas, e.g.:
In []:
import pandas as pd
columns = ['name', 'length', 'A', 'V', 'I', ...] # columns = list(data[0].keys()) - Py3.6+
df = pd.DataFrame(data, columns=columns)
print(df)
Out[]:
name length A V I ...
0 1A5ZA 83 28 31 24 ...
1 1AJ8A 49 18 11 20 ...
2 1AORA 96 32 49 15 ...
...
You could also just dump it out to a file with cvs.DictWriter():
from csv import DictWriter
fieldnames = ['name', 'length', 'A', 'V', 'I', ...]
with open(<output>, 'w') as file:
writer = DictWrite(file, fieldnames)
writer.writerows(data)
Which would output something like:
name,length,A,V,I,...
1A5ZA,83,28,31,24,...
1AJ8A,49,18,11,20,...
1AORA,96,32,49,15 ...
...