Creating a dictionary to count the number of occurrences of Sequence IDs

Creating a dictionary to count the number of occurrences of Sequence IDs - python-3.x

I'm trying to write a function to count the number of each sequence ID that occurs in this file (it's a sample blast file)
The picture above is the input file I'm dealing with.
def count_seq(input):
dic1={}
count=0
for line in input:
if line.startswith('#'):
continue
if line.find('hits found'):
line=line.split('\t')
if line[1] in dic1:
dic1[line]+=1
else:
dic1[line]=1
return dic1
Above is my code which when called just returns empty brackets {}
So I'm trying to count how many times each of the sequence IDs (second element of last 13 lines) occur eg: FO203510.1 occurs 4 times.
Any help would be appreciated immensely, thanks!

Maybe this is what you're after:
def count_seq(input_file):
dic1={}
with open(input_file, "r") as f:
for line in f:
line = line.strip()
if not line.startswith('#'):
line = line.split()
seq_id = line[1]
if not seq_id in dic1:
dic1[seq_id] = 1
else:
dic1[seq_id] += 1
return dic1
print(count_seq("blast_file"))

This is a fitting case for collections.defaultdict. Let f be the file object. Assuming the sequences are in the second column, it's only a few lines of code as shown.
from collections import defaultdict
d = defaultdict(int)
seqs = (line.split()[1] for line in f if not line.strip().startswith("#"))
for seq in seqs:
d[seq] += 1
See if it works!

Related

TypeError: string indices must be integers --> Python

I wanted to create a python function which should read each
character of a text file and then count and display
the occurrence of alphabets E and T individually (including
small cases e and t too).
def test():
f = open("poem.txt",'r')
count = 0
count1 =0
try:
line = f.readlines()
for i in line:
for x in line:
if (i[x] in 'Ee'):
count+=1
else:
if (i[x] in 'Tt'):
count1+=1
print("E or e",count)
print("T or t",count1)
except EOFError:
f.close()
test()
This is what I tried
And it gave :
File "/Users/ansusinha/Desktop/Tution/Untitled15.py", line 23, in test
if (i[x] in 'Ee'):
TypeError: string indices must be integers
What am I missing here?

You are missing the fact that Python strings come with a .count() method.
You can read the entire file with
file_as_string = f.read()
and then count occurrences of any substring with
amount_of_E = file_as_string.count('E')
Check out str.count in Python documentation.
With
amount_of_Ee = file_as_string.lower().count('e')
you count occurrences of both E and e and with
amount_of_Tt = file_as_string.lower().count('t')
you are done with counting using two lines of code.
In your own code you try to index a string with another string, but string indices must be integers.
With for x in line: you actually wanted for x in i: where then the x will be a single character of line i you could directly use in if x in 'eE':.
But there is no need for the loops at all as Python strings come with the .count() method, so just use it.

Because, f.readlines() does not read only line, it reads all lines.
Code should be like this
def test():
f = open("poem.txt",'r')
count = 0
count1 =0
try:
lines = f.readlines()
for line in lines:
for char_in_line in line:
if (char_in_line in 'Ee'):
count+=1
elif (char_in_line in 'Tt'):
count1+=1
print("E or e",count)
print("T or t",count1)
except EOFError:
f.close()
test()
If your poem.txt is this,
LaLaLa
I'm shoes.
Who are you?
Then i[x] should be like this i["LaLaLa"]

Count frequency of words under given index in a file

I am trying to count occurrence of words under specific index in my file and print it out as a dictionary.
def count_by_fruit(file_name="file_with_fruit_data.txt"):
with open(file_name, "r") as file:
content_of_file = file.readlines()
dict_of_fruit_count = {}
for line in content_of_file:
line = line[0:-1]
line = line.split("\t")
for fruit in line:
fruit = line[1]
dict_of_fruit_count[fruit] = dict_of_fruit_count.get(fruit, 0) + 1
return dict_of_fruit_count
print(count_by_fruit())
Output: {'apple': 6, 'banana': 6, 'orange': 3}
I am getting this output, however, it doesn't count frequency of the words correctly. After searching around I didn't seem to find the proper solution. Could anyone help me to identify my mistake?
My file has the following content: (data separated with tabs, put "\t" in example as format is being altered by stackoverflow)
I am line one with \t apple \t from 2018
I am line two with \t orange \t from 2017
I am line three with \t apple \t from 2016
I am line four with \t banana \t from 2010
I am line five with \t banana \t from 1999

You are looping too many times over the same line. Notice that the results you are getting are all 3 times what you are expecting.
Also, in Python, you also do not need to read the entire file. Just iterate over the file object line by line.
Try:
def count_by_fruit(file_name="file_with_fruit_data.txt"):
with open(file_name, "r") as f_in:
dict_of_fruit_count = {}
for line in f_in:
fruit=line.split("\t")[1]
dict_of_fruit_count[fruit] = dict_of_fruit_count.get(fruit, 0) + 1
return dict_of_fruit_count
Which can be further simplified to:
def count_by_fruit(file_name="file_with_fruit_data.txt"):
with open(file_name) as f_in:
dict_of_fruit_count = {}
for fruit in (line.split('\t')[1] for line in f_in):
dict_of_fruit_count[fruit] = dict_of_fruit_count.get(fruit, 0) + 1
return dict_of_fruit_count
Or, if you can use Counter:
from collections import Counter
def count_by_fruit(file_name="file_with_fruit_data.txt"):
with open(file_name) as f_in:
return dict(Counter(line.split('\t')[1] for line in f_in))

The problem is for fruit in line:. Splitting the lines on the tabs is going to split them into three parts. If you loop over those three parts every time, adding one to the count for each, then your counts are going to be 3 times as large as the actual data.
Below is how I would write this function, using generator expressions and Counter.
from collections import Counter
def count_by_fruit(file_name="file_with_fruit_data.txt"):
with open(file_name, "r") as file:
lines = (line[:-1] for line in file)
fruit = (line.split('\t')[1] for line in lines)
return Counter(fruit)

Iterate N items at a time on a generator with single yield

How do I do that?
islice() return n items at a time but I can't figure out how to iterate it.
Right now I do something like this:
# -*- coding: utf-8 -*-
'''
print 3 lines at a time.
'''
def myread(filename):
with open(filename,'r',encoding='utf-8-sig') as f:
for line in f:
yield line.strip()
filename = 'test.txt'
temp = []
for res_a in myread(filename):
temp.append(res_a)
if len(temp)==3:
print(temp)
temp = []
print(temp)
Note that I don't know how big is my text file.

You can use itertools.islice and the two argument form of iter, eg:
from itertools import islice
with open('file') as fin:
# gen-comp yielding stripped lines
lines = (line.strip() for line in fin)
# create list of at most 3 lines from the file's current position
# and use an empty list as a sentinel value of when to stop... (no more lines)
for three in iter(lambda: list(islice(lines, 3)), []):
print(three)
As a function:
def myread(filename):
with open(filename) as fin:
lines = (line.strip() for line in fin)
yield from iter(lambda: list(islice(lines, 3)), [])

islice(itr, n) will only return an iterator that runs until it reaches the nth element of itr. You would have to keep rebuilding the islice iterator for every group of n elements you want to return. You might want to try the grouper recipe from the itertools documentation, which avoids this rebuilding:
def grouper(iterable, n, fillvalue=None):
"Collect data into fixed-length chunks or blocks"
# grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx"
args = [iter(iterable)] * n
return zip_longest(*args, fillvalue=fillvalue)
To complete the example, you can filter out the fillvalues added to the output groups to get it to replicate the code provided by the OP:
for grp in grouper(myread(filename), 3):
trimmed_grp = [line for line in grp if line is not None]
print(trimmed_grp)

Never resets list

I am trying to create a calorie counter the standard input goes like this:
python3 calories.txt < test.txt
Inside calories the food is the following format: apples 500
The problem I am having is that whenever I calculate the values for the person it seems to never return to an empty list..
import sys
food = {}
eaten = {}
finished = {}
total = 0
#mappings
def calories(x):
with open(x,"r") as file:
for line in file:
lines = line.strip().split()
key = " ".join(lines[0:-1])
value = lines[-1]
food[key] = value
def calculate(x):
a = []
for keys,values in x.items():
for c in values:
try:
a.append(int(food[c]))
except:
a.append(100)
print("before",a)
a = []
total = sum(a) # Problem here
print("after",a)
print(total)
def main():
calories(sys.argv[1])
for line in sys.stdin:
lines = line.strip().split(',')
for c in lines:
values = lines[0]
keys = lines[1:]
eaten[values] = keys
calculate(eaten)
if __name__ == '__main__':
main()
Edit - forgot to include what test.txt would look like:
joe,almonds,almonds,blue cheese,cabbage,mayonnaise,cherry pie,cola
mary,apple pie,avocado,broccoli,butter,danish pastry,lettuce,apple
sandy,zuchini,yogurt,veal,tuna,taco,pumpkin pie,macadamia nuts,brazil nuts
trudy,waffles,waffles,waffles,chicken noodle soup,chocolate chip cookie

How to make it easier on yourself:
When reading the calories-data, convert the calories to int() asap, no need to do it every time you want to sum up somthing that way.
Dictionary has a .get(key, defaultvalue) accessor, so if food not found, use 100 as default is a 1-liner w/o try: ... except:
This works for me, not using sys.stdin but supplying the second file as file as well instead of piping it into the program using <.
I modified some parsings to remove whitespaces and return a [(name,cal),...] tuplelist from calc.
May it help you to fix it to your liking:
def calories(x):
with open(x,"r") as file:
for line in file:
lines = line.strip().split()
key = " ".join(lines[0:-1])
value = lines[-1].strip() # ensure no whitespaces in
food[key] = int(value)
def getCal(foodlist, defValueUnknown = 100):
"""Get sum / total calories of a list of ingredients, unknown cost 100."""
return sum( food.get(x,defValueUnknown ) for x in foodlist) # calculate it, if unknown assume 100
def calculate(x):
a = []
for name,foods in x.items():
a.append((name, getCal(foods))) # append as tuple to list for all names/foods eaten
return a
def main():
calories(sys.argv[1])
with open(sys.argv[2]) as f: # parse as file, not piped in via sys.stdin
for line in f:
lines = line.strip().split(',')
for c in lines:
values = lines[0].strip()
keys = [x.strip() for x in lines[1:]] # ensure no whitespaces in
eaten[values] = keys
calced = calculate(eaten) # calculate after all are read into the dict
print (calced)
Output:
[('joe', 1400), ('mary', 1400), ('sandy', 1600), ('trudy', 1000)]
Using sys.stdin and piping just lead to my console blinking and waiting for manual input - maybe VS related...

How can I simplify and format this function?

So I have this messy code where I wanted to get every word from frankenstein.txt, sort them alphabetically, eliminated one and two letter words, and write them into a new file.
def Dictionary():
d = []
count = 0
bad_char = '~!##$%^&*()_+{}|:"<>?\`1234567890-=[]\;\',./ '
replace = ' '*len(bad_char)
table = str.maketrans(bad_char, replace)
infile = open('frankenstein.txt', 'r')
for line in infile:
line = line.translate(table)
for word in line.split():
if len(word) > 2:
d.append(word)
count += 1
infile.close()
file = open('dictionary.txt', 'w')
file.write(str(set(d)))
file.close()
Dictionary()
How can I simplify it and make it more readable and also how can I make the words write vertically in the new file (it writes in a horizontal list):
abbey
abhorred
about
etc....

A few improvements below:
from string import digits, punctuation
def create_dictionary():
words = set()
bad_char = digits + punctuation + '...' # may need more characters
replace = ' ' * len(bad_char)
table = str.maketrans(bad_char, replace)
with open('frankenstein.txt') as infile:
for line in infile:
line = line.strip().translate(table)
for word in line.split():
if len(word) > 2:
words.add(word)
with open('dictionary.txt', 'w') as outfile:
outfile.writelines(sorted(words)) # note 'lines'
A few notes:
follow the style guide
string contains constants you can use to provide the "bad characters";
you never used count (which was just len(d) anyway);
use the with context manager for file handling; and
using a set from the start prevents duplicates, but they aren't ordered (hence sorted).

Using re module.
import re
words = set()
with open('frankenstein.txt') as infile:
for line in infile:
words.extend([x for x in re.split(r'[^A-Za-z]*', line) if len(x) > 2])
with open('dictionary.txt', 'w') as outfile:
outfile.writelines(sorted(words))
From r'[^A-Za-z]*' in re.split, replace 'A-Za-z' with the characters which you want to include in dictionary.txt.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Creating a dictionary to count the number of occurrences of Sequence IDs - python-3.x

Related

TypeError: string indices must be integers --> Python

Count frequency of words under given index in a file

Iterate N items at a time on a generator with single yield

Never resets list

How can I simplify and format this function?

Categories

Resources