Count frequency of words under given index in a file - python-3.x

I am trying to count occurrence of words under specific index in my file and print it out as a dictionary.
def count_by_fruit(file_name="file_with_fruit_data.txt"):
with open(file_name, "r") as file:
content_of_file = file.readlines()
dict_of_fruit_count = {}
for line in content_of_file:
line = line[0:-1]
line = line.split("\t")
for fruit in line:
fruit = line[1]
dict_of_fruit_count[fruit] = dict_of_fruit_count.get(fruit, 0) + 1
return dict_of_fruit_count
print(count_by_fruit())
Output: {'apple': 6, 'banana': 6, 'orange': 3}
I am getting this output, however, it doesn't count frequency of the words correctly. After searching around I didn't seem to find the proper solution. Could anyone help me to identify my mistake?
My file has the following content: (data separated with tabs, put "\t" in example as format is being altered by stackoverflow)
I am line one with \t apple \t from 2018
I am line two with \t orange \t from 2017
I am line three with \t apple \t from 2016
I am line four with \t banana \t from 2010
I am line five with \t banana \t from 1999

You are looping too many times over the same line. Notice that the results you are getting are all 3 times what you are expecting.
Also, in Python, you also do not need to read the entire file. Just iterate over the file object line by line.
Try:
def count_by_fruit(file_name="file_with_fruit_data.txt"):
with open(file_name, "r") as f_in:
dict_of_fruit_count = {}
for line in f_in:
fruit=line.split("\t")[1]
dict_of_fruit_count[fruit] = dict_of_fruit_count.get(fruit, 0) + 1
return dict_of_fruit_count
Which can be further simplified to:
def count_by_fruit(file_name="file_with_fruit_data.txt"):
with open(file_name) as f_in:
dict_of_fruit_count = {}
for fruit in (line.split('\t')[1] for line in f_in):
dict_of_fruit_count[fruit] = dict_of_fruit_count.get(fruit, 0) + 1
return dict_of_fruit_count
Or, if you can use Counter:
from collections import Counter
def count_by_fruit(file_name="file_with_fruit_data.txt"):
with open(file_name) as f_in:
return dict(Counter(line.split('\t')[1] for line in f_in))

The problem is for fruit in line:. Splitting the lines on the tabs is going to split them into three parts. If you loop over those three parts every time, adding one to the count for each, then your counts are going to be 3 times as large as the actual data.
Below is how I would write this function, using generator expressions and Counter.
from collections import Counter
def count_by_fruit(file_name="file_with_fruit_data.txt"):
with open(file_name, "r") as file:
lines = (line[:-1] for line in file)
fruit = (line.split('\t')[1] for line in lines)
return Counter(fruit)

Related

How to print 2 lines under a line from a text file?

Edit : I want to print 2 lines under the code entered by the user but it doesn't seem to work.
my text file looks like this :
86947367
banana
5
78364721
apple
3
35619833
orange
2
84716491
sweets
8
46389121
chicken
10
I have tried :
file = ('read_it.txt')
user = input('Enter code')
with open(file, 'r') as f:
lines = f.readlines()
for i, line in enumerate(lines):
if line == user:
print("{}\n{}".format(lines[i+1], lines[i+2]))
But i get an output of 2 blank lines.
file = 'filename.txt'
user = input('Enter code')
with open(file, 'r') as f:
lines = [line.strip() for line in f.readlines()] # Strip \n and \t from text
for i, line in enumerate(lines): # enumerate will count and keep track of the lines
if line == user:
print("{}\n{}".format(lines[i+1], lines[i+2]))

Creating a dictionary to count the number of occurrences of Sequence IDs

I'm trying to write a function to count the number of each sequence ID that occurs in this file (it's a sample blast file)
The picture above is the input file I'm dealing with.
def count_seq(input):
dic1={}
count=0
for line in input:
if line.startswith('#'):
continue
if line.find('hits found'):
line=line.split('\t')
if line[1] in dic1:
dic1[line]+=1
else:
dic1[line]=1
return dic1
Above is my code which when called just returns empty brackets {}
So I'm trying to count how many times each of the sequence IDs (second element of last 13 lines) occur eg: FO203510.1 occurs 4 times.
Any help would be appreciated immensely, thanks!
Maybe this is what you're after:
def count_seq(input_file):
dic1={}
with open(input_file, "r") as f:
for line in f:
line = line.strip()
if not line.startswith('#'):
line = line.split()
seq_id = line[1]
if not seq_id in dic1:
dic1[seq_id] = 1
else:
dic1[seq_id] += 1
return dic1
print(count_seq("blast_file"))
This is a fitting case for collections.defaultdict. Let f be the file object. Assuming the sequences are in the second column, it's only a few lines of code as shown.
from collections import defaultdict
d = defaultdict(int)
seqs = (line.split()[1] for line in f if not line.strip().startswith("#"))
for seq in seqs:
d[seq] += 1
See if it works!

How to sort text document, keep order and only unique lines

I have text document with words, line under line
text1
text2
text3
text2
text4
text4
text2
text3
now I want remove all copies, keep unique lines only and keep original order:
text1
text2
text3
text4
I have several solutions, but nothing works for me correct
this one keeps only unique lines,
with open('C:\folder\filedoc.txt', 'r') as lines:
lines_set = {line.strip() for line in lines}
with open('C:\folder\filedoc.txt', 'w') as out:
for line in lines_set:
out.write(line + '\n')
but not the order:
1. text2
2. text5
3. text3
4. text4
5. text1
this one keeps order but same words too:
with open('C:\folder\filedoc.txt', 'r') as lines:
lines_set = []
for line in lines:
if line.strip() not in lines_set:
lines_set.append(line.strip())
this one works well, but with input text:
with open('C:\my_path\doc.txt', 'r') as lines:
lines_set = []
for line in lines:
if line.strip() not in lines_set:
lines_set.append(line.strip())
I don't want use input, need somehow sort ordered list itself. with each cycle I've add a new word in text file, but with certain condition in a certain (and not each cycle) I want remove duplicated words at once. I need a continually expanding list with one line, but keep it in original order after removing of same words
this code works correct for me, exactly how I need, but with wrong results in many other conditions with returned list if I go this way with def and function:
def loadlines1(f):
with open(f, 'r') as lines:
lines_set = []
for line in lines:
if line.strip() not in lines_set:
lines_set.append(line.strip())
return lines_set
def loadlines2(f):
with open(f, 'r') as lines:
lines_set = []
for line in lines:
lines_set.append(line.strip())
return lines_set
def removeDuplicates(l):
out = list(set(l))
for i in enumerate(out):
out[i[0]] = l.index(i[1])
out.sort()
for i in enumerate(out):
out[i[0]] = l[i[1]]
return out
def savelines(f, l):
open(f, 'w').write('\n'.join(l))
lines = loadlines2('C:\folder\filedoc.txt')
stripped_lines = removeDuplicates(lines)
savelines('doc.txt', stripped_lines)
would be good if I can avoid any return analysis
now I'm found this one, but not sure how to figure out with it
lines_seen = set()
outfile = open(outfilename, "w")
for line in open(infilename, "r"):
if line not in lines_seen:
outfile.write(line)
lines_seen.add(line)
outfile.close()
and this one maybe too:
with open('C:\folder\filedoc.txt', 'r') as afile:
a = set(line.rstrip('\n') for line in afile)
with open('C:\folder\filedoc.txt', 'r') as bfile:
for line in bfile:
line = line.rstrip('\n')
if line not in a:
print(line)
a.add(line)
so can you help me figure out with this problem, please
the best solution for me how I imagine it, if it is possible of course, I don't know exactly how to do it, but I guess this way: read all lines in my document and find all same words (and not compare with new one only like in variant with input) then somehow remove all extra same words and keep only unique, then copy all list and rewrite it over the previous doc... so maybe something like this in the end of each cycle, if condition in cycle was. but not sure maybe there is a some better and easy way
You can get a list in original order with all duplicates removed by doing something like this:
from collections import OrderedDict
no_duplicates = list(OrderedDict.fromkeys(f.readlines()))
And then all you have to do is write it back to the file.
This should work:
from collections import OrderedDict
with open('file.txt', 'r') as f:
items = list(OrderedDict.fromkeys(f.readlines()))
with open('file.txt', 'w') as f:
for item in items:
f.write(item)

keep order as it was saved and keep only unique words in text file list

need sort lines in order in which they were saved in txt file, just new line comes from below and save this order after remove similar words. so if I add words in loop one by one
line A
line B
line C
line D
line E
here I got three solutions, but nothing works for me correct
first keeps only unique words;
with open('C:\my_path\doc.txt', 'r') as lines:
lines_set = {line.strip() for line in lines}
with open(''D:\path\file.txt', 'w') as out:
for line in lines_set:
out.write(line + '\n')
but destroys order:
1. line B
2. line E
3. line C
4. line D
5. line A
second keeps order but same words too:
with open('C:\my_path\doc.txt', 'r') as lines:
lines_set = []
for line in lines:
if line.strip() not in lines_set:
lines_set.append(line.strip())
last one works well, but with input text:
with open('C:\my_path\doc.txt', 'r') as lines:
lines_set = []
for line in lines:
if line.strip() not in lines_set:
lines_set.append(line.strip())
in some cases I have no any input, and also have different input, so need somehow sort ordered list itself
can you help me figure out with it please
loadLines is almost as your function you show twice, but it allows duplicates. removeDuplicates strips duplicates. saveLines writes a list to a file, deliminating by newline. All functions preserve order.
#Load lines with duplicates
def loadLines(f):
with open(f, 'r') as lines:
lines_set = []
for line in lines:
lines_set.append(line.strip())
return lines_set
#Search list "l", return list without duplicates.
def removeDuplicates(l):
out = list(set(l))
for i in enumerate(out):
out[i[0]] = l.index(i[1])
out.sort()
for i in enumerate(out):
out[i[0]] = l[i[1]]
return out
#Write the lines "l" to filepath "f"
def saveLines(f, l):
open(f, 'w').write('\n'.join(l))
lines = loadLines('doc.txt')
print(lines)
stripped_lines = removeDuplicates(lines)
print(stripped_lines)
saveLines('doc.txt', stripped_lines)

How can I simplify and format this function?

So I have this messy code where I wanted to get every word from frankenstein.txt, sort them alphabetically, eliminated one and two letter words, and write them into a new file.
def Dictionary():
d = []
count = 0
bad_char = '~!##$%^&*()_+{}|:"<>?\`1234567890-=[]\;\',./ '
replace = ' '*len(bad_char)
table = str.maketrans(bad_char, replace)
infile = open('frankenstein.txt', 'r')
for line in infile:
line = line.translate(table)
for word in line.split():
if len(word) > 2:
d.append(word)
count += 1
infile.close()
file = open('dictionary.txt', 'w')
file.write(str(set(d)))
file.close()
Dictionary()
How can I simplify it and make it more readable and also how can I make the words write vertically in the new file (it writes in a horizontal list):
abbey
abhorred
about
etc....
A few improvements below:
from string import digits, punctuation
def create_dictionary():
words = set()
bad_char = digits + punctuation + '...' # may need more characters
replace = ' ' * len(bad_char)
table = str.maketrans(bad_char, replace)
with open('frankenstein.txt') as infile:
for line in infile:
line = line.strip().translate(table)
for word in line.split():
if len(word) > 2:
words.add(word)
with open('dictionary.txt', 'w') as outfile:
outfile.writelines(sorted(words)) # note 'lines'
A few notes:
follow the style guide
string contains constants you can use to provide the "bad characters";
you never used count (which was just len(d) anyway);
use the with context manager for file handling; and
using a set from the start prevents duplicates, but they aren't ordered (hence sorted).
Using re module.
import re
words = set()
with open('frankenstein.txt') as infile:
for line in infile:
words.extend([x for x in re.split(r'[^A-Za-z]*', line) if len(x) > 2])
with open('dictionary.txt', 'w') as outfile:
outfile.writelines(sorted(words))
From r'[^A-Za-z]*' in re.split, replace 'A-Za-z' with the characters which you want to include in dictionary.txt.

Resources