Python3 Function returning only first values - python-3.x

I'm trying to make a function that opens a textfile, counts the words and then returns the top 7 words in percent. However when I run the code with "return" it only retrieves the first value but "print" gives me what I need + a "None".
def count_wfrequency(fname):
with open(fname,"r") as f:
dictionary = dict()
for line in f:
line = line.rstrip()
delaorden = line.split()
for word in delaorden:
dictionary[word] = dictionary.get(word,0) + 1
tmp = list()
for key, value in dictionary.items():
tmp.append((value, key))
tmp.sort(reverse=True)
for value, key in tmp[:7]:
b = ("{0:.0f}%".format(value / len(tmp)* 100))
print(key, b, end=", ")
the 9%, to 6%, of 5%, and 4%, in 3%, a 3%, would 2%, Frequency of words: None
What am I doing wrong?
edit: To clarify, I want to make it so that it prints out "Frequency of words: the 9%, to 6%...and so on.
Trying to use the function like this:
if args.word_frequency:
print("Frequency of words: {}".format(funk.count_wfrequency(args.filename)))

Related

How do I convert a text file into a dictionary

I would like to parse the following text file into a dictionary:
Train A
Travelled 150km
No longer in use
Stored in warehouse
Train B
Travelled 100km
Used by X company
Daily usage
Actively upgrading
The end result dictionary should have Train A and Train B as keys, and the rest of values as list of values:
{
'Train A': ['Travelled 150km', 'No longer in use', 'Stored in warehouse'],
'Train B': ['Travelled 100km', 'Used by X company', 'Daily usage', 'Actively upgrading']
}
I've currently tried
with open('file.txt') as f:
data = f.read().split('\n')
dict = {}
for i in data:
key = i[0]
value = i[1:]
d[key] = value
print(dict)
Really not too sure where im wrong. I want to split the \n after Train A, where Train A is Key and all the other information listed is the value
Your units are separated by blank lines - so, you should first split by two newlines, not by one. The following implementation is somewhat inefficient (it splits the same variable twice), but it works, and you can improve it if you want:
[{x.split("\n")[0]: x.split("\n")[1:]} for x in data.split("\n\n")]
#[{'Train A': ['Travelled 150km', 'No longer in use',
# 'Stored in warehouse']},
# {'Train B': ['Travelled 100km', 'Used by X company',
# 'Daily usage', 'Actively upgrading'}]
You are close. You need to split the files using empty line first ('\n\n'), then continue on with your idea.
with open('file.txt') as f:
data = f.read().split('\n\n') # <=== this is what's missing
print(data)
d = {}
for i in data:
i = i.split('\n')
key = i[0]
value = i[1:]
d[key] = value
print(d)

I want to make a dictionary of trigrams out of a text file, but something is wrong and I do not know what it is

I have written a program which is counting trigrams that occur 5 times or more in a text file. The trigrams should be printed out according to their frequency.
I cannot find the problem!
I get the following error message:
list index out of range
I have tried to make the range bigger but that did not work out
f = open("bsp_file.txt", encoding="utf-8")
text = f.read()
f.close()
words = []
for word in text.split():
word = word.strip(",.:;-?!-–—_ ")
if len(word) != 0:
words.append(word)
trigrams = {}
for i in range(len(words)):
word = words[i]
nextword = words[i + 1]
nextnextword = words[i + 2]
key = (word, nextword, nextnextword)
trigrams[key] = trigrams.get(key, 0) + 1
l = list(trigrams.items())
l.sort(key=lambda x: x[1])
l.reverse()
for key, count in l:
if count < 5:
break
word = key[0]
nextword = key[1]
nextnextword = key[2]
print(word, nextword, nextnextword, count)
The result should look like this:(simplified)
s = "this is a trigram which is an example............."
this is a
is a trigram
a trigram which
trigram which is
which is an
is an example
As the comments pointed out, you're iterating over your list words with i, and you try to access words[i+1], when i will reach the last cell of words, i+1 will be out of range.
I suggest you read this tutorial to generate n-grams with pure python: http://www.albertauyeung.com/post/generating-ngrams-python/
Answer
If you don't have much time to read it all here's the function I recommend adaptated from the link:
def get_ngrams_count(words, n):
# generates a list of Tuples representing all n-grams
ngrams_tuple = zip(*[words[i:] for i in range(n)])
# turn the list into a dictionary with the counts of all ngrams
ngrams_count = {}
for ngram in ngrams_tuple:
if ngram not in ngrams_count:
ngrams_count[ngram] = 0
ngrams_count[ngram] += 1
return ngrams_count
trigrams = get_ngrams_count(words, 3)
Please note that you can make this function a lot simpler by using a Counter (which subclasses dict, so it will be compatible with your code) :
from collections import Counter
def get_ngrams_count(words, n):
# turn the list into a dictionary with the counts of all ngrams
return Counter(zip(*[words[i:] for i in range(n)]))
trigrams = get_ngrams_count(words, 3)
Side Notes
You can use the bool argument reverse in .sort() to sort your list from most common to least common:
l = list(trigrams.items())
l.sort(key=lambda x: x[1], reverse=True)
this is a tad faster than sorting your list in ascending order and then reverse it with .reverse()
A more generic function for the printing of your sorted list (will work for any n-grams and not just tri-grams):
for ngram, count in l:
if count < 5:
break
# " ".join(ngram) will combine all elements of ngram in a string, separated with spaces
print(" ".join(ngram), count)

Was solving a hacker problem and some test cases didnt pass

Given the names and grades for each student in a Physics class of students, store them in a nested list and print the name(s) of any student(s) having the second lowest grade.
Note: If there are multiple students with the same grade, order their names alphabetically and print each name on a new line.
Input Format
The first line contains an integer, , the number of students.
The subsequent lines describe each student over lines; the first line contains a student's name, and the second line contains their grade.
Constraints
There will always be one or more students having the second lowest grade.
Output Format
Print the name(s) of any student(s) having the second lowest grade in Physics; if there are multiple students, order their names alphabetically and print each one on a new line.
This is my code:
list = []
for _ in range(int(input())):
name = input()
score = float(input())
new = [name, score]
list.append(new)
def snd_highest(val):
return val[1]
list.sort(key = snd_highest)
list.sort()
value = list[1]
grade = value[1]
for a,b in list:
if b == grade:
print (a)
This is the test case:
4
Rachel
-50
Mawer
-50
Sheen
-50
Shaheen
51
And the expected output is Shaheen but i got the other 3.
Please explain.
To find the second lowest value, you have actually just sorted your list in ascending order and just taken the second value in the list by using the below code
value = list[1]
grade = value[1]
Imagine this is your list after sorting:
[['Sheen', 50.0], ['mawer', 50.0], ['rachel', 50.0], ['shaheen', 51.0]]
According to value = list[1], the program chooses "value = ['mawer', 50.0]".
Then the rest of your program takes the grade from this value and outputs the corresponding name, that's why this isn't working as per what you need, you need to write logic to find the lowest value and then find the second lowest, this current program just assumes the lowest value is in the second position in the list.
Try doing this
if __name__ == '__main__':
students = []
for _ in range(int(input())):
name = input()
score = float(input())
new = [name, score]
students.append(new)
def removeMinimum(oldlist):
oldlist = sorted(oldlist, key=lambda x: x[1])
min_ = min(students, key=lambda x: x[1])
newlist = []
for a in range(0, len(oldlist)):
if min_[1] != oldlist[a][1]:
newlist.append(oldlist[a])
return newlist
students = removeMinimum(students);
# find the second minimum value
min_ = min(students, key=lambda x: x[1])
# sort alphabetic order
students = sorted(students, key=lambda x: x[0])
for a in range(0, len(students)):
if min_[1] == students[a][1]:
print(students[a][0])
I hope this may help you to pass all your test cases. Thank you.
# These functions will be used for sorting
def getSecond(ele):
return ele[1]
def getFirst(ele):
return ele[0]
studendList = []
sortedList = []
secondLowestStudents = []
# Reading input from STDIN and saving in nested list [["stud1": <score>], ["stud2", <score>]]
for _ in range(int(input())):
name = input()
score = float(input())
studendList.append([name, score])
# sort the list by score and save it in a new list studendList (remove the duplicate score as well - see, if x[1] not in sortedList)
studendList.sort(key=getSecond)
[sortedList.append(x[1]) for x in studendList if x[1] not in sortedList]
# Get the second lowest grade
secondLowest = sortedList[1]
# Now sort the origin list by the name fetch the student list having the secondLowest grade
studendList.sort(key=getFirst)
[secondLowestStudents.append(x[0]) for x in studendList if x[1] == secondLowest]
# Print the student's name having second-lowest grade
for st in secondLowestStudents:
print(st)

Never resets list

I am trying to create a calorie counter the standard input goes like this:
python3 calories.txt < test.txt
Inside calories the food is the following format: apples 500
The problem I am having is that whenever I calculate the values for the person it seems to never return to an empty list..
import sys
food = {}
eaten = {}
finished = {}
total = 0
#mappings
def calories(x):
with open(x,"r") as file:
for line in file:
lines = line.strip().split()
key = " ".join(lines[0:-1])
value = lines[-1]
food[key] = value
def calculate(x):
a = []
for keys,values in x.items():
for c in values:
try:
a.append(int(food[c]))
except:
a.append(100)
print("before",a)
a = []
total = sum(a) # Problem here
print("after",a)
print(total)
def main():
calories(sys.argv[1])
for line in sys.stdin:
lines = line.strip().split(',')
for c in lines:
values = lines[0]
keys = lines[1:]
eaten[values] = keys
calculate(eaten)
if __name__ == '__main__':
main()
Edit - forgot to include what test.txt would look like:
joe,almonds,almonds,blue cheese,cabbage,mayonnaise,cherry pie,cola
mary,apple pie,avocado,broccoli,butter,danish pastry,lettuce,apple
sandy,zuchini,yogurt,veal,tuna,taco,pumpkin pie,macadamia nuts,brazil nuts
trudy,waffles,waffles,waffles,chicken noodle soup,chocolate chip cookie
How to make it easier on yourself:
When reading the calories-data, convert the calories to int() asap, no need to do it every time you want to sum up somthing that way.
Dictionary has a .get(key, defaultvalue) accessor, so if food not found, use 100 as default is a 1-liner w/o try: ... except:
This works for me, not using sys.stdin but supplying the second file as file as well instead of piping it into the program using <.
I modified some parsings to remove whitespaces and return a [(name,cal),...] tuplelist from calc.
May it help you to fix it to your liking:
def calories(x):
with open(x,"r") as file:
for line in file:
lines = line.strip().split()
key = " ".join(lines[0:-1])
value = lines[-1].strip() # ensure no whitespaces in
food[key] = int(value)
def getCal(foodlist, defValueUnknown = 100):
"""Get sum / total calories of a list of ingredients, unknown cost 100."""
return sum( food.get(x,defValueUnknown ) for x in foodlist) # calculate it, if unknown assume 100
def calculate(x):
a = []
for name,foods in x.items():
a.append((name, getCal(foods))) # append as tuple to list for all names/foods eaten
return a
def main():
calories(sys.argv[1])
with open(sys.argv[2]) as f: # parse as file, not piped in via sys.stdin
for line in f:
lines = line.strip().split(',')
for c in lines:
values = lines[0].strip()
keys = [x.strip() for x in lines[1:]] # ensure no whitespaces in
eaten[values] = keys
calced = calculate(eaten) # calculate after all are read into the dict
print (calced)
Output:
[('joe', 1400), ('mary', 1400), ('sandy', 1600), ('trudy', 1000)]
Using sys.stdin and piping just lead to my console blinking and waiting for manual input - maybe VS related...

Creating a dictionary to count the number of occurrences of Sequence IDs

I'm trying to write a function to count the number of each sequence ID that occurs in this file (it's a sample blast file)
The picture above is the input file I'm dealing with.
def count_seq(input):
dic1={}
count=0
for line in input:
if line.startswith('#'):
continue
if line.find('hits found'):
line=line.split('\t')
if line[1] in dic1:
dic1[line]+=1
else:
dic1[line]=1
return dic1
Above is my code which when called just returns empty brackets {}
So I'm trying to count how many times each of the sequence IDs (second element of last 13 lines) occur eg: FO203510.1 occurs 4 times.
Any help would be appreciated immensely, thanks!
Maybe this is what you're after:
def count_seq(input_file):
dic1={}
with open(input_file, "r") as f:
for line in f:
line = line.strip()
if not line.startswith('#'):
line = line.split()
seq_id = line[1]
if not seq_id in dic1:
dic1[seq_id] = 1
else:
dic1[seq_id] += 1
return dic1
print(count_seq("blast_file"))
This is a fitting case for collections.defaultdict. Let f be the file object. Assuming the sequences are in the second column, it's only a few lines of code as shown.
from collections import defaultdict
d = defaultdict(int)
seqs = (line.split()[1] for line in f if not line.strip().startswith("#"))
for seq in seqs:
d[seq] += 1
See if it works!

Resources